{"id":156383,"date":"2006-01-01T00:00:00","date_gmt":"2006-01-01T00:00:00","guid":{"rendered":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/msr-research-item\/a-state-space-model-with-neural-network-prediction-for-recovering-vocal-tract-resonances-in-fluent-speech-from-mel-cepstral-coefficients\/"},"modified":"2018-10-16T20:22:56","modified_gmt":"2018-10-17T03:22:56","slug":"a-state-space-model-with-neural-network-prediction-for-recovering-vocal-tract-resonances-in-fluent-speech-from-mel-cepstral-coefficients","status":"publish","type":"msr-research-item","link":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/publication\/a-state-space-model-with-neural-network-prediction-for-recovering-vocal-tract-resonances-in-fluent-speech-from-mel-cepstral-coefficients\/","title":{"rendered":"A State-space Model with Neural-Network Prediction for Recovering Vocal Tract Resonances in Fluent Speech from Mel-cepstral Coefficients"},"content":{"rendered":"<p>In this paper, we present a state-space formulation of a neural-network-based hidden dynamic model of speech whose parameters are trained using an approximate EM algorithm. This efficient and effective training makes use of the output of an off-the-shelf formant tracker (for the vowel segments of the speech signal), in addition to the Mel-cepstral observations, to simplify the complex sufficient statistics that would be required in the exact EM algorithm. The trained model, consisting of the state equation for the target-directed vocal tract resonance (VTR) dynamics on all classes of speech sounds (including consonant closure and constriction) and the observation equation for mapping from the VTR to Mel-cepstral acoustic measurement, is then used to recover the unobserved VTR based on the extended Kalman filter. The results demonstrate accurate estimation of the VTR, especially during rapid consonant\u2013vowel or vowel\u2013consonant transitions and during consonant closure when the acoustic measurement alone provides weak or no information to infer the VTR values. The practical significance of correctly identifying the VTRs during consonantal closure or constriction is that they provide target frequency values for the VTR or formant transitions from adjacent sounds. Without such target values, the VTR transitions from vowel to consonant or from consonant to vowel are often very difficult to extract accurately by the previous formant tracking techniques. With the use of the new technique reported in this paper, the consonantal VTRs and the related transitions become more reliably identified from the speech signal.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this paper, we present a state-space formulation of a neural-network-based hidden dynamic model of speech whose parameters are trained using an approximate EM algorithm. This efficient and effective training makes use of the output of an off-the-shelf formant tracker (for the vowel segments of the speech signal), in addition to the Mel-cepstral observations, to [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"Speech Communication","msr_number":"","msr_organization":"","msr_pages_string":"971-988","msr_page_range_start":"971","msr_page_range_end":"988","msr_series":"","msr_volume":"48","msr_copyright":"","msr_conference_name":"","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"Roberto Togneri","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2006-01-01","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"https:\/\/www.researchgate.net\/publication\/222680881_Deng_L_A_state-space_model_with_neural-network_prediction_for_recovering_vocal_tract_resonances_in_fluent_speech_from_mel-cepstral_coefficients_Speech_Commun_488_971-988","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":2006,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13556,13545,13554],"msr-publication-type":[193715],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-156383","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-research-area-human-computer-interaction","msr-locale-en_us"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2006-01-01","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"971-988","msr_chapter":"","msr_isbn":"","msr_journal":"Speech Communication","msr_volume":"48","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"https:\/\/www.researchgate.net\/publication\/222680881_Deng_L_A_state-space_model_with_neural-network_prediction_for_recovering_vocal_tract_resonances_in_fluent_speech_from_mel-cepstral_coefficients_Speech_Commun_488_971-988","msr_doi":"","msr_publication_uploader":[{"type":"url","title":"https:\/\/www.researchgate.net\/publication\/222680881_Deng_L_A_state-space_model_with_neural-network_prediction_for_recovering_vocal_tract_resonances_in_fluent_speech_from_mel-cepstral_coefficients_Speech_Commun_488_971-988","viewUrl":false,"id":false,"label_id":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[{"id":0,"url":"https:\/\/www.researchgate.net\/publication\/222680881_Deng_L_A_state-space_model_with_neural-network_prediction_for_recovering_vocal_tract_resonances_in_fluent_speech_from_mel-cepstral_coefficients_Speech_Commun_488_971-988"}],"msr-author-ordering":[{"type":"text","value":"Roberto Togneri","user_id":0,"rest_url":false},{"type":"user_nicename","value":"deng","user_id":31602,"rest_url":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=deng"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[],"msr_project":[169434],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"article","related_content":{"projects":[{"ID":169434,"post_title":"Acoustic Modeling","post_name":"acoustic-modeling","post_type":"msr-project","post_date":"2004-01-29 16:42:42","post_modified":"2019-08-14 14:50:04","post_status":"publish","permalink":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/project\/acoustic-modeling\/","post_excerpt":"Acoustic modeling of speech typically refers to the process of\u00a0establishing statistical\u00a0representations for the feature vector sequences\u00a0computed from the speech waveform. Hidden Markov Model (HMM) is one most common type of acoustuc models. Other acosutic models include segmental models, super-segmental models (including hidden dynamic models), neural networks, maximum entropy models, and (hidden) conditional random fields, etc. Acoustic modeling also encompasses \"pronunciation modeling\", which describes how a sequence or multi-sequences of fundamental speech units\u00a0(such as phones or&hellip;","_links":{"self":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/169434"}]}}]},"_links":{"self":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/156383","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":2,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/156383\/revisions"}],"predecessor-version":[{"id":434868,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/156383\/revisions\/434868"}],"wp:attachment":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/media?parent=156383"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=156383"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=156383"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=156383"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=156383"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=156383"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=156383"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=156383"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=156383"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=156383"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=156383"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=156383"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=156383"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}