{"id":170475,"date":"2010-05-31T21:23:41","date_gmt":"2010-05-31T21:23:41","guid":{"rendered":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/project\/photo-real-talking-head\/"},"modified":"2017-06-02T11:36:36","modified_gmt":"2017-06-02T18:36:36","slug":"photo-real-talking-head","status":"publish","type":"msr-project","link":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/project\/photo-real-talking-head\/","title":{"rendered":"Photo-Real Talking Head"},"content":{"rendered":"\t<div data-wp-context='{\"items\":[]}' data-wp-interactive=\"msr\/accordion\">\n\t\t\t\t\t<div class=\"clearfix\">\n\t\t\t\t<div\n\t\t\t\t\tclass=\"btn-group align-items-center mb-g float-sm-right\"\n\t\t\t\t\tdata-bi-aN=\"accordion-collapse-controls\"\n\t\t\t\t>\n\t\t\t\t\t<button\n\t\t\t\t\t\tclass=\"btn btn-link m-0\"\n\t\t\t\t\t\tdata-bi-cN=\"Expand all\"\n\t\t\t\t\t\tdata-wp-bind--aria-controls=\"state.ariaControls\"\n\t\t\t\t\t\tdata-wp-bind--aria-expanded=\"state.ariaExpanded\"\n\t\t\t\t\t\tdata-wp-bind--disabled=\"state.isAllExpanded\"\n\t\t\t\t\t\tdata-wp-class--inactive=\"state.isAllExpanded\"\n\t\t\t\t\t\tdata-wp-on--click=\"actions.onExpandAll\"\n\t\t\t\t\t\ttype=\"button\"\n\t\t\t\t\t>\n\t\t\t\t\t\tExpand all\t\t\t\t\t<\/button>\n\t\t\t\t\t<span aria-hidden=\"true\"> | <\/span>\n\t\t\t\t\t<button\n\t\t\t\t\t\tclass=\"btn btn-link m-0\"\n\t\t\t\t\t\tdata-bi-cN=\"Collapse all\"\n\t\t\t\t\t\tdata-wp-bind--aria-controls=\"state.ariaControls\"\n\t\t\t\t\t\tdata-wp-bind--aria-expanded=\"state.ariaExpanded\"\n\t\t\t\t\t\tdata-wp-bind--disabled=\"state.isAllCollapsed\"\n\t\t\t\t\t\tdata-wp-class--inactive=\"state.isAllCollapsed\"\n\t\t\t\t\t\tdata-wp-on--click=\"actions.onCollapseAll\"\n\t\t\t\t\t\ttype=\"button\"\n\t\t\t\t\t>\n\t\t\t\t\t\tCollapse all\t\t\t\t\t<\/button>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t\t\t<ul class=\"msr-accordion\">\n\t\t\t\t\t\t\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-2\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-2\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-1\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\tHonors &amp; Awards\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-1\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-2\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<ul>\n<li>The<i> 3D Photo-Real <\/i><i>talking head<\/i> project won \u201c<b>Demo of the Year<\/b>\u201d@2011 in MSRA, which is also shown at <b>Craig Mundie<\/b>\u2019s Techforum 2011, Techfest 2011 (including public day), Exec Retreat 2011, <b>MGX<\/b> 2011, with great press coverage (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.msnbc.msn.com\/id\/21134540\/vp\/41980049#41980049\">MSNBC<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.pcworld.com\/article\/240642\/how_microsoft_research_helped_craig_mundie_speak_chinese.html\">PCWorld<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/news.cnet.com\/8301-10805_3-20034995-75.html\">CNET<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/seattletimes.nwsource.com\/html\/microsoftpri0\/2014443637_techfesta3dtalkingheadphoto.html\">The Seattle Times<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, etc.).<\/li>\n<li><i>Dictionary Talking Head<\/i> is selected as <b>MSR highlighted 18 \u201ctech transfers\u201d<\/b> (e.g. significant product impact) of 2010 from the worldwide labs (reported by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.pcworld.com\/article\/217470\/microsoft_uses_karaoke_feature_on_chinas_bing_dictionary.html\">PCWorld<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>).<\/li>\n<li>The<i> Photo-Real<\/i><i> talking head<\/i> project won <b>NO.1<\/b> in Audio-Visual consistency test in LIPS Challenge 2009, an international audio\/visual lips rendering contest held in the AVSP Workshop.<\/li>\n<\/ul>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-4\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-4\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-3\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\t2D Photo-Real Talking Head\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-3\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-4\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<h3>abstract<\/h3>\n<p align=\"left\">We propose an HMM trajectory-guided, real image sample concatenation approach to photo-real talking head synthesis. It renders a smooth and natural video of articulators in sync with given speech signals. With an audio\/video footage as short as 20 minutes from a speaker, the proposed system can synthesize a highly photo-real video in sync with the given speech signals. This system won the FIRST place in the Audio-Visual match contest in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" href=\"http:\/\/www.lips2008.org\/\" target=\"_blank\">LIPS2009 Challenge<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n<p align=\"left\"><span id=\"85025886-4b74-4e15-b3ba-eb0a34c6216e\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Image85025886-4b74-4e15-b3ba-eb0a34c6216e\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-pic1.png\" alt=\"\" \/><span id=\"ImageCaption85025886-4b74-4e15-b3ba-eb0a34c6216e\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n<h3>Video Demo:<\/h3>\n<ul>\n<li><a title=\"\" href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-obama_story-1-30.avi\" target=\"_self\">Demo 1: Text driven photo-real talking head (British English)<\/a><\/li>\n<li><a title=\"\" href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-sus001_volx8.avi\" target=\"_self\">Demo 2: Speech driven photo-real talking head (British English)<\/a><\/li>\n<li><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-00001_11.wmv\" target=\"_self\">Demo 3: Text driven photo-real talking head (American English)<\/a><\/li>\n<li><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-xiaocui.avi\" target=\"_self\">Demo 4: Virtual talk show host (in Chinese)<\/a><\/li>\n<\/ul>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-6\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-6\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-5\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\tAn HMM-based Singing and Talking Head\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-5\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-6\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<h3>abstract<\/h3>\n<p>This demo shows a trainable, Hidden Markov Model(HMM)-based, talking and singing head which can synthesize speech from a given text or singing voice from given lyrics and music scores (melody).<\/p>\n<p>In training, audio\/visual features along with the corresponding scripts (text or lyrics and melody) are used to train statistical HMMs where key features and their dynamics of basic audio\/visual components are captured and parameterized statistically. In speech synthesis, a given text is first analyzed and decomposed into a sequence of phonemes along with their corresponding durations and f0 prosody. Thus generated speech parameter trajectories are then used to synthesize the final speech waveform. In singing voice synthesis, given lyrics and melody of a song is used to determine the pitch trajectory and phoneme durations and the information is used to drive the trained HMMs to synthesize a singing voice.<\/p>\n<p>Since the HMMs are trained with a person&#8217;s speech or a singer&#8217;s voice data, personalized speech or singing voice can be optimally reproduced in the maximum likelihood sense. Head motions and synchronized lip-movements can be automatically synthesized with corresponding prosodic cues and viseme sequence and they can also be interactively modified.<\/p>\n<p><span id=\"fddec45d-3f46-439c-8961-2e2297a38247\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Imagefddec45d-3f46-439c-8961-2e2297a38247\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-pic5.png\" alt=\"\" \/><span id=\"ImageCaptionfddec45d-3f46-439c-8961-2e2297a38247\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n<h3>\u00a0Video Demo:<\/h3>\n<ul>\n<li><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-youandme_samanthav2.wmv\">Demo 5: &#8220;You and Me&#8221; by cartoon talking head (girl).<\/a><\/li>\n<li><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-youandme_tomv2.wmv\">Demo 6: &#8220;You and Me&#8221; by cartoon talking head (boy).<\/a><\/li>\n<li><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-introduction.wmv\">Demo 7: Self-introduction by cartoon talking head (boy).<\/a><\/li>\n<\/ul>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-8\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-8\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-7\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\tComputerized Audio-Visual Language Learning\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-7\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-8\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<h3>abstract<\/h3>\n<p>For foreign language users, learning correct pronunciation is considered by many to be of the most arduous of tasks if one does not have access to a personal tutor. The reason is that the most common method for learning pronunciation, that is, to listen and repeat audio tapes, has the following important deficiencies: completeness and engagement. Completeness, in that audio data alone does not offer users how to move their mouth\/lips to sound out phonemes that are perhaps non-existent in their mother tongue. Also audio alone is less motivating\/personalized for learners, and as supported by studies in Cognitive Informatics, information is processed by humans more efficiently as both audio and visual inform.<\/p>\n<p>The ambition is to create a visualized language teacher that can be engaged in many aspects of language learning from detailed pronunciation training to conversational practice. An initial implementation is a photo-realistic talking head for pronunciation training by demonstrating highly precise lip-sync animation for any arbitrary text input. So that, ESL users can watch synthesized videos to learn how the mouth moves with speech in a lip-sync manner for many sample sentences on Bing Dictionary (Engkoo).<\/p>\n<p><span id=\"e64c2915-3695-45d1-848d-995e52e38830\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Imagee64c2915-3695-45d1-848d-995e52e38830\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-pic3.png\" alt=\"\" \/><span id=\"ImageCaptione64c2915-3695-45d1-848d-995e52e38830\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n<h3>Video Demo:<\/h3>\n<p>Live demo\u00a0can be found on Bing dictionary (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/dict.bing.com.cn\/\">http:\/\/dict.bing.com.cn<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>).<\/p>\n<h3>News Coverage:<\/h3>\n<ul>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.pcworld.com\/article\/217470\/microsoft_uses_karaoke_feature_on_chinas_bing_dictionary.html\">Microsoft Uses Karaoke Feature on China&#8217;s Bing Dictionary<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.pcworld.com\/article\/217470\/microsoft_uses_karaoke_feature_on_chinas_bing_dictionary.html\"> | PCWorld<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ul>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-10\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-10\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-9\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\t3D Photo-Real Talking Head\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-9\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-10\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<h3>abstract<\/h3>\n<p>We propose a new 3D photo-real talking head with a personalized, photo realistic appearance. Different head motions and facial expressions can be freely controlled and rendered. It extends our prior, high-quality, 2D photo-real talking head to 3D.<\/p>\n<p>Around 20-minutes of audio-visual 2D video are first recorded with read prompted sentences spoken by a speaker. We use a 2D-to-3D reconstruction algorithm to automatically wrap the 3D geometric mesh with 2D frames to construct a training database. In training, super feature vectors consisting of 3D geometry, texture and speech are formed to train a statistical, multi-streamed, Hidden Markov Model (HMM). The HMM is then used to synthesize both the trajectories of geometry animation and dynamic texture. The 3D talking head animation can be controlled by the rendered geometric trajectory while the facial expressions and articulator movements are rendered with the dynamic 2D image sequences. Head motions and facial expression can also be separately controlled by manipulating corresponding parameters. The new 3D talking head has many useful applications such as voice-agent, tele-presence, gaming, speech-to-speech translation, etc.<\/p>\n<p><span id=\"b3ca581e-b5ae-4ded-b0b4-39d3a408562f\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Imageb3ca581e-b5ae-4ded-b0b4-39d3a408562f\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-pic2.png\" alt=\"\" \/><span id=\"ImageCaptionb3ca581e-b5ae-4ded-b0b4-39d3a408562f\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n<h3>Video Demo:<\/h3>\n<ul>\n<li><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-3d_intro_2.avi\">Demo 8: 3D Photo-Realistic Talking Head (US male)<\/a><\/li>\n<li><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-craig_eng_hd.wmv\">Demo 9: 3D Talking Head for Craig Mundie<\/a><\/li>\n<\/ul>\n<h3>\u00a0News Coverage:<\/h3>\n<ul>\n<li>CNET News: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/news.cnet.com\/8301-10805_3-20034995-75.html\">Microsoft demos 3D photo avatars, display tech <span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<li>MSNBC Homepage Story: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.msnbc.msn.com\/id\/21134540\/vp\/41980049\">Realistic 3-D talking head made from 2-D video <span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<li>The Seattle Times: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/seattletimes.nwsource.com\/html\/microsoftpri0\/2014443637_techfesta3dtalkingheadphoto.html\">TechFest<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/seattletimes.nwsource.com\/html\/microsoftpri0\/2014443637_techfesta3dtalkingheadphoto.html\">: Animating a 3-D photo avatar<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ul>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-12\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-12\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-11\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\tA Multi-lingual, 3D Photo-realistic Talking Head\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-11\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-12\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<h3>abstract<\/h3>\n<p>Speaking fluently a foreign language, without even attending a traditional or self-paced language course, is incredible if not impossible. In this demo, we create a talking head which can speak foreign languages. We use Chinese (to be learned) and English (native language) as the language pair to demonstrate this technology and authentic Chinese is spoken by an English speaker\u2019s talking head lip-synchronously in the original speaker\u2019s voice. The talking head and corresponding Mandarin TTS is trained with the English speaker\u2019s audio\/video recording. Two advanced technologies, 3D photo-realistic talking head and cross-lingual TTS (Text-to-Speech) synthesis, are combined seamlessly. The Mandarin Chinese TTS was trained with 1 hour of the speaker\u2019s English data. The synthesized Chinese is then lip-synced with the English speaker\u2019s 3D photo-realistic talking head, by matching corresponding inter-language lip articulations between the English speaker and a reference Chinese speaker. We predict trajectories of the talking head with a statistically trained Hidden Markov Model (HMM) and render natural facial expressions and lips movements time-synchronously with the corresponding speech. The prototype is useful for applications like speech-to-speech translation, voice agents, gaming, and tele-presence and computer assisted language learning.<\/p>\n<p><span id=\"02f7f470-586e-4310-bcb7-8d4468f3a4ce\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Image02f7f470-586e-4310-bcb7-8d4468f3a4ce\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-pic6.png\" alt=\"\" \/><span id=\"ImageCaption02f7f470-586e-4310-bcb7-8d4468f3a4ce\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n<h3>Video demo:<\/h3>\n<ul>\n<li>\n<div><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-craigengv3.wmv\">Demo 10: Craig Mundie&#8217;s talking head speaks in English.<\/a><\/div>\n<\/li>\n<li>\n<div><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-craigchnv2.wmv\">Demo 11: Craig Mundie&#8217;s talking head speaks in Chinese.<\/a><\/div>\n<\/li>\n<li>\n<div><a title=\"\" href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/photo-real_talking_head-craig_jpn.wmv\" target=\"_self\">Demo 12: Craig Mundie&#8217;s talking head speaks in Japanese (for fun ^_^).<\/a><\/div>\n<\/li>\n<\/ul>\n<h3>News Coverage:<\/h3>\n<ul>\n<li>PCWorld: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.pcworld.com\/article\/240642\/how_microsoft_research_helped_craig_mundie_speak_chinese.html\">How <span class=\"sr-only\"> (opens in new tab)<\/span><\/a><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.pcworld.com\/article\/240642\/how_microsoft_research_helped_craig_mundie_speak_chinese.html\">M<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.pcworld.com\/article\/240642\/how_microsoft_research_helped_craig_mundie_speak_chinese.html\">icrosoft research helped Craig Mundie speak Chinese<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ul>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-14\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-14\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-13\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\tA New Language Independent, Photo-realistic Talking Head Driven by Voice Only\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-13\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-14\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<h3>abstract<\/h3>\n<p>We present a high-fidelity, speech-to-lips conversion talking head with no linguistic knowledge of input speech. A context-dependent, multi-layer, Deep Neural Network (DNN) is first trained with error back-propagation procedure over thousands hours of speaker independent data. A highly discriminative mapping between acoustic speech input and 9k tied states is thus established. Additionally, an HMM-based lips motion synthesizer is trained over a speaker\u2019s audio\/visual data and where each state is statistically mapped to its corresponding lips images. In test, for given speech input, DNN predicts likely states in terms of their posterior probabilities. Photorealistic lips animation is then rendered through the DNN predicted state lattice with the HMM lips motion synthesizer. In addition to speaker independence, the DNN can also be trained language independently for corresponding gaming or telepresence applications.<\/p>\n<p><span id=\"98e6af65-15ba-4a00-ac4b-8e69b61874a7\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Image98e6af65-15ba-4a00-ac4b-8e69b61874a7\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/voice_driven_talking_head-fig1.png\" alt=\"\" \/><span id=\"ImageCaption98e6af65-15ba-4a00-ac4b-8e69b61874a7\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n<h3>Video Demo<\/h3>\n<table>\n<tbody>\n<tr>\n<td>\n<p align=\"center\">English(en-US)<\/p>\n<\/td>\n<td>\n<p align=\"center\">Chinese(zh-CN)<\/p>\n<\/td>\n<td>\n<p align=\"center\">Japanese(ja-JP)<\/p>\n<\/td>\n<td>\n<p align=\"center\">Spanish(es-ES)<\/p>\n<\/td>\n<td>\n<p align=\"center\">French(fr-FR)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"center\"><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/voice_driven_talking_head-0000000004_en-us.mp4\" target=\"_self\">MP4<\/a><\/p>\n<\/td>\n<td>\n<p align=\"center\"><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/voice_driven_talking_head-0000000001_zh-cn.mp4\" target=\"_self\">MP4<\/a><\/p>\n<\/td>\n<td>\n<p align=\"center\"><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/voice_driven_talking_head-100004_ja-jp.mp4\" target=\"_self\">MP4<\/a><\/p>\n<\/td>\n<td>\n<p align=\"center\"><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/voice_driven_talking_head-0000000001_es-es.mp4\" target=\"_self\">MP4<\/a><\/p>\n<\/td>\n<td>\n<p align=\"center\"><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/voice_driven_talking_head-0000000004_fr-fr.mp4\" target=\"_self\">MP4<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"center\"><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/voice_driven_talking_head-0000000008_en-us.mp4\" target=\"_self\">MP4\u00a0<\/a><\/p>\n<\/td>\n<td>\n<p align=\"center\"><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/voice_driven_talking_head-0000000004_zh-cn.mp4\" target=\"_self\">MP4<\/a><\/p>\n<\/td>\n<td>\n<p align=\"center\"><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/voice_driven_talking_head-100007_ja-jp.mp4\" target=\"_self\">MP4\u00a0<\/a><\/p>\n<\/td>\n<td>\n<p align=\"center\"><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/voice_driven_talking_head-0000000005_es-es.mp4\" target=\"_self\">MP4<\/a><\/p>\n<\/td>\n<td>\n<p align=\"center\"><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2016\/02\/voice_driven_talking_head-0000000008_fr-fr.mp4\" target=\"_self\">MP4<\/a><\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-16\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-16\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-15\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\tTalking Robot\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-15\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-16\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<h3>abstract<\/h3>\n<p>In this work, we turn our high quality, 3D photo-realistic talking head into a talking robot. Instead of displaying the 3D talking head onto a flat-screen display, our new 3D physical robot has its 2D rendered image sequence projected onto a plastic talking robot\u2019s face. The 3D talking robot has photo-realistic facial animation which is lip-synced with corresponding audio speech signals. The system consists of three components: a plastic face mask of the robot, a mini-projector which back projects rendered video images onto the plastic mask, and a laptop computer for rendering high quality audio\/video for any given text input. The technology can drive different robots for many natural and user friendly applications.<\/p>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t\t\t\t\t<\/ul>\n\t<\/div>\n\t\n","protected":false},"excerpt":{"rendered":"","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13554],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-170475","msr-project","type-msr-project","status-publish","hentry","msr-research-area-human-computer-interaction","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"2011-11-25","related-publications":[160942,160943,160944,160945,160946,162082,162083,162084,162087,162088,162089,162090],"related-downloads":[],"related-videos":[185942,187076,187105],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","value":"lijuanw","display_name":"Lijuan Wang","author_link":"<a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/people\/lijuanw\/\" aria-label=\"Visit the profile page for Lijuan Wang\">Lijuan Wang<\/a>","is_active":false,"user_id":32680,"last_first":"Wang, Lijuan","people_section":0,"alias":"lijuanw"}],"msr_research_lab":[199560],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/170475","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":2,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/170475\/revisions"}],"predecessor-version":[{"id":388403,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/170475\/revisions\/388403"}],"wp:attachment":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/media?parent=170475"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=170475"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=170475"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=170475"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=170475"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}