{"id":586960,"date":"2019-05-15T08:00:29","date_gmt":"2019-05-15T15:00:29","guid":{"rendered":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/?p=586960"},"modified":"2019-06-17T11:16:26","modified_gmt":"2019-06-17T18:16:26","slug":"speech-and-language-the-crown-jewel-of-ai-with-dr-xuedong-huang","status":"publish","type":"post","link":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/podcast\/speech-and-language-the-crown-jewel-of-ai-with-dr-xuedong-huang\/","title":{"rendered":"Speech and language: the crown jewel of AI with Dr. Xuedong Huang"},"content":{"rendered":"<p><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-586963 size-large\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788-1024x576.png\" alt=\"Dr. Xuedong Huang\" width=\"1024\" height=\"576\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788-1024x576.png 1024w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788-300x169.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788-768x432.png 768w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788-1066x600.png 1066w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788-655x368.png 655w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788-343x193.png 343w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/p>\n<h3>Episode 76, May 15, 2019<\/h3>\n<p>When was the last time you had a meaningful conversation with your computer\u2026 and felt like it truly understood you? Well, if <a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/people\/xdh\/\">Dr. Xuedong Huang<\/a>, a Microsoft Technical Fellow and head of Microsoft\u2019s Speech and Language group, is successful, you will. And if his track record holds true, it\u2019ll be sooner than you think!<\/p>\n<p>On today\u2019s podcast, Dr. Huang talks about his role as Microsoft\u2019s Chief Speech Scientist, gives us some inside details on the latest milestones in speech and language technology, and explains how mastering speech recognition, translation and conversation will move machines further along the path from \u201cperceptive AI\u201d to \u201ccognitive AI\u201d and that much closer to truly human intelligence.<\/p>\n<h3>Related:<\/h3>\n<ul type=\"disc\">\n<li><a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/podcast\">Microsoft Research Podcast<\/a>: View more podcasts on Microsoft.com<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/itunes.apple.com\/us\/podcast\/microsoft-research-a-podcast\/id1318021537?mt=2\">iTunes<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Subscribe and listen to new podcasts each week on iTunes<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/subscribebyemail.com\/www.blubrry.com\/feeds\/microsoftresearch.xml\">Email<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Subscribe and listen by email<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/subscribeonandroid.com\/www.blubrry.com\/feeds\/microsoftresearch.xml\">Android<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Subscribe and listen on Android<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/open.spotify.com\/show\/4ndjUXyL0hH1FXHgwIiTWU\">Spotify<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Listen on Spotify<\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.blubrry.com\/feeds\/microsoftresearch.xml\">RSS feed<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/note.microsoft.com\/ww-registration-microsoft-research-newsletter-s.html?wt.mc_id=S-webpage_podcast\">Microsoft Research Newsletter<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>: Sign up to receive the latest news from Microsoft Research<\/li>\n<\/ul>\n<hr \/>\n<h3>Transcript<\/h3>\n<p>Xuedong Huang: At some point, let\u2019s say computers can understand three hundred languages, can fluently communicate and converse. I have not run into a person who can speak three hundred languages. And not only machines can fluently communicate and converse, but can comprehend, understand and learn and reason and can really finish all the PhD courses in all subjects. The knowledge acquisition, reasoning, is beyond anyone\u2019s individual capability. When that moment is here, you can think about how intelligent that AI is going to be.<\/p>\n<p><strong>Host: You\u2019re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I\u2019m your host, Gretchen Huizinga.<\/strong><\/p>\n<p><strong>Host: When was the last time you had a meaningful conversation with your computer\u2026 and felt like it truly understood you? Well, if Dr. Xuedong Huang, a Microsoft Technical Fellow and head of Microsoft\u2019s Speech and Language group, is successful, you will. And if his track record holds true, it\u2019ll be sooner than you think!<\/strong><\/p>\n<p>On today\u2019s podcast, Dr. Huang talks about his role as Microsoft\u2019s Chief Speech Scientist, gives us some inside details on the latest milestones in speech and language technology, and explains how mastering speech recognition, translation and conversation will move machines further along the path from \u201cperceptive AI\u201d to \u201ccognitive AI\u201d and that much closer to truly human intelligence. That and much more on this episode of the Microsoft Research Podcast.<\/p>\n<p><strong>Host: Xuedong Huang, welcome to the podcast.<\/strong><\/p>\n<p>Xuedong Huang: Thank you.<\/p>\n<p><strong>Host: You are a Microsoft Technical Fellow in the speech and language group, and you lead Microsoft\u2019s spoken language efforts. So, we\u2019re going to talk in depth about these in a bit, but first, as the company\u2019s Chief Speech Scientist, give us a general view of what you do for a living and why you do it. What gets you up in the morning?<\/strong><\/p>\n<p>Xuedong Huang: Well, what we do is really make sure we have the best speech and language technology that can be used to empower a wide range of scenarios. The reason we have a group to do that is really I feel that, you know, this is not only the most natural way for people to communicate, as we\u2019re doing right now, but it\u2019s really the hardest AI challenge we\u2019re facing. So, that\u2019s what we do, trying to really drive breakthroughs, deliver these awesome services on our cloud, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/azure.microsoft.com\/en-us\/\">Azure Services<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and make sure we are satisfying a wide range of customers both inside Microsoft and outside of Microsoft. There are three things, really, if you want to frame this whole thing.<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Xuedong Huang: The first, we have the horsepower to really drive speech recognition accuracy. To drive the naturalness of our synthesis effort. To make sure translation quality is accurate when you translate from English to Chinese or French or German. So, there\u2019s really a lot of science behind that, making sure the accuracy, naturalness, latency, they are really world-class. So that\u2019s one. The second one is really, we not only provide technology, we deliver services on Azure. That from Office to Windows, Cortana, they are all depending on the same cloud services. And we also have edge devices like our speech device, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/cognitive-services\/speech-service\/get-speech-devices-sdk\">SDK<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. So, we want to make sure the speech on the edge and the cloud, they are really delivered in the modern fashion.<\/p>\n<p><strong>Host: Mm-hmm.<\/strong><\/p>\n<p>Xuedong Huang: That\u2019s the platform in the cloud and embedded. So, that\u2019s the second: the platform is modern. The third one is really, to show our love to the customer, because we have wide range of customers worldwide. We want to really delight and make sure our customer experience using speech translation is top notch.<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Xuedong Huang: That\u2019s actually the three key things I do: AI horsepower, modernize our platform in the cloud and on the edge, and love our customers.<\/p>\n<p><strong>Host: Well, and you\u2019ve got a lot of teams working in these groups to tackle each of these \u201cpillars\u201d we might call them.<\/strong><\/p>\n<p>Xuedong Huang: Yes. We have teams worldwide as well.<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Xuedong Huang: And so, the diversity is amazing because we are really trying to address the language barriers.<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Xuedong Huang: Trying to remove the language barriers. So, we do have teams in China. We have teams in Germany, in Israel, in India and in the US, of course. So, we really work around the globe trying to deal with these language challenges.<\/p>\n<p><strong>Host: So, I want to start by quoting you to set the stage for our conversation today. You said, \u201cSpeech and language is the crown jewel of AI.\u201d So, unpack that for us.<\/strong><\/p>\n<p>Xuedong Huang: Mm-hmm. Well, we can think in the scale of human\u2019s evolution. At some point, the language was born. That accelerated human\u2019s evolution. If you think about all the animals on this planet, you know, there are animals running faster than humans, they can see better\u2026<\/p>\n<p><strong>Host: Their teeth are sharper.<\/strong><\/p>\n<p>Xuedong Huang: \u2026especially in the night.<\/p>\n<p><strong>Host: They\u2019re stronger.<\/strong><\/p>\n<p>Xuedong Huang: Yep. They can actually hear better, smell better\u2026 Only we, humans, have the language. We can organize better. We can describe in science-fiction terms. We can really organize ourselves, create a constitution. So, if you look at the humans, it is speech and language that set us apart from other animals. For artificial intelligence, speech and language would drive the evolution of AI, just like it did to humans. That\u2019s why it\u2019s the crown jewel of AI.<\/p>\n<p><strong>Host: All right.<\/strong><\/p>\n<p>Xuedong Huang: And it\u2019s a tough one to crack.<\/p>\n<p><strong>Host: Yeah. There\u2019s a whole philosophical discussion on that topic alone, but it leads to some interesting questions about, you know, if you are wildly successful with machine language, what are these machines?<\/strong><\/p>\n<p>Xuedong Huang: So, let\u2019s just actually, you know, set our imagination\u2026<\/p>\n<p><strong>Host: Yeah, let\u2019s do.<\/strong><\/p>\n<p>Xuedong Huang: \u2026off a little bit, right? At some point, let\u2019s say computers can understand three hundred languages, can fluently communicate and converse. I have not run into a person who can speak three hundred languages. And not only machines can fluently communicate and converse, but can comprehend, understand and learn and reason and can really finish all the PhD courses in all subjects. The knowledge acquisition, reasoning, is beyond anyone\u2019s individual capability. When that moment is here, you can think about how intelligent that AI is going to be.<\/p>\n<p><strong>Host: Is this something you envision?<\/strong><\/p>\n<p>Xuedong Huang: Yes.<\/p>\n<p><strong>Host: Do we want that?<\/strong><\/p>\n<p>Xuedong Huang: Yes. I think this world will be a much better place. I was in Japan just a few weeks ago, carrying <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/translator.microsoft.com\/\">Microsoft Translator<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on my mobile devices. I was able to really communicate with Japanese who do not speak Chinese or English. It\u2019s already there. Microsoft translator can speak the language I do not speak and help me to be more productive when I was in Japan.<\/p>\n<p><strong>Host: So, I\u2019m all about that. It just scares me a little bit to think about a machine\u2026 \u201cwe weren\u2019t first, we\u2019re not last, we\u2019re just next\u2026\u201d<\/strong><\/p>\n<p>Xuedong Huang: But you know, there are two levels of intelligence. The first level is really perceptive intelligence. That is the ability to see, to hear, to smell. Then the high level is cognitive intelligence. That is the ability to reason, to learn and to acquire knowledge. Most of the AI breakthroughs we have today, they are in the perceptive level such as speech recognition, speech synthesis, computer vision. But this high-level reasoning and knowledge acquisition, cognitive capability, is still far from being close to human\u2019s level.<\/p>\n<p><strong>Host: Right<\/strong><\/p>\n<p>Xuedong Huang: And what I\u2019m excited about translation, it is really something between perceptive intelligence and cognitive intelligence. And the fact that we are actually able to really build the success on the perceptive intelligence and expand into cognitive intelligence is quite a journey.<\/p>\n<p><strong>Host: Right.<\/strong><\/p>\n<p>Xuedong Huang: And uh, I do not know when we are going to reach that milestone. But that one is coming. It\u2019s just a matter of time. It could take fifty years, but I think it is going to happen.<\/p>\n<p><strong>Host: We\u2019ll have to come back for another podcast to talk about that milestone because we\u2019re going to talk about a couple of milestones in a minute. But first I want to do a little bit of backtracking, because you\u2019ve been around for a while and you started in Microsoft Research right about the time Rick Rashid was setting the organization up and speech was one of the first groups that was formed. And according to MSR lore, the goal of the group was to \u201cmake speech mainstream.\u201d So, give us a brief history of speech at MSR. How has the research gone from \u201cnot mainstream\u201d in those early \u201ctake risks and look far out days\u201d to being a presence in nearly every Microsoft product today?<\/strong><\/p>\n<p>Xuedong Huang: Before I joined Microsoft Research, I was also on the faculty at CMU in Pittsburgh. So, Rick Rashid was a professor there. I was a junior faculty member. So, I was doing my research, mostly at CMU, on speech. Microsoft reached out and they wanted to set up a speech group. So, I moved, actually, on the first day of 1993, after New Year\u2019s break. I flew from Pittsburgh to Seattle and started that journey and never changed. So, that was the beginning of Microsoft Speech. We were the research group that really started working on bringing speech to the developers.<\/p>\n<p><strong>Host: Right.<\/strong><\/p>\n<p>Xuedong Huang: So\u2026<\/p>\n<p><strong>Host: Not just blue-sky research anymore\u2026<\/strong><\/p>\n<p>Xuedong Huang: Not blue-sky research. So, we licensed technology from CMU. That\u2019s how we started. So, we\u2019re very grateful to CMU\u2019s pioneering research in this area. So, we were the research group, but we delivered the first speech API, SAPI, on Windows \u201995. As a research group, we were pretty proud of that because usually research is doing only blue-sky research. We not only did blue-sky research, we continued to push the envelope, continued to improve the recognition accuracy, but we also worked with Windows, brought that technology to Windows developers. So, SAPI was the first speech API in the industry on Windows.<\/p>\n<p><strong>Host: Wow.<\/strong><\/p>\n<p>Xuedong Huang: And that was really quite a journey. And then, I eventually left research, joined the product group. I took the team! And it was also an exceptional Microsoft speech research group that came with me. Went to the product group. So, this has been really a fascinating twenty-seven years\u2019 experience at Microsoft. I stopped doing speech after 2004, after we shipped the speech server, and I started many different things including running the incubation for research as a startup.<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Xuedong Huang: And I also worked as an architect for Satya Nadella when he was running Bing.<\/p>\n<p><strong>Host: Okay.<\/strong><\/p>\n<p>Xuedong Huang: And then, when Harry was running the Research and Technology group, I was helping incubate a wide range of AI projects from foundational pieces like a GPU cluster, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/phillywiki.azurewebsites.net\/\">Project Philly<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, the deep learning tool kit, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/docs.microsoft.com\/en-us\/cognitive-toolkit\/\">CNTK<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. And of course, speech research, all the way to the high-end solutions like customer care intelligence.<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Xuedong Huang: And about three years ago, I had the privilege to return to run a combined a Speech and Language group. So basically, we were able to consolidate all the resources working on speech and translation and that was the story, really, you know, the journey of my experience. A fascinating twenty-seven years.<\/p>\n<p><strong>Host: Where does Speech and Language live right now?<\/strong><\/p>\n<p>Xuedong Huang: So, as I said, we moved back and forth multiple times between research and product groups. Right now, we are sitting in the Cloud and AI group. This is a product group. We\u2019re part of these cloud services and we provide company-wide and industry-wide speech and translation services. We also have speech and dialog research. They are really operating like a research group.<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Xuedong Huang: They\u2019re all researchers in that team. As what Rick has been saying, tech transfer is a full-contact sport. We are not just, you know, a full-contact sport, we\u2019re one body sport. So, it\u2019s actually a very exciting group, with a group of very talented, very innovative people.<\/p>\n<p><strong>Host: So, it\u2019s still forward-thinking in the research mode\u2026<\/strong><\/p>\n<p>Xuedong Huang: It\u2019s both forward-thinking and well-grounded. We have to be grounded to deliver services from infrastructure to cost of serving, and we also have to be standing high to see the future, to define what is the solution that the people need and people want, even though the solution might not have existed and they may not know what it is at this moment.<\/p>\n<p>(music plays)<\/p>\n<p><strong>Host: Well, let\u2019s talk about some specific research milestones that you\u2019ve been involved in. They are really interesting. Three areas you\u2019ve been involved in: conversational speech recognition, machine translation and conversational Q&A. So, let\u2019s start with the recognition. In 2016, you led a team that reached historical human parity in transcribing conversational speech. Tell us about this. What was it a part of, how did it come about?<\/strong><\/p>\n<p>Xuedong Huang: So, in 2016, we reached human parity on the broadly used <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www1.icsi.berkeley.edu\/Speech\/stp\/description.html\">Switchboard Conversational Transcription<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> task. That task has been used in the research community and industry probably over ten years. In 2017, we redefined the human parity milestone, so we\u2019re not competing with only one single person, we\u2019re competing with a group of people to transcribe the same task. So, I would say 2017 is really a historical moment. In comparison to a group of people transcribing the same task, Microsoft Speech Stack outperformed all four teams combined together. When I challenged our research group, nobody thought that was even feasible. But in less than two years, amazingly, when we had the conviction and the resource and the focus, magic indeed happened. So, that was actually a fantastic moment for the team, for science, for the technology stack. That was the first human parity milestone for my personal professional career.<\/p>\n<p><strong>Host: So, I want to go in the weeds a little bit on this because this is interesting what you say, in two years, nobody thought it was possible and then you did it. Tell us a little more about the technical aspects of how you accomplished this.<\/strong><\/p>\n<p>Xuedong Huang: So, if you look at the history of speech research, the speech group pioneered many breakthroughs that got reused by others. Let\u2019s take translation as an example. So, even for speech, in the early 70s, the speech recognition used more traditional AI, like rule-based approach, expert system. And IBM Watson research pioneered statistic speech recognition, using <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Hidden_Markov_model\">Hidden Markov Model<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, using, you know, statistic language model. They really pushed the envelope and advanced the field. So, that was a great moment. It was the same group of IBM speech researchers, they borrowed the same idea from speech recognition, applied that to translation. They rewrote translation history. Really advanced the quality of translation substantially. And after Hidden Markov Model, it was deep learning that started with speech recognition, neural speech recognition. And once again, translation borrowed the same thing with neural machine translation and also advanced. So, you can see the mirror of using technology speech people pioneered. Actually, speech guys have been doing this, you know, systematic benchmarking, funded by DARPA, very rigorous evaluation, that really changed how science and engineering could be evaluated.<\/p>\n<p><strong>Host: Right.<\/strong><\/p>\n<p>Xuedong Huang: So, there are many broad lessons from the speech technology community that could have been used broadly, beyond speech. So, we got trained to deal with tough problems. It\u2019s no wonder the same group of people could have achieved this historic milestone.<\/p>\n<p><strong>Host: Well, let\u2019s talk about another human parity milestone: the automatic Chinese to English news translation for the WMT-2017 task. And I had Arul Menezes on the show to talk all about that. But I\u2019d love your perspective on whether and how \u2013 this kind of goes back to what we talked about at the beginning \u2013 whether and how you think machines can now compare to traditional human translation services and why this work is an important breakthrough for barriers between people and cultures.<\/strong><\/p>\n<p>Xuedong Huang: So, the second human parity breakthrough from my team is equally exciting. As I said, transcribing Switchboard Conversational Speech is a great milestone. But it\u2019s really at the very low level, at the perceptive AI level. Translation is a task that is between perceptive AI and cognitive AI. Of course, translation is a harder task, and nobody believed we could have achieved this. So, we set a goal: in five years, let\u2019s see if we can achieve translation human parity on the sentence by sentence basis. So, I want to really put that condition here. When human, like you and me translate, we are looking at the whole paragraph, we have the broader context, we do a better job. So, we limited ourselves because, for the broader use, the WMT, which is just news translation measured on the sentence by sentence level\u2026<\/p>\n<p><strong>Host: Um-hum.<\/strong><\/p>\n<p>Xuedong Huang: \u2026and so, it\u2019s a broadly open research, public benchmark. Even for that one, we thought it could have taken five years. So, we applied the same principle: do it on the success we had on transcribing Switchboard Speech Recognition. But this time, we actually went one step beyond. We partnered with Microsoft Research Group in Beijing because it\u2019s a Chinese to English translation. So, across Pacific, multiple teams in Microsoft Research Asia, worked together days and nights. Amazingly, this group of people surprised everyone. We delivered this in less than a year, reaching human parity, an historical translation level, better than professional people on the same task, as measured by our scientists. So, this time, really, we did something magic. I\u2019m very proud of the team. I\u2019m very proud of the collaboration.<\/p>\n<p><strong>Host: Well, another super interesting area that I\u2019d love to talk about with you is what you call <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/stanfordnlp.github.io\/coqa\/\">COQA<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. And that\u2019s C-O-Q-A. Conversational Q&A. So, obviously we\u2019re talking about computers having this conversation with us, question and answer. Tell us about the work that\u2019s going on in this most human, and perhaps most difficult, of tasks in speech recognition technology.<\/strong><\/p>\n<p>Xuedong Huang: So, this task is pioneered by Stanford researchers. It\u2019s even one step closer to cognitive AI. This is really machine reading comprehension task with conversation, with dialogue, about the task. Let\u2019s say you read a paragraph. Then we challenge the reader to answer correctly with a sequence of questions that are related. For example, if you read the paragraph about Bill Gates, the first question could have been, \u201cWho is the founder of Microsoft?\u201d The second question could be related to the first one, \u201cHow old is the person when the person started?\u201d Or you could have said, \u201cAnd when the person retired, how old was he?\u201d So, that context relevancy is harder than simple machine reading comprehension because there\u2019s a sequence of related questions you have to answer, given the context. So, for this latest breakthrough, and I have to give credit mostly to our colleagues in Beijing research lab, we have been pioneering this working together using shared resources and the infrastructure. It\u2019s just amazing. I\u2019m so impressed with the agility and the speed we have to achieve this amazing conversational question and answering challenge. So, the leading researchers, they are all in Beijing, will play a great and supporting role, helping Microsoft, once again, be the first to achieve human parity on this broadly watched AI task. Nobody believed anyone could have achieved this conversational Q&A human parity in such a short time. And so, we thought it might take two years. Once again, we broke historical record.<\/p>\n<p><strong>Host: Well, we\u2019ve talked a little bit about the more technical aspects of what you are doing and how you are doing this. So, on this last one, are there any other methodologies or techniques that you brought to the table to conquer this Q&A task?<\/strong><\/p>\n<p>Xuedong Huang: So, Microsoft has accumulated thirty years of research and experiences in AI, right? The natural language group in Beijing, they have been doing this in the last twenty years and they have accumulated lots of talents, a lot of experiences. And we basically use deep learning and transfer learning. Also, we built our success on top of the whole community.<\/p>\n<p><strong>Host: Mm-hmm.<\/strong><\/p>\n<p>Xuedong Huang: For example, Google, they delivered this fascinating technology called BERT. And\u2026<\/p>\n<p><strong>Host: Is that an acronym?<\/strong><\/p>\n<p>Xuedong Huang: Yes, it\u2019s an acronym. It\u2019s embedding technology. We built the success on top of that, expanded that. That\u2019s how we achieved the human parity breakthrough.<\/p>\n<p><strong>Host: Mm-hmm.<\/strong><\/p>\n<p>Xuedong Huang: So, it\u2019s really a reflection of the collective community. And I talked about the collaboration between Microsoft Research in Asia and our team in the US. Actually, this is a great example of collaboration of the whole industry.<\/p>\n<p>(music plays)<\/p>\n<p><strong>Host: On the heels of everything that could possibly go right \u2013 and it\u2019s pretty exciting what you\u2019ve described to us in this podcast \u2013 we do have to address what could possibly go wrong, if you are successful.<\/strong><\/p>\n<p>Xuedong Huang: Mm-hmm.<\/p>\n<p><strong>Host: You want to enable computers to listen, hear, speak, translate, answer questions \u2013 basically, communicate \u2013 with people. Does anything about that keep you up at night?<\/strong><\/p>\n<p>Xuedong Huang: Yes, absolutely. My worry is really, someday, humans can be too dependent on AI. And AI will never be perfect. AI would have a unique sort of biases. So, I worry about that unconscious influence.<\/p>\n<p><strong>Host: Right.<\/strong><\/p>\n<p>Xuedong Huang: So, how to deal with that is really a broad societal issue that we have to be aware and we have to address. Because just like anyone, if you have an assistant you depend on, you absolutely know much that assistant can influence you, change your agenda, change your opinion. And AI, one day, is going to play the same role. AI will be biased. And how do we deal with that is my top concern.<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Xuedong Huang: If everything goes well. That is really, you know, a top issue we have to deal with. We have to learn how to deal with it. We do not know because we are not there yet.<\/p>\n<p><strong>Host: So, what kinds of \u201cdesign thinking\u201d are you bringing to this as you build these tools that can speak and listen and converse, because one of the biggest things is that human ability to impute human qualities to something that\u2019s not human\u2026<\/strong><\/p>\n<p>Xuedong Huang: I think just, you know, there are enough responsible people working on AI. And the good news is that we\u2019re not there yet, right? So, we have time to work together to deal with that and make sure AI is going to really serve mankind, not to destroy mankind. So that\u2019s my top worry\u2026<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Xuedong Huang: \u2026what keeps me awake. But my short-term worry is really AI is not good enough! Not yet!<\/p>\n<p><strong>Host: Okay.<\/strong><\/p>\n<p>Xuedong Huang: And people, as Bill Gates used to say, you always overestimate what you can do in the short-term and underestimate the impact in the long-term. For this case, we cannot underestimate the long-term impact.<\/p>\n<p><strong>Host: Right.<\/strong><\/p>\n<p>Xuedong Huang: The long-term milestone.<\/p>\n<p><strong>Host: Okay. It\u2019s story time.<\/strong><\/p>\n<p>Xuedong Huang: Mmmm. Good!<\/p>\n<p><strong>Host: Tell us a bit about your life. What\u2019s your story? What got you interested in research, particularly the speech and language technology research, and what was your path to MSR?<\/strong><\/p>\n<p>Xuedong Huang: Good. Um, I was a graduate student in Beijing\u2019s Tsinghua University. At that time, my first computer was Apple 2. So, because you know Chinese language is not easy to type. So, it was very cumbersome. So, that necessity brought me to speech recognition. My dream at that time was, as a graduate student in Tsinghua, actually was in AI. In AI of Tsinghua\u2019s, you know, graduate school\u2026<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Xuedong Huang: \u2026was fantastic to have, you know, so many professors and faculty members who had that long-term vision and set-up the pioneering environment for us to explore and experiment with. So, I finished my master\u2019s degree. I was in the PhD program and I had been working on speech recognition since \u201982 because I was enrolled, admitted, to Tsinghua in 1982. That dream, to make it easier for people to really communicate with machines, never disappeared. So, I have been working on this for over thirty years. Even though, at Microsoft, for a short period of time, I stepped out of speech, but I was still doing something related. So, I really thought this was a fascinating story. So, I got some personal really interesting story. As I said, you know, it was hard to type in Chinese when I was at the Tsinghua University. And I didn\u2019t finish my PhD at Tsinghua. I went to the University of Edinburgh\u2026<\/p>\n<p><strong>Host: Okay.<\/strong><\/p>\n<p>Xuedong Huang: \u2026in Scotland. And I did finish my PhD there. But my personal pain point when I first landed in Edinburgh was really \u2013 I learned English, mostly American English, in China. It wasn\u2019t that good because it wasn\u2019t my native language. But listening to a Scottish professor\u2026<\/p>\n<p><strong>Host: Oh, my goodness!<\/strong><\/p>\n<p>Xuedong Huang: \u2026talking was always challenging. But I was so grateful BBC had closed captioning.<\/p>\n<p><strong>Host: Oh, funny.<\/strong><\/p>\n<p>Xuedong Huang: So, I really learned my Scottish English from watching BBC. And I have to say, that automatic captioning technology is available on Microsoft Power Point today. And that journey of personal pain points to what Office Power Point teams can bring together is fascinating and personally extremely rewarding.<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Xuedong Huang: I\u2019m so grateful to see the technology I have worked on is going to help many other people who are attending Scottish universities!<\/p>\n<p><strong>Host: You know, Arul talked about that Power Point\u2026<\/strong><\/p>\n<p>Xuedong Huang: Yeah.<\/p>\n<p><strong>Host: \u2026service and he was talking about people who had hearing disabilities.<\/strong><\/p>\n<p>Xuedong Huang: Mm-hmm.<\/p>\n<p><strong>Host: You give it a whole new\u2026<\/strong><\/p>\n<p>Xuedong Huang: It\u2019s much broader\u2026<\/p>\n<p><strong>Host: Oh, absolutely!<\/strong><\/p>\n<p>Xuedong Huang: \u2026because the language barrier is always there. Not everyone is as fluent. And I host many visitors. Almost in every year I\u2019m hosting Tsinghua University MBA students and they all learn English, but their ability to converse and listen, simply is not as good as native people here. So, the simple fact that we were able to provide captioning on the Power Point presentation helped all of them\u2026<\/p>\n<p><strong>Host: Yeah.<\/strong><\/p>\n<p>Xuedong Huang: \u2026to learn and understand much better. So, this is actually a fairly broad scenario without even translating. Just the fact you have captioning, we enhance the communication.<\/p>\n<p><strong>Host: Right. And you know, we talked earlier about the different languages and we talked a little bit about dialects, but we didn\u2019t really talk about accents within language. I mean, even in the United States, you go to various parts of the country and have a more difficult time understanding, even from your own country, just because of the accent.<\/strong><\/p>\n<p>Xuedong Huang: That\u2019s why my Scottish English is a good story! And I hope I still have a little bit of Scottish accent!<\/p>\n<p><strong>Host: I hear it! Well at the end of every podcast, I give my guests the last word. And since you are in human language technologies, it\u2019s particularly apropos for you. Now\u2019s your chance to say whatever you want to our listeners who might be interested in enabling computers to converse and communicate. What ought they to put boots on for?<\/strong><\/p>\n<p>Xuedong Huang: Working on speech and language! This is really the crown jewel of AI. You know, there\u2019s no more challenging task than this one, in my opinion. Especially if you want to move from perceptive AI to cognitive AI. To get the ability to reason, to understand, to acquire knowledge by reading, by conversing, is just, you know, such a fundamental area that can improve everyone\u2019s life, improve everyone\u2019s productivity, make this world a much better place without language barriers, without the communication barriers, without understanding barriers.<\/p>\n<p><strong>Host: Xuedong Huang, thank you for joining us on the podcast today. It\u2019s been fantastic.<\/strong><\/p>\n<p>Xuedong Huang: My pleasure.<\/p>\n<p>(music plays)<\/p>\n<p>To learn more about Dr. Xuedong Huang and the science of machine speech and language, visit <a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/\">Microsoft.com\/research<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Episode 76, May 15, 2019 When was the last time you had a meaningful conversation with your computer\u2026 and felt like it truly understood you? Well, if Dr. Xuedong Huang, a Microsoft Technical Fellow and head of Microsoft\u2019s Speech and Language group, is successful, you will. And if his track record holds true, it\u2019ll be [&hellip;]<\/p>\n","protected":false},"author":38022,"featured_media":586963,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"https:\/\/player.blubrry.com\/id\/43946866\/","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[],"msr_hide_image_in_river":0,"footnotes":""},"categories":[240054],"tags":[],"research-area":[13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-586960","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-msr-podcast","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"https:\/\/player.blubrry.com\/id\/43946866\/","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[548625,664548],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788.png\" class=\"img-object-cover\" alt=\"Xuedong Huang wearing glasses and smiling at the camera\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788.png 1400w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788-300x169.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788-768x432.png 768w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788-1024x576.png 1024w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788-1066x600.png 1066w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788-655x368.png 655w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2019\/05\/Xuedong-Huang_Podcast_Site_05_2019_1400x788-343x193.png 343w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"May 15, 2019","formattedExcerpt":"Episode 76, May 15, 2019 When was the last time you had a meaningful conversation with your computer\u2026 and felt like it truly understood you? Well, if Dr. Xuedong Huang, a Microsoft Technical Fellow and head of Microsoft\u2019s Speech and Language group, is successful, you&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/posts\/586960","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/users\/38022"}],"replies":[{"embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/comments?post=586960"}],"version-history":[{"count":6,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/posts\/586960\/revisions"}],"predecessor-version":[{"id":593530,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/posts\/586960\/revisions\/593530"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/media\/586963"}],"wp:attachment":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/media?parent=586960"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/categories?post=586960"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/tags?post=586960"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=586960"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=586960"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=586960"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=586960"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=586960"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=586960"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=586960"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=586960"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}