In October 2016, in a big milestone for artificial intelligence, Microsoft unveiled a system that can transcribe the contents of a phone call as well or better than human professionals.
But while Microsoft’s system had fewer transcription errors than the average human transcriptionist, it still couldn’t best a team of trained humans. So, the world of academia fired back with a new challenge: Lower the error rate to below what human teams can do.
Now Microsoft has done just that. In a blog entry on Sunday, Xuedong Huang, Microsoft Research’s chief speech scientist, reported that the company had broken even that barrier.
It’s a major milestone, Huang wrote. And it gives the company a sound foundation to go from mere transcription to understanding the meaning of what’s being said, he said. Speech recognition is a fundamental building block for building more robust artificial intelligence.
“Moving from recognizing to understanding speech is the next major frontier for speech technology,” Huang wrote.
Microsoft’s voice recognition system has been improving rapidly. Transcription accuracy is judged by error rates; i.e., the portion of words a system gets wrong out of a given recording of speech. That error rate is determined using Switchboard, a standard test for voice transcription accuracy widely used in the industry, including by IBM and Google.
As recently as September 2016, Microsoft’s error rate, according to Switchboard, was 6.3%, which means that out of every 100 words the system was getting more than 6 wrong. By comparison, a single human transcriptionist has an average error rate of 5.9%, and a team of trained humans clocks in with an error rate of around 5.1%.
Microsoft matched the former error rate in October and just beat the latter.
That’s far sooner than the company expected. Indeed, back in 2015, Huang himself told Business Insider that building a system capable of surpassing a human at transcription was “four to five years away.” Less than two years later, we’re well past that point.
Still, challenges remain. Microsoft’s transcription system is patterned after the audio coming from a nice, stable landline telephone, Geoffrey Zweig, formerly a principal researcher at the company, told Business Insider last October. The next frontier for voice recognition is to accurately transcribe speech even when it’s coming over a lousy cell connection or an echoing McDonalds drive-thru speaker.
Speech science “still has many challenges to address, such as achieving human levels of recognition in noisy environments with distant microphones, in recognizing accented speech, or speaking styles and languages for which only limited training data is available,” Huang wrote in his blog post on Sunday.