The field of Artificial Intelligence (AI) is developing extremely fast these days, fueled mainly by breakthroughs in Machine Learning (ML, an area within AI). The Neural Networks used in this kind of algorithms continue to grow larger as computing power increases, thus boosting their ability to learn more difficult tasks. For example, Stanford’s Machine Learning Group recently claimed they have developed an algorithm that can diagnose pneumonia better than radiologists, and researchers at Oxford and Yale even concluded that, according to a survey amongst AI experts, “AI will be able to beat us at everything by 2060”. So expectations are high, but can they be met? Or are we at the peak of the hype cycle and is the next AI Winter just around the corner?
Speech recognition, the process of turning recorded speech into text, is a task that has been around in Computer Science since the 1950’s when three Bell Labs researchers started working on it. They were able to recognize 10 different words (digits actually). In August 2017 Microsoft proclaimed to have reached an important milestone by matching human accuracy in speech recognition. Or do they?
Well, at least some news websites seem to think so! ZDNet wrote “Microsoft’s new record: Speech recognition AI now transcribes as well as a human” and Business Insider even stated, “Microsoft’s voice-recognition tech is now better than even teams of humans at transcribing conversations”. No wonder our customers expect our speech recognition software to be almost perfect. They read headlines emphasising the incredible power of AI every day. Some expectation management is required though.
Even though state-of-the-art machine learning algorithms do a great job in many different challenging tasks, they’re still pretty dumb. A computer “seeing” a picture of two children playing frisbee at the beach suggesting the caption “two children playing frisbee at the beach” seems to have an actual understanding of the world. But when it suggests the caption “boy holding a baseball bat” to a picture of a girl brushing her teeth, you realize how thin that understanding really is, if it was there at all.
It’s important to realise that current machine learning algorithms are taught one very specific task using loads of training examples. The Artificial Intelligence that has beaten the world’s best Go player, was initially trained on 30 million moves from historical games before it was further improved using reinforcement learning. The algorithm can beat every human competitor but doesn’t have a clue about anything other than Go.
The same holds for Microsoft’s Speech Recognition breakthrough. Their algorithm was taught to perform well on the “Switchboard speech recognition task”, a scientific speech recognition benchmark introduced in 1992. The task at hand is to transcribe a phone conversation between two known American English-speaking people talking about one predefined, coherent topic. Each of the speakers is recorded on a different channel, so they can be separated even when they talk at the same time. This is an entirely different cup of tea when compared to speaker independent, large vocabulary speech recognition of untrained speakers in noisy, unorganised YouTube-like videos, as we happen to do at Media Distillery.
Although some expectation management might be needed to give a realistic idea of what AI can and cannot achieve nowadays, it has taken some impressive steps forward. Since we started to use deep learning as part of our speech recognition software, its accuracy increased by more than 40%. Our deep learning-based logo recognition (using a Faster RCNN deep network) proved to be on par with Google’s. And the first version of our topic detection algorithm (based on Word2Vec-like word embeddings) allows our customers to accurately find any of the predefined topics. Advances in computing power (i.e. new generations of GPUs) will allow us to continue to improve our products and introduce new algorithms. Even though the next AI winter might be coming and the trough of disillusionment is lurking, we plan to take advantage of this technology while it is progressing.
November 1, 2017