Why our voice assistants struggle so much to understand us? The answer, of course, is in the math.
What Does 'Phoneme' Mean?
Article Feb 05, 2020
Rob Keefer
The proliferation of voice assistants has brought with it new memes and funny misinterpretations of speech, as Lindsay McGregor relayed in this tweet:
This brings up the question of why our voice assistants struggle so much to understand us.
As with so many topics in computer science, the answer to this question comes down to math, and probabilities in particular. Essentially, your voice assistant makes an educated guess at what you said based on the detected phonemes it gathers from the sounds it hears.
Ok, what does that mean?
Your voice assistant breaks up a spoken phrase into small sounds called phonemes. A phoneme is a basic distinctive unit of speech sound by which words and sentences are represented. There are 45 phonemes in the English language. For example, the word ‘phone’ is comprised of three phonemes: ‘F’, ‘OA’, and ’N’. Sounds such as ‘F’, ’N’, and ‘M’ are all distinct, but when combined with other sounds may not be as distinct as you may think.
Humans understand words within sentences, and sentences within a larger context. Often it is the context that enables us to differentiate and understand a given word, even when a word is not enunciated clearly. Voice assistants are trained to identify the most common of the over 170,000 English words, but even with this vast amount of training, they do not have a greater context in which to interpret what is said.
I wonder if Alexa would have understood Lindsay McGregor’s request had it also known that Lindsay was headed into the family room, rather than just interpreting a phrase with no context.
This is illustrated in the short video below in which I ask Alexa, Google, and Siri what the word ‘phoneme’ means.
Each voice assistant fails differently. The phonetic spelling of ‘phoneme’ is F OA N EE M which is close to F U N EE, F OA N [N], and F OA N EE for funny, phoning, and phony respectively.
Now this is where the math comes in. Let’s imagine that each of these voice assistants is trained on only 5000 English words (I’m sure it is more than that, so humor me).I think it is reasonable to assume that the word ‘phoneme’ would not make the list of the top 5000 English words, and therefore no voice assistant would be taught the word. So the likelihood of a voice assistant understanding the word ‘phoneme’ is zero. Since the assistant doesn’t know the word we used, it takes a guess at what was meant based on the phoneme patterns that it has learned.
Based on the results of the experiment in the video, it seems that the Google speech recognition model gave a higher probability to a user speaking the word F U N EE (funny), over the other choices. This compared with Alexa which calculated F OA N [N] (phoning) as the most possible, and Siri which chose F OA N EE (phony).
As time passes, improvements in sound processing technology and speech recognition models will lead to accurate word recognition. At that point, Alexa will turn the lights on as requested, rather than playing the lullaby version of Livin’ on a Prayer.
Looking for a guide on your journey?
Ready to explore how human-machine teaming can help to solve your complex problems? Let's talk. We're excited to hear your ideas and see where we can assist.
Let's Talk