Voice is a technology that makes it possible to give commands to computers with voice commands. Since the vast majority of computers in the world are connected to the internet, it is also possible to give voice commands online and thus control devices remotely, perform searches, book a vacation or place orders in an online shop. Digital speech assistance is being used more and more. Not only at home, but also at work, in the car and just in public places.
From visual interface to audio interface
My prediction is that in the coming years voice will bring about a huge change in the way we deal with online technology. Where everything is still visually oriented and we use a screen, this will shift to voice in many areas where a screen will no longer be needed. This will have major consequences for the way we communicate, but also for example how we shop online and receive information.
Brands will also have to start thinking about how they want to present themselves in a world where voice becomes leading in a number of areas. How are you going to present yourself in a recognizable way if there is no longer always a screen on which a consumer sees your product or logo? And if a consumer wants to buy a product, how do you ensure that your brand is first choice in the mind of the consumer?
There are a number of major players who have been working for years to improve the quality of their speech assistant. Apple has Siri, Google has the Google Assistant, Amazon has Alexa and Microsoft has Cortana.
How does voice actually work?
You can activate a digital assistant via a smart speaker by saying a command like “Hello Alexa” or “Hey Siri”. You can then ask a question or give a command to which the digital assistant will respond. It may be that this is an answer or a requested action right away, but you may also receive a question about a follow-up action. But how does this technically actually work and how is it possible that a digital assistant “understands” your question or command? This has everything to do with a smart combination of hardware and software.
The operation of digital speech interaction can be divided into three core steps:
- Speech to text
- Text to intention
- Action intention
The first step, speech to text, essentially converts voice commands to a text input that your computer or smartphone normally receives because you type it. Good speech to text software such as Apple Dictation, Google Docs voice typing and Dragon naturally tune into environmental noise and variation in voice tone / pitch / accent to provide accurate translations in multiple languages. The software breaks your speech into small, recognizable parts that are called phonemes – there are around 40 in the Dutch language and 44 in the English language. It is the order, combination and context of these phonemes with which the advanced audio analysis software can figure out what exactly you are saying. For words that are pronounced in the same way, the software analyzes the context and syntax of the sentence to find out what the best text combination is for the word you spoke. In the database, the software then matches the analyzed words with the text that best matches the words you spoke.
The second step is text to intention. This step interprets what the user means exactly. For example, if you say “tell me about Amsterdam” in a conversation context, how do digital speech assistants know what exactly you mean by this question? Do you ask for the latest local news about Amsterdam, or do you want flight options to Amsterdam, or do you want to know the weather in Amsterdam ?, And when a word has a double meaning, this interpretation becomes even more difficult.
Web search engines solve this challenge by arranging answers to the ‘query’ in descending order of derived intention. For a digital speech assistant, this ranking should ultimately lead to the best answer and not a list of answers that a search engine can get away with.
A number of possible answers are put in a so-called thread. Each thread uses hundreds of algorithms to study the evidence, looking at factors such as the information, what kind of information it is, the reliability and how likely it is to be relevant, and then making an individual weighting based on what has already learned the software.
The third and final step is the intention to act. This step is aimed at meeting the needs of the user. Most digital speech assistants evolve from answering simple questions such as the weather to doing things when they are integrated into other devices. Think of cars, thermostats, light bulbs, door locks, refrigerators, washing machines, alarm systems and coffee machines.
These 3 core functions of Digital VI not only get better with more data, but are also available as an API (application programming interface) with multiple providers. Companies can use that modularity and choose the best options and combinations to build integrated solutions for their customers.
Although good progress has already been made with digital speech assistants, their development is still in its infancy and we will see this technology come to fruition in the coming years.
The future of speech recognition (voice)
The beauty of the future is that nobody knows exactly what is going to happen, but that does not alter the fact that you can make predictions about what you think will happen. This is also nice to see or read years later. We all know the movie from 1998 that asks people if they have a mobile phone.
It is, however, impressive to see how quickly speech assistance technology is developing and how it finds its way to our homes. The technology is now seen as the natural way to operate the smart house, thanks to the inexpensive ability to add speech to your installation. And this has been developed to a good level in a relatively short time, which makes it all the more exciting for the future.
Siri was the first to make speech technology mainstream. When this was launched in 2011 with the iPhone 4S, this was a pretty revolutionary addition. Siri was then mainly focused on the device with which it was delivered and therefore, although functional, quite limited. In addition, it was also beta technology that was still developing. At the introduction, Siri also became a direct victim of its own success. Due to the great popularity, the necessary bugs appeared and the back-end servers were not prepared for the questions of millions of iPhone users.
Despite these starting problems, Siri did clear the way for competitors’ speech assistants and also proved that controlling devices with your voice was a function that consumers would like to embrace. Google and Amazon also delved into the development of their own speech assistant. Google Assistant first and foremost resembled Siri by applying natural language processing where it interprets the user’s question and then uses Google’s gigantic databases to look up the answer.
Google has also presented an interesting extension of Assistant under the name Google Duplex. This technology is not only designed to answer questions and make lists, but to become a credible personal assistant who can communicate autonomously and as naturally as possible with others and thus, for example, call a restaurant on your behalf to reserve a table.
A cautious estimate is that between 60 and 80 million people worldwide now have access to a smart speaker and the vast majority are equipped with Alexa. The Amazon range means that you can have full speakers with built-in Alexa, or that you can make any speaker smart with the Amazon Echo Dot. The voice OS has also found its way to Amazon tablets and, via third parties, everything from fridges to robots.
“The future is not that we look down and scroll, but do two things that are much more instinctive and efficient: speaking and listening.”
The smart home or office is a complex web of disparate products, but Alexa simplifies your installation by linking them all together and bringing them together. The smartest thing Amazon did with Alexa was to get it out of the Amazon ecosystem and open it up to as many partners as possible. Being a market leader does not always mean that the future of a technology is safe and that you can rest on your laurels. The road to an operating system that functions completely naturally through speech is still far away. There are still a lot of challenges in the field of language, grammar and pronunciation and dialects.
However, the real future of speech assistants could be at the end of our addictive relationship with our smartphones. What are we actually using our phone for? Information, games, communication and maybe even people calling. This is exactly where speech assistance can play a major role. If you use your smartphone as an alarm, why not wake up with one of your favorite songs? And can the lights in your house be switched on, the coffee machine switched on and displayed on your mirror in the bathroom? And all at a certain time or because you say “Good morning Alexa”.
The future is not that we look down and scroll, but do two things that are much more instinctive and efficient: speaking and listening. In the beginning I was quite skeptical about audio as an interface, certainly when it comes to presenting online information. We are so handy and fast with reading and scrolling? Is audio then not less convenient and slower? But now take the rapidly growing popularity of podcasts and audio books and the possibilities to be able to quickly screen or search through audio. Or would the solution lie in screen connectivity? Google and Amazon give their smart speakers a screen like on the Echo Show.
Will Google stay King of Search?
Google is now the absolute market leader in search, but it is not self-evident that this will be the case in new areas such as voice or visual searches. The company has a considerable lead when it comes to speech recognition and conversation search possibilities, but the competition is certainly not standing still. Google is no longer the young start-up that has managed to push competitors such as Altavista, Excite and AskJeeves out of the market. Moreover, the earnings model of Google is largely based on the current way of searching and advertising and it will want to maintain this for as long as possible.
So where does all this leave the future of voice optimization? Many of these possible scenarios have a real chance of success, but it is impossible to predict which direction this will move. It is therefore quite difficult for brands to develop a strategy for the emerging speech recognition technology. Just start by asking how the voice should sound that belongs to your brand. And that’s just the beginning.