Voice-enabled gadgets already dominate our home and work. As voice-recognition technology continues to gain huge popularity, we depend on touchscreens less and less. It’s now easier than ever to manage a device using a built-in voice assistant.
As we consider voice tech specific to user experience and device interaction, let’s look at some heuristics we use for designing a powerful voice solution.
Seems pretty straightforward right? Well, you’d be surprised how many different ways there are to say just one simple thing. Let’s start by examining a voice command:
To ensure that the app can accurately recognize user intent, it’s essential to appreciate what the user might say and how they’ll say it (this is the utterance). There can be vast differences in the way different individuals will speak with the same or similar intent. Most often, the way you might think to word a sentence is different to how someone might say it.
For the system to understand these utterances, we need to train it to accurately recognize a variety of vocal inputs. To this end, it’s helpful to conduct research and document the natural language people use when completing a task.
Once you have an idea of the different voice commands someone might say, rigorously test how the speech recognition system responds to them. You might have to train the AI to understand and adapt to different pronunciations of the same word. This will ensure that similar sounding words will map to the correct word.
For example: when ordering a pizza, the system might mistakenly detect the word “ham” as “him” - but since that doesn’t make sense in the context of ordering pizza, “him” could be automatically mapped to the word “ham”.
Lastly, use prototype dialog flows as a powerful visual tool for choreographing a chatbot-style conversation.
Even with the proliferation of voice-powered digital products (estimated to reach 8.4 billion units in 2024 – a number higher than the world’s population), users still might be inclined to use touch when voice input is available. In multi-modal interfaces (you can use both voice and touch), a person’s initial intuition of using voice as an input type is relative to their understanding of how to use voice.
The challenge with voice is that you never know exactly what you can do with voice.
When a user is unsure how to use voice, they are inclined to simply read exactly what they see on screen, out loud and word-for-word. They might not be fully aware of any advanced in-app tasks they can accomplish through voice.
It’s important to make voice features obvious to the user and this can be achieved with onboarding and prompts in specific situations where it’s possible to use voice commands.
Since a VUI has limited visual affordances, the GUI should work to promote engagement with the VUI. Consider an accessible and intuitive placement of the voice button, as well as its behavior (listening, transcribing & processing states). Adding haptics can also make it more fun and intuitive to use.
And remember - sometimes it might just make more sense for a user to tap rather than use their voice. Don’t force the user to use voice.
With AI, positive and negative extremes of user experience are enhanced. Imagine a self-driving car: when it works it’s amazing, and when it doesn’t...it fails spectacularly. Voice is similar - users tend to have high expectations on how speech recognition technology performs. It’s important to meet these expectations with a credible error handling process.
Verbal communication is quite error prone, but fortunately humans are good at recovering from these voice errors. We can try to make AI smarter and smarter to the point where it performs flawlessly, or we can ensure failures cost less.
When voice impacts the screen content dynamically (ie: filtering a product list), implementing a live visual feedback loop will make it easy to identify and recover from errors. In this example, the cart item flashes green, letting the user see how the system processed their voice input:
If the system got it wrong, the user would be able to immediately pick it up. Voice related errors should be communicated to the user as soon as possible. Since visual comprehension is faster than voice comprehension, leverage UI elements to alert the user and guide them.
Pairing notifications with haptics is an effective, multi-sensory way to indicate an error was encountered. Create a guided multi-turn experience by prompting for clarification when necessary.
We can’t treat human-technology communication the same way that we treat human-human communication. Making the voice AI too human can make the user feel uneasy. We refer to this feeling of strangeness as the “uncanny valley”
You can avoid triggering this feeling of the “uncanny valley” by ensuring the user understands that the app isn’t a person; rather a high functioning machine that can be controlled through voice. We want our AI to be helpful and to speed up tasks. We don’t want the AI to try to mimic human appearance, behavior, or emotions too closely. Avoid having a multi-turn voice assistant attempt to express empathy (ie: “I see that you're upset”). Rather, communicate factual statements about the situation (“Let me fix that”).
The back end system might have lots of data on the user. Consider how you present this to the user. Only show information that is relevant and necessary. Start by implementing voice for simple user interactions such as searching or customizing an item - remembering that less is more.
With the rapid evolution of voice in digital products, these tips will help build trust, empowering users to execute tasks using voice as quickly as possible with minimal back and forth.
When we change our design approach to voice experiences, we truly unlock the power of voice.
For a demo of voice capabilities, check out this voice-powered pizza app built by Willowtree. And if you have any questions, thoughts or ideas that are ready to implement - please reach out.