Raise Your Voice.

A New Call to Action for Digital Voice Experiences

Voice assistants like Alexa and Siri have penetrated the American consumer household faster than any new technology in history — significantly faster than smartphones. As of 2020, some 4.2 billion digital voice assistants are in use, and by 2024 that number is forecasted to double to 8.4 billion units — exceeding the global population.

If literally everyone has access to voice technology, why haven’t we seen a similar explosion in voice-based business cases? Why hasn’t voice transformed the world?

The quick answer: it will, once we figure out how to use it correctly.

The power of multimodal.

How can voice improve UX?

Here's the critical design flaw: attempting to create voice-only conversations divorced from screens and other digital experiences.

Voice is still being treated as a standalone, self-contained system: we ask Alexa what movies are playing and she rambles out seven different movie titles with five showtimes for each. That was called Moviefone, and it was slow and frustrating when it was invented 25 years ago.

Imagine instead: Hit the mic button on your favorite movie ticket app and ask, “What movies are showing tonight?” Receive the full list of movies and showtimes on your smartphone screen. Then say, “Book me two tickets for Star Wars at 9 p.m.” Immediately receive confirmation on your screen, with tickets available in your app. Total elapsed time: eight seconds.

At WillowTree, we call this concept of mixed voice-and-visual interactions multimodal experiences. Multimodal will soon transform how we interact with technology, ushering in a new era.

We are in The Coming Age of Voice.

Why is multimodal better?

We speak three times as fast we type…

…but we read twice as fast as we listen.

Don't take our word for it. Check out this side-by-side comparison for a fictional pizza ordering app (turn on your sound)…

Again, the core idea is simple: humans SPEAK 3x faster than we TYPE. But we READ 2x faster than we LISTEN.

The era of having to swipe, type and tap to navigate apps and websites is coming to an end — the future of digital is the voice command married with screen response.

The question is, how do we get there?

Expert Tip – Surveying the voice landscape and building a solution takes time, effort, and expertise. Want help from our Strategy and Design teams? Let’s talk.


Voice capabilities will soon become integrated into every facet of human-facing technology. Does that mean voice will replace screens? Certainly not. What we’re seeing instead is the emergence of VoiceCases — those moments when the easiest way of getting something done is via voice.

VoiceCases are present in our daily lives, and pop culture imagination has been conjuring futuristic multimodal applications for decades. Consider Marvel’s J.A.R.V.I.S. or the “Enhance” meme that’s become a staple of every procedural crime drama.

We know what the future looks like. What are the business implications of voice today?

Voice User Interfaces

Voice User Interfaces (VUIs) promise easier, more efficient experiences. However, we are often met with discoverability challenges that result in frustration and unmet expectations.

We’re invested in this problem space at WillowTree. We think it’s solvable and can be mitigated through thoughtful user experience design, cleverly practiced engineering, and an embrace of a multimodal approach. In order to set up a Voice User Interface for success, we’ve outlined a series of steps in our free, open sourced Figma file.

There are obviously a multitude of decisions to make within this framework, but we think it provides a great place to start for most VUIs.

Expert Tip – Consider how multimodal voice applies to your business. (Do you employ lots of field service technicians with aging hardware? Are you a B2C retailer looking to ease customer pain points? Need help answering these questions? Get in touch.)

How to get started.

1. Know the current state.

As with any design or technology initiative, you need to start with the user and the current state of their experience. What technology is available to create voice experiences? What are your customers’ expectations of those experiences, and how can you meet or exceed those expectations with a digital solution?

Voice technology has been evolving rapidly, with new platforms, new devices, new industries, and new integrations cropping up regularly.

2. Find use cases.

The unignorable VoiceCases

WillowTree conducted a nationwide survey of 824 people to get a pulse on which voice use cases were resonating with users. We asked people to rate fifteen use cases on two criteria: usefulness and efficiency. Here’s what we found.

Based on this research and on what we’re seeing in the market, voice is poised to become the preferred interaction model for at least three unignorable use cases that currently happen primarily via screens: Specific Search, Composition & Logging, and Coaching & Instruction.

Specific Search
Finding a known item or piece of information
Locating media, directions, weather, FAQs; finding and ordering a part in inventory
Composition & Logging
Developing content and data entry
Writing emails, completing forms, composing lists, all paperwork
Coaching & Instruction
Any skill that requires guidance, especially hands- or eyes-free tasks
Driving a vehicle, piloting aircraft, performing medical procedures
3. Build the foundation.

Regardless of the VoiceCases you’re focusing on, or even the platforms you intend to start with, position your company or product to move quickly as the voice space continues to develop.

Take a platform-agnostic approach

The form factor, capabilities, and market share of voice-enabled devices will continue to shift rapidly over the coming years. Design a backend for your voice ecosystem that can support voice applications wherever on are: whether that’s on Google Home, in a car, or on a device yet to be invented. You should also have APIs that enable easy access for VoiceCases.

Start documenting knowledge and capabilities

Voice engineering activities are a little more advanced than those required for conventional software applications. In addition to standard UI and back-end technologies, voice requires two additional layers: Automated Speech Recognition (ASR) and Machine Learning (ML).

  • Automated Speech Recognition (ASR) is the technology by which a device translates human sounds into recognized speech with (hopefully) a high degree of accuracy.
  • Machine Learning (ML) model development is required for non-trivial voice interactions. These models need to be “trained” to translate the output of the ASR into meaningful software commands.

Gather training data

The content required to “train” your AI can come from any number of places; some of the most common places to look are: in-person user interviews, customer support transcripts, customer-facing knowledge bases, and blog content.

Reuse existing content

By the same token, you should audit all customer-facing content for useful solutions to user problems. If you’ve surfaced VoiceCases in your discovery process that you don’t yet have a solution for, you’ll need to develop new content to address them.

4. Design and develop.

So you’ve got some VoiceCases in mind and you’ve got a backend that can support the conversations you need to have with your users. It’s time to launch something into the world!

Here a few thought starters:

  • What platforms are your audience currently using? (Alexa, Google Assistant, Bixby, Facebook Messenger)
  • Have you considered a branded personality for your voice apps (a “voice for your voice”)? Hint: don’t get too hung up on this.
  • Have you established KPIs in advance of building your voice-enabled product?
  • Are you prepared to run regular user testing to generate further training data and hone the conversation structure?
  • Have you considered design flows across multiple devices—allowing you to create a fully orbed “multimodal” experience?

Rethink your UX

You need to rethink all of your customer’s interactions with your products and services now that there is an additional tool—voice—at your disposal.

When the steam engine was invented, you rethink your entire approach to transportation—rather than hooking a steam engine to a carriage. Similarly, you can’t just create apps for your company on Google Home and Alexa and think you’ve checked the “voice” box. You have to think about how and where voice will be the most efficient, most delightful way for your users to complete a task—even if what they’re doing via voice is just one part of a flow that happens across multiple devices or channels.

Expert Tip – The voice technology landscape changes daily. Subscribe to our newsletter to stay on top of new developments in voice, mobile, and emerging technology.

The future.

While voice may seem like a “UX enhancement” in today’s conventional applications, it will be a mandatory input method in the next generation of software.

In the next hardware generation (let’s call it the next 2-10 years), we will see the emergence of virtual displays that will cannibalize the presence of physical monitors and touchscreens. Without a physical keyboard or touchscreen, additional input systems, including voice, will be required for user/software interaction.

Voice is the future of UX, and multimodal applications can unleash the power of voice today. WillowTree is proud to be an industry leader in the research and development of voice-based technology.

One email, once a month.

Our latest thinking—delivered.
Thank you! You have been successfully added to our monthly email list.
Oops! Something went wrong while submitting the form.

Let's talk.

From full cycle product builds to supporting an existing team, we’re here to help.
This website uses tracking technologies to analyze site usage and enhance the experience of the website. In addition, if you click “accept” we may also use tracking technologies that are controlled by third parties who are not our service providers. [i.e., Beeswax; Liveramp; Google Analytics]