Two pieces of advice when you’re ready to integrate voice technology into your business: First, train your voice model on your own data. Doing so is how you differentiate from the competition when it comes to generative AI.
Open models like ChatGPT make deploying generative AI easier, but you can’t control how they respond. That means building and training your voice model is key to creating the personalized multimodal experiences that customers will flock to during the mass adoption of voice.
The second piece of advice: Don’t jump to building an AI-powered voice tool right away.
Building a voice model is more efficient than it’s ever been. Yet, it’s still a painstaking process that requires working closely with leaders from your marketing, sales, customer service, product design, and other key stakeholders. It also means choosing the right AI partner to help you shepherd the project from beginning to end.
This cross-collaborative process is needed because while AI is very good at figuring out how to do something, it’s not very good at figuring out what to do. That’s where humans come in — specifically, many humans with a thorough knowledge of your customers’ experience needs, interests, fears, concerns, and desires.
Here’s what the process of building and training an AI voice model that speaks your customers’ language looks like. For an even more detailed look, check out my Wall Street Journal bestseller, The Sound of the Future: The Coming Age of Voice Technology.
In this initial stage, you fall in love with your customers’ problems: immerse yourself in the challenges and moments of friction your customers experience when dealing with your organization (Note: “customers” is meant broadly throughout, encompassing external consumers and internal users such as frontline employees, corporate staff, and financial partners). Only after pinpointing what makes their lives and work more difficult can you decide if a new technology, like voice, can help make it easier.
As you find opportunities where voice can offer a potential solution (e.g., capturing and processing data in real-time or automating transactions), identify these as your voice use cases. From there, your next step is to create a specific, concrete jobs-to-be-done list for each use case.
Originated by famed business consultant Clayton Christensen, the jobs-to-be-done (JTBD) framework aims to discover customers’ unmet needs. Within these hidden, unmet needs live your strongest opportunities for making customers’ lives easier through voice.
Create your JTBD lists by analyzing how current communication methods impact customers’ experiences from start to finish. For example, let’s look at a top voice use case for many businesses: customer service. A smart place to begin would be:
From there, separate the calls into buckets, each for a specific kind of question or challenge, and rank them by frequency. The list might look something like:
You can see a JTBD list emerge for the customer service team. Working from this list, you can analyze how each job-to-be-done is currently handled, asking if new tools — including voice technology — would make each job quicker and easier.
Creating a complete, accurate journey map for a typical user is highly detailed work. But the effort is worth it because the biggest insights often come from understanding the smallest details. Involve your team and AI partner in generating these journey maps. Drawing on a large cross-section of knowledge will help you see each moment of the journey with a fine level of detail.
Create your journey maps by chronologically formatting the activities and thoughts of a typical user, from beginning to end, as they interact with one or more of your systems. Let’s return to the customer service department example.
A customer calls your contact center with a question or a problem. Doing so means moving through a series of steps (e.g., answering voice prompts, entering information to authenticate identity, holding for a representative, etc.) that, taken together, constitute the customer’s journey map. Each step may present an opportunity for improved efficiency, speed, accuracy, convenience, or some other positive impact on the customer experience.
This step brings your customers’ hidden, unmet needs to the surface. Your JTBD maps organically emerge from your journey maps. By visualizing and tracing each stage in the typical customer’s journey, you can pinpoint the micro jobs-to-be-done that customers must do in each interaction.
For each task identified on your JBTD maps, ask customers two key questions:
Use the data you collect to score each job-to-be-done by importance and satisfaction level, looking for high opportunity scores. That is, jobs-to-be-done with high importance but low satisfaction levels. Notice the three tasks highlighted below in the JBTD map.
A high opportunity score means the odds are good you can make job-to-be-done more efficient, perhaps by applying voice technology.
If a fully operational voice system is like a Hollywood movie that’s been filmed, edited, and ready to view, the prototype is like the screenplay. Your goal with an AI voice prototype isn’t to write lots of code for a polished product. Instead, it’s to create a useful guide for the software developers who’ll eventually write the code.
To do that, focus on understanding how customers will want to interact with your voice system:
The good news here is that much of this information will probably be readily available from your journey mapping work. As for using this information to drive conversational exchanges, we have two valuable techniques: observing customers in their real-life context and applying conversation-modeling exercises.
Observing customers in their real-life context, or recreating it as closely as possible, is a valuable method of collecting data. When building Vocable, an augmentative and alternative communication (AAC) app that helps speech-impaired people communicate with caregivers, the WillowTree team worked with patients and speech pathology experts at Duke University and WakeMed Hospital.
That real-life context helped us observe and understand the issues people with paralysis commonly experience and how we could help them communicate those issues to their caregivers. Of course, not all instances allow such direct observation. Let’s look at a different example: developing a voice-driven smartphone app for a dental flossing machine.
In this case, the Data and AI Research Team (DART) here at WillowTree analyzed all the points of contact a user could potentially have with the dental flossing machine:
The team also studied phone records of customer service calls to identify common questions and complaints, particularly among new users. Gradually, they sorted these conversation topics into buckets that represented the issues the app would have to address — specifically, the jobs-to-be-done by users of the flossing device.
From there, the team organized conversations by developing a series of flowcharts. The result: conversational abilities such as responding with technical information from the product manual when users ask about water pressure.
Using conversation-modeling exercises means asking pairs of people — members of your development team, for example — to improvise dialogue between a user and the voice tool. You can even isolate them from each other, known as Wizard of Oz testing. The rest of your team observes and takes notes, paying attention to:
Start by asking your improvisers to model the shortest route to complete each job-to-be-done. From there, gradually imagine dialogues of more complexity and variation, allowing them to snowball into a multitude of conversations that mimic the real-world conditions your voice system will have to deal with. Push your team hard toward building a multimodal solution so you don’t default to a back-and-forth, call-and-response approach.
These two steps — observing customers in their real-life context and using conversation modeling exercises — will give you plenty of insights for sketching out a multimodal voice prototype. Companies like Sayspring, a division of Adobe, can help guide your prototyping process. Additionally, your team could build a function prototype with minimal code using advanced AI tools like GPT-4.
Once you’ve built a prototype capable of basic voice interaction, you can begin training your natural language processing (NLP) voice model, such as GPT-4, to communicate effectively with real-world customers. Training requires identifying the most common words, phrases, and expressions used to handle each of your customers’ jobs-to-be-done and specifying the appropriate response.
The work of training a voice interface tends to be complicated and detail-driven. Resist the urge to rush it or shortchange the resources required. Your competitive advantage ultimately will come from understanding how to engage customers with voice as effectively as possible.
In addition to learning your customers’ language, training involves identifying and supplying your underlying language model with all the knowledge it needs to generate a response. Imagine you sell bicycles. Your conversational AI assistant must answer questions such as “How do I adjust the seat?” and “Is this bike available in a different color?”
Because large language models (LLMs) are trained on general world knowledge, you’ll need to train your model and give it access to custom private knowledge (e.g., a database of your stock with notable features for each product). Techniques like retrieval augmented generation (RAG) are highly effective here.
One of the best ways to align customers’ jobs-to-be-done with the correct vocabulary is to study past interactions (e.g., phone, email, text). Enterprises, especially Fortune 500 companies, have an edge over smaller companies here. But small and midsize businesses can also identify the words, phrases, and expressions needed to train highly effective AI voice models, thanks to synthetic data.
Driven by for-profit and non-profit organizations such as the Stanford Open Virtual Assistant Lab (OVAL), the synthetic data movement has shown that a relatively tiny amount of real-world information may be just as powerful for training generative AI models as the vast archives giant companies own.
Not to mention, synthetic data greatly benefits data security. That’s because using synthetic data to train AI models removes the risk of compromising actual customer data.
Early facial recognition tools performed poorly at recognizing the faces of people of color because predominantly white teams built the early prototypes. The lesson: You can’t create the world's best products if everyone in the room looks the same.
Businesses of all sizes can build and train more effective voice models by ensuring the team building and training these models is as diverse as possible. The more points of view, modes of speech, and ways of thinking your AI model experiences during training, the more likely your voice tool will serve as wide of an assortment of human beings as possible.
Make diversity a priority throughout the entire process when testing, evaluating, and improving your voice model.
Pretesting and improving your voice model before its formal launch is essential for improving usability. Your goal is to achieve a high intent match rate. Most user requests are within your voice system’s domain, and most requests are understood and acted upon correctly.
When pretesting, pose as wide of a variety of challenges to your voice system as possible. This recreates the diversity of real-world issues likely to emerge once the system goes live.
Once testing begins, it’s wise to enlist the help of a voice development company like Bespoken to track performance metrics such as word error rate (i.e., the number of words spoken by users that the system misunderstands or misinterprets).
Don’t expect market-ready results right away. Bespoken CEO John Kelvie reports error rates of 20% or higher are possible during early rounds. This error rate is obviously unacceptably high. But thankfully, the errors often point to their solutions, which are usually more straightforward than first appear.
Knowing the most common sources of error helps expedite the pretesting and improvement phase. There are three buckets to look into: vocabulary, system design, and failure to anticipate potential user statements.
When helping to create a voice tool for a cosmetics company, Bespoken found the system often misinterpreted “ageless” as “age list,” confusing the bot and leaving it unable to fulfill the customer’s request. The fix: train the system to treat “age list” (a phrase very unlikely for a user to ever say) as a synonym for “ageless.” Similar vocabulary adjustments resulted in a more than 80% reduction in error rates.
Minor flaws in system design can create an outsized impact. Voice is a great example because statements made to the user have to be brief. Otherwise, it quickly becomes too much information to understand.
A rule of thumb: when asking the user to make a choice, never present more than three options at a time. Realities like these emphasize the need for multimodal UX/UI design for conversational AI assistants, like coordinating a voice response with options displayed on a touchscreen.
Imagine someone responding to a text message via voice while driving. This potentially chaotic real-world scenario will present problems that don’t arise in conversation modeling at the office. Someone driving could be easily distracted, prompting them to say, “Wait,” or, “I didn’t catch that last part.”
When unanticipated user statements like these happen, they must be accounted for. That might mean revisiting the system design to present information more simply or giving the user slightly different options for a greater set of circumstances.
“Error flow handling” is the craft of repairing a conversation when faced with a potential error. Imagine, for example, all the mechanisms needed to make a customer saying, “Wait a sec, make it four tickets for the 8:15 p.m. movie” — right before they almost book two tickets for 7:30 p.m. — flow as a seamless voice experience.
Building this level of error flow handling isn’t easy, but it’s crucial for delivering great voice experiences. Reducing error rates to zero isn’t reasonable. Plus, your voice system will always have to contend with moments of miscommunication beyond its control (e.g., background noise, or someone who speaks very softly). Customers still deserve a positive voice experience under those conditions, even when it means handing them off to live support.
The best way to get data on your voice system is to put it in the wild, see how it performs, and iterate accordingly. Your first few months will likely reveal new issues that didn’t arise during initial testing, prompting continued refinement and evaluation to deliver more value to users.
When you should formally launch your voice tool will depend on your unique situation. You'll have to weigh factors such as the potential impact of errors against the value you’re providing. A mistake by an AI-powered medical assistant, for instance, would have more severe consequences than one by an interactive trivia app.
As this blog shows, building and training AI voice tools to speak your customers’ language requires exceptional digital craftsmanship, from the initial planning stages to continually iterating the final product.
If you’re ready to build the voice and conversational AI applications your organization needs, the DART team at WillowTree is prepared to help. Our eight-week GenAI Jumpstart program has helped businesses rapidly identify and prototype new voice solutions, like a safe and secure conversational AI assistant for financial services.
Get started by learning more about our eight-week GenAI Jumpstart program.