Editor’s Note: This is the first article in our AI Hallucinations series. The second reviews WillowTree's specific methods for reducing hallucination rates.
Generative AI is transforming how we do business. But early adopters have discovered that large language models (LLMs) can occasionally provide responses that are out-of-left field, off-brand, heavily biased, or just plain wrong. The industry has termed these types of completions: hallucinations.
AI hallucinations aren't the laser-wielding space kittens one might see on a psychedelic trip but rather the result of LLMs working as designed. An LLM such as OpenAI's ChatGPT is a top-notch predictive text machine; it functions by predicting the next word based on a vast textual context (e.g., the internet or particular data on which the model was trained).
This context includes…
A set of hyper-parameters guide this effort. The most commonly adjusted one is called "temperature," which controls the diversity of the next token (a token is a fundamental unit of text in an LLM). The higher the temperature, the wider the net of plausible responses that the LLM could select from to produce the completion.
Hallucinations tend to occur when the in-context data in the prompt contradicts the LLM’s language function in some way. Imagine a custom chatbot designed to answer employees' benefits-related questions. Suppose a user asks, “What is our PTO policy?” and that organization’s PTO policy is neither included in the data set on which the LLM was trained nor passed to the LLM as additional context to the prompt. In that case, the LLM might respond with information that reflects an average or aggregate of PTO policy based on the large amount of internet data it has ingested. A chatbot's facility with language — and users’ inherent confidence in such a powerful piece of technology — can make these inaccurate statements all the more believable.
How should organizations respond? Should they wait until LLM technology entirely solves the problem of hallucinations? What if hallucinations can never be solved? What threshold of inaccuracy is acceptable? Meanwhile, as competitors race toward AI transformation, what is the path forward for companies that want to use chatbots responsibly?
After decades of digital experience, people have accepted that complete cybersecurity is not feasible. However, we’ve learned to employ safeguards that minimize the risk of a breach and mitigate the harm from such a breach.
WillowTree’s Data and AI Research Team (DART) operates from a similar premise that hallucinations will always be part of working with LLMs and can never be entirely prevented. In light of this, we use the layered "defense-in-depth" cybersecurity framework to protect against AI-generated misinformation.
A defense-in-depth approach to artificial intelligence hallucinations means taking actions toward primary security (minimizing the risk) and toward secondary security (mitigating the harm).
“Our team is trying to understand, ‘Can we detect a hallucination if we can't prevent it? Can we stop that misinformation from getting to the user?’” explains Patrick Wright, Chief Data and AI Officer at WillowTree. “Maybe we give them an error message or detect it after the fact and then inform the user there has been this hallucination.”
Wright stresses that, as in cybersecurity, a multi-layered user-centered approach is critical. From concept to delivery, at every stage of the application development process, business leaders, engineers, developers, and designers can take decisive action to minimize and mitigate risk.
Here are some ways WillowTree suggests applying a defense-in-depth approach to a development project lifecycle.
Before defining the data required (a key step to reducing AI-generated misinformation), you must clarify the business problem you want to solve. This is a critical step in understanding the kinds of questions users will ask the LLM. And defining those questions is the foundation for ensuring you have gathered the appropriate, broad data set to reduce the risk of hallucination.
This is not a new concept: A problem-first approach is always the most effective way to innovate.
Internal workshops can clarify the problem you want to solve and how an LLM solution can help. From these workshops, you can begin generating a list of questions users might ask. However, we strongly recommend going to users directly. At WillowTree, our researchers engage users through various methods to uncover how people will want the tool to work. This kind of research can help complete the list of data points (or answers) your data scientists will need to gather, make ready, and eventually introduce to your model.
“You need the data to be wide-ranging enough to cover the topics you're going to ask about it,” explains Michael Freenor, Applied AI Director at WillowTree. “If the data is very narrow in scope, but you're asking broader questions, you're more likely to get a hallucination.” (Remember that hallucinations often occur when users ask the model questions where the answer is outside the model’s data set).
Since hallucinations happen due to how models interpret directives (as described above: to choose the next word, based on constraints in the prompt, given the emerging context, within a flexibility parameter), there is no universally accepted “best” LLM, and no LLM with objectively lower rates of hallucination across every application.
That said, WillowTree is experienced in working with and developing prompts for many LLMs, in addition to fine-tuning custom machine-learning models. Our engineers can help determine which model best suits your business needs. We can help you think through the complexity of working with off-the-shelf models, including challenges in protecting private data, avoiding drift, and keeping source material current. We can also help you determine when a more custom model is better.
By choosing the LLM that best matches your business problem, you’re ensuring the LLM can better answer the questions your users are likely to ask, further lowering the risk of hallucinations.
You can minimize the likelihood of false answers by ensuring that you’re providing the most effective prompts. Initial prompts should guide your LLM to provide more accurate answers to user questions, followed by testing the accuracy of prompt–model combinations and iterating on prompts to improve accuracy over time.
DART’s data scientists have built a testing methodology that accurately predicts the likelihood of prompt–model combinations to deliver a hallucination. Using this benchmark, WillowTree’s engineers refine prompts and retest to reduce the possibility that hallucinations will occur.
There are also many UI/UX choices a team of product developers can make to reduce the harm caused by a hallucination. For example, the application should help users calibrate their trust in the system (now considered an industry best practice, according to Google). When applications make the fact of the LLM explicit, users bring a different set of expectations to the experience than if they believe they are communicating with a person.
Additionally, moderating layers between the LLM and the user during moments of input and output offers opportunities to minimize hallucination risk further and reduce harm from inappropriate answers.
“For most clients, we do not recommend they develop apps where the end user’s input is sent directly to the LLM or that the output of the LLM is shown directly to the user without some kind of moderation layer in between,” explains Freenor. “That moderation layer lets us protect against hallucinations. It helps the input prompt be less likely to generate a hallucination, and it lets us monitor the response from the LLM and evaluate whether it looks like a hallucination.”
Not only does a moderating layer reduce the risk of hallucinations, but it also is an opportunity to flag any information a brand doesn’t want its application to send to users. For example, information on a company’s competitors might be flagged as inappropriate and filtered out at this stage. In the moderating layer, an LLM can measure its degree of confidence in the accuracy of its answer, and the product designers can direct this information back to users, further mitigating the risk of harm when answers are not 100% accurate but are still informative.
According to Shillace’s Laws of Semantic AI, it is also possible to raise uncertainty as an exception. This means that the moderating layer can flag uncertain responses (based on the language used in the response). Rather than feeding these back to the user where they become active hallucinations, the moderating layer can prompt the tool to clarify with the user by asking for additional details or clarity around the user’s intent in their original request.
These are just a few ways a development team can design an LLM application to minimize risk and mitigate harm from hallucinations. WillowTree’s DART is uncovering new approaches every day.
Given the business problem an LLM is helping to solve, the next step is establishing an acceptable hallucination rate. Different business problems and industries allow for varying levels of risk. Applications in healthcare and finance will likely have lower risk profiles than applications for user-generated graphic design, for instance.
Once the team determines the acceptable hallucination rate, it will need to measure the actual hallucination rate of the application and work to reduce it. DART’s predictive hallucination measurement for relational data (we call it “The Benchmark Exam”) is one method for determining this rate.
“If you make changes to your application, that could increase or decrease your hallucination rate,” explains Freenor. “So you need that continuous testing to understand how your changes affect that hallucination rate and then actively work that down.”
Once you have the LLM application in production, you want to continue monitoring your hallucination rates, partly because they could spike.
“A spike could indicate that you have different content you weren't using previously,” says Freenor. “Or, if you're not controlling the model, it could indicate that something has changed with the model provider. You want to be able to detect those shifts quickly and then figure out what happened.”
When designing and implementing monitoring systems, monitoring for hallucination rates throughout the application stack and setting alerts for spikes is crucial. DART has developed a process for detecting high-probability hallucinations in a log of responses using a hallucination audit process, colloquially called “Bot Court.”
Even after the app launches, the development team can enlist users to help monitor for hallucinations by including an easy way to flag inaccurate responses and otherwise give feedback about how accurate, informative, and useful the LLM application is.
From here, user feedback informs how the team should revise. Beginning the next iteration, the app development team takes user responses back to the development stage, where they:
The diagram above shows a multi-layered, defense-in-depth approach to app development that increases protection from LLM hallucinations.
Risk tolerance is not one-size-fits-all. For example, if healthcare or financial services applications have lower risk tolerance, as mentioned above, perhaps only a “human-in-the-loop” framework for artificial intelligence may be acceptable. Contrast that with a sneaker company that allows some customization of sneakers utilizing generative AI input: they'll likely have a higher risk tolerance for hallucinations. The good news is that organizations can (and must) raise or reduce their risk exposure by making certain prioritization decisions, given the following trade-offs.
For each application, there will be a tradeoff between useful responses and cautious responses (“I can’t help with that” is cautious but not useful). For some use cases (“Can you interpret my medical chart for me?” or “Which companies should I invest in?”), a company might decide it can never risk giving factually incorrect information to a user. For these use cases, the model would lean deeply into truthfulness and more often return the cautious “I can’t help with that” response.
Whereas for other use cases (“What are some good resources for new cancer patients?” or “What questions should I ask a financial advisor?”), when a user asks for information rather than a definitive answer, it would be acceptable to deliver a response that is 80% or 90% accurate but may contain some errors.
No matter the use case, good practice in AI usage has cohered around always letting users know that a response has been AI-generated and should be checked for accuracy, which can further mitigate harm from inaccurate statements. Organizations must determine their risk profiles for each use case or group of use cases.
Organizations must also prioritize and find a balance between accuracy and speed. Most hallucination minimization techniques introduce latency into the system, and customers can experience frustrating response delays (imagine an hourglass icon and, “We’re working on a response for you.”). Alternatively, an organization might prioritize speed, understanding that a user might experience a hallucination, but review the LLM’s responses shortly after delivery to check for hallucinations. If one occurred, a fast follow-up could communicate this with the user (via the LLM interface, via email, etc.) and make necessary corrections.
Generative AI is a potent tool that has lit fires in all of us to imagine new ways of working, reaching customers, and creating seamless experiences. And like the technological advances before it, generative AI brings additional risks.
Thankfully, humanity already has practice adopting new digital approaches, and the “defense in depth” techniques we’ve come to rely on to minimize and mitigate cyber attacks offer a practical framework for evaluating and modulating one’s exposure to the risk of LLM hallucination.
If you’re curious about how to apply generative AI solutions to your internal or customer-facing workflows, WillowTree can help. Our data and AI capabilities and deep bench of world-class data scientists, data engineers, software engineers, test engineers, user researchers, and product designers can handle an enterprise’s most complex data challenges.
Schedule a call to talk with one of our Data and AI Research Team members or book an exploratory workshop to learn more about how WillowTree can help you minimize and mitigate the risks of hallucination with LLMs. Let us help you stay ahead of the competition and innovate for your customers — responsibly. Check out Part 2 in our series for a deeper dive into how we prevent AI hallucinations, or reach out to get started.