Craft At WillowTree Logo
Content for craftspeople. By the craftspeople at WillowTree.
Data & AI

Detecting AI Hallucination Risk Using a CIA Technique

Madeline Woodruff
Software Engineer
Michael Freenor
Director of Applied AI
January 19, 2024

As large language models (LLMs) continue flooding mainstream AI software development, retrieval augmented generation (RAG) has emerged as an early best practice. RAG systems are typically composed of two layers:

  1. A semantic search layer, where text embeddings are used to search for information semantically related to a user prompt or question, which supplies…
  2. the generation layer with in-context data (ICD) to answer the user directly.

While using ICD, a serious risk of AI hallucinations rears its ugly head. Sometimes, the ICD conflicts with the LLM’s world knowledge (i.e., the general domain-independent knowledge the LLM was trained on). When this happens, a game of tug-of-war plays out.

If the ICD wins, the LLM replies with accurate data in the context of the use case in question. But if world knowledge (WK) wins, the LLM answers the user’s prompt as if there were no ICD at all, potentially answering out of context or with data that’s no longer accurate.

Even more problematic is when the ICD and WK agree, the system will appear to respond accurately when technically it’s hallucinating the right answer. It’s the AI equivalent of “even a broken clock is right twice a day.”

Winning this tug-of-war in favor of our ICD has been a major undertaking for WillowTree’s Data & AI Research Team (DART) because doing so ensures the reliability and safety of our clients’ RAG systems. One of the most effective strategies we’ve found comes from an espionage technique used by organizations like the CIA to detect information leaks: the canary trap.

A canary trap involves telling different stories to different people to see who’s leaking information (i.e., which canary is singing). In our case, we found that by using fictive data — data made intentionally false and thus easily recognizable — we could see when an AI system “sang” using ICD or WK, enabling us to detect and mitigate WK spillover.

Why RAG Systems Need Routine Testing for Hallucination Risk

RAG systems help LLMs generate more relevant, accurate responses by retrieving and summarizing custom domain-specific knowledge. That enhances what the LLM already knows. But if, for some reason, the LLM ignores the ICD, it risks hallucinating.  

To illustrate, imagine your company makes its HR data queryable through an LLM agent interface, like a conversational AI assistant. Imagine, too, that your company has completely fixed age discrimination.

Assuming this were true, your data now has nothing special outlining the absence of age discrimination at your company (consider how large a database would need to be to express that everything typically true in the world is not true in your organization). This now puts the fact that there is no age discrimination at your company (a fact not listed in the ICD at all) up against the WK that age discrimination is a thing that exists on Earth.

Thus, we create a tug-of-war between abstaining from answering at all (given that the necessary information is not included in the ICD) and answering according to WK (which, in context, would contradict our ICD).

Setting the Canary Trap: Finding the Source of AI Hallucinations Using Fictive Data

The example above shows one of the two most common instances of tension between ICD and WK:

  1. the ICD makes an assertion that, without context, contradicts WK
  2. an ICD fact is unstated in the response

In the second case, WK has the only and last word on the matter, and the model must ignore the ICD to answer with what it thinks is correct.

But a third case also exists — where ICD and WK agree with each other. Such a scenario could arise in which the ICD chunks pulled back by your RAG system contain information that hasn’t changed in many months or years. In that case, it’s likely your LLM of choice was trained on that very information, holding the answer you’d get from your ICD in WK as well. Such cases are complex to test outright since we’d expect the same factual information regardless of which source has been employed.

Case in point: Early tests from a recent project for a major financial institution showed surprisingly high accuracy scores. But what we found after inspecting the specific ICD chunks from our RAG search was that a fortuitous hallucination produced the correct answer. WK and ICD agreed, making the system appear to work as intended even when the ICD chunks failed to contain the correct information.

How we deployed a fictive database for testing

To ensure the LLM would respond with ICD and not WK, even when both were correct, we relied on a fictive test database. We took our initial data and changed the numbers to false (and thus recognizable) values. Now, we could infer whether the RAG system replied based on the intended ICD chunks or based on WK.

Since the false values do not exist with the LLM’s WK, we built regression tests around the fictive values instead. Given the probability of the LLM hallucinating precisely those numbers out of nowhere is practically zero, it allowed us to verify the RAG system is indeed passing information from the ICD chunks as intended.

To generate this fictive database, we started with our client’s ground truth data. We then chunked these pages based on different strategies, took the chunk text, and embedded it. The numerical values within the chunk text were halved in value. We used Azure OpenAI GPT-4 for this. An example:

Real chunk text: I have 8 apples and 5 oranges. 
Fake chunk text: I have 4 apples and 2.5 oranges.

We stored the counterfactual chunk text in our fictive database, along with the original embeddings of the real chunk. When doing a similarity search, it will point to the fictive chunk’s embeddings, returning the fake chunk’s counterfactual values. These values were passed into the system prompt.

We then generated a set of questions and answers based on the fictive chunks per experiment. For example:

Fictive chunk: I have 4 apples and 3.5 oranges.
Q: “How many apples and bananas do I have?”
A: “You have 4 apples and 3.5 oranges.”

Each question would be embedded. From there, a RAG query would grab the fictive documents most similar to the generated test question. The fictive chunks would then pass to the chatbot for completion.

Finally, our LLM evaluator system would evaluate the response.

  • If the results matched the counterfactual ICD, we could conclude the chatbot used ICD alone.
  • If the results fail to match the counterfactual values, this indicates that the chatbot used WK to respond instead.
Diagram of manual review process used to evaluate results of large language model (LLM) accuracy testing

Evaluating these results required a manual review process, as diagrammed above. We used two changing variables in this testing strategy: chunking strategies and LLM models.

Running Our Canary Trap Experiments

Our team suspected two reasons why a chatbot might use WK instead of ICD in a RAG system:

  1. suboptimal chunking strategies
  2. suboptimal choice of LLM model

Regarding chunking strategies, if the size of the document is too big, it could cause more confusion for GPT. Likewise, if it’s too small, it might not provide enough contextual data.

The knowledge database for our RAG system included information originating from webpages that included HTML. Here, we hypothesized that stripping the HTML from the knowledge chunks and only providing plain text (no HTML) would make the knowledge chunk more concise and information-dense.

Last, we hypothesized that the choice of LLM and its training would also affect whether responses are biased toward ICD or WK.

Test 1: Chunking strategies with a fixed LLM model

The chunking strategies we used in our canary trap experiments included chunking by:

  • sentence
  • paragraph
  • 500 characters with 100 characters of overlap
  • 1,000 characters with 200 characters of overlap

Before chunking the documents, we ran a fact-extraction step where we reprocessed a document’s text (both the “with-HTML” version and “no-HTML”). We then chunked these two new documents. This process leaves a total of eight strategies: four for the newly reprocessed former with-HTML document and four for the newly reprocessed former no-HTML document. For this experiment, we chose Azure OpenAI GPT-4 for our fixed LLM model.

We generated a set of question/answer pairings off of the fictive chunks: 10 questions were generated per chunking strategy for a total of 80 questions. We hypothesized that the paragraph strategy and/or the 1,000 characters with 200 characters of overlap strategy would prove to be the most effective. The chunks from our RAG query, referred to as documents below, were limited to three and inserted in the system prompt, as shown.

systemPrompt = f"Answer ONLY from the document provided. The document is delimited by backticks: `{results}`"

And here are the results of the chunking strategy.

Pass rate for 1000-200char-nohtml: 100.0%
Pass rate for 500-100char-html: 100.0%
Pass rate for sentence-html: 100.0%
Pass rate for sentence-nohtml: 90.0%
Pass rate for 1000-200char-html: 80.0%
Pass rate for paragraph-html: 80.0%
Pass rate for paragraph-nohtml: 80.0%
Pass rate for 500-100char-nohtml: 50.0%


Experiment results showed the best performance from our no-HTML version of the 1,000 characters with 200 characters of overlap chunking strategy. Thus, we proved our hypothesis about the chunking strategy of 1,000 characters with 200 characters of overlap, but disproved our hypothesis about the paragraph chunking strategy.

Test 2: GPT-3.5 vs. GPT-4 vs. Claude

In the next experiment, we fixed the chunking strategy to 1,000 characters with 200 characters of overlap and varied the LLM chatbot completion in the RAG rewrite step. The LLMs we used were:

We used the same system prompt for all LLMs. The dataset was generated off of 100 chunks from the 1,000 characters with 200 characters of overlap no-HTML strategy. We hypothesized gpt-4 would perform best, while it was possible that Claude may do better than gpt-3.5. The table below illustrates the accuracy of each LLM.

Experiment results comparing AI hallucination risk of Azure OpenAI GPT-3.5, GPT-4, and Claude 2


Surprisingly, GPT-3.5 Turbo and GPT-4 had the same accuracy measurement, whereas claude-v2 performed slightly behind them. This might be because Claude slightly outperforms GPT regarding factual knowledge. It was a surprise that GPT-3.5 Turbo and GPT-4 exhibited the same pass rate. This is particularly unexpected, considering GPT-4 generally outperforms GPT-3.5 Turbo due to its superior accuracy and logic capabilities.

Build the AI Hallucination Detection Process Your Business Needs

As our canary trap experiments show, using a fictive database is extremely helpful in evaluating if an LLM uses WK or ICD. Furthermore, using the fictive database showed that a chunking strategy of 1,000 characters with 200 characters of overlap performed the best to minimize WK usage. It also showed that Azure OpenAI GPT-3.5 Turbo and GPT-4 outperformed Claude 2 when ensuring an LLM uses ICD.

For many use cases, it's crucial that chatbots and conversational AI assistants utilize context-specific data, which could potentially involve extracting private or real-time sensitive information. These findings underscore the delicate balancing act of ensuring that language model systems reliably extract relevant ICD while minimizing skew by WK.  

If you need help detecting hallucination risk in your generative AI applications, the DART team at WillowTree is ready to help. Get started by learning about our eight-week GenAI Jumpstart program.

Madeline Woodruff
Software Engineer
Michael Freenor
Director of Applied AI

Recent Articles