As large language models (LLMs) continue flooding mainstream AI software development, retrieval augmented generation (RAG) has emerged as an early best practice. RAG systems are typically composed of two layers:
While using ICD, a serious risk of AI hallucinations rears its ugly head. Sometimes, the ICD conflicts with the LLM’s world knowledge (i.e., the general domain-independent knowledge the LLM was trained on). When this happens, a game of tug-of-war plays out.
If the ICD wins, the LLM replies with accurate data in the context of the use case in question. But if world knowledge (WK) wins, the LLM answers the user’s prompt as if there were no ICD at all, potentially answering out of context or with data that’s no longer accurate.
Even more problematic is when the ICD and WK agree, the system will appear to respond accurately when technically it’s hallucinating the right answer. It’s the AI equivalent of “even a broken clock is right twice a day.”
Winning this tug-of-war in favor of our ICD has been a major undertaking for WillowTree’s Data & AI Research Team (DART) because doing so ensures the reliability and safety of our clients’ RAG systems. One of the most effective strategies we’ve found comes from an espionage technique used by organizations like the CIA to detect information leaks: the canary trap.
A canary trap involves telling different stories to different people to see who’s leaking information (i.e., which canary is singing). In our case, we found that by using fictive data — data made intentionally false and thus easily recognizable — we could see when an AI system “sang” using ICD or WK, enabling us to detect and mitigate WK spillover.
RAG systems help LLMs generate more relevant, accurate responses by retrieving and summarizing custom domain-specific knowledge. That enhances what the LLM already knows. But if, for some reason, the LLM ignores the ICD, it risks hallucinating.
To illustrate, imagine your company makes its HR data queryable through an LLM agent interface, like a conversational AI assistant. Imagine, too, that your company has completely fixed age discrimination.
Assuming this were true, your data now has nothing special outlining the absence of age discrimination at your company (consider how large a database would need to be to express that everything typically true in the world is not true in your organization). This now puts the fact that there is no age discrimination at your company (a fact not listed in the ICD at all) up against the WK that age discrimination is a thing that exists on Earth.
Thus, we create a tug-of-war between abstaining from answering at all (given that the necessary information is not included in the ICD) and answering according to WK (which, in context, would contradict our ICD).
The example above shows one of the two most common instances of tension between ICD and WK:
In the second case, WK has the only and last word on the matter, and the model must ignore the ICD to answer with what it thinks is correct.
But a third case also exists — where ICD and WK agree with each other. Such a scenario could arise in which the ICD chunks pulled back by your RAG system contain information that hasn’t changed in many months or years. In that case, it’s likely your LLM of choice was trained on that very information, holding the answer you’d get from your ICD in WK as well. Such cases are complex to test outright since we’d expect the same factual information regardless of which source has been employed.
Case in point: Early tests from a recent project for a major financial institution showed surprisingly high accuracy scores. But what we found after inspecting the specific ICD chunks from our RAG search was that a fortuitous hallucination produced the correct answer. WK and ICD agreed, making the system appear to work as intended even when the ICD chunks failed to contain the correct information.
To ensure the LLM would respond with ICD and not WK, even when both were correct, we relied on a fictive test database. We took our initial data and changed the numbers to false (and thus recognizable) values. Now, we could infer whether the RAG system replied based on the intended ICD chunks or based on WK.
Since the false values do not exist with the LLM’s WK, we built regression tests around the fictive values instead. Given the probability of the LLM hallucinating precisely those numbers out of nowhere is practically zero, it allowed us to verify the RAG system is indeed passing information from the ICD chunks as intended.
To generate this fictive database, we started with our client’s ground truth data. We then chunked these pages based on different strategies, took the chunk text, and embedded it. The numerical values within the chunk text were halved in value. We used Azure OpenAI GPT-4 for this. An example:
We stored the counterfactual chunk text in our fictive database, along with the original embeddings of the real chunk. When doing a similarity search, it will point to the fictive chunk’s embeddings, returning the fake chunk’s counterfactual values. These values were passed into the system prompt.
We then generated a set of questions and answers based on the fictive chunks per experiment. For example:
Each question would be embedded. From there, a RAG query would grab the fictive documents most similar to the generated test question. The fictive chunks would then pass to the chatbot for completion.
Finally, our LLM evaluator system would evaluate the response.
Evaluating these results required a manual review process, as diagrammed above. We used two changing variables in this testing strategy: chunking strategies and LLM models.
Our team suspected two reasons why a chatbot might use WK instead of ICD in a RAG system:
Regarding chunking strategies, if the size of the document is too big, it could cause more confusion for GPT. Likewise, if it’s too small, it might not provide enough contextual data.
The knowledge database for our RAG system included information originating from webpages that included HTML. Here, we hypothesized that stripping the HTML from the knowledge chunks and only providing plain text (no HTML) would make the knowledge chunk more concise and information-dense.
Last, we hypothesized that the choice of LLM and its training would also affect whether responses are biased toward ICD or WK.
The chunking strategies we used in our canary trap experiments included chunking by:
Before chunking the documents, we ran a fact-extraction step where we reprocessed a document’s text (both the “with-HTML” version and “no-HTML”). We then chunked these two new documents. This process leaves a total of eight strategies: four for the newly reprocessed former with-HTML document and four for the newly reprocessed former no-HTML document. For this experiment, we chose Azure OpenAI GPT-4 for our fixed LLM model.
We generated a set of question/answer pairings off of the fictive chunks: 10 questions were generated per chunking strategy for a total of 80 questions. We hypothesized that the paragraph strategy and/or the 1,000 characters with 200 characters of overlap strategy would prove to be the most effective. The chunks from our RAG query, referred to as documents below, were limited to three and inserted in the system prompt, as shown.
And here are the results of the chunking strategy.
Experiment results showed the best performance from our no-HTML version of the 1,000 characters with 200 characters of overlap chunking strategy. Thus, we proved our hypothesis about the chunking strategy of 1,000 characters with 200 characters of overlap, but disproved our hypothesis about the paragraph chunking strategy.
In the next experiment, we fixed the chunking strategy to 1,000 characters with 200 characters of overlap and varied the LLM chatbot completion in the RAG rewrite step. The LLMs we used were:
We used the same system prompt for all LLMs. The dataset was generated off of 100 chunks from the 1,000 characters with 200 characters of overlap no-HTML strategy. We hypothesized gpt-4 would perform best, while it was possible that Claude may do better than gpt-3.5. The table below illustrates the accuracy of each LLM.
Surprisingly, GPT-3.5 Turbo and GPT-4 had the same accuracy measurement, whereas claude-v2 performed slightly behind them. This might be because Claude slightly outperforms GPT regarding factual knowledge. It was a surprise that GPT-3.5 Turbo and GPT-4 exhibited the same pass rate. This is particularly unexpected, considering GPT-4 generally outperforms GPT-3.5 Turbo due to its superior accuracy and logic capabilities.
As our canary trap experiments show, using a fictive database is extremely helpful in evaluating if an LLM uses WK or ICD. Furthermore, using the fictive database showed that a chunking strategy of 1,000 characters with 200 characters of overlap performed the best to minimize WK usage. It also showed that Azure OpenAI GPT-3.5 Turbo and GPT-4 outperformed Claude 2 when ensuring an LLM uses ICD.
For many use cases, it's crucial that chatbots and conversational AI assistants utilize context-specific data, which could potentially involve extracting private or real-time sensitive information. These findings underscore the delicate balancing act of ensuring that language model systems reliably extract relevant ICD while minimizing skew by WK.