
Recent studies have explored how large language model (LLM) performance may be changing over time (notably, a July study by researchers at Stanford and UC Berkeley provided evidence that ChatGPT-3.5 and ChatGPT-4 had “drifted substantially” from March to June 2023).
This degradation can occur due to many factors — an influx of data and user feedback, or recently updated model weights, or the addition of new moderation layers designed to filter or refine output, for instance. This can overwhelm a system or degrade its performance over time.
Preventing degradation in LLM-powered systems is crucial as even a minor change in behavior can snowball into significant issues: decreased productivity, skewed analytics, inaccurate answers or predictions (often termed “hallucinations”), and other faulty conclusions that can misinform strategic decision-making. In scenarios where real-time answers are required, such as in healthcare diagnostics or financial transactions, any delay or error could lead to critical failures.
In short, unmeasured and unchecked system degradation could not only disrupt operations but also erode trust and loyalty among end-users, affecting a business's bottom line.
So…how can we identify degradation before it impacts our customers? How can we take this time- and resource-intensive benchmarking process and make it something we can do regularly, to ensure the LLM we’ve deployed is working for us as intended?
At WillowTree, our Data and AI Research Team (DART) is tackling this issue of system degradation head-on, using the LLMs themselves to benchmark LLM-powered systems accuracy.
We fully recognize the inherent eyebrow-raise here: a potentially error-prone LLM is evaluating itself? Can we trust a use case with this kind of self-analysis and self-regulation?
The core issue is that manually comparing texts requires substantial time and human effort from subject matter experts, and human comparison is subject to reviewer bias and error in the same way that an LLM may be biased. We’re not suggesting removing humans from the loop, but instead of laborious (and similarly error-prone) manual text comparisons, we've established a system that leverages an LLM to compare two texts. These two responses will never be exactly the same, because we expect responses from an LLM to vary slightly from one inquiry to another due to their non-deterministic nature. So we needed to develop a new testing approach, and the concept described in this article gives us a reliable baseline to measure future tweaks and adjustments in our LLM systems.
By balancing human expertise with AI efficiency, we've found an innovative, less resource-intensive way to maintain operational health and accuracy. Regular benchmarking allows us to spot symptoms of system drift and anomalies and address them swiftly.
In this article (the first of a three-part series on benchmarking) we suggest that — to save time and cost and therefore enable regular LLM benchmarking — LLMs can do the heavy lifting of evaluation for us.
By leveraging LLMs within the benchmarking process, DART has developed a framework for automatically evaluating different states of large language model systems.
While future experiments might explore using multiple LLMs in tandem, we exclusively used OpenAI’s GPT-4 application programming interface (API) to conduct our investigations for this article.
Our process considered the following key phases:
You can build an evaluation tool that measures any quality you are looking for in your content: for example, you could track “relevance” or “tone of voice.” We chose to measure the qualities of truthfulness and informativeness to showcase how the LLM can make an evaluation.
1. Truthfulness:
This metric assesses the alignment of the LLM-generated answer with the gold standard answer. It considers questions such as: Is the answer complete? Does it contain half-truths? How does it respond when information is present but not acknowledged?
2. Informativeness:
This metric examines the LLM-generated answer's ability to provide all the necessary information as compared to the gold standard. It looks for any missing or additional information that may affect the overall quality of the response.
Consider this excerpt from Wikipedia about the history of Earth (a rather static domain). Let’s use it as the context to allow for ground-truth annotation. We use the known facts in the paragraph to generate question–answer pairs, and then test the system on those questions to measure truthfulness and accuracy.
"The Hadean eon represents the time before a reliable (fossil) record of life; it began with the formation of the planet and ended 4.0 billion years ago. The following Archean and Proterozoic eons produced the beginnings of life on Earth and its earliest evolution. The succeeding eon is the Phanerozoic, divided into three eras: the Palaeozoic, an era of arthropods, fishes, and the first life on land; the Mesozoic, which spanned the rise, reign, and climactic extinction of the non-avian dinosaurs; and the Cenozoic, which saw the rise of mammals. Recognizable humans emerged at most 2 million years ago, a vanishingly small period on the geological scale."
—History of Earth (Wikipedia)
A gold standard question–answer pair for this piece of text might be:
Q: What is the chronological order of the eons mentioned in the paragraph?
A: The chronological order of the eons mentioned is the Hadean, followed by the Archean and Proterozoic. The Phanerozoic is the last one and it is divided into the Paleozoic, Mesozoic, and Cenozoic eras.
This response can now serve as a candidate answer in the Gold Standard Dataset.
Now, suppose you run some enhancements to your LLM-based application and you need to re-evaluate the overall system accuracy. You would then compare your new batch of responses to the dataset you previously created.
To illustrate, let's generate a new LLM response to evaluate.
Output (Correct Answer):
The chronological order of the eons mentioned in the paragraph is: Hadean, Archean, Proterozoic, and Phanerozoic.
Now that we have our correct answer, we can evaluate it against our ground-truth.
What interests us most in this experiment is that the LLM can play a valuable role in evaluation. During this phase, we used an LLM to systematically compare the answers generated by the system we were evaluating with the gold standard set of answers (that we consider to be the truth).
Again, we know what you’re thinking: we used an LLM to evaluate the accuracy of an LLM? Isn’t that circuitous? Remember, the LLM is not cheating. It’s simply comparing sentences and paragraphs for accuracy.
“What we care about is the ability to compare two pieces of text,” says WillowTree’s Michelle Avery, Group VP of AI. “Can we take two answers to the same question and put this in ChatGPT, and say, ‘Answer 1 is our ground-truth. Does answer 2 completely represent answer 1?’ The LLM can compare the two sentences. It can tell us how close they are to each other, and therefore, how accurate the answer is when compared to the ground-truth.”
We used an LLM to evaluate these responses against a specific set of metrics. In the examples below we will show how we evaluated responses for truthfulness and informativeness. The evaluation process is broken down into specific parts as follows:
The evaluation approach involves two principal aspects:
We prompt the LLM to act as an evaluator, comparing the gold standard answer with the new LLM-generated answer. The LLM is tasked to respond with a binary 'Yes' or 'No' for the two key features being evaluated (truthfulness and informativeness). These two metrics have been chosen so we will be able to evaluate how factual the response is and if the presented facts are explained fully when compared against our base truth.
Correct Example:
Considering the gold standard answer…
‘The chronological order of the eons mentioned is the Hadean, followed by the Archean and Proterozoic. The Phanerozoic is the last one and it is divided into the Paleozoic, Mesozoic, and Cenozoic eras.'
and the language model's response :
'The chronological order of the eons mentioned in the paragraph is: Hadean, Archean, Proterozoic, and Phanerozoic.'
Would you say that the language model's response is truthful when compared to the gold standard? Please respond with Yes or No.
Output:
Result: Yes
The results show the LLM answer is correct in producing the chronological order of eons.
As mentioned above, in addition to evaluating how factual the response is (truthfulness) we’re also evaluating whether the presented facts are explained fully when compared against our base truth (informativeness).
The language model’s response truthfully listed the chronological order of the eons, but did not provide the deeper layer of information — namely, that the last eon is further “divided into the Paleozoic, Mesozoic, and Cenozoic eras.”
Would you say that the language model’s response is informative when compared to the gold standard? Please respond with Yes or No.
Output:
Result: No
The results show the LLM answer is not informative as compared to the gold standard.
Of course, this opens up a new can of worms — can we examine parameters like truthfulness and informativeness on more of a sliding scale (and does this introduce new layers of subjective bias). For instance, one could argue that the above answer (that did not include as much detail) is Yes, but the score would be low, similar to how there can be “half-truths” that are technically true. We’ll be digging deeper into this big question in future posts in this series, so stay tuned.
When the LLM completes its evaluation, we compile and summarize the results. This is executed by calculating the percentage of accuracy for each characteristic: truthfulness and informativeness.
For instance, if out of 100 answers, the LLM finds that 85 of its responses were truthful (in accordance with the gold standard) and 70 were informative, we can deduce that the LLM was 85% truthful and 70% informative.
To test, we can add an incorrect answer to see how the results are captured.
Incorrect Example:
Considering the gold standard answer…
’The chronological order of the eons mentioned is the Hadean, followed by the Archean and Proterozoic. The Phanerozoic is the last one and it is divided into the Paleozoic, Mesozoic, and Cenozoic eras.'
and the language model's response :
'The chronological order of the eons mentioned in the paragraph is the Phanerozoic, Proterozoic, Archean, and then the Hadean.',
Would you say that the language model's answer is truthful? Please respond with Yes or No.
Output:
Result: No
As we can see, the LLM generated response is in reverse chronological order and is false and not truthful.
This framework contains limitations and ethical considerations. Scaling up to assess large datasets can be challenging, and relying on human-generated or auto-generated gold standards introduces potential biases and subjectivity. The accuracy and completeness of LLM-generated gold standards can impact the quality of evaluation.
However, this will be the case for any evaluation dataset, human-generated or automated. The quality of gold standard questions and answers will always affect the quality of evaluation. And, for knowledge experts to craft questions and answers (based on the context that the knowledge expert knows that the LLM has) takes time and expertise. Using an LLM alleviates this but still requires manual review.
Finally, creating an automated evaluator designed by the same system it evaluates may introduce biases and lack objectivity. The biases that the LLM returns come from any biases that went into training the model and are thus inherent to the LLM. To ensure responsible AI, we must adhere to principles that minimize harm and avoid promoting biases.
In this blog post, we outlined an approach for benchmarking accuracy using LLM-powered systems to prevent degradation, using a three-step framework of:
This approach provides a structured methodology to assess the performance and reliability of these models. By continually refining and iterating upon this framework, we can contribute to the ongoing discussions around the capabilities and limitations of LLM-based systems and foster the development of more reliable and interpretable AI technologies.