Craft At WillowTree Logo
Content for craftspeople. By the craftspeople at WillowTree.
Data & AI

Preventing System Degradation: How We Benchmark Accuracy of LLM-Based Systems

Milton Leal
Platform Data Scientist
Christopher Frenchi
Senior Software Test Engineer
September 21, 2023

Introduction

Recent studies have explored how large language model (LLM) performance may be changing over time (notably, a July study by researchers at Stanford and UC Berkeley provided evidence that ChatGPT-3.5 and ChatGPT-4 had “drifted substantially” from March to June 2023).

This degradation can occur due to many factors — an influx of data and user feedback, or recently updated model weights, or the addition of new moderation layers designed to filter or refine output, for instance. This can overwhelm a system or degrade its performance over time.

Preventing degradation in LLM-powered systems is crucial as even a minor change in behavior can snowball into significant issues: decreased productivity, skewed analytics, inaccurate answers or predictions (often termed “hallucinations”), and other faulty conclusions that can misinform strategic decision-making. In scenarios where real-time answers are required, such as in healthcare diagnostics or financial transactions, any delay or error could lead to critical failures.

In short, unmeasured and unchecked system degradation could not only disrupt operations but also erode trust and loyalty among end-users, affecting a business's bottom line.

So…how can we identify degradation before it impacts our customers? How can we take this time- and resource-intensive benchmarking process and make it something we can do regularly, to ensure the LLM we’ve deployed is working for us as intended?

Our Approach to Benchmarking Accuracy

At WillowTree, our Data and AI Research Team (DART) is tackling this issue of system degradation head-on, using the LLMs themselves to benchmark LLM-powered systems accuracy.

We fully recognize the inherent eyebrow-raise here: a potentially error-prone LLM is evaluating itself? Can we trust a use case with this kind of self-analysis and self-regulation?

The core issue is that manually comparing texts requires substantial time and human effort from subject matter experts, and human comparison is subject to reviewer bias and error in the same way that an LLM may be biased. We’re not suggesting removing humans from the loop, but instead of laborious (and similarly error-prone) manual text comparisons, we've established a system that leverages an LLM to compare two texts. These two responses will never be exactly the same, because we expect responses from an LLM to vary slightly from one inquiry to another due to their non-deterministic nature. So we needed to develop a new testing approach, and the concept described in this article gives us a reliable baseline to measure future tweaks and adjustments in our LLM systems.

By balancing human expertise with AI efficiency, we've found an innovative, less resource-intensive way to maintain operational health and accuracy. Regular benchmarking allows us to spot symptoms of system drift and anomalies and address them swiftly.

In this article (the first of a three-part series on benchmarking) we suggest that — to save time and cost and therefore enable regular LLM benchmarking —  LLMs can do the heavy lifting of evaluation for us.  

  • This first article establishes the overall evaluation metric and how it can be implemented to benchmark accuracy and prevent system degradation;
  • Our second article will discuss how numerical values can be used to ‘score’ an AI response;
  • Our third article will examine how we can use these techniques to evaluate the output quality of a retrieval augmented generation (RAG) system (a natural language processing system combining information retrieval with generative AI).

By leveraging LLMs within the benchmarking process, DART has developed a framework for automatically evaluating different states of large language model systems.

While future experiments might explore using multiple LLMs in tandem, we exclusively used OpenAI’s GPT-4 application programming interface (API) to conduct our investigations for this article.

Our process considered the following key phases:

  1. Creating a Gold Standard Dataset. Industry-wide, it is best practice for human subject matter experts to create a dataset of questions and known answers (see Meta with its Llama 2, or Lee et. al. on Scaling Reinforcement Learning from Human Feedback with AI Feedback). We use this Gold Standard Dataset of question–answer pairs as our trusted yardstick for performance measurement and system enhancement.
  2. Generating responses. After we generate a list of questions and answers for the Gold Standard Dataset, we ask these questions to the LLM system and let it generate new responses. These new responses reflect our current system state — that is, they reflect the quality and accuracy of answers the system will likely give to user questions.
  3. Comparing the LLM answer to the Gold Standard. Traditionally, human experts have also evaluated the quality of the LLMs response in comparison to the Gold Standard. However, in this article, we demonstrate that we can effectively use an LLM  to compare the answers from the system we’re benchmarking to the gold standard answers. We use an LLM to evaluate the benchmarked system for properties such as truthfulness and informativeness.

Evaluation Metrics

You can build an evaluation tool that measures any quality you are looking for in your content: for example, you could track “relevance” or “tone of voice.” We chose to measure the qualities of truthfulness and informativeness to showcase how the LLM can make an evaluation.

1. Truthfulness:

This metric assesses the alignment of the LLM-generated answer with the gold standard answer. It considers questions such as: Is the answer complete? Does it contain half-truths? How does it respond when information is present but not acknowledged?

2. Informativeness:

This metric examines the LLM-generated answer's ability to provide all the necessary information as compared to the gold standard. It looks for any missing or additional information that may affect the overall quality of the response.

Creating a Gold Standard Dataset

Consider this excerpt from Wikipedia about the history of Earth (a rather static domain). Let’s use it as the context to allow for ground-truth annotation. We use the known facts in the paragraph to generate question–answer pairs, and then test the system on those questions to measure truthfulness and accuracy.

"The Hadean eon represents the time before a reliable (fossil) record of life; it began with the formation of the planet and ended 4.0 billion years ago. The following Archean and Proterozoic eons produced the beginnings of life on Earth and its earliest evolution. The succeeding eon is the Phanerozoic, divided into three eras: the Palaeozoic, an era of arthropods, fishes, and the first life on land; the Mesozoic, which spanned the rise, reign, and climactic extinction of the non-avian dinosaurs; and the Cenozoic, which saw the rise of mammals. Recognizable humans emerged at most 2 million years ago, a vanishingly small period on the geological scale."

—History of Earth (Wikipedia)

A gold standard question–answer pair for this piece of text might be:

Q: What is the chronological order of the eons mentioned in the paragraph?

A: The chronological order of the eons mentioned is the Hadean, followed by the Archean and Proterozoic. The Phanerozoic is the last one and it is divided into the Paleozoic, Mesozoic, and Cenozoic eras.

This response can now serve as a candidate answer in the Gold Standard Dataset.

Generating Responses

Now, suppose you run some enhancements to your LLM-based application and you need to re-evaluate the overall system accuracy. You would then compare your new batch of responses to the dataset you previously created.

To illustrate, let's generate a new LLM response to evaluate.

# Generate LLM answer
def generate_LLM_answer(context, GS_question):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": "You are an LLM evaluation assistant that answers the question correctly.",
            },
            {
	    "role": "user", 
	    "content": f"Based on the context: '{context}', please answer the question: '{GS_question}'"},
        ],
    )
    return response.choices[0].message["content"].strip()

Output (Correct Answer):

The chronological order of the eons mentioned in the paragraph is: Hadean, Archean, Proterozoic, and Phanerozoic.

Now that we have our correct answer, we can evaluate it against our ground-truth.

Comparing the LLM Answer to the Gold Standard

Figure 1: Overview of WillowTree’s Accuracy Benchmarking Framework

What interests us most in this experiment is that the LLM can play a valuable role in evaluation. During this phase, we used an LLM to systematically compare the answers generated by the system we were evaluating with the gold standard set of answers (that we consider to be the truth).

Again, we know what you’re thinking: we used an LLM to evaluate the accuracy of an LLM? Isn’t that circuitous? Remember, the LLM is not cheating. It’s simply comparing sentences and paragraphs for accuracy.

“What we care about is the ability to compare two pieces of text,” says WillowTree’s Michelle Avery, Group VP of AI. “Can we take two answers to the same question and put this in ChatGPT, and say, ‘Answer 1 is our ground-truth. Does answer 2 completely represent answer 1?’ The LLM can compare the two sentences. It can tell us how close they are to each other, and therefore, how accurate the answer is when compared to the ground-truth.”

We used an LLM to evaluate these responses against a specific set of metrics. In the examples below we will show how we evaluated responses for truthfulness and informativeness. The evaluation process is broken down into specific parts as follows:

Evaluation Approach

The evaluation approach involves two principal aspects:

1. LLM as Evaluator

We prompt the LLM to act as an evaluator, comparing the gold standard answer with the new LLM-generated answer. The LLM is tasked to respond with a binary 'Yes' or 'No' for the two key features being evaluated (truthfulness and informativeness). These two metrics have been chosen so we will be able to evaluate how factual the response is and if the presented facts are explained fully when compared against our base truth.

# Evaluate Answers
def evaluate_answers(GS_answer, LLM_answer):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": "You are an LLM evaluation assistant that will evaluate the  given information.",
            },
            {
	     "role": "user", 
	     "content": f"Given the gold standard answer '{GS_answer}' and the language model's response '{LLM_answer}', would you say that the language model's response is truthful when compared to the gold standard? Please respond with Yes or No."},
        ],
    )
    return response.choices[0].message["content"].strip()

Truthfulness

Correct Example:

Considering the gold standard answer…

‘The chronological order of the eons mentioned is the Hadean, followed by the Archean and Proterozoic. The Phanerozoic is the last one and it is divided into the Paleozoic, Mesozoic, and Cenozoic eras.'

and the language model's response :

'The chronological order of the eons mentioned in the paragraph is: Hadean, Archean, Proterozoic, and Phanerozoic.'

Would you say that the language model's response is truthful when compared to the gold standard? Please respond with Yes or No.

Output:

Result: Yes

The results show the LLM answer is correct in producing the chronological order of eons.

Informativeness

As mentioned above, in addition to evaluating how factual the response is (truthfulness) we’re also evaluating whether the presented facts are explained fully when compared against our base truth (informativeness).

The language model’s response truthfully listed the chronological order of the eons, but did not provide the deeper layer of information — namely, that the last eon is further “divided into the Paleozoic, Mesozoic, and Cenozoic eras.”

Would you say that the language model’s response is informative when compared to the gold standard? Please respond with Yes or No.
Output:

Result: No

The results show the LLM answer is not informative as compared to the gold standard.

Of course, this opens up a new can of worms — can we examine parameters like truthfulness and informativeness on more of a sliding scale (and does this introduce new layers of subjective bias). For instance, one could argue that the above answer (that did not include as much detail) is Yes, but the score would be low, similar to how there can be “half-truths” that are technically true. We’ll be digging deeper into this big question in future posts in this series, so stay tuned.

2. Summarizing Results

When the LLM completes its evaluation, we compile and summarize the results. This is executed by calculating the percentage of accuracy for each characteristic: truthfulness and informativeness.

For instance, if out of 100 answers, the LLM finds that 85 of its responses were truthful (in accordance with the gold standard) and 70 were informative, we can deduce that the LLM was 85% truthful and 70% informative.

To test, we can add an incorrect answer to see how the results are captured.

Incorrect Example:

Considering the gold standard answer…

’The chronological order of the eons mentioned is the Hadean, followed by the Archean and Proterozoic. The Phanerozoic is the last one and it is divided into the Paleozoic, Mesozoic, and Cenozoic eras.'

and the language model's response :

'The chronological order of the eons mentioned in the paragraph is the Phanerozoic, Proterozoic, Archean, and then the Hadean.',

Would you say that the language model's answer is truthful? Please respond with Yes or No.

Output:

Result: No

As we can see, the LLM generated response is in reverse chronological order and is false and not truthful.

Limitations

This framework contains limitations and ethical considerations. Scaling up to assess large datasets can be challenging, and relying on human-generated or auto-generated gold standards introduces potential biases and subjectivity. The accuracy and completeness of LLM-generated gold standards can impact the quality of evaluation.

However, this will be the case for any evaluation dataset, human-generated or automated. The quality of gold standard questions and answers will always affect the quality of evaluation. And, for knowledge experts to craft questions and answers (based on the context that the knowledge expert knows that the LLM has) takes time and expertise. Using an LLM alleviates this but still requires manual review.

Finally, creating an automated evaluator designed by the same system it evaluates may introduce biases and lack objectivity. The biases that the LLM returns come from any biases that went into training the model and are thus inherent to the LLM. To ensure responsible AI, we must adhere to principles that minimize harm and avoid promoting biases.

Conclusion

In this blog post, we outlined an approach for benchmarking accuracy using LLM-powered systems to prevent degradation, using a three-step framework of:

  1. Creating a gold standard dataset
  2. Generating LLM responses
  3. Evaluating the quality of those responses using binary metrics

This approach provides a structured methodology to assess the performance and reliability of these models. By continually refining and iterating upon this framework, we can contribute to the ongoing discussions around the capabilities and limitations of LLM-based systems and foster the development of more reliable and interpretable AI technologies.

Additional References

  1. G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment
  2. Constitutional AI: Harmlessness from AI Feedback
Milton Leal
Platform Data Scientist
Christopher Frenchi
Senior Software Test Engineer

Recent Articles