Content for craftspeople. By the craftspeople at WillowTree.

Data & AI

Evaluating Truthfulness: A Deeper Dive into Benchmarking LLM Accuracy

Milton Leal

Platform Data Scientist

Christopher Frenchi

AI Research Engineer

Last updated:

Published:

October 13, 2023

NOTE: This article is Part II in our series on LLM Benchmarking. In Part I, we established how to benchmark LLM-based systems' accuracy to prevent system degradation, and in Part III, we explore our approach to using LLMs to evaluate the results of Retrieval Augmented Generation (RAG) systems.

Introduction to LLM Accuracy

Welcome back to our series on benchmarking the accuracy of LLM-powered systems. In Part 1, we delved into how WillowTree's Data and AI Research Team (DART) uses large language models (LLMs) to prevent system degradation. We established that LLMs, especially GPT-4, excel at comparing LLM-based application outputs against a gold standard.

In Part 2, we aim to go beyond binary yes/no evaluations. Our goal is to engineer a pair of prompts that can emulate a scoring metric method that rates the truthfulness and informativeness of an answer on a scale of 1 to 5, with a score of 1 signifying "not truthful," and a score of 5 representing "completely truthful." We'll explore the nuances of evaluating manually modified answers that may contain partial truths or misleading information, using GPT-4 as our trusty partner in this quest for accuracy.

This process of evaluating accuracy for natural language processing (NLP) tasks is particularly important because LLM-generated answers are not deterministic and could drift away from an expected output. With a scoring metric, we can spot these nuanced variations in the outputs and work towards improving the application.

The scoring metric contributes to WillowTrees’s evaluation-driven development methodology, making it especially important in the context of Retrieval Augmented Generation (RAG) systems, which will be the focus of our Part 3 article.

Scoring Truthfulness: Experiment Setup

Before we dive into the intricacies of our scoring metric and dataset, let's set the stage. To summarize Part 1, we discussed the three key elements of the evaluator: a question, a gold standard answer, and the answer we want to compare. For this experiment, we will be using the following as our Gold Standard Question and Answer:

GS Q: “What are the three eras of the Phanerozoic eon, and what significant events happened in each?”

GS A: "The three eras of the Phanerozoic eon are the Palaeozoic, the Mesozoic, and the Cenozoic. The Palaeozoic era was characterized by the emergence of arthropods, fishes, and the first life on land. The Mesozoic era spanned the rise, reign, and climactic extinction of the non-avian dinosaurs. The Cenozoic era saw the rise of mammals and the emergence of recognizable humans, which occurred about 2 million years ago."

Before we get into the answers we want to compare, we need to design our truthfulness scoring prompt to return not only a score between 1 and 5 but also the reasoning for the score to showcase the evaluator’s judgment (the same applies for informativeness; see appendix for informativeness prompt):

Truthfulness Prompt:

"Considering the gold standard answer and the given response from the language model, can you evaluate the degree of truthfulness of the language model's response on a scale of 1 to 5?

‍Not Truthful: This score indicates that the information given is completely false or misleading, with almost no element of truth.
Slightly Truthful: This score means the information has small elements of truth but is largely flawed or inaccurate. It may include significant omissions of relevant facts.
Moderately Truthful: This score signals that the information provided is a mix of truth and inaccuracies. Its truthfulness may depend on the interpretation or context.
Mostly Truthful: This score denotes a high level of truth. However, it may contain minor errors or omissions or be slightly misleading in certain contexts.
Completely Truthful: A perfect score. The information presented is entirely accurate, factual, and free of any misleading elements or omissions. This level of truthfulness is transparent, honest, and reliable.

Please generate a single-digit score and the corresponding reason for the score. Scores start with 'Score: ' and reasons start with 'Reason: '. The reason should be the next line after the score."

Given the established metrics, for us to conduct the scoring experiment, we have manually modified 21 answers based on the third needed evaluator element: the original LLM response we want to compare and manipulate:

A: The three eras of the Phanerozoic eon are the Palaeozoic, Mesozoic, and Cenozoic. During the Palaeozoic era, there was the emergence of arthropods, fishes, and the first life on land. The Mesozoic era was marked by the rise, reign, and eventual extinction of the non-avian dinosaurs. In the Cenozoic era, there was the rise of mammals and the emergence of recognizable humans around 2 million years ago.

Now, we want to manually modify this answer to test various aspects of truthfulness and informativeness. An example of a singular modified answer could be the following:

Modified Answer 1: The three eras of the Phanerozoic eon are the Palaeozoic, the Mesozoic, and the Archean.

To complete our experiment, we also need a fourth element: a hypothesis of what we expect the modified answer’s scores to be. For Modified Answer 1, we have established the truth score hypothesis to be 3 because while two eras were correct, the last era was incorrect. The informativeness score was hypothesized to be 2 based on a loss of information and the incorrect value of the last era. You can check out some other modified answers and the hypotheses attributed to them in the appendix section of this post.

LLM Results: Truthfulness and Informativeness

Now that we have a collection of modified answers and their respective hypotheses, we can compare them against the gold standard. We aim to shed light on the LLM's ability to evaluate responses based on their truthfulness and informativeness. Our ultimate goal is to determine how closely the LLM's evaluations align with human judgment.

‍

Truthfulness

The chart showcases the outcomes of our truthfulness assessment. The LLM's score matched our expectations in 16 of the 21 modified answers. However, for four responses, the LLM's score was more conservative than human evaluation, implying that the LLM rated an answer lower than the human’s rating. Interestingly, only once did the LLM rate an answer at a level higher than human judgment.

‍

Graph 1: Overlays the LLM's predicted labels against human-hypothesized true labels, grounded in the scoring criteria defined in the truthfulness prompt.

‍

A notable discrepancy between LLM and human assessment is evident in the previously discussed Modified Answer 1. While humans predicted a truthfulness score of 3, the LLM gave it a 2. This divergence was justified as: "The language model's response is moderately truthful. It correctly identifies the Palaeozoic and the Mesozoic as two eras of the Phanerozoic eon. Yet, it mistakenly cites the Archean as the third era, overlooking the correct answer, the Cenozoic. This is a significant oversight, neglecting the era that marked the ascendancy of mammals and the dawn of humans."

Informativeness

In the chart below, we present the results of our informativeness experiment. For 17 out of 21 modified answers, the score provided by the LLM aligned with the score we hypothesized. In the other three instances, the LLM score ended up being more conservative than the human score, meaning that the LLM scored a particular answer below the human evaluation. In one instance, the LLM happened to score the answer one score level above the human judgment.

‍

LLM Consistency across multiple runs

We also wanted to check the consistency of LLM responses across multiple runs. So we ran the same experiment 30 times and checked the score variance for each modified answer. The mean standard deviation for truthfulness was pretty low: 0.06. We can better see this by looking at the graph:

‍

For truthfulness, 19 out of 21 answers had a variance of zero, meaning that GPT-4 scored the same answer consistently in all 30 runs. However, we could see the LLM diverged in a tiny fraction of the runs, producing some slight variance for two questions.

We did the same for informativeness and got a higher mean standard deviation of 0.16.

For informativeness, the LLM displayed more variation in its scoring as compared to the truthfulness graph and registered a mean standard deviation of 0.16. Five out of 21 answers showed inconsistent scores across the 30 trials, with multiple instances of deviation visible in the graph.

‍

‍

This variance indicates that while the LLM is generally consistent in its outputs, there are certain questions or contexts where its assessment of truthfulness or informativeness can vary. This highlights the importance of understanding that the LLM's evaluation is not always deterministic.

The evaluation method is indeed non-deterministic because we use the LLM as its backbone. Possible turnarounds here would be to implement a majority voting system (but the use case needs to require such an expensive approach). At the end of the day, what we really want is a baseline.

Limitations

Based on the findings from the experiment, the scoring metrics proved not only to have a more granular result than the binary evaluation we discussed in Part 1 but also made us aware of human biases for determining an answer’s hypothesized score.

When comparing the scoring results with the binary approach, fragmented sentences such as "The three eras of the Phanerozoic eon are” can pass truthfulness evaluation using only the yes/no prompting. In this case, the binary evaluator’s reasoning was the following, “The language model's response is entirely truthful as it accurately begins to list the three eras of the Phanerozoic eon, aligning with the gold standard answer. However, it does not complete the response and thus while it is truthful, it is not comprehensive.”

However, using the scoring metric defined, we can get back a value of 2 with the reasoning, “The language model's response contains elements of truth, starting the list of eras in the Phanerozoic eon correctly from the gold standard answer. However, it doesn't provide the full answer, missing the names of the eras and their characteristics. This is considered a significant omission of relevant facts.” While the score is better than the binary evaluation, this answer is severely flawed and should have been scored as 1 instead of 2. This reasoning, however, could be up for debate and lead to biases when determining our hypothesis scores.

As we saw in the results, there may be divergences between the evaluator and the human judgment. This brings up the question:

Will every person hypothesize the same score for a given answer? Would 30 different people score the same answer the same way, based on the same scoring metric? Will the same person score an answer the same way every time? Is the intrinsic variance within a machine evaluator any different from the variability of a human evaluator?

Conclusion

In conclusion, using LLMs for evaluating metrics, such as truthfulness and informativeness, in generated responses has proven effective and insightful. The evaluator framework used in our experiment consistently aligned LLM scores with human hypothesized scores in most cases. This demonstrated the model's capacity to follow intricate human thinking patterns and make evaluative judgments similar to a human evaluator. However, some divergence did occur, which presented opportunities to further refine the evaluative model and shed light on the dimensions of human bias.

The experiment also opened avenues for discussion regarding the role of human bias in setting the hypotheses for scores. This emphasizes the complexity and nuanced nature of human judgment and how it might occasionally be at odds with the systematic evaluations performed by an LLM.

Finally, using LLMs as evaluative tools for answer outputs demonstrated a great deal of promise in maintaining system integrity and avoiding system degradation. The consistent performance of the LLM across multiple runs in the experiment suggests that this approach can play a critical role in enhancing the accuracy of LLM-powered systems in the future.

Stay tuned for Part 3 of our series, where we'll delve even deeper into evaluating LLM responses, specifically focusing on the quality of retrieval augmented generation (RAG) systems. We're excited to share our insights and findings with you as we explore the fascinating world of artificial intelligence benchmarking.

For help setting up a cost-efficient evaluation process for your generative AI applications, the DART team at WillowTree is ready to help. Get started by learning about our eight-week GenAI Jumpstart program and future-proof your company against asymmetric LLM and genAI technology innovation with our Fuel iX enterprise AI platform.

Appendix

Informativeness Prompt

"Given the gold standard answer and the language model's response, please assess the informativeness of the language model's answer on a scale of 1 to 5.

1 - Not Informative: This score indicates that the information provided is irrelevant, lacking in substance, or doesn't contribute to one's understanding of the topic at all.

2 - Slightly Informative: This score signifies that the information contains some relevant points, but it may be too vague, too basic, or missing key details to offer substantial insight.

3 - Moderately Informative: Denotes that the information provides a decent overview of the topic, but might lack depth or particular details. The recipient gains a partial understanding of the subject matter.

4 - Mostly Informative: The information is well presented, covers most aspects of the topic, and provides a substantial understanding. There might be minor gaps or a lack of examples to illustrate certain points.

5 - Completely Informative: A perfect score. The information given is comprehensive, well-researched, and is presented in a way that maximizes understanding. It not only covers the broad strokes but also covers details, examples, and contextual information.

Please generate a single-digit score and the corresponding reason for the score.

Scores start with 'Score: ' and reasons start with 'Reason: '.The reason should be the next line after the score. Just output a score and reason according to the instruction."

Creating answers based on hypotheses. Examples of Modified Answers

5 - Completely Truthful: A perfect score. The information presented is entirely accurate, factual, and free of any misleading elements or omissions. This level of truthfulness is transparent, honest, and reliable.

Modified Answer: "The three eras of the Phanerozoic eon are the Palaeozoic, Mesozoic, and Cenozoic. During the Palaeozoic era, there was the emergence of arthropods, fishes, and the first life on land. The Mesozoic era spanned the rise, reign, and climactic extinction of the non-avian dinosaurs. The Cenozoic era saw the rise of mammals and the emergence of recognizable humans, which occurred about 2 million years ago."

4 - Mostly Truthful: This score denotes a high level of truth. However, it may contain minor errors or omissions, or be slightly misleading in certain contexts.

Modified Answer: "The three eras of the Phanerozoic eon are the Palaeozoic, Mesozoic, Cenozoic."

3 - Moderately Truthful: This score signals that the information provided is a mix of truth and inaccuracies. Its truthfulness may depend on the interpretation or context.

Modified Answer: "The three eras of the Phanerozoic eon are the Palaeozoic, Mesozoic, and Archean. During the Palaeozoic era, there was the emergence of arthropods, fishes, and the first life on land. The Mesozoic era was marked by the rise, reign, and eventual extinction of the non-avian dinosaurs. In the Cenozoic era, there was the rise of mammals and the emergence of recognizable humans around 2 million years ago."

2 - Slightly Truthful: This score means the information has small elements of truth but is largely flawed or inaccurate. May include significant omissions of relevant facts.

Modified Answer: "The three eras of the Phanerozoic eon are the Palaeozoic, the Proterozoic, and the Archean."

1 - Not Truthful: This score indicates that the information given is completely false or misleading, with almost no element of truth.

Modified Answer: "The three eras of the Phanerozoic eon are the Hadean, the Proterozoic, and the Archean."

Table of Contents

Milton Leal

Platform Data Scientist

Christopher Frenchi

AI Research Engineer