Craft At WillowTree Logo
Content for craftspeople. By the craftspeople at WillowTree.
Data & AI

Evaluating RAG: Using LLMs to Automate Benchmarking of Retrieval Augmented Generation Systems

NOTE: This article is Part III in our series on LLM Benchmarking. In Part I, we established how to benchmark LLM-based systems' accuracy to prevent system degradation. In Part II, we took a deeper dive into evaluating truthfulness as an element of LLM accuracy. If you need help setting up a cost-efficient evaluation process for your generative AI applications, the DART team at WillowTree is ready to help. Get started by learning about our eight-week GenAI Jumpstart program.


Having established the foundations of large language model (LLM) benchmarking in Parts I and II of this series, this article will explore using an LLM to evaluate the results of retrieval augmented generation (RAG) systems.

With retrieval augmented generation, you essentially hook up a database to a large language model and then bias a chatbot or AI-enabled assistant to retrieve information that is stored in that database as opposed to broader external knowledge. We’re then using a separate LLM to evaluate whether or not the RAG is performing as intended. More LLMs testing LLMs!

As we begin testing RAG systems, generating data, and interpreting the results, manual comparisons can become challenging, raising questions about bias, truth, and utility. Our premise for this experiment is: Can we automate the process to create a self-evaluating generative AI system?

In this article, we walk through how the Data and AI Research Team (DART) at WillowTree created an evaluation framework to compare an end-to-end RAG chatbot LLM output against a known base truth. Using this evaluation methodology, we can better understand how to efficiently test a RAG system.

What is Retrieval Augmented Generation?

RAG systems utilize embeddings and semantic search to retrieve stored knowledge that augments a large language model’s external knowledge (aka, “world knowledge”). We can think of embeddings as a mathematical representation of an object that captures similarities or relationships.

Essentially operating as AI-powered search engines, they retrieve and summarize specific data based on semantic similarities. By storing information that the LLM doesn’t necessarily know in a database, we can use RAG to query organization-specified documents and return the needed information for a chatbot or AI assistant to answer questions. The importance of manual and automated evaluation in these systems lies in verifying the accuracy and effectiveness of their responses.

But how do we evaluate the RAG LLM response? How do we know what we are getting back is correct? Let’s think about how we would manually achieve testing this before showing how to automate it.

Manual testing would involve the following steps:

  1. Create a series of questions, expected answers, and expected sources — your “gold standard dataset.”
  2. Ask the chatbot a question from your dataset.
  3. Receive the chatbot LLM response.
  4. Compare the LLM response to the expected answer.
  5. Assign a Yes/No determination or score to the LLM response (subject to personal bias).
  6. Compare the expected source to the source returned by the LLM.
  7. Document and share results.

We would then duplicate steps 2 through 7 for the next question for that page source, start over with step 1 for the next source page, rinse and repeat.

In short, manually creating and evaluating hundreds of prompts and responses is downright awful.

Now, let’s explore how to transform these manual processes into an automated series of tests using the evaluation framework.

Building an Evaluation Framework for RAG.

The following walkthrough will be broken into steps to better illustrate the automated process.

To set the stage, let's recap our goal when evaluating our RAG system:

Test the LLM response to ensure that the information returned from the LLM is an appropriate, correct, and accurate response to the user query.

  • This testing often involves looking through results and determining if the LLM response accurately passed back an appropriate response to the user. Several metrics can be used, but for this experiment, we will focus on accuracy and sources.

1. Context - Understanding what is stored in the “knowledge” database

In our example, we will grab some data regarding the earliest evidence of photosynthetic organisms, using this information as a stand-in for the kind of information an organization might include in its database.

For the purposes of this experiment, we’re going to consider a data chunk below, taken from an article in the scientific journal Plant Physiology and found in a reputable National Library of Medicine’s database that claims photosynthetic organisms may have been present as early as 3.5 billion years ago. While perhaps controversial, we are not debating the scientific accuracy of this data chunk as compared to others; we’re explicitly training a RAG system to consider this passage as our source of truth.

An untrained model might find other sources — for instance, this article from Wikipedia — that suggest evidence of photosynthetic organisms appearing only as early as 3.2 billion years ago. We don’t want the LLM drawing from this source but rather from the sources we specify.

This is the point of using RAG systems: for many use cases and specific tasks, especially in highly regulated industries like financial services or healthcare, organizations strive to retain total control over the knowledge database an LLM references. Doing so ensures the system responds only with approved information that conforms to regulatory frameworks.

We’re using this paragraph from Plant Physiology as an example of an expected chunk we might find in our knowledge store that we DO want our LLM to use as a source. In other words, we are evaluating to ensure the LLM is NOT referencing broader world knowledge and returning answers based on, in this case, Wikipedia.

# NIH - Early Evolution of Photosynthesis
chunk = 'There is suggestive evidence that photosynthetic organisms were present approximately 3.2 to 3.5 billion years ago, in the form of stromatolites, layered structures similar to forms that are produced by some modern cyanobacteria, as well as numerous microfossils that have been interpreted as arising from phototrophs (Des Marais, 2000).'

2. Gold Standard Dataset - Questions, Answers, Sources

To evaluate RAG, we need a gold standard (GS) dataset against which to test. This gold standard will be our ground truth question/answer. There are several ways that we can go about creating this dataset. We can create them directly from our knowledge chunks, mining our chatbot usage logs for frequently asked questions, or we can create them based on internal bug bashes to find edge cases. From this illustration, let’s create a simple example for questions, answers, and sources based on the data chunk outlined above. We will create a correct and incorrect dataset to test against, establishing our source of truth for the evaluation.

As Part II of this series explains, we can create our own question/answer datasets or use an LLM to help us generate these datasets. By passing in knowledge chunks and asking an LLM to create questions and answers, we can quickly generate large question/answer datasets across multiple pages. While the automatic generation of these datasets makes the process easier, these datasets do need to be reviewed by humans and, therefore, incur some costs.

#Example QA generation prompt

] = "Given the following paragraph, please generate a single question and its corresponding answer.\n Question start with a 'Q:' and answers start with an 'A:'. The answer should be the next line after the question. Just output a question and answer according to the instruction."

Note: Consider storing your question/answer datasets in a .csv file or database. If your RAG system returns source information, you might consider also storing source information in your question/answer dataset. We will hard code some variables to showcase the system and make it easier to understand.

question =  'When did photosynthetic organisms emerge?'

answer = 'Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.'

source = ''

wrong_answer = 'Photosynthetic organisms emerged between 3.2 and 2.4 billion years ago.'

wrong_source = ''

3. RAG Request/Response - Generate an answer from the chatbot to compare with the gold standard

Now that we understand our context and question/answer/sources dataset, we can pass in our generated question(s) to the RAG chatbot to get a response that we can then compare to the generated gold standard response. During our request we don’t need to worry about embeddings during the RAG semantic search because all of that is obfuscated during our evaluation.

Let’s assume your RAG chatbot can be called through an HTTP API. If we iterate through our question dataset, a typical request/response could look like the following:

# Question passed into the RAG Chatbot
question = "When did photosynthetic organisms emerge?"

# This would actually be populated from calling our RAG system API
rag_api_response = {
    "sources": [
    "chatbot_response": [
            "message": "Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.",
    "cost": "0.001662"

One thing to note in the response is the call out to cost. It’s important to track and understand the cost and/or “tokens” used during the chatbot API calls to effectively calculate the cost of running the chatbot as well as evaluating it. Let’s look at what is returned and what we will be storing for evaluation.

rag_source = rag_api_response['sources']
rag_answer = rag_api_response['chatbot_response'][0]['message']
rag_cost = float(rag_api_response['cost'])

faked_rag_response = [rag_answer, rag_source, rag_cost]

['Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.', [''], 0.001662]

4. Evaluation - Does the chatbot response match the expectation?

Ultimately, RAG evaluation compares two pieces of text based on a metric. We will compare the gold standard question/answer/source to the response from our chatbot. There are a few ways we can evaluate responses with known tools, such as ragas or MLFlow; in this post, we will show a simple behind-the-scenes look at implementing a simple evaluator with an LLM call.

Let’s define a metric we will showcase for evaluation; in this example, we will use accuracy. Depending on the metric in question, this can be specific to the type of response you want to test against. In addition, multiple metrics can be used, but be aware of costs associated with more extensive or additional LLM calls.

# Metric Prompt for Evaluation
] = "Looking at the gold standard answer \n'{answer}'\n and the language model's answer \n'{faked_rag_response}'\n, would you state that the language model's answer is completely accurate?"
# Evaluate Answers
def evaluate_answers(metric_prompts : list[str]) -> list[str]:
    Main function to evaluate answers. 
        prompts (list[str]): Prompts to be evaluated
      (list[str]): A list with prompts evaluation results

    evaluations = []
    input_tokens = 0
    completion_tokens = 0
    for prompt in metric_prompts:
        text = prompt
        response =
                    "role": "system",
                    "content": "You are an LLM evaluation assistant that will evaluate the given information. Evaluate the following results with 'Yes/No' followed by the 'reason' on a newline. 'Yes/No\nreason'. If you are unsure, please respond with 'I don't know.\nNo'.",
                {"role": "user", "content": text},
        input_tokens += response.usage.prompt_tokens
        completion_tokens += response.usage.completion_tokens

    return evaluations, input_tokens, completion_tokens

4a. Augmentation Check

The first test we will showcase is the LLM rewrite. We will compare the GS answer against the LLM response from our chatbot. In the following examples, we will showcase both a correct and incorrect answer during the comparison. Using the function above, we can check out how the evaluator works.

accuracy = ACCURACY_PROMPT.format(answer=answer, faked_rag_response=rag_answer)
evaluations, eval_input_tokens, eval_completion_tokens = evaluate_answers([accuracy])

Looking at the gold standard answer
'Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.'
and the language model's answer
'Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.'
, would you state that the language model's answer is completely accurate?
["Yes\nThe language model's answer is completely accurate because it matches the gold standard answer exactly."]

bad_accuracy = ACCURACY_PROMPT.format(answer=wrong_answer, faked_rag_response=rag_answer)
bad_evaluations, bad_eval_input_tokens, bad_eval_completion_tokens = evaluate_answers([bad_accuracy])

Looking at the gold standard answer
'Photosynthetic organisms emerged between 3.2 and 2.4 billion years ago.'
and the language model's answer
'Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.'
, would you state that the language model's answer is completely accurate?
["No\nThe language model's answer is not completely accurate because it states that photosynthetic organisms emerged between 3.2 and 3.5 billion years ago, which is different from the gold standard answer that states they emerged between 3.2 and 2.4 billion years ago."]

4b. Retrieval Check

The next thing we want to test is the sources returned.

# check if the rag response for sources matches the source
def source_check(responses : list[str], source : str) -> bool:
    Checks if the source in the rag response matches the source in the prompt
        responses: list of rag responses
        source: source in the prompt
        True if the source in the rag response matches the source in the prompt
        False otherwise
    for response in responses:
        if response == source:
            print("LLM Response:", response, "\nGS Source:", source)
            print('The source matches the source in the prompt')
            return True
        print("LLM Response:", response, "\nGS Source:", source)
        print('The source does NOT match the source in the prompt')
    return False
source_result = source_check(rag_source, source)

LLM Response:
GS Source:
The source matches the source in the prompt

bad_source_result = source_check(rag_source, wrong_source)

LLM Response:
GS Source:
The source does NOT match the source in the prompt

5. Review Results

5a. Cost

It’s important to understand the cost of using an LLM to evaluate the output of a RAG query. Keeping track of the reason for the evaluation metric increases the completion tokens used, which increases the cost.

# Print the RAG cost with a precision of 6 decimal places
print(f"RAG Cost: ${round(rag_cost, 6):.6f}")

# Print the number of input tokens and completion tokens for the Chatbot Evaluation
print(f"\nChatbot Evaluation:\n Input tokens = {eval_input_tokens}\n Completion tokens = {eval_completion_tokens}")

# Calculate and print the evaluation cost
eval_cost = eval_input_tokens / 1000 * 0.003 + eval_completion_tokens / 1000 * 0.06
print(f" Evaluation Cost: ${round(eval_cost, 6):.6f}")
print(f"\nBad Chatbot Evaluation:\n Input tokens = {bad_eval_input_tokens}\n Completion tokens = {bad_eval_completion_tokens}")

# Calculate and print the bad evaluation cost
bad_eval_cost = bad_eval_input_tokens / 1000 * 0.003 + bad_eval_completion_tokens / 1000 * 0.06
print(f" Bad Evaluation Cost: ${round(bad_eval_cost, 6):.6f}")

# Calculate the total evaluation cost considering the cost of OpenAI GPT-4 and print it
total_input_tokens = eval_input_tokens + bad_eval_input_tokens
total_completion_tokens = eval_completion_tokens + bad_eval_completion_tokens

total_cost = (total_input_tokens / 1000 * 0.003) + (total_completion_tokens / 1000 * 0.06) + rag_cost
total_cost = round(total_cost, 6)
print(f"\nTotal Cost: ${total_cost:.6f}")

RAG Cost: $0.001662

Chatbot Evaluation:
Input tokens = 136
Completion tokens = 19
Evaluation Cost: $0.001548

Bad Chatbot Evaluation:
Input tokens = 136
Completion tokens = 60
Bad Evaluation Cost: $0.004008

Total Cost: $0.007218

5b. Results

Sharing results is crucially important once we evaluate a RAG response. Where do we save our results: a .csv, db, into our experiment tracker? Ultimately, it is up to the team to determine where these metrics can be reviewed and shared during development.

The breakdown of the different outputs is meant to showcase what we need when reviewing the evaluator responses. When it comes to sharing our results, the findings must be actionable. How is your team saving and sharing these results with developers and stakeholders? What do we do when something fails or behaves incorrectly? If you're using a fine-tuned model, how do we improve it?

def generate_results(question : str, answer: str, rag_answer: str, binary_evaluation: str, binary_reason: str, source_result: str):
    """ Gather evaluation result to print a full report of the results"""
Question: {question}          
Answer: {answer}          
LLM Response: {rag_answer}          
Accuracy Pass/Fail: {binary_evaluation}          
Reasoning: {binary_reason}          
Source Pass/Fail: {source_result}''')
binary_evaluation, binary_reason = evaluations[0].split('\n')

generate_results(question, answer, rag_answer, binary_evaluation, binary_reason, source_result)

Question: When did photosynthetic organisms emerge?          
Answer: Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.          
LLM Response: Photosynthetic organisms emerged between 3.2 and 3.5 billion years ago.          
Accuracy Pass/Fail: Yes          
Reasoning: The language model's answer is completely accurate because it matches the gold standard answer exactly.          
Source Pass/Fail: True


Using an evaluation framework to measure the responses from our RAG chatbot provides insights into the quality of the LLM's responses. We are using an evaluation framework because the results from an LLM are not deterministic. We can, however, use this to our advantage and have shown above that an LLM can evaluate two answers and provide a meaningful determination for different metrics. Using accuracy is a simple way of comparing a desired result.

Like writing automated tests, using an evaluation method on your RAG chatbot helps ensure quality. Also, similar to tests, the meaning of results only matters if we’ve constructed good-quality datasets and metrics to use during evaluation.

With the code above as a guideline, a Python file can be created and run to evaluate your RAG system. This file can be used locally or through CI/CD. The next steps are to consider larger datasets and understand how your team will use and share these results with developers and stakeholders.

If you need help setting up a cost-efficient evaluation process for your generative AI applications, the DART team at WillowTree is ready to help. Get started by learning about our eight-week GenAI Jumpstart program.

If you haven’t reviewed Part I and Part II of this series on LLM benchmarking, you may find additional answers to your questions in these articles mentioned:

Table of Contents
Christopher Frenchi

Read the Video Transcript

Recent Articles