Craft At WillowTree Logo
Content for craftspeople. By the craftspeople at WillowTree.
Data & AI

Continuous Evaluation of Generative AI Using CI/CD Pipelines

Iago Brandão
Senior Data Scientist
January 16, 2024

The adoption of artificial intelligence continues at a robust pace. According to IBM’s Global AI Adoption Index, 42% of enterprise businesses have actively deployed AI, with an additional 40% actively exploring AI solutions. Overall, we’re adopting AI faster than we adopted smartphones.

Most AI applications today are powered by large language models (LLMs). LLMs are a powerful, exciting technology. But they also present new risks, including this mission-critical threat: How can you ensure your LLM-based solution behaves as expected at every new release or even at every development change?

Without continuous evaluation and monitoring, LLM-based AI systems may drift, exhibit undesirable behaviors, or simply stop working correctly, leading to:

  • task failures
  • a frustrating customer experience
  • financial impacts from improperly handling sensitive data

To help our clients prevent system degradation, WillowTree's Data and AI Research Team (DART) integrates the continuous evaluation of LLMs within continuous innovation and continuous development (CI/CD) pipelines, helping to ensure AI safety.

Typical flow of a CI/CD pipeline with the integration of continuous LLM evaluation

These CI/CD pipelines are integral to modern agile software development methodologies. They ensure changes to the code base are reliable and stable and do not disrupt existing features. That gives us the potential to use these pipelines for automating LLM evaluations.

How to Set Up Continuous LLM Evaluation in Your CI/CD Pipelines

For this example scenario, we’ll use Github Actions for CI/CD pipelines and OpenAI's Python package for the evaluation process. But you can choose any tool, LLM, and language you want.

First, let’s start with the CI pipeline, where early tests are usually run. Configure the pipeline as described in the YAML file from the .github/workflows folder:

name: CI Step    # Name of the CI/CD pipeline

  pull_request:    # This workflow gets triggered on pull requests
    branches: [ main ]  # Particularly, pull requests targeting the main branch

jobs:   # Here we define the jobs to be run 
  build:  # Defining a job called build
    runs-on: ubuntu-latest   # The type of runner that the job will run on

    steps:  # Steps represent a sequence of tasks that will be executed as part of the job
    - name: Check out repo  # Step 1: It is checking out the repository   
      uses: actions/checkout@v2 # Using Github’s official checkout action to accomplish this

    - name: Setup Python  # Step 2: It is setting up Python environment
      uses: actions/setup-python@v4 # Using an action from the marketplace to setup Python
        python-version: '3.10' # Specifies the Python version
    - name: Install Dependencies  # Step 3: Installing project dependencies
      run: |  # Running the commands
        python -m pip install --upgrade pip  
        pip install -r requirements.txt  # Requirements file should list all Python dependencies

    - name: Run Tests (Replace with your test command)  # Step 4: Running evaluation
      run: |
        python tests/ 

Code the continuous evaluation process at the Python file, using the logic described for evaluating LLMs for truthfulness. To summarize, the main steps are:

  1. Identify your gold standard questions (GS Q) and answers (GS A).
  2. Design a metric prompt to return scores between 1 and 5 and its judgment.
  3. Ask your LLM-based AI solution to answer the gold standard questions.
  4. Compare the gold standard answers to the LLM’s responses using the metric prompt.
  5. Store the evaluation in a CSV file or upload it to a database.

Here’s how that process looks as a workflow.

Workflow diagram showing how to evaluate a large language model (LLM) for truthfulness

When triggered, the CI pipeline should run the evaluation Python script and store the results. The CI pipeline should raise an error message if the scores generated are above a minimum required threshold. This monitoring will help avoid an undesired change reaching the CD pipeline and going into production.

And with that, we’ve integrated continuous evaluation of LLM performance into our CI/CD pipelines. But our work isn’t done yet. We still need to make key decisions, like what to measure and how often to measure it, before making meaningful efficiency gains in our LLM evaluation process.

Guidance for Continuous Evaluation of LLMs

To continuously evaluate your AI application properly, we recommend prioritizing the following topics when creating your evaluation strategy.

Know your resource constraints

When integrating a continuous evaluation process into your CI/CD pipelines, consider what should be measured and when based on constraints such as budget, time, and labor. In our case, we chose to measure LLM truthfulness as our key performance indicator (KPI). Depending on your constraints, a different KPI (e.g., reasoning, common sense, toxicity) may be a better metric for evaluating your AI system.

Tip: You can track token usage and time spent running two or more measurements during your evaluation to help decide if you’re putting them into your CI/CD pipeline or not.

Decide when to evaluate your LLM-based system

The timing and frequency of your continuous evaluation strategy is a key decision. For example, evaluating your application at every commit can lead to a significant increase in lead time. Likewise, evaluating merges could work as a behavior double-check after implementing a modification. Evaluation should happen often enough to ensure your LLM-based AI system behaves efficiently at pull request opening.

Leverage historical data from legacy AI applications

Based on LLM evaluations, you can apply analytics of how your application has evolved historically. Now, you have an even richer understanding of your AI application’s behavior.

Tip: Consider storing the evaluation results in a persistent file storage system or database. Check carefully that the results are stored properly for future usage.

Consider a retrieval augmented generation (RAG) system

Because a retrieval augmented generation system adds new or custom information to an LLM’s trained knowledge, you can use it with your gold standard questions (GS Q) and gold standard answers (GS A) to compare LLM evaluations through time and spot any discrepancies. This might be the most crucial takeaway because it enables you to identify undesired behavior and act on it in the early stages.

Cost Considerations When Evaluating LLM-based Systems

Cost is one of the biggest concerns when evaluating and benchmarking LLMs.

Cost estimation will depend on your solution and evaluation architecture. From there, adopt at least one metric for cost tracking. For a managed LLM such as OpenAI’s GPT 4 Turbo, it could be token usage. For a self-managed LLM such as the Llama 2 model, it could be execution duration.

Whichever direction you go, some key considerations should be part of your cost estimation.

AI assistant

When estimating the costs of evaluating your LLM-based system, some key items should be considered, such as expected incoming requests, average time by request, and average in and out token count by request. If using a RAG system, take your estimated vector database usage into account.


Factor in how often you expect the evaluator to assess your AI assistant. From there, calculate how long it takes to execute each evaluation, and how many tokens in and out are used. This is important because it will help shape our evaluation strategy.

For instance, if we need to track every possible minor change, we can run the AI evaluator at each new commit, leading to possible multiple evaluations. But if we only need to evaluate bigger releases, we can assess performance only at merges or pull requests. Once we’re confident about expected behavior and ready to accept all possible risks, we can run the AI evaluation only on demand.

Evaluation strategy

The evaluator’s estimated cost will be multiplied by the number of times it runs (N times) in your development and deployment stages. Depending on your strategy, the evaluator may run multiple times, just a few times, or only on demand at each performance assessment. For this, budget the cost to evaluate your AI system at each new commit, at every open pull request, on merge into the main branch, or through workflow dispatch (i.e., manually).

Here’s what that evaluation strategy looks like illustrated as a formula.

Note some differences could lead you to better cost estimates. With a self-managed LLM, for example, you need to take into account the computation, storage, and network layer, asking guiding questions such as:

  • Are there costs to making the LLM available?
  • Do we need a public cloud provider, or do we have enough on-premises resources?
  • Is there a capable team to manage these resources?

On the other hand, if you choose a managed LLM (e.g., OpenAI GPT 3.5 Turbo), you need to pay more attention to token usage at each LLM call.

Furthermore, to follow and analyze LLM usage, you can implement some usage metrics for your AI solution. Time spent per request and token usage, for instance, can be stored and then analyzed, enabling you and your team to estimate future costs, trends, and anomalies in your system.

Limitations to Take into Account

Integrating continuous LLM evaluation into CI/CD pipelines helps you spot functional drifts in your AI solution. Thus, we can compare it to an automated end-to-end test that another LLM can run on your behalf. But note that this solution does not cover other possible CI/CD tests, such as unit and integration tests.

Prevent Drift in Your Generative AI Applications

As AI-based systems become the norm, the complexity in their development and deployment processes continues to unveil itself. Given these intricate processes, it's vital to maintain a vigilant eye on LLM-powered generative AI solutions so they retain their expected behavior with each code modification.

Using CI/CD pipelines gives software engineers a powerful way to automate the evaluation of LLMs for unexpected or undesired changes in the early stages of each new version. Time and cost must be responsibly managed in this approach, of course. But overall, those costs are comparatively low compared to the price of putting a poor LLM solution in production.

If you need help setting up a cost-efficient evaluation process for your generative AI applications, the DART team at WillowTree is ready to help. Get started by learning about our eight-week GenAI Jumpstart program.  

Iago Brandão
Senior Data Scientist

Recent Articles