The adoption of artificial intelligence continues at a robust pace. According to IBM’s Global AI Adoption Index, 42% of enterprise businesses have actively deployed AI, with an additional 40% actively exploring AI solutions. Overall, we’re adopting AI faster than we adopted smartphones.
Most AI applications today are powered by large language models (LLMs). LLMs are a powerful, exciting technology. But they also present new risks, including this mission-critical threat: How can you ensure your LLM-based solution behaves as expected at every new release or even at every development change?
Without continuous evaluation and monitoring, LLM-based AI systems may drift, exhibit undesirable behaviors, or simply stop working correctly, leading to:
To help our clients prevent system degradation, WillowTree's Data and AI Research Team (DART) integrates the continuous evaluation of LLMs within continuous innovation and continuous development (CI/CD) pipelines, helping to ensure AI safety.
These CI/CD pipelines are integral to modern agile software development methodologies. They ensure changes to the code base are reliable and stable and do not disrupt existing features. That gives us the potential to use these pipelines for automating LLM evaluations.
For this example scenario, we’ll use Github Actions for CI/CD pipelines and OpenAI's Python package for the evaluation process. But you can choose any tool, LLM, and language you want.
First, let’s start with the CI pipeline, where early tests are usually run. Configure the pipeline as described in the YAML file from the .github/workflows folder:
Code the continuous evaluation process at the llm_system_evaluation.py Python file, using the logic described for evaluating LLMs for truthfulness. To summarize, the main steps are:
Here’s how that process looks as a workflow.
When triggered, the CI pipeline should run the evaluation Python script and store the results. The CI pipeline should raise an error message if the scores generated are above a minimum required threshold. This monitoring will help avoid an undesired change reaching the CD pipeline and going into production.
And with that, we’ve integrated continuous evaluation of LLM performance into our CI/CD pipelines. But our work isn’t done yet. We still need to make key decisions, like what to measure and how often to measure it, before making meaningful efficiency gains in our LLM evaluation process.
To continuously evaluate your AI application properly, we recommend prioritizing the following topics when creating your evaluation strategy.
When integrating a continuous evaluation process into your CI/CD pipelines, consider what should be measured and when based on constraints such as budget, time, and labor. In our case, we chose to measure LLM truthfulness as our key performance indicator (KPI). Depending on your constraints, a different KPI (e.g., reasoning, common sense, toxicity) may be a better metric for evaluating your AI system.
Tip: You can track token usage and time spent running two or more measurements during your evaluation to help decide if you’re putting them into your CI/CD pipeline or not.
The timing and frequency of your continuous evaluation strategy is a key decision. For example, evaluating your application at every commit can lead to a significant increase in lead time. Likewise, evaluating merges could work as a behavior double-check after implementing a modification. Evaluation should happen often enough to ensure your LLM-based AI system behaves efficiently at pull request opening.
Based on LLM evaluations, you can apply analytics of how your application has evolved historically. Now, you have an even richer understanding of your AI application’s behavior.
Tip: Consider storing the evaluation results in a persistent file storage system or database. Check carefully that the results are stored properly for future usage.
Because a retrieval augmented generation system adds new or custom information to an LLM’s trained knowledge, you can use it with your gold standard questions (GS Q) and gold standard answers (GS A) to compare LLM evaluations through time and spot any discrepancies. This might be the most crucial takeaway because it enables you to identify undesired behavior and act on it in the early stages.
Cost is one of the biggest concerns when evaluating and benchmarking LLMs.
Cost estimation will depend on your solution and evaluation architecture. From there, adopt at least one metric for cost tracking. For a managed LLM such as OpenAI’s GPT 4 Turbo, it could be token usage. For a self-managed LLM such as the Llama 2 model, it could be execution duration.
Whichever direction you go, some key considerations should be part of your cost estimation.
When estimating the costs of evaluating your LLM-based system, some key items should be considered, such as expected incoming requests, average time by request, and average in and out token count by request. If using a RAG system, take your estimated vector database usage into account.
Factor in how often you expect the evaluator to assess your AI assistant. From there, calculate how long it takes to execute each evaluation, and how many tokens in and out are used. This is important because it will help shape our evaluation strategy.
For instance, if we need to track every possible minor change, we can run the AI evaluator at each new commit, leading to possible multiple evaluations. But if we only need to evaluate bigger releases, we can assess performance only at merges or pull requests. Once we’re confident about expected behavior and ready to accept all possible risks, we can run the AI evaluation only on demand.
The evaluator’s estimated cost will be multiplied by the number of times it runs (N times) in your development and deployment stages. Depending on your strategy, the evaluator may run multiple times, just a few times, or only on demand at each performance assessment. For this, budget the cost to evaluate your AI system at each new commit, at every open pull request, on merge into the main branch, or through workflow dispatch (i.e., manually).
Here’s what that evaluation strategy looks like illustrated as a formula.
Note some differences could lead you to better cost estimates. With a self-managed LLM, for example, you need to take into account the computation, storage, and network layer, asking guiding questions such as:
On the other hand, if you choose a managed LLM (e.g., OpenAI GPT 3.5 Turbo), you need to pay more attention to token usage at each LLM call.
Furthermore, to follow and analyze LLM usage, you can implement some usage metrics for your AI solution. Time spent per request and token usage, for instance, can be stored and then analyzed, enabling you and your team to estimate future costs, trends, and anomalies in your system.
Integrating continuous LLM evaluation into CI/CD pipelines helps you spot functional drifts in your AI solution. Thus, we can compare it to an automated end-to-end test that another LLM can run on your behalf. But note that this solution does not cover other possible CI/CD tests, such as unit and integration tests.
As AI-based systems become the norm, the complexity in their development and deployment processes continues to unveil itself. Given these intricate processes, it's vital to maintain a vigilant eye on LLM-powered generative AI solutions so they retain their expected behavior with each code modification.
Using CI/CD pipelines gives software engineers a powerful way to automate the evaluation of LLMs for unexpected or undesired changes in the early stages of each new version. Time and cost must be responsibly managed in this approach, of course. But overall, those costs are comparatively low compared to the price of putting a poor LLM solution in production.
If you need help setting up a cost-efficient evaluation process for your generative AI applications, the DART team at WillowTree is ready to help. Get started by learning about our eight-week GenAI Jumpstart program.