LLM Benchmarks Whitepaper

Can Data Subsampling Make Evaluating LLMs Faster and Cheaper?

Large language models, or LLMs, help businesses deliver better customer service, build better products, and run tighter operations. However, LLMs are not created equal, and creating LLM benchmarks to find the right model for the right application is costly.

But what if technology leaders could confidently benchmark LLMs using only a fraction of their current data?

Download the Whitepaper

Why We Did This Study

Something caught WillowTree's Data AI and Research Team (DART) by surprise recently. Upon conducting a dependency analysis on the Hugging Face Open LLM Leaderboard scores — one of the top resources for assessing open LLMs — we saw remarkably high accuracy correlations among the leaderboard’s four reported tests.

That made us wonder: Would it be possible to create accurate LLM benchmarks using only a fraction of the tests’ datasets? If so, technology leaders could evaluate and deploy LLMs much more efficiently.

We discovered that, yes, data subsampling is an efficient proxy for evaluating LLMs in specific tasks. This means technology leaders can create LLM benchmarks faster and cheaper, bettering their odds of finding the right language model for the right application.

Download the Whitepaper

In this study, you’ll learn:

  • How various tests affect the time and cost required to benchmark an LLM
  • How WillowTree used data subsampling to evaluate eight open source LLMs against four tasks (ARC, HellaSwag, MMLU, TruthfulQA)
  • How to replicate these evaluations to use for your own LLM benchmarking
Download the Whitepaper

More Insights

Let's talk.

Elegant, Performant Digital Products.
Personalized, Automated Marketing.
All Powered by Data and AI.