Craft At WillowTree Logo
Content for craftspeople. By the craftspeople at WillowTree.
Data & AI

RAG: How Retrieval Augmented Generation Systems Work

The AI world took a great leap forward with the development of large language models (LLMs). Trained on massive amounts of data, LLMs discover language patterns and relationships, enabling AI tools to generate more accurate, contextually relevant responses.

But LLMs also confront AI engineers with new challenges, from LLM benchmarking to monitoring. One of the best options engineers can turn to is retrieval augmented generation, or RAG. An RAG system enhances (i.e., augments) LLM performance by retrieving and summarizing contextually relevant information the LLM wasn’t originally trained on.

This is important because training equips an LLM with world knowledge, some common understanding, and general information from internet documents. A classic application for LLMs is chatbots or more advanced conversational AI assistants that can respond to queries about domain-specific information without hallucinating (inventing answers that are incorrect or contextually irrelevant).

Overview of retrieval augmented generation (RAG) system architecture with user prompt and response example

But this training process means sometimes, LLMs aren’t exposed to all the information they need. Knowledge also changes over time. So, how can an LLM answer questions about evolving information it hasn’t yet observed? And how can we mitigate the chance of AI hallucinations in those scenarios?

These are dilemmas RAG helps us solve.

What is Retrieval Augmented Generation?

Retrieval augmented generation (RAG) is a generative AI method that enhances LLM performance by combining world knowledge with custom or private knowledge. These knowledge sets are formally referred to as parametric and nonparametric memories, respectively[1]. This combining of knowledge sets in RAG is helpful for several reasons:  

  • Providing LLMs with up-to-date information: LLM training data is sometimes incomplete. It may also become outdated over time. RAG allows adding new and/or updated knowledge without retraining the LLM from scratch.
  • Preventing AI hallucinations: The more accurate and relevant in-context information LLMs have, the less likely they’ll invent facts or respond out of context.
  • Maintaining a dynamic knowledge base: Custom documents can be updated, added, removed, or modified anytime, keeping RAG systems up-to-date without retraining.

With this high-level understanding of the purpose of RAG, let’s dig deeper into how RAG systems actually work.

Key Components of a RAG System and Their Function

An RAG system usually consists of two layers:

  • a semantic search layer composed of an embedding model and vector store, and
  • a generation layer (also called the query layer) composed of an LLM and its associated prompt

These layers help the LMM retrieve relevant information and generate the most valuable answers.

Simplified breakdown of an RAG system showing semantic search and generation layers

Figure 1 shows how RAG systems layer new information into the world knowledge LLMs already know. To see an example of RAG applied to a real-world use case, check out how we built a safe AI assistant for a financial services firm.

Semantic search layer

The semantic search layer comprises two key components: an embedding model and a vector store or database. Together, these components enable the semantic search layer to:

  1. Build a knowledge base by gathering custom or proprietary documents such as PDFs, text files, Word documents, voice transcriptions, and more.
  2. Read and segment these documents into smaller pieces, commonly called "chunks." Our upcoming blog post on advanced RAG will discuss chunking strategies and LLM-powered fact-extraction preprocessing.
  3. Transform the chunks into embedding vectors and store the vectors in a vector database alongside the original chunk text.

It’s worth examining how the embedding model and vector store make semantic search possible. Search is enriched by understanding a query's intent and contextual meaning (i.e., semantics) rather than just looking for literal keyword matches.

Embedding model

Embedding models are in charge of encoding text. They project text into a numerical representation equivalent to the original text’s semantic meaning[2], as depicted in Figure 2. For instance, the sentence “Hi, how are you?” could be represented as a numerical (embedding) vector [0.12, 0.2, 2.85, 1.33, 0.01, ..., -0.42] with N dimensions.

Breakdown of how an embedding model enables semantic search in RAG

This illustrates a key takeaway about embeddings: Embedding vectors that represent texts with similar meanings tend to cluster together within the N-dimensional embedding space.

Some examples of embedding models are OpenAI's text-embedding-ada-002, Jina AI's jina-embeddings-v2, and SentenceTransformers’ multi-QA models.

Vector store

Vector stores are specialized databases for handling high-dimensional data representations. They have specific indexing structures optimized for the efficient retrieval of vectors.

Some examples of open-source vector stores are Facebook’s FAISS, Chroma DB, and even PostgreSQL with the pgvector extension. Vector stores can be in-memory, on disk, or even fully managed, like Pinecone and Weaviate.

Generation Layer

The generation layer consists of an LLM and its associated prompt. The generation layer takes a user query (text) as input and does the following:

  1. Executes a semantic search to find the most relevant information for the query.
  2. Inserts the most relevant chunks of text into an LLM prompt along with the user's query and invokes the LLM to generate a response for the user.

Here’s a deeper look at how the LLM and prompt interact with an RAG system.


Large language models are built upon Transformer architecture, which uses the technique of attention mechanism to help the model decide where to pay more or less attention in a sentence or text. LLMs are trained on massive amounts of data drawn from public sources, mainly available on the internet.

LLMs become brainier in RAG systems and are able to generate improved answers based on the context retrieved through semantic search. Now, the LLM can change its answers to better align with each query’s intent and meaning.

Some examples of managed LLMs are OpenAI’s ChatGPT, Google’s Bard, and Perplexity AI’s Perplexity. Some LLMs are available for self-managed scenarios, such as Meta's Llama 2, TII’s Falcon, Mistral’s Mistral AI, and Databricks’s Dolly.


A prompt is a text input given to an LLM that effectively programs it by tailoring, augmenting, or sharpening its functionalities[3]. With RAG systems, the prompt contains the user’s query alongside relevant contextual information retrieved from the semantic search layer that the model can use to answer the query.

“You are a helpful assistant, here is a users query: ``` ${query}```.
Here is some relevant data to answer the user’s question: ``` ${relevantData}```.
Please answer the user’s query concisely."

The above image shows an example of a naive prompt answering a query by considering the context retrieved from the semantic search layer.

Essential Things to Know When Considering RAG

In addition to the practical and theoretical applications of RAG, AI practitioners should also be aware of the ongoing monitoring and optimization commitments that come with it.

RAG evaluation

RAG systems should be evaluated as they change to ensure that behavior and quality are improving and not degrading over time. RAG systems should also be red-teamed to evaluate their behavior when faced with jailbreak prompts or other malicious or poor-quality inputs. Learn more in our blog about evaluating RAG systems.

Quality and quantity of RAG knowledge

A RAG system is as good as the content available in the knowledge database. Furthermore, even if the knowledge database has the correct information, if the semantic search does not retrieve it or rank it highly enough in the search results, the LLM will not see the information and will likely respond unsatisfactorily.

Moreover, if the retrieved content has low information density — or is entirely irrelevant — the LLM’s response will also be unsatisfactory. In this case, using a model with a larger context window is tempting so that more semantic search results can be provided to the LLM. But this comes with tradeoffs — namely, increased cost and risk of diluting the relevant information with irrelevant information — which can “confuse” the model.

RAG cost

Since embedding models usually have an insignificant cost nowadays, RAG’s main costs arise from vector database hosting and LLM inference. The biggest driver of cost with LLM inference in RAG systems is likely the number of semantic search results inserted into the prompts. A more significant LLM prompt with more semantic search results could potentially yield a higher-quality response. Still, it will also result in more token usage and possibly more substantial response latency.

However, a larger prompt with more information does not necessarily guarantee a better response. The optimal number of results to insert into the prompt will be different for every system and is impacted by factors such as chunk size, chunk information density, the extent of information duplication in the database, the scope of user queries, and much more. An evaluation-driven development approach is likely the best way to determine the best process for your system.

Is RAG Right for Your Generative AI Applications?

Retrieval augmented generation systems mark a significant advancement in AI, enhancing LLM performance by reducing hallucinations and ensuring knowledge base information is current, accurate, and relevant. Balancing information retrieval against cost and latency while maintaining a high-quality knowledge database is essential for effective use. Future advancements, including techniques like hypothetical document embeddings (HyDE), promise to further improve RAG systems.

Despite its costs, RAG undeniably improves user interaction, creating stickier, more delightful generative AI experiences for customers and employees alike.

For more help on crafting exceptional digital experiences, find the latest research, thought leadership, and best practices on WillowTree’s data and AI knowledge hub.

For help getting started with your genAI projects, learn about our eight-week GenAI Jumpstart program and future-proof your company against asymmetric genAI tech innovation with our Fuel iX enterprise AI platform.


[1] Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks." Advances in Neural Information Processing Systems 33 (2020): 9459-9474.

[2] Neelakantan, Arvind, et al. "Text and code embeddings by contrastive pre-training." arXiv preprint arXiv:2201.10005 (2022).

[3] White, Jules, et al. "A prompt pattern catalog to enhance prompt engineering with chatgpt." arXiv preprint arXiv:2302.11382 (2023).

[4] Introducing text and code embeddings [WWW Document], OpenAI. URL

Table of Contents
Zakey Faieq
Iago Brandão

Read the Video Transcript

Recent Articles