Craft At WillowTree Logo
Content for craftspeople. By the craftspeople at WillowTree.
Data & AI

Partial Intent Classification: How We Made a Safe Conversational AI Assistant for a Financial Services Firm

One of the most rewarding things about working in generative AI is discovering how emerging technologies fit together into new solutions. Case in point: When designing a conversational AI assistant for one of North America’s top financial services firms, the Data and AI Research Team (DART) here at WillowTree discovered that using intent classification in concert with a large language model (LLM) improves performance, safety, and costs.  

We think of this interweaving of intent classification and an LLM alongside a retrieval-augmented generation (RAG) system as “partial intent classification.” This practice — and the story behind it — shows how important it is for AI professionals to think creatively and continually experiment toward new solutions.

After all, the development of powerful LLMs like GPT-4, along with RAG systems, allows developers to create chatbots that are far more capable and self-sufficient than those that rely on legacy techniques. However, these new models and approaches still carry risks. They may produce uninformative or false outputs (i.e., AI hallucinations), or they may be susceptible to jailbreaking (i.e., exploiting vulnerabilities to generate content outside of ethical bounds).  

Given concerns about answer reliability and model security remain top-of-mind for business and technology leaders, we propose partial intent classification as a mitigation strategy for these problems.

Why we turned to Intent classification

Intent mappings enable chatbots to offer deterministic, curated responses to user prompts that align with specific user intents. Doing so grants us another lever of control over these often unpredictable systems. These user intents often cover frequently asked questions (FAQs).

But intent classification can also detect malicious or off-topic behavior — prompts designed to jailbreak the system or engage it outside of its sanctioned topics. So, by introducing deterministic use case identification, we create a robust frontline of defense against jailbreak attempts.

Moreover, intent classification also benefits chatbot performance by reducing latency and cost. That’s because the intent classifier only relies on the embeddings of the prompt and a lightweight model, not on any LLM completions.

How we applied intent classification to a financial services AI assistant

To support an intent classification system, we first need a set of common intents. We can generate these the same way as a set of webpage FAQs or even pull them directly from an existing collection of FAQs. The idea is to capture a set of common user prompts in your set of intents. Generating several examples of each intent is necessary so your model has sufficient examples on which to train and test. Consider these variations of the same user intent:

  • “How do I dispute a credit card charge?”
  • “There’s suspicious activity on my most recent credit card statement.”
  • “I don’t recognize a credit card purchase. What should I do?”

Each of these intents needs a predetermined output to be returned whenever that intent is recognized in the input. In our case, we began our experimentation with intent classification by examining the validation of the geometric separability of the embeddings.

Comparison of truncated SVD clustering with five intent categories vs. seven
A comparison of truncated SVD clustering of embeddings with five intent categories (left) and seven intent categories (right). While the left plot shows generally strong geometric separability of the categories, the right plot shows that as more categories are added, the categories are less easily separable in the reduced space. 

After looking into K-means clustering compared to singular value decomposition (SVD) and t-distributed stochastic neighbor embedding (t-SNE) clustering, the results indicated that using a K-nearest neighbors (KNN) model to perform classification on the original 1,536 dimensional embeddings could be effective. While initial trials exhibited strong classification accuracy with the user intents, the focus isn't merely on accurate intent mapping every time. The intent classifier must also grapple with prompts that do not fit the given list of intents. These may be prompts that:

  • should result in RAG
  • attempt to jailbreak the system
  • try to pull the chatbot out of scope

For the classifier to handle prompts that don’t map a desired intent with a curated response, we had to create new categories in addition to our initial set of intents: out-of-scope, jailbreak, and related but non-intent prompt. We don’t want our chatbot to engage with out-of-scope and jailbreak prompts. From this point, we’ll refer to them as “non-answer” prompts because we don’t want the chatbot to answer them.

As for related but non-intent prompts, these fall under our chatbot’s purview but don’t fit our intent categories. These prompts should be sent through the rest of the chatbot architecture to be answered via RAG. In light of these new categories, we’ll refer to our original set of intents as “in-scope” intents.

Adjusting for a larger set of intent categories

With this larger group of categories, intent classification accuracy decreased. Additionally, considerations beyond mere accuracy were raised. Now, we needed to separate the inaccurate classifications into several different categories, each with its own set of consequences. On top of that, considerations about the size of the training set become important. A larger training set ensures more coverage for each intent, related prompts, and non-answer prompts.

Of course, total coverage is impossible, and increasing the size of the training set comes with diminishing returns. Beyond the cost and time of generating more examples, increasing the number of prompts to check a user prompt against also increases the latency and cost of the intent classifier, two key pain points we mean to improve.

A heuristics-based system enhanced confidence

In testing the KNN with this more diverse set of categories, we saw that in some cases, prompts that fit one in-scope intent were incorrectly classified as another in-scope intent or related prompts were classified as in-scope intents. There were also some non-answer prompts classified as related prompts. Our approach to these problems was to modify the classifier via two heuristics.

The first heuristic is for the model’s confidence in its prediction of a given intent. In the case that the KNN predicts that a prompt matched one of our in-scope intents, we calculate the sum of the cosine similarities of the k nearest neighbors with the predicted intent label. This approach attempts to measure the “closeness” of the prompt to the predicted intent in the embeddings space. If the prompt is determined to be sufficiently close (i.e., having a heuristic value exceeding some predetermined threshold), the intent classifier returns the KNN’s prediction. Otherwise, the prediction is ignored, and the prompt triggers RAG.

Note that sending some of them to RAG decreases the percentage of in-scope intents that are correctly classified. But it also reduces the percentage of prompts incorrectly classified as in-scope intents and the number of in-scope intents incorrectly classified as the wrong in-scope intents.

The second heuristic measures the model’s confidence that a prompt was a non-answer prompt. Similarly, this heuristic was equal to the sum of the cosine similarities of the k nearest neighbors labeled as non-answer prompts. If this heuristic is above a certain threshold, the prompt is classified as a non-answer, and the chatbot will return a predetermined response, refusing to engage with the user’s prompt.

This heuristic does not exist to require a certain degree of confidence to reject a prompt. Instead, it rejects any prompt that elicits a certain level of confidence in it being a non-answer prompt.

A closer look at the heuristics’ impact

The colored labels in the graph below are based solely on the KNN (k=5) prediction, not factoring in the thresholds. This plot demonstrates the purpose of the thresholds. For example, the pink and orange points are all prompts incorrectly mapped to an in-scope intent based on the KNN. But if we incorporate the shown threshold, all of those points would be sent to RAG instead of returning the incorrect deterministic response.

Threshold measurements using k-nearest neighbors for intent classification

The blue points to the left of the threshold are correctly classified intents that would also be sent to RAG, but that's simply the tradeoff we're making with our threshold. We can also see that our intent classifier would reject the purple points just above the rejection threshold, which would prevent our bot from answering these undesirable prompts.

Precision-recall curves helped us optimize tradeoffs

Balancing the tradeoffs discussed above is essential to determine the thresholds for these heuristics or whether to use different ones entirely. Thinking in terms of maximizing accuracy is likely not the best approach. There are different consequences associated with sending a true intent prompt to RAG (where it will likely receive an adequate response anyway) and giving a pre-canned response to a prompt that didn’t actually elicit that response.

Additionally, there are different consequences to classifying an in-scope prompt as a jailbreak attempt and refusing to answer it versus sending a jailbreak prompt through the rest of the system because the intent classifier missed it. These tradeoffs are best understood through precision-recall curves.

Example of a precision-recall curve in the context of AI intent classification
This plot is an example of a precision-recall curve. Each point corresponds to a specific threshold value for the in-scope intent heuristic. This demonstrates how decreasing your threshold to increase the percentage of true in-scope intents being discovered also decreases the percentage of prompts classified as such being correct.

Precision, also called true positive rate (TPR), is the percentage of all true intent (or jailbreak/out-of-scope) prompts correctly classified. Recall is the percentage of all of the prompts classified as intents (or jailbreak/out-of-scope) that are correctly classified. These need to be viewed separately for intents and jailbreak/out-of-scope since they rely on different heuristics.

In each case, increasing the threshold will cause fewer prompts to be classified as either intents or jailbreak/out-of-scope. This process will result in lower precision because more true intent prompts or true jailbreak/out-of-scope prompts will not be classified as such due to the lower threshold.

However, increasing the threshold may also increase your classifier’s recall. In the case of the intent threshold, a higher degree of confidence will be required to classify a prompt as an intent. Therefore, prompts classified as intents should be more accurate, which is what recall is measuring.

Turning to the case of the jailbreak/out-of-scope threshold, increasing it will result in a higher confidence threshold to automatically classify a prompt as a jailbreak/out-of-scope prompt. Theoretically, this should decrease the number of prompts incorrectly classified as such.

Finding the right AI solution takes a team

Despite the huge strides in generative AI over the past year alone, chatbots that rely on generative AI still have elements of unpredictability. If there’s a specific set of user intents your conversational AI assistant must handle in a sensitive or precise company-approved way, partial intent classification is a powerful tool to consider.

Of course, there will always be use cases that push AI into new frontiers to find solutions. If that’s your scenario, the DART team at WillowTree is here to help, from conversational AI and voice to generative AI solutions.

Learn more by checking out our data and AI consulting services. Get started with our eight-week GenAI Jumpstart program and future-proof your company against asymmetric genAI technology innovation with our Fuel iX enterprise AI platform.

Table of Contents
Will Rathgeb
Michael Freenor

Recent Articles