Craft At WillowTree Logo
Content for craftspeople. By the craftspeople at WillowTree.
Data & AI

OpenAI or Azure OpenAI: Can models be more deterministic depending on API?

Michael Freenor
Director of Applied AI
Milton Leal
Platform Data Scientist
October 11, 2023

Comparing OpenAI Models: OpenAI vs. Microsoft Azure OpenAI APIs

Suppose you repeatedly made the same API call to an OpenAI Large Language Model (LLM) that generates text; would you expect to get the same output every time? Given that GPT models are non-deterministic by nature (even with temperature=0), the answer should be no. However, what if you made the same API call to an OpenAI Embedding Model? As the embeddings are supposed to be a frozen layer of a particular LLM, the answer should be yes.

Over at WillowTree’s Data and AI Research Team (DART), alongside other experiments in artificial intelligence and machine learning, we are focused on discovering and developing methods that provide guaranteed reliability in LLM performance. When we discovered that calling text-embedding-ada-002 through OpenAI’s API produced a variety of results given the same input string, we immediately began to measure that error. At this stage of the LLM game, any source of indeterminacy must be measured so its potential effect on downstream applications might be discovered and understood.

Another post will cover what we learned about the distribution of noise in OpenAI’s text-embedding-ada-002 output. However, our focus on the issue led to a happy coincidence; we were in the middle of running these studies when we switched our API key from OpenAI’s standard  API to OpenAI’s Azure endpoint. In doing so, the noise distribution for text-embedding-ada-002 completely disappeared! This finding led us to a direct comparison of the differing behavior of the GPT-4-0163 model between OpenAI and Microsoft Azure OpenAI’s APIs.

In this post, we will share some of these findings. The story that emerges becomes quite clear – for some natural language and generative AI tasks like creative writing and open-ended factual questions, Azure’s API has significantly less variation in output when compared to the same model, parameters, and input on OpenAI’s API. On the other hand, if you are using LLMs for translation or simple factual question answering, Microsoft Azure’s API currently provides more consistent performance given the same model and parameters.

Comparison of text-embedding-ada-002

The comparison between OpenAI and Azure on text-embedding-ada-002 is easy to characterize: Azure’s outputs are identical given the same input, whereas OpenAI’s outputs are noisy. In other words, don’t expect to get exactly the same embedding vector back from OpenAI’s ada-002 implementation. As can be seen in Figure 1, OpenAI produces about 10 or so unique embeddings per 100 trials of the same input sentence, whereas Azure produces 1 in each case.

Figure 1. Number of distinct outputs per 100 identical API calls (sentences in appendix).

In another post, we’ll dive more into the actual shape of the noise in OpenAI’s output. How this impacts a retrieval augmented generation (RAG) system or another system reliant on text embeddings depends critically on the use case and data in question. DART continues to look into the general question and the tradeoffs involved as we strive for greater reliability and consistency in LLM output. Still, if you need absolute consistency in embedding outputs, such as for a reverse-lookup from embeddings to text, Azure OpenAI’s API can support those needs.

Comparison of gpt-4-0613

Discovering a large-scale difference between what should have been the same model with the same parameters led to an investigation of GPT-4’s completions. It’s been known for a while now that temperature=0, which should produce mostly deterministic output, produces a variety of potential completions (we have observed this with gpt-3, gpt-3.5-turbo, and gpt-4-0613).

If Azure’s ada-002 is actually deterministic (or at least much more so than OpenAI’s), we figured that maybe Azure OpenAI’s gpt-4-0613 might produce deterministic output with temperature=0. What we found was a mixed picture: Azure’s gpt-4-0613 is not deterministic with temperature=0, but it is decidedly more deterministic than OpenAI’s.

Figure 2. Proportion of distinct completions, gpt-4-0613 with temperature=0. A higher score means more variability in response. A lower score indicates more deterministic output.

What started as a somewhat jumbled set of prompts became more structured as we noticed trends. For example, prompts from index 0 up through index 9 might be classed as “generation” tasks – writing poems, interpreting history, describing a real place, etc. Prompts 10 through 13 are about relatively straightforward facts. The remainder (14 and to the right) are all translation tasks).

A picture emerges where Azure OpenAI’s temperature=0 gpt-4-0613 replies with more deterministic responses. The exception is when it comes to simple factual answers and translation, which is largely consistent across both APIs. Both implementations return similar numbers of distinct completions per input and produce the same variants, by and large. In looking at the overlap in completion between the models, the following picture arose:

Figure 3. Variation in response versus the overlap between API outputs. Note that in translation and simple factual answer tasks, the low number of unique completions is associated with a high overlap between models. However, with higher numbers of distinct completions, the overlap between OpenAI and Azure outputs becomes minimal.

So, not only do the completion APIs have different statistical properties (a different expected proportion of distinct replies), but they appear to have little agreement between outputs depending on the task. We conclude that if your use case revolves around translation, OpenAI and Azure’s APIs perform equivalently. For almost any other task, however, if consistency is required, Azure has an overall better performance profile.

What does the future hold for Azure vs. OpenAI APIs?

To state the obvious, we need more information from OpenAI on how these models are being deployed. Still, it seemed essential to report our findings to date, as a lack of consistency between temperature=0 responses makes development and testing more complicated than need be.

There might be something in the software implementation, hardware implementation, or both that differs between OpenAI and Azure, which impacts numeric error in some way. Another possibility is that they both have the same implementation, but their ability to perform under scale is different (due to Azure’s greater capacity). For struggling GPT prompt engineers, developers, and testers grappling with the indeterminacy of embeddings and completions, the message is clear: stick to Azure where possible.

Stay tuned for more of DART's latest explorations in generative AI, machine learning, and responsible AI.


Sentences used to compare text-embedding-ada-002:

1: The whisper of leaves on a summer evening carries stories from ancient times.

2: In the vast expanse of the cosmos, our planet is but a fleeting speck of light.

3: Every time we laugh, a moment of joy is etched into the tapestry of the universe.

4: Lost within the pages of a book, one can travel worlds without ever taking a step.

5: Dreams are the bridge between our deepest desires and the reality we construct.

Prompts used to compare gpt-4-0613:

0: Write a short story about a robot who discovers it has emotions.

1: Write a poem about the beauty of the night sky.

2: Create a recipe for a vegan chocolate cake.

3: Write me the kind of note a parent would leave in their kid's lunchbox.

4: What caused Napoleon's exile?

5: What is Godzilla?

6: Explain photosynthesis to me as if I were five years old.

7: Write a brief description of New York City.

8: Describe the main differences between classical and modern art.

9: Explain the concept of gravity in a way that a middle school student would understand.

10: What kind of fees does <client> have for standard checking accounts?

11: What was the federal funds rate in early 2017?

12: When did the Soviet Union break up?

13: When did the Beatles come to America?

14: Translate the following English sentence into French: "Can you use dogs to hunt at this national park?"

15: Translate the following English sentence into French: "What movies with 'dog' in the title have this dog breed appeared in?"

16: Translate the following English sentence into French: "What fraternity was this president part of?"

17: Translate the following English sentence into French: "Did this president have both a son and a daughter?"

18: Translate the following English sentence into French: "Where are the places to go spelunking in this national park, are permits required, and can you rent equipment?"

19: Translate the following English sentence into French: "Is this dog breed commonly taller than two feet?"

20: Translate the following English sentence into French: "Can i bring my cat with me to this national park?"

21: Translate the following English sentence into French: "What awards has this president won for their humanitarianism?"

22: Translate the following English sentence into French: "What four-legged animals can be seen in this national park and are we allowed to feed them?"

23: Translate the following English sentence into French: "Was this president older than 55 when he was elected?"

Michael Freenor
Director of Applied AI
Milton Leal
Platform Data Scientist

Recent Articles