Monday, 18 December 2023

Using Microsoft Tokenizer to count Azure OpenAI model tokens

If you have been working with OpenAI APIs, you will have come across the term "tokens". Tokens are a way in which these APIs process and output text. Various versions of the OpenAI APIs have different token context lengths. This means there is a limit to the text they can process in a single request. More about tokens here: https://learn.microsoft.com/en-us/azure/ai-services/openai/overview#tokens

When building an app based on these APIs, we need to keep track of the tokens being sent and make sure not to send more than the maximum context length of the OpenAI model being used (e.g. gpt-3.5-turbo). If more tokens are sent than the maximum context length of the model, the request will fail with the following error:

// For gpt-3.5-turbo
{
"error": {
"message": "This model's maximum context length is 4096 tokens. However, your messages resulted in 6598 tokens. Please reduce the length of the messages.",
"type": "invalid_request_error",
"param": "messages",
"code": "context_length_exceeded"
"status": "400" //(model error)
}
}

To help with counting tokens before sending to the APIs, there are various libraries available. One of them being the Microsoft Tokenizer: https://github.com/microsoft/Tokenizer which is an open source .NET and TypeScript implementation of OpenAI's tiktoken library. 

So in this post, let's see how we can use the Microsoft Tokenizer .NET SDK to manage the tokens sent to OpenAI APIs.

First we will need the Microsoft Tokenizer nuget package:

https://www.nuget.org/packages/Microsoft.DeepDev.TokenizerLib/

Since we will actually be counting the tokens of a chat between the user and an AI assistant, we will also use the Azure OpenAI .NET SDK:

https://www.nuget.org/packages/Azure.AI.OpenAI/1.0.0-beta.8

Next, in our code we will first have to initialize the tokenizer and let it know which OpenAI model will we be working with. Most of the recent models like gpt-3.5-turbo, gpt-4 etc. share the same token encoding i.e. cl100k_base. So we can use the same tokenizer across these models.

Now let's look at the actual code:

using Azure;
using Azure.AI.OpenAI;
using Microsoft.DeepDev;
using System.Text.Json;
namespace Microsoft.Tokenizer.Demo
{
internal class Program
{
static async Task Main(string[] args)
{
Uri azureOpenAIResourceUri = new("https://<your-azure-openai-service>.openai.azure.com/");
AzureKeyCredential azureOpenAIApiKey = new("<your-azure-openai-key>");
string azureOpenAIDeploymentName = "gpt-35-turbo"; //Deployment name
OpenAIClient client = new(azureOpenAIResourceUri, azureOpenAIApiKey);
var chatCompletionsOptions = new ChatCompletionsOptions();
chatCompletionsOptions.MaxTokens = 500;
chatCompletionsOptions.Messages.Add(new ChatMessage(ChatRole.System, "You are a helpful assistant. You will talk like a pirate."));
chatCompletionsOptions.Messages.Add(new ChatMessage(ChatRole.User, "Can you help me?"));
chatCompletionsOptions.Messages.Add(new ChatMessage(ChatRole.Assistant, "Arrrr! Of course, me hearty! What can I do for ye?"));
chatCompletionsOptions.Messages.Add(new ChatMessage(ChatRole.User, "What's the best way to train a parrot?"));
//gpt-3.5-turbo, gpt-3.5-turbo-16k, gpt-4, gpt-4-32k, gpt-4-turbo all use the same encoging i.e. cl100k_base so the model name here should not matter as long as one of these models is used
//https://github.com/microsoft/Tokenizer/blob/44cc0d603b22483abcc71310e25b8b3746f32cd9/Tokenizer_C%23/TokenizerLib/TokenizerBuilder.cs#L17
var tokenizer = await TokenizerBuilder.CreateByModelNameAsync("gpt-3.5-turbo"); //model name, not the Azure Deployment name. Notice the period in the model name.
var tokens = tokenizer.Encode(JsonSerializer.Serialize(chatCompletionsOptions.Messages), Array.Empty<string>());
Console.WriteLine($"Token count : {tokens.Count}");
//Make sure the token count is less than 3500. Leave 500 tokens for the response
while (tokens.Count > 3500)
{
//start removing messages from the chat history from index 1 because index 0 is the system prompt
chatCompletionsOptions.Messages.RemoveAt(1);
tokens = tokenizer.Encode(JsonSerializer.Serialize(chatCompletionsOptions.Messages), Array.Empty<string>());
}
Response <ChatCompletions> response = await client.GetChatCompletionsAsync(azureOpenAIDeploymentName, chatCompletionsOptions);
ChatMessage responseMessage = response.Value.Choices[0].Message;
Console.WriteLine($"[{responseMessage.Role.ToString().ToUpperInvariant()}]: {responseMessage.Content}");
}
}
}

What we have here is a sample chat history between a user and an assistant. Before sending the chat history to the OpenAI api to get the next message from the assistant, we are using the Tokenizer library to count the tokens, and if it comes out that there are more tokens present in the than the model supports, we are removing the earlier messages from the chat. This is so that the most recent conversations are sent to the API and the response generated stays relevant to the current conversation context. 

Hope this helps!