Monday 14 October 2024

Working with the OpenAI Assistants API: Using file search

This is the second post in the series where we explore the OpenAI Assistants API. In this post, we will be looking at the file search capabilities which allows us to upload files to the Assistants API and chat with them. See the following posts for the entire series:

Working with the OpenAI Assistants API: Create a simple assistant

Working with the OpenAI Assistants API: Using file search (this post)

Working with the OpenAI Assistants API: Using code interpreter (coming soon)

The file search API uses the Retrieval Augmented Generation (RAG) pattern which has been made popular recently. The added advantage of using the Assistants API for this is that the API manages document chunking, vectorizing and indexing for us. Whereas without the Assistants API we would have to use a separate service like Azure AI Search and manage the document indexing ourselves. 

To upload and chat with documents using the Assistants API, we have to use the following moving pieces: 

  • First, we need to create a Vector Store in the Assistants API.
  • Then, we need to upload files using the Open AI File client and add them to the vector store.
  • Finally, we need to connect the vector store to either an assistant or a thread which would enable to assistant to answer questions based on the document.

For the demo code, we will be using the Azure OpenAI service for working with the OpenAI gpt-4o model and since we will be using .NET code, we will need the Azure OpenAI .NET SDK as well as Azure.AI.OpenAI.Assistants nuget packages.

Limitations


As per OpenAI docs, there are some limitations for the file search tool:

  • Each vector store can hold up to 10,000 files.
  • The maximum file size of a file which can be uploaded is 512 MB. Each file should contain no more than 5,000,000 tokens per file (computed automatically when you attach a file).

When querying for the documents in the vector store, we have to be aware of the following things which are not possible right now. However, the OpenAI team are working on this and some of these features will be available soon:

  • Support for deterministic pre-search filtering using custom metadata.
  • Support for parsing images within documents (including images of charts, graphs, tables etc.)
  • Support for retrievals over structured file formats (like csv or jsonl).
  • Better support for summarization — the tool today is optimized for search queries.

Current supported files types can be found in the OpenAI docs

Hope this helps!

Monday 7 October 2024

Working with the OpenAI Assistants API: Create a simple assistant

With OpenAI's recently released Assistants API, building AI bots becomes a lot easier. Using the API, an assistant can leverage custom instructions, files and tools (previously called functions) and answer user questions based on them.

Before the Assistants API, building such assistants was possible but for a lot of things, we had to use our own services e.g. vector storage for file search, database for maintaining chat history etc.

The Assistants API gives us a handy wrapper on top of all these disparate services and a single endpoint to work with. So in this series of posts, let's have a look at what the Assistants API can do.

Working with the OpenAI Assistants API: Create a simple assistant (this post)

Working with the OpenAI Assistants API: Using file search

Working with the OpenAI Assistants API: Using code interpreter (coming soon)

The first thing we are going to do is build a simple assistant which has a "SharePoint Tutor" personality. It will be used to answer questions for users who are learning to use SharePoint. Before deep diving into the code, lets understand the different moving pieces of the Assistants API: 

An assistant is a container in which all operations between the AI and the user are managed.

A thread is a list of messages which were exchanged between the user and AI. The thread is also responsible for maintaining the conversation history.

A run is a single invocation of an assistant based on the history in the thread as well as the tools available to the assistant. After a run is executed, new messages are generated and added to the thread.

For the demo code, we will be using the Azure OpenAI service for working with the OpenAI gpt-4o model and since we will be using .NET code, we will need the Azure OpenAI .NET SDK as well as Azure.AI.OpenAI.Assistants nuget packages.

This was a simple assistant creation just to get us familiar with the Assitants API. In the next posts, we will dive deeper into the API and explore the more advanced concepts. Stay tuned!

Monday 23 September 2024

Using gpt-4o vision to understand images

OpenAI released gpt-4o recently, which is the new flagship model that can reason across audio, vision, and text in real time. It's a single model which can be provided with multiple types of input (multi modal) and it can understand and respond based on all of them. 

The model is also available on Azure OpenAI and today we are going to have a look at how to work with images using the vision capabilities of gpt-4o. We will be providing it with images directly as part of the chat and asking it to analyse the images before responding. Let's see how it works:

We will be using the Azure OpenAI service for working with the OpenAI gpt-4o and since we will be using .NET code, we will need the Azure OpenAI .NET SDK v2:

1. Basic image analysis

First, let's start with a simple scenario of sending an image to the model and asking it to describe it.


2. Answer questions based on details in images

Next, let's give a slightly more complex image of  some ingredients and ask it to create a recipe:

Image source: allrecipes.com

3. Compare images

This one is my favourite, let's give it 2 images and ask it to compare them against each other. This can be useful in scenarios where there is a single "standard" image and we need to determine if another image adheres to the standard.

4. Binary data

If the URL of the image is not accessible anonymously, then we can also give the model binary data of the image:


5. Data URI


We can also use Data URI's instead of direct URLs



6. Limitations

As per OpenAI docs, there are some limitations of the vision model that we should be aware of:

Medical images: The model is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.

Non-English: The model may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.

Small text: Enlarge text within the image to improve readability, but avoid cropping important details.

Rotation: The model may misinterpret rotated / upside-down text or images.

Visual elements: The model may struggle to understand graphs or text where colors or styles like solid, dashed, or dotted lines vary.

Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.

Accuracy: The model may generate incorrect descriptions or captions in certain scenarios.

Image shape: The model struggles with panoramic and fisheye images.

Metadata and resizing: The model doesn't process original file names or metadata, and images are resized before analysis, affecting their original dimensions.

Counting: May give approximate counts for objects in images.

CAPTCHAS: For safety reasons, we have implemented a system to block the submission of CAPTCHAs.


Overall, I do think the ability to combine text and image input as part of of the same chat is a game changer! This could unlock a lot of scenarios which were not possible just with a single mode of input. Very excited to see what is next!

Hope you found the post useful!


Sunday 10 March 2024

Create a Microsoft 365 Copilot plugin: Extend Microsoft 365 Copilot's knowledge

Microsoft 365 Copilot is an enterprise AI tool that is already trained on your Microsoft 365 data. If you want to "talk" to data such as your emails, Teams chats or SharePoint documents, then all of it is already available as part of it's "knowledge".

However, not all the data you want to work with will live in Microsoft 365. There will be instances when you want to use Copilot's AI on data residing in external systems. So how do we extend the knowledge of Microsoft 365 Copilot with real time data coming from external systems? The answer is by using plugins! Plugins not only help us do Retrieval Augmented Generation (RAG) with Copilot, but they also provide a framework for writing data to external systems. 

To know more about the different Microsoft 365 Copilot extensibility options, please have a look here: https://learn.microsoft.com/en-us/microsoft-365-copilot/extensibility/decision-guide

So in this post, let's have a look at how to build a plugin which talks to an external API and then infuses the real time knowledge into Copilot's AI. At the time of this writing, there is nothing more volatile than Cryptocurrency prices! So, I will be using a cryptocurrency price API and enhance Microsoft 365 Copilot's knowledge with real time Bitcoin and Ethereum rates!

(click to zoom)

So let's see the different moving parts of the plugin. We will be using a Microsoft Teams message extension built on the Bot Framework as a base for our plugin:  

1) App manifest

This is by far the most important part of the plugin. The name and description (both short and long) are what tell Copilot about the nature of the plugin and when to invoke it to get external data. We have to be very descriptive and clear about the features of the plugin here as this is what the Copilot will use to determine whether the plugin is invoked. The parameter descriptions are used to tell Copilot how to create the parameters required by the plugin based on the conversation.

2) Teams messaging extension code

This function does the heavy lifting in our code. It is called with the parameters specified in the app manifest by Copilot. Based on the parameters we can fetch external data and return it as adaptive cards. 

3) Talk to the external system (Cryptocurrency API)

This is helper function which is used to actually talk to the crypto api and return rates. 

Hope you found this post useful! 

The code for this solution is available on GitHub: https://github.com/vman/M365CopilotPlugin

Thursday 15 February 2024

Generate images using Azure OpenAI DALL·E 3 in SPFx

Dall E 3 is the latest AI image generation model coming out of OpenAI. It is leaps and bounds ahead of the previous model Dall E 2. Having explored both, the image quality as well as the adherence to text prompts is much better for Dall E 3. It is now available as a preview in Azure OpenAI Service as well.

Given all this, it is safe to say if you are working on the Microsoft stack and want to generate images with AI, using the Azure OpenAI Dall E 3 model would be the recommended option.

In this post, let's explore the image generation API for Dall E 3 and also how to use it from a SharePoint Framework (SPFx) solution. The full code of the solution is available on GitHub: https://github.com/vman/Augmentech.OpenAI

First, let's build the web api which will wrap the Azure OpenAI API to create images. This will be a simple ASP.NET Core Web API which will accept a text prompt and return the generated image to the client.

To run this code, we will need the following NuGet package: https://www.nuget.org/packages/Azure.AI.OpenAI/1.0.0-beta.13/

Now for calling the API, we will use a standard React based SPFx webpart. The webpart will use Fluent UI controls to grab the text prompt from user and send it to our API.

Hope this helps!

Thursday 25 January 2024

Get structured JSON data back from GPT-4-Turbo

With the latest gpt-4-turbo model out recently, there is one very helpful feature which came with it: The JSON mode option. Using JSON mode, we are able to predictably get responses back from OpenAI in structured JSON format. 

This can help immensely when building APIs using Large Language Models (LLMs). Even though the model can be instructed to return JSON in it's system prompt, previously, there was no guarantee that the model would return valid JSON. With the JSON mode option now, we can specify the required format and the model will return data according to it. 

To know more about JSON mode, have a look at the official OpenAI docs: https://platform.openai.com/docs/guides/text-generation/json-mode

Now let's look at some code to see how this works in action:

I am using the Azure OpenAI service to host the gpt-4-turbo model and I am also using the v1.0.0.-beta.12 version of the Azure OpenAI .NET SDK found on NuGet here:

https://www.nuget.org/packages/Azure.AI.OpenAI/1.0.0-beta.12

What is happening in the code is that in the system message, we are instructing the LLM that analyse the text provided by the user and then extract the cities mentioned in this text and return them in the specified JSON format. 

Also important is line 22 where we explicitly specify to use the response format as JSON.   

Next, we provide the actually text to parse in the user message. 

Once we get the data back in expected JSON schema, we are able to convert it to objects which can be used in code.

And as expected we get the following output:


Hope this helps!

Monday 18 December 2023

Using Microsoft Tokenizer to count Azure OpenAI model tokens

If you have been working with OpenAI APIs, you will have come across the term "tokens". Tokens are a way in which these APIs process and output text. Various versions of the OpenAI APIs have different token context lengths. This means there is a limit to the text they can process in a single request. More about tokens here: https://learn.microsoft.com/en-us/azure/ai-services/openai/overview#tokens

When building an app based on these APIs, we need to keep track of the tokens being sent and make sure not to send more than the maximum context length of the OpenAI model being used (e.g. gpt-3.5-turbo). If more tokens are sent than the maximum context length of the model, the request will fail with the following error:

To help with counting tokens before sending to the APIs, there are various libraries available. One of them being the Microsoft Tokenizer: https://github.com/microsoft/Tokenizer which is an open source .NET and TypeScript implementation of OpenAI's tiktoken library. 

So in this post, let's see how we can use the Microsoft Tokenizer .NET SDK to manage the tokens sent to OpenAI APIs.

First we will need the Microsoft Tokenizer nuget package:

https://www.nuget.org/packages/Microsoft.DeepDev.TokenizerLib/

Since we will actually be counting the tokens of a chat between the user and an AI assistant, we will also use the Azure OpenAI .NET SDK:

https://www.nuget.org/packages/Azure.AI.OpenAI/1.0.0-beta.8

Next, in our code we will first have to initialize the tokenizer and let it know which OpenAI model will we be working with. Most of the recent models like gpt-3.5-turbo, gpt-4 etc. share the same token encoding i.e. cl100k_base. So we can use the same tokenizer across these models.

Now let's look at the actual code:

What we have here is a sample chat history between a user and an assistant. Before sending the chat history to the OpenAI api to get the next message from the assistant, we are using the Tokenizer library to count the tokens, and if it comes out that there are more tokens present in the than the model supports, we are removing the earlier messages from the chat. This is so that the most recent conversations are sent to the API and the response generated stays relevant to the current conversation context. 

Hope this helps!