Monday, 23 September 2024

Using gpt-4o vision to understand images

OpenAI released gpt-4o recently, which is the new flagship model that can reason across audio, vision, and text in real time. It's a single model which can be provided with multiple types of input (multi modal) and it can understand and respond based on all of them. 

The model is also available on Azure OpenAI and today we are going to have a look at how to work with images using the vision capabilities of gpt-4o. We will be providing it with images directly as part of the chat and asking it to analyse the images before responding. Let's see how it works:

We will be using the Azure OpenAI service for working with the OpenAI gpt-4o and since we will be using .NET code, we will need the Azure OpenAI .NET SDK v2:

1. Basic image analysis

First, let's start with a simple scenario of sending an image to the model and asking it to describe it.

using Azure.AI.OpenAI;
using Azure;
using OpenAI.Chat;
using System.Text.Json;
namespace OpenAI.SDK.Test
{
internal class Program
{
static async Task Main(string[] args)
{
string endpoint = "https://<myopenaiservice>.openai.azure.com/";
string key = "<my-open-ai-service-key>";
string deploymentName = "gpt-4o";
var openAiClient = new AzureOpenAIClient(
new Uri(endpoint),
new AzureKeyCredential(key),
new AzureOpenAIClientOptions(AzureOpenAIClientOptions.ServiceVersion.V2024_06_01));
var chatClient = openAiClient.GetChatClient(deploymentName);
List<ChatMessage> messages = [
new UserChatMessage(
ChatMessageContentPart.CreateImageMessageContentPart(
new Uri("https://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Orange_tabby_cat_sitting_on_fallen_leaves-Hisashi-01A.jpg/360px-Orange_tabby_cat_sitting_on_fallen_leaves-Hisashi-01A.jpg"),
ImageChatMessageContentPartDetail.High)
),
new UserChatMessage("Describe the image to me")
];
ChatCompletion chatCompletion = chatClient.CompleteChat(messages);
Console.WriteLine($"[ASSISTANT]: {chatCompletion}");
}
}
}

[ASSISTANT]: The image shows a ginger and white cat sitting on a ground covered with dry leaves. The cat has a white chest and paws, with a ginger coat on its back and head. Its ears are perked up, and it appears to be looking intently at something. The background is out of focus, highlighting the cat as the main subject.

view raw output.md hosted with ❤ by GitHub


2. Answer questions based on details in images

Next, let's give a slightly more complex image of  some ingredients and ask it to create a recipe:

Image source: allrecipes.com
var chatClient = openAiClient.GetChatClient(deploymentName);
List<ChatMessage> messages = [
new UserChatMessage(
ChatMessageContentPart.CreateImageMessageContentPart(
new Uri("https://www.allrecipes.com/thmb/HbnN9fkzDBmzI83sbxOhtbfEQUE=/750x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/AR-15022-veggie-pizza-DDMFS-4x3-step-01-a32ad6054e974ecd9f79c8627bc9e811.jpg"),
ImageChatMessageContentPartDetail.High)
),
new UserChatMessage("What can I cook with these ingredients?"),
];
ChatCompletion chatCompletion = chatClient.CompleteChat(messages);
Console.WriteLine($"[ASSISTANT]: {chatCompletion}");

[ASSISTANT]: With these ingredients, you can make a vegetable pizza on a crescent roll crust. Here's a simple recipe:

Ingredients

  • Crescent roll dough (premade)
  • Cream cheese
  • Sour cream
  • Grated carrots
  • Broccoli florets
  • Chopped red bell pepper
  • Sliced radishes
  • Diced onion
  • Sliced celery
  • Ranch dressing mix (or powdered seasoning)

Instructions

  1. Prepare the Crust:

    • Preheat your oven according to the crescent roll package instructions.
    • Roll out the crescent dough onto a baking sheet, pressing the seams together to form a crust.
    • Bake until golden brown. Let it cool completely.
  2. Make the Spread:

    • In a mixing bowl, combine the cream cheese, sour cream, and ranch dressing mix until smooth.
  3. Assemble the Pizza:

    • Spread the cream cheese mixture evenly over the cooled crust.
    • Top with chopped broccoli, grated carrots, diced red bell pepper, sliced radishes, diced onion, and sliced celery.
  4. Serve:

    • Cut into squares and serve chilled.

This is a refreshing and colorful appetizer or snack!

view raw output.md hosted with ❤ by GitHub

3. Compare images

This one is my favourite, let's give it 2 images and ask it to compare them against each other. This can be useful in scenarios where there is a single "standard" image and we need to determine if another image adheres to the standard.

var chatClient = openAiClient.GetChatClient(deploymentName);
List<ChatMessage> messages = [
new UserChatMessage(
ChatMessageContentPart.CreateImageMessageContentPart(
new Uri("https://dalleprodsec.blob.core.windows.net/private/images/generated_00.png"),
ImageChatMessageContentPartDetail.High)
),
new UserChatMessage(
ChatMessageContentPart.CreateImageMessageContentPart(
new Uri("https://dalleprodsec.blob.core.windows.net/private/images/generated_01.png"),
ImageChatMessageContentPartDetail.High)
),
new UserChatMessage("Using the first image as a reference, what should change in the second image so that both images are similar?"),
];
ChatCompletion chatCompletion = chatClient.CompleteChat(messages);
Console.WriteLine($"[ASSISTANT]: {chatCompletion}");

[ASSISTANT]: To make the second image more similar to the first image, consider the following changes:

  1. Noodle Type: Use spaghetti instead of rotini pasta.

  2. Ingredients: Incorporate cherry tomatoes and black olives, as seen in the first image, and reduce the variety of vegetables.

  3. Color Scheme: Emphasize more red and yellow colors from the tomatoes and peppers, similar to the first image.

  4. Garnish and Seasoning: Add fresh herbs like basil and a sprinkle of grated cheese or visible seasoning such as pepper.

  5. Presentation: Arrange the dish more concentrically for a spiral design, as in the first image.

view raw output.md hosted with ❤ by GitHub

4. Binary data

If the URL of the image is not accessible anonymously, then we can also give the model binary data of the image:


var chatClient = openAiClient.GetChatClient(deploymentName);
List<ChatMessage> messages = [
new UserChatMessage(
ChatMessageContentPart.CreateImageMessageContentPart(BinaryData.FromStream(File.OpenRead("C:\\images\\ROBOT ASTRONAUT .png")), "image/jpg" ,
ImageChatMessageContentPartDetail.High)
),
new UserChatMessage("What is in this image?")
];
ChatCompletion chatCompletion = chatClient.CompleteChat(messages);
Console.WriteLine($"[ASSISTANT]: {chatCompletion}");

[ASSISTANT]: The image depicts a futuristic humanoid robot standing in an alien landscape. The robot has a detailed, intricate design with visible mechanical components and a spacesuit-like exterior. In the background, a colorful cosmic scene with stars, a luminous nebula, and a distant planet can be seen, creating a sci-fi atmosphere.

view raw output.md hosted with ❤ by GitHub

5. Data URI


We can also use Data URI's instead of direct URLs



var chatClient = openAiClient.GetChatClient(deploymentName);
string dataURI = "data:image/jpeg;base64,<long-data-uri-of-image>";
//convert data uri to binary data
byte[] binaryData = Convert.FromBase64String(dataURI.Split(',')[1]);
List<ChatMessage> messages = [
new UserChatMessage(
ChatMessageContentPart.CreateImageMessageContentPart(BinaryData.FromBytes(binaryData), "image/jpeg",
ImageChatMessageContentPartDetail.High)
),
new UserChatMessage("What is in this image?")
];
ChatCompletion chatCompletion = chatClient.CompleteChat(messages);
Console.WriteLine($"[ASSISTANT]: {chatCompletion}");

[ASSISTANT]: The image depicts a dramatic scene of a dragon with red scales and glowing eyes emerging from the clouds. Sunlight beams illuminate the creature, highlighting its sharp features and wings. The setting gives a mystical and powerful atmosphere.

view raw output.md hosted with ❤ by GitHub

6. Limitations

As per OpenAI docs, there are some limitations of the vision model that we should be aware of:

Medical images: The model is not suitable for interpreting specialized medical images like CT scans and shouldn't be used for medical advice.

Non-English: The model may not perform optimally when handling images with text of non-Latin alphabets, such as Japanese or Korean.

Small text: Enlarge text within the image to improve readability, but avoid cropping important details.

Rotation: The model may misinterpret rotated / upside-down text or images.

Visual elements: The model may struggle to understand graphs or text where colors or styles like solid, dashed, or dotted lines vary.

Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.

Accuracy: The model may generate incorrect descriptions or captions in certain scenarios.

Image shape: The model struggles with panoramic and fisheye images.

Metadata and resizing: The model doesn't process original file names or metadata, and images are resized before analysis, affecting their original dimensions.

Counting: May give approximate counts for objects in images.

CAPTCHAS: For safety reasons, we have implemented a system to block the submission of CAPTCHAs.


Overall, I do think the ability to combine text and image input as part of of the same chat is a game changer! This could unlock a lot of scenarios which were not possible just with a single mode of input. Very excited to see what is next!

Hope you found the post useful!