(3/6) OpenAI API Mastery: Innovating with GPT-4 Turbo and DALL·E 3 – GPT-4 Turbo and Vision

Welcome back to part 3 of this course, where we’ll be looking at the new ChatGPT 4, named GPT-4 Turbo!

GPT-4 Turbo is even more capable and now has a knowledge cutoff of April 2023, making it significantly less behind on its knowledge of current events.

GPT-4 Turbo is also 3X cheaper for input tokens and 2X cheaper for output tokens compared to the original GPT-4 model. The maximum number of output tokens for this model is 4096.

It also introduces a 128k maximum input context, which is absolutely insane, and there is also a ‘vision’ version that can analyze our images. More details on all of this as we go through this part, which is going to be a lot of fun!

First, create a new folder in your base directory named 3_GPT4_turbo. Inside this folder, create a new file named gpt4_turbo.py:

📁FINX_OPENAI_UPDATES (root project folder)
    📁1_Parallel_function_calling
    📁2_JSON_mode_and_seeds
    📁3_GPT4_turbo
        📄gpt4_turbo.py
    📄.env

Inside this file let’s start with our imports:

from decouple import config
from openai import OpenAI
from pathlib import Path

client = OpenAI(api_key=config("OPENAI_API_KEY"))

And then let’s give GPT-4 Turbo a quick test run (You can skip this one and just look along with my results, we’ll be removing this basic code again in a second):

messages = [
    {
        "role": "system",
        "content": "You make up fun children's short stories according to the user request.",
    },
    {"role": "user", "content": "Tell me a story about a unicorn in space."},
]

response = client.chat.completions.create(
    model="gpt-4-1106-preview",
    messages=messages,
    stop=["\n"],
)

print(response.choices[0].message.content, "\n")

Now if I run this same function with GPT 3.5-Turbo and GPT 4-Turbo we can really feel the difference in quality:

3.5: Once upon a time, in a magical land far, far away, there lived a majestic unicorn named Luna. Luna was not an ordinary unicorn - she had the ability to travel through space and visit other worlds. One day, as she was wandering through the enchanted forest, she stumbled upon a mysterious portal that appeared out of nowhere.

4: Once upon a time, in a shimmering galaxy far, far away, there lived a unique and magical creature known as Stellar, the space-faring unicorn. Stellar had a coat that sparkled with the radiance of a thousand stars and a mane that flowed behind him like the trails of comets across the night sky. His horn, forged from cosmic crystal, glowed with an ethereal light that reflected the colors of distant nebulae.

We can see that 4 is much more descriptive and at a higher level of writing. It really paints a vivid picture like a writer would.

Pricing

So we said that GPT-4 Turbo was cheaper than GPT-4, so I guess that makes GPT-4 Turbo the new GPT 4, as it’s both more capable and cheaper.

Input tokens are 3x cheaper than GPT-4, and output tokens are 2x cheaper, and the new GPT-3.5 Turbo is also cheaper than the older 3.5 Turbo version, but what are the costs exactly?

The new GPT-4 Turbo model is priced at Input: $0.01 and Output: $0.03 per 1000 tokens, which while indeed cheaper than the old GPT-4, is still 10 times more expensive than the new version of GPT-3.5 Turbo, which is priced at Input: $0.001 and Output: $0.002 per 1000 tokens.

This is why it’s good to use the new 3.5 Turbo model whenever you can as it’s often simply good enough and much cheaper.

Both the new 3.5 Turbo and 4 Turbo models have the parallel function calling feature, JSON mode, seeds, and improvements so the 3.5 Turbo model certainly isn’t shabby either.

One of the points that stands out about GPT-4 Turbo is that we no longer have to worry about the context limit, or how big a prompt we can send to ChatGPT at once. the new GPT-4 Turbo has a whopping context limit of 128k tokens, which is about 300 pages of text in a single prompt. Although there are some caveats to this as we’ll soon find out.

The new GPT-3.5 Turbo we have been using so far also has an increased context limit of 16k tokens by the way, which in itself is very high already.

Testing Out the Context Limit

So let’s test out the context limit.

In order to do this we first need a very large text, so let’s go to our trusty copyright-free book website Gutenberg, and download a book.

I’ll be using “The Importance of Being Earnest: A Trivial Comedy for Serious People”, which is actually a historical play that is now in the public domain. You can download a full text version at the link below.

Inside the 3_GPT4_turbo folder, create a new file called book.txt and just copy and paste the entire text from the website inside this text file:

📁FINX_OPENAI_UPDATES (root project folder)
    📁1_Parallel_function_calling
    📁2_JSON_mode_and_seeds
    📁3_GPT4_turbo
        📄book.txt (has the whole book text inside)
        📄gpt4_turbo.py
    📄.env

Now go back to your 'gpt4_turbo.py' file and remove everything except the imports and the client variable, as we don’t want to keep generating stories about unicorns, I think we’ve had enough of those for now!

This is all that will be left in your 'gpt4_turbo.py' file:

from decouple import config
from openai import OpenAI
from pathlib import Path

client = OpenAI(api_key=config("OPENAI_API_KEY"))  # type:

Now we’re going to use pathlib to define a variable that holds the path to our book:

path_to_book = Path(__file__).parent / "book.txt"

This line uses the Path class from the pathlib module, __file__ is a built-in variable in Python that contains the path of the Python file that is currently being executed.

The .parent attribute of a Path object gives the directory containing the file, in this case, the '3_GPT4_turbo' folder. As this is also the folder that contains our book.txt file, we can use the / operator to append the filename to the path and voila we have a path to our book.txt file!

Book Summarizer

So let’s write a book summarizer. As it’s fairly simple, we’ll just go over it in one go:

def book_summary_GPT(file_path):
    with open(file_path, "r", encoding="utf8") as file:
        book = file.read()

    messages = [
        {
            "role": "system",
            "content": "You are a book-summarizing AI. You will receive a book as the query and you will return a summary that is not too long and summarizes the important main points and happenings of the book. Make sure to use only the text provided for your summary, and not any other knowledge you may have.",
        },
        {
            "role": "user",
            "content": book,
        },
    ]

    response = client.chat.completions.create(
        model="gpt-4-1106-preview",
        messages=messages,
    )

    content = response.choices[0].message.content
    # write the output to summary.txt
    with open("summary.txt", "w") as file:
        file.write(content)

    print(content)
    return content

We take a file_path as input and then open it in read mode with the encoding utf8. We then read the text into a variable named book. We set up the messages with a system instruction for summarizing a book.

We make a call to "gpt-4-1106-preview" which is GPT-4 Turbo, and we pass in our messages.

We then get the content and write it to a file named summary.txt where 'w' means write mode. This file will be created if it doesn’t exist. We then print and return the content.

The issue – rate limits

Ok sounds simple enough, so let’s run it.

(Hint, we’re going to run into some trouble!)

Alternatively, if you’re concerned about the costs of sending this many tokens you can look along on my screen for this part, but it won’t be super expensive unless you make hundreds of calls:

book_summary_GPT(path_to_book)

Now if we run this! BAM error!

openai.RateLimitError: Error code: 429 - {'error': {'message': 'Request too large for gpt-4-1106-preview in organization org-oMYMXpp7Cr9pG1rG5Z8a1T2w on tokens per min (TPM): Limit 10000, Requested 34407. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

So what’s going on here? I thought the model had a 128k limit and we only tried to send 34k tokens.

Well, the model may have a context limit of 128,000 tokens, but we’re only allowed to use 10,000 tokens per minute, which essentially makes it completely useless!

Why is that?

It turns out there is a tier list based on how much you have used the API so far:

Tier 1  $5 used (+ certain number of days elapsed since)
10,000 tokens per minute
Tier 2  $50 used (+ certain number of days elapsed since)
40,000 tokens per minute
Tier 3  $100 used (+ certain number of days elapsed since)
80,000 tokens per minute
Tier 4  $250 used (+ certain number of days elapsed since)
300,000 tokens per minute
Tier 5  $1,000 used (+ certain number of days elapsed since)
300,000 tokens per minute

So yeah, that’s something to be aware of, as of now, start of November 2023 you will not be able to utilize the full potential of this model unless you’ve been using the API professionally on your website.

While using it extensively for local development and testing projects only, I never exceed 1-2 dollars per month, so I’ve yet to even become a tier 2 user!

These limits will likely be loosened in the future though, so by the time you watch this tutorial, this may no longer be an issue. Note that the whole book we wanted to pass in was still only 34,407 tokens in total, meaning without the tokens per minute rate limit we could have passed in this whole book three times over in a single prompt!

A Shorter Test

So let’s run it with just the first 30,000 characters passed in then shall we? It’s still a load of text and we should be able to see the potential of handling this much text at once successfully.

In your book_summary_GPT function, find the following:

    {
        "role": "user",
        "content": book,
    },

and simply change it to:

    {
        "role": "user",
        "content": book[:30_000],
    },

Now if you run the print statement again, you’ll get a second surprise and a wise lesson:

book_summary_GPT(path_to_book)

Note that interestingly, the entire play is summarized even though we only provided a portion. I suspect the play and other public domain works are part of the training data that ChatGPT was trained on, or at the very least some summaries of the play were.

I tried to find a public works book that is not already present in ChatGPT’s knowledge data-bank, but it’s really hard to do so, which kind of proves a point.

If we need a summary of an existing work we can just not pass in all this data and ask for a summary of “The Importance of Being Earnest” by Oscar Wilde to achieve the same thing!

ChatGPT has so much knowledge that often you really won’t need to pass in all that data to ask it to do something on a certain topic.

It’s still nice to know you’ll be able to do so after the limits are loosened, and this will mostly be useful for proprietary business data that is unique to your business, because if it’s part of the open internet, at this rate, chances are ChatGPT already knows it.

Also keep in mind that passing such massive prompts should always be avoided when possible considering the costs, which go up linearly with the insane amount of tokens you can theoretically pass to GPT-4 Turbo.

Where a normal prompt often is a couple of lines or paragraphs worth of characters a whole book equates to many, many of those normal calls at once.

GPT-4 Turbo Vision

Now let’s have a look at something very exciting. GPT-4 Turbo Vision! Go ahead and make a new file in the '3_GPT4_turbo' directory named 'gpt4_turbo_vision.py' and let’s get started.

📁FINX_OPENAI_UPDATES (root project folder)
    📁1_Parallel_function_calling
    📁2_JSON_mode_and_seeds
    📁3_GPT4_turbo
        📄book.txt (has the whole book text inside)
        📄gpt4_turbo.py
        📄gpt4_turbo_vision.py
    📄.env

First get started with your imports:

from decouple import config
from openai import OpenAI

client = OpenAI(api_key=config("OPENAI_API_KEY"))

And let’s just do a test run with the most simple possible way to use GPT-4 Turbo vision, by linking to an image that is already on the internet. I’ll be using the following random image, as imgur will allow us to just hotlink to their files directly:

Now create a function:

response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What’s in this image?"},
                {
                    "type": "image_url",
                    "image_url": "https://i.imgur.com/M6bTG.jpeg",
                },
            ],
        }
    ],
    max_tokens=300,
)

print(response.choices[0])

Make sure you pass in the specific "gpt-4-vision-preview" model and notice how easy this syntax all is. We pass in a user message but in this case, the content is a list. We have the first entry which is of type text and the second entry which is of type image_url, allowing us to send both of these as our query.

Just a quick note on the max_tokens parameter, usually you don’t really have to worry about setting it unless you want a limit, but for this model, there seems to be a pretty low default, so go ahead and set this to like 300 to make sure your response message does not get cut off halfway through.

So let’s run the file to test it out.

The image URL we chose deliberately does not give any hints as to what is in the image, so any analysis will be purely from GPT vision. My output was:

This image shows a view from inside a refrigerator looking outwards. There's a person who appears to be a man standing and looking into the fridge, holding a cat. The fridge contains various items such as bottles, what appears to be mustard, and other condiments. There's a dish rack with yellow utensils above on one of the fridge shelves, indicating that there might be more storage space above the fridge. The kitchen seems to have a window with blinds, allowing natural light into the room, and there's a plant visible through the window.

Now that is seriously impressive, and also really fun. Before we look at a more practical use case for this, let’s discuss some of the details and specs of this model.

Costs and Limitations

If you do this on literally hundreds of images especially of high quality, your usage costs will go up faster compared to text-only models, as the images have to be analyzed and processed by the API.

For example, a 1024 x 1024 square image in detail: high mode costs 765 tokens. (more on detail settings in a moment) which amounts to $0.00765 at current pricing.

That’s not a huge deal if you process a couple of images but if you do thousands it racks up quickly.

The model also still has some limitations, which I’m sure will largely disappear with future iterations. For now, the model is not as good with non-English languages, especially if they have an alphabet beyond Latin, rotated images and graphs, the exact location in space of objects, and things like counting.

Building DinnerGPT

So let’s look at a more realistic implementation where the image is hosted locally on our server.

Say the user uploads an image to our server through our website or mobile application. Our application is called: “What’s for dinner!” and the service we provide is for the user to upload a picture of their fridge and we will tell them what they can cook with the ingredients they have in their fridge.

We want to make a call to GPT-4 Turbo and upload our image to ask it for the answer.

First of all, let’s get two images of fridge contents to test with. If you want to really make this fun, take one picture each of your open fridge and your open freezer and use those!

I’ll be using two images from the internet as I’m about to head out on a short trip to Japan soon and the current contents of our fridge and freezer are therefore rather shabby at the moment.

First go to https://unsplash.com/photos/green-and-pink-plastic-container-AEU9UZstCfs and download this image using the ‘Download free’ button in the top right. I will be using the ‘medium’ size from the choice menu.

These images are free for us to use and you don’t even have to log in or sign up. Thanks “Ello” from Unsplash!

Save this image as 'fridge.jpg' in the '3_GPT4_turbo' folder.

Then for the second image, I will use https://unsplash.com/photos/a-refrigerator-filled-with-lots-of-food-and-drinks-ZUhM8LE_HGc with a thank you to Unsplash user “Staton” for this one. Save it as fridge2.jgp in the same folder.

Finally, make a new file called dinner.py for our code, your folders should now look like this:

📁FINX_OPENAI_UPDATES (root project folder)
    📁1_Parallel_function_calling
    📁2_JSON_mode_and_seeds
    📁3_GPT4_turbo
        📄book.txt
        📄dinner.py (new empty file)
        📄fridge.jpg (new image)
        📄fridge2.jpg (new image)
        📄gpt4_turbo.py
        📄gpt4_turbo_vision.py
    📄.env

Now inside our 'dinner.py' file let’s get started on our imports:

from decouple import config
from openai import OpenAI
from pathlib import Path
import base64

client = OpenAI(api_key=config("OPENAI_API_KEY"))
get_path
 = lambda filename: Path(__file__).parent / filename

We import decouple, openai, and pathlib as before, and this time we also import base64, which we’ll go over in a second.

We initialize the client as usual and then we define a simple lambda that takes a filename as input and returns the path to that file using the current directory trick we used before. (Assuming the file exists in the same directory as the file we’re executing, of course).

Base64 Encoding

The problem as always, is that ChatGPT only takes text input, so we need to encode our images as text.

Let’s write a simple function to do just that:

def image_encoder(image_path: str):
    with open(image_path, "rb") as image:
        return base64.b64encode(image.read()).decode("utf-8")

It takes an image_path and returns a base64 encoded string representation of the image file at that path.

We then open the file at the path specified by image_path in binary mode ("rb"). This is necessary because image files are binary files, and attempting to open them in text mode would result in errors.

The read method is called on the file object to read the entire contents of the file. The result, a bytes object containing the binary data from the file, is then passed to the b64encode function from the base64 module.

This encodes the binary data into a base64 encoded bytes object.

Base64 encoding is a way of taking binary data and turning it into text so that it’s more easily transmitted in things like email and HTML form data. Finally, the decode method is called on the base64 encoded bytes object to convert it into a string using UTF-8 encoding.

This string is then returned by the function.

Creating the DinnerGPT Function

Ok, now that we have a way to send our picture to ChatGPT in text format, let’s create our dinnerGPT bot.

def dinnerGPT(image_name: str):
    image_path = get_path(image_name)
    base64_image = image_encoder(image_path)

First we go ahead and take the image name, which should be the name of an image in the same folder as our 'dinner.py' file.

We then get the path to this image using our lambda function and then we encode the image using our image_encoder function.

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "system",
                "content": "You are a dinner AI. You will receive a picture of a fridge and recommend what the user can cook using the ingredients you see.",
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What can I cook from the stuff in my fridge?",
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                        },
                    },
                ],
            },
        ],
        max_tokens=500,
    )
    answer = response.choices[0].message.content
    return answer

Now we make a call using the 'vision-preview' version of the GPT-4 Turbo model.

The messages list contains a fairly simple system message followed by the user message.

Again we have a content key with a question. We don’t take the question from the user but just hardcode it in here as we want dinnerGPT to always answer this question, and the user to only upload an image.

The second entry is an image_url, which may seem a bit odd, but this is actually treated as a URL that starts with data:image/jpeg;base64, and then the base64 encoded image data.

This is a special URL that is used to embed images. Remember to set the max_tokens with GPT vision to make sure your response message doesn’t get cut off.

Finally, we return the answer from the response. Now let’s test it out:

print(dinnerGPT("fridge.jpg"))

And we get some recommendations!

Based on the ingredients visible in your fridge, here are a few meal suggestions you could consider:

1. Vegetable Stir-fry: Utilize the green and red bell peppers, mushrooms, and greens (like kale) from your fridge. Sauté them in a hot pan with a bit of oil, add soy sauce or your favorite stir-fry sauce, and serve over rice or noodles. You can also add some green onions for extra flavor.

2. Garden Salad: Create a fresh salad using the greens, slice some of those strawberries for a sweet touch, add diced bell peppers for crunch, and if there are any nuts or seeds in your pantry, sprinkle those on top as well. A light vinaigrette would work well as a dressing.

3. Stuffed Bell Peppers: Hollow out the green or red bell peppers, stuff them with a mixture of rice, sautéed mushrooms, spices, and if you have any cheese or protein sources on hand, include those as well. Bake in the oven until the peppers are soft and the filling is heated through.

4. Brussels Sprouts and Kale Sauté: Trim the Brussels sprouts, slice them in half, and sauté with chopped kale. Add garlic and onions if available, season with salt and pepper, and finish with a squeeze of lemon juice for a healthy side dish.

5. Fruit Salad: Combine strawberries with any other available fruits and some lemon zest for a simple and refreshing dessert or snack.

Make sure to season your dishes well with salt, pepper, and any herbs or spices you may have on hand to enhance the flavors. If there are any additional ingredients in containers that are not visible, those could also offer more possibilities for meal creation.

Very impressive, dinnerGPT! recommendations are based on the products seen in our image, how cool is that!

The Detail Setting

There is a 'detail' setting, which can be defined in the image URL object in the user message, like below:

{
    "type": "image_url",
    "image_url": {
        "url": f"data:image/jpeg;base64,{base64_image}",
        "detail": "high",
    },
},

I suspect high is the default setting, as small details are generally picked up on quite well. The other option is low, which is faster and cheaper but obviously has less detail.

The low option reduces the image to 512×512 pixels and uses very few tokens, whereas the high option uses this same 512×512 reduced-size image but then takes the full-scale image and cuts it up into 512×512 squares going over all of these for detailed analysis.

The number of these squares, and thus the cost, is determined by the size of the image.

We’ll stick with high for the detailed fridge analysis, but do keep in mind that if you’re going to be making hundreds of calls the cost is going to be a bit more expensive than general text-only GPT calls.

Taking DinnerGPT to the Next Level

So now let’s take this to the next level.

Obviously, most users will have a freezer as well, some may even have multiple fridges or freezers, and an image of the pantry would be nice to include as well, so how can we send multiple pictures at once?

Make sure to comment out the print(dinnerGPT("fridge.jpg")) line and let’s continue in our 'dinner.py' file, writing a second version of our dinnerGPT bot below the existing code:

def dinnerGPTmulti(image_name_list: list[str]):
    messages = [
        {
            "role": "system",
            "content": "You are a dinner AI. You will receive one or more pictures of fridges/refrigerators/pantries/etc and recommend what the user can cook using the ingredients you see.",
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What can I cook from the stuff in my fridge?",
                },
                # This is where the images will be appended
            ],
        },
    ]

Ok, so we define dinnerGPTmulti, which takes a list of strings with image names this time, so we can pass in any number of images, as we won’t know how many there are.

We then define our messages list, which is the same as before except we left out the entire image_url object, as we’ll be appending this dynamically, based on how many images we have.

    image_path_list = [get_path(image_name) for image_name in image_name_list]
    base64_image_list = [image_encoder(image_path) for image_path in image_path_list]
    for image in base64_image_list:
        messages[1]["content"].append(
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image}",
                },
            },
        )

For each image in the list of images, we get the path using our lambda function.

We then encode the images using our image_encoder function and then loop over each image in the list and append a new image_url object to the content list of the second message.

This is the same as before, except we’re doing it in a loop now.

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=messages,
        max_tokens=500,
    )
    answer = response.choices[0].message.content
    return answer

Now we simply make the call and return the answer! Your whole dinnerGPTmulti function looks like this:

def dinnerGPTmulti(image_name_list: list[str]):
    messages = [
        {
            "role": "system",
            "content": "You are a dinner AI. You will receive one or more pictures of fridges/refrigerators/pantries/etc and recommend what the user can cook using the ingredients you see.",
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What can I cook from the stuff in my fridge?",
                },
                # This is where the images will be appended
            ],
        },
    ]
    image_path_list = [get_path(image_name) for image_name in image_name_list]
    base64_image_list = [image_encoder(image_path) for image_path in image_path_list]
    for image in base64_image_list:
        messages[1]["content"].append(
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image}",
                },
            },
        )

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=messages,
        max_tokens=500,
    )
    answer = response.choices[0].message.content
    return answer

Pretty simple right? Now let’s test it out passing in both our images at once!

print(dinnerGPTmulti(["fridge.jpg", "fridge2.jpg"]))

I won’t paste the entire response in here again but this time it mentions ingredients from both images.

Many ingredients are present in both images like the bell peppers, eggs, strawberries, etc. But the jar of minced garlic is specifically called out in the answer, proving that both images have been analyzed and the results combined. I don’t know about you, but my mind is blown.

As computing power becomes cheaper and faster, computers and robots are going to see and analyze the world around us faster and faster and with ever-increasing understanding.

All those futuristic robot movies seem to be getting less and less futuristic. Anyway, that’s it for part 3, I hope you enjoyed it and I’ll see you in part 4 where we look at the new DALLE-3 model for generating images, and we’ll also look at AI image edits and variations.

👉 See you there!