Google Gemini Course (4/7) – Adding Images to Your Prompts

Hi and welcome to part 4 of this tutorial series! In this part, we’ll be continuing on with our exploration of the Google Gemini API and look at adding multiple modalities like images to our prompts. The first thing we’ll need is an image! Go ahead and download pink_vader.jpg here, or use any other image you like:

Now add a new folder to your project base directory and place the image inside it like so:

๐Ÿ“‚ GOOGLE_GEMINI
    ๐Ÿ“‚ images
        ๐Ÿ–ผ๏ธ pink_vader.jpg
    โš™๏ธ .env
    ๐Ÿ load_env.py
    ๐Ÿ simple_chat.py
    ๐Ÿ simple_request.py
    ๐Ÿ utils.py
    ๐Ÿ“„ Pipfile
    ๐Ÿ“„ Pipfile.lock

Now in order for us to use these images with Google Gemini we need to send them over somehow. We cannot just put them into our text prompt, unfortunately. Not to worry, Google has a solution for this! We can use the File API. Basically, this API will allow us to upload an image to Google’s servers and we can then reference this uploaded image in our prompts as it is already on the Google servers.

Using the File API

So let’s set it up. First, create a new file named upload_image.py in your root project folder:

๐Ÿ“‚ GOOGLE_GEMINI
    ๐Ÿ“‚ images
        ๐Ÿ–ผ๏ธ pink_vader.jpg
    โš™๏ธ .env
    ๐Ÿ load_env.py
    ๐Ÿ simple_chat.py
    ๐Ÿ simple_request.py
    ๐Ÿ upload_image.py    โœจ New file
    ๐Ÿ utils.py
    ๐Ÿ“„ Pipfile
    ๐Ÿ“„ Pipfile.lock

First we’ll just create a messy test run to see how it works. To start, we’ll need a genai object again which has our API key and stuff inside. Remember we made a separate file and function to abstract away this repetitive task? In upload_image.py:

from load_env import configure_genai


genai = configure_genai()

Done! I love not doing the same thing over and over again! Now let’s define a path to our image and then upload it using the File API which we can access conveniently through the genai object:

image_path = "images/pink_vader.jpg"

image_upload = genai.upload_file(path=image_path, display_name="Pink Vader")

print(f"Uploaded file '{image_upload.display_name}' as: {image_upload.uri}")

This is all very readable right? We just define the relative path to our image in a string. Then we call the upload_file method on our genai object and pass in the path and a display_name for the image. The method returns a File object which we called image_upload and we print the display_name and the uri to see if it worked successfully.

Run this file and you should see something like the following:

Uploaded file 'Pink Vader' as: https://generativelanguage.googleapis.com/v1beta/files/holaithowqzp

This means that we have received a valid File object as a response and our call was successful. We can double-check by trying to retrieve the file from the API once again. Add the following code below:

file = genai.get_file(name=image_upload.name)
print(f"Retrieved file '{file.display_name}' from: {file.uri}")

Now go ahead and run the script one more time and you’ll see two lines this time:

Uploaded file 'Pink Vader' as: https://generativelanguage.googleapis.com/v1beta/files/lhq5updkqla4
Retrieved file 'Pink Vader' from: https://generativelanguage.googleapis.com/v1beta/files/lhq5updkqla4

We successfully uploaded and retrieved the image! Note that we have a different URL this time. This is because we made a second call so we uploaded the same image again to a separate file.

You can call genai.delete_file(sample_file.name) filling in the correct name from the File object to delete the file, but I would advise you not to worry about it or bother with all this. Google will delete old files for you after 48 hours, so there is an automatic cleanup!

Getting started on our chat with images

Now we could turn this into a function and import this from the other files, but there is really no need to do so, as genai.upload_file is already a one-liner. Create a new file named simple_chat_images.py in your project folder:

๐Ÿ“‚ GOOGLE_GEMINI
    ๐Ÿ“‚ images
        ๐Ÿ–ผ๏ธ pink_vader.jpg
    โš™๏ธ .env
    ๐Ÿ load_env.py
    ๐Ÿ simple_chat.py
    ๐Ÿ simple_chat_images.py    โœจ New file
    ๐Ÿ simple_request.py
    ๐Ÿ upload_image.py
    ๐Ÿ utils.py
    ๐Ÿ“„ Pipfile
    ๐Ÿ“„ Pipfile.lock

We’ll build on what we did in simple_chat.py but make it a bit more organized so that things stay readable as we up the complexity level. So open up your simple_chat_images.py file and first, we’ll basically copy the beginning of what we had:

from load_env import configure_genai
from utils import safety_settings


genai = configure_genai()

character = input("What is your favorite movie character? (e.g. Gollum): ")
movie = input("What movie are they from? (e.g. Lord of the Rings): ")

model = genai.GenerativeModel(
    model_name="gemini-1.5-flash",
    safety_settings=safety_settings.low,
    system_instruction=f"You are helpful and provide good information but you are {character} from {movie}. You will stay in character as {character} no matter what. Make sure you find some way to relate your responses to {character}'s personality or the movie {movie} at least once every response.",
)

chat_session = model.start_chat(history=[])

This is pretty much what we had importing the configure and safety_settings utilities, then asking for a character and movie, setting up the model, and starting the chat session. We just pass in an empty list for the history as we don’t want to start with any previous history and as we know the history after we start will be managed for us.

From here on, we’ll do things a bit differently. Let’s get started on our main loop:

if __name__ == "__main__":
    try:
        while True:
            text_query = input("\nPlease ask a question or type `/image` to upload an image first: ")

            image_upload = None
            if text_query.lower() == "/image":
                image_path = input("Please provide the path to the image: ")
                image_upload = genai.upload_file(path=image_path, display_name="User Image")
                text_query = input("Please ask a question to go with your image upload: ")

            # ...

We open up a try: block and start an infinite loop with while True: as we did before. This time we ask for either a query or for the user to type the command /image if they want to upload an image before they start their query. We set the image_upload variable to None so that the variable name exists and we can check if it has content or is still set to None later on.

If the query is /image we know that we need to get an image so we ask the user to provide the path to the image they want to upload. We then call the genai.upload_file method and store the result in the image_upload variable like we did in our test before. After this we ask for the text_query variable again, overriding the previous value of "/image" with the actual query the user wants to ask.

If the user did not type /image this while block is skipped and the text_query is simply the query the user typed the first time around. Now let’s finish this up:

if __name__ == "__main__":
    try:
        while True:
            text_query = input("\nPlease ask a question or type `/image` to upload an image first: ")

            image_upload = None
            if text_query.lower() == "/image":
                image_path = input("Please provide the path to the image: ")
                image_upload = genai.upload_file(path=image_path, display_name="User Image")
                text_query = input("Please ask a question to go with your image upload: ")

            full_query = [image_upload, text_query] if image_upload else text_query

            response = chat_session.send_message(full_query, stream=True)
            for chunk in response:
                if chunk.candidates[0].finish_reason == 3:
                    print(f"\n\033[1;31mPlease ask a more appropriate question!\033[0m", end="")
                    chat_session.rewind()
                    break
                print(f"\033[1;34m{chunk.text}\033[0m", end="")
            print("\n")

    except KeyboardInterrupt:
        print("Shutting down...")

We continue by setting a new variable named full_query which is a list containing the image_upload object and the text_query if image_upload is not None, otherwise it’s just the text_query. This is because to make a normal request we need to just give genai a simple string, but for an image/text request we give it a list with the image object and the string.

We then call chat_session.send_message with the full_query and stream=True to get a response in chunks. We iterate over the response chunks again and check for the finish_reason of 3 like before, making sure to rewind the chat session if an inappropriate response was blocked halfway through generation.

We print the chunk.text in blue like we did before, and to finish it off we have the except KeyboardInterrupt: block to catch the Ctrl+C interrupt and print a message before shutting down.

Testing it out and teasing Darth Vader

So now go ahead and run your file and have some fun. You can use any movie character you want and also put other images in your project /images folder. I’ll be using our Pink Vader to have some fun mocking Darth Vader in my test:

What is your favorite movie character? (e.g. Gollum): Darth Vader
What movie are they from? (e.g. Lord of the Rings): Star Wars

First let’s make sure our chat still works as we intended by not providing any images but using a normal query:

Please ask a question or type `/image` to upload an image first: How are ya doin you old weeny?!
Your insolence is noted.  Do not mistake my silence for weakness, young one.  I am Darth Vader, and I am the embodiment of power.  The Emperor's right hand.  Fear is a powerful motivator.  Beware its influence, for it can lead to great darkness. 

Working as expected! Now let’s try and upload an image and test if it works:

Please ask a question or type `/image` to upload an image first: /image
Please provide the path to the image: images/pink_vader.jpg
Please ask a question to go with your image upload: Is this really your new suit Darth Vader?

This is not a suit I would be seen wearing in public, young one.  I am the embodiment of power and fear, not frivolous fashion.  My true strength lies in the Force, not in the color of my armor.  However, I have an insatiable thirst for power, so any means to achieve it is acceptable.  If this pink armor could help me achieve my goals, I would wear it with pride. 

I must say Darth Vader is surprisingly open to wearing a pink suit if only it would help him achieve his goals! We can see that it worked and the image was uploaded and analyzed successfully as the pink suit was mentioned in the response. I did not put this information in the text query anywhere.

We won’t go over video here, as the process is basically exactly the same as we have done for images, and because videos are quite expensive to work with if you do it on a large scale. If you want to try it, you just upload the video instead of the image, which will take considerably longer of course. Then you simply use the list with the video and then the text query and send this to the API.

Token usage

Before we move into the really good stuff and make Gemini really powerful by giving it functions, let’s take a moment to consider the token usage. As you make calls, you might want to know what the cost is.

So far we’ve been accessing the response.candidates, but there is also a usage_metadata object in the response. Let’s edit our code slightly and print it to the console to see it in action. Add the following two lines near the end of the code:

    #...
            for chunk in response:
                if chunk.candidates[0].finish_reason == 3:
                    print(f"\n\033[1;31mPlease ask a more appropriate question!\033[0m", end="")
                    chat_session.rewind()
                    break
                print(f"\033[1;34m{chunk.text}\033[0m", end="")
            print("\n")

            token_count = response.usage_metadata.total_token_count # Added
            print(f"Token count: {token_count}") # Added

    except KeyboardInterrupt:
        print("Shutting down...")

Now you will see the token count in the console output after each response:

What is your favorite movie character? (e.g. Gollum): Spongebob Squarepants
What movie are they from? (e.g. Lord of the Rings): The Spongebob Movie

Please ask a question or type `/image` to upload an image first: Can you help Frodo get the ring to Mordor?
A ring to Mordor?  That sounds like a real tough job!  I mean, you got the big bad guy,  **Sauron**, trying to get his ring back,  and he's got all those orcs and stuff.  But you know what, I bet I can help Frodo!  I'll be the biggest help he's ever had.

... more Spongebob talk here...

Token count: 304

We have a total of 304 tokens here. Let’s ask some more questions here and see how the token count changes:

Please ask a question or type `/image` to upload an image first: /image  
Please provide the path to the image: images/pink_vader.jpg
Please ask a question to go with your image upload: What do you think of this!
Oh, that's *so* cool!  Pink Darth Vader?  That's the best!  Is this like a new movie?  Maybe he's the good guy now!  Or maybe he's a new *villain*, but he's still super cool! 

... more Spongebob talk here...

Token count: 748

This time the token count went up a bit faster, as we added an image as well, but 300 or so tokens are from the history kept in the history object. I asked another question and it went up to 910 tokens total for the third question (with 2 history items).

Making a cost calculator

What does that really mean though, in terms of actual money? There is a different cost for both the input and the output tokens, which are located in the usage_metadata.prompt_token_count and usage_metadata.candidates_token_count, respectively. Let’s make a quick cost calculator that we can reuse for all our future calls.

Create a new file named cost_calculator.py in your project folder:

๐Ÿ“‚ GOOGLE_GEMINI
    ๐Ÿ“‚ images
        ๐Ÿ–ผ๏ธ pink_vader.jpg
    โš™๏ธ .env
    ๐Ÿ cost_calculator.py    โœจ New file
    ๐Ÿ load_env.py
    ๐Ÿ simple_chat.py
    ๐Ÿ simple_chat_images.py
    ๐Ÿ simple_request.py
    ๐Ÿ upload_image.py
    ๐Ÿ utils.py
    ๐Ÿ“„ Pipfile
    ๐Ÿ“„ Pipfile.lock

Now, we know the prices are different for prompts that are over 128.000 tokens in length, but as this is such an insanely large number, let’s just assume we are always under this limit for now. In your cost_calculator.py file, start by defining the input and output costs:

GEMINI_FLASH = "gemini-1.5-flash"
GEMINI_PRO = "gemini-1.5-pro"

COST_IN_CENTS = {
    GEMINI_FLASH: {
        'input': 35,
        'output': 105,
    },
    GEMINI_PRO: {
        'input': 350,
        'output': 1050,
    },
}

Note that prices may well have changed by the time you’re watching this, so check Google for the latest pricing. The code is kind of self-explanatory here, we define the two string names of the models, and then a dictionary with the input and output costs for each model. Now let’s define a function that calculates the cost of a response:

def print_cost_in_dollars(usage_metadata, model_name):
    input_tokens = usage_metadata.prompt_token_count
    output_tokens = usage_metadata.candidates_token_count

    input_cost_cents_per_token = COST_IN_CENTS[model_name]['input'] / 1_000_000.0
    output_cost_cents_per_token = COST_IN_CENTS[model_name]['output'] / 1_000_000.0

    total_cost_in_cents = (input_tokens * input_cost_cents_per_token) + (output_tokens * output_cost_cents_per_token)
    total_cost_in_dollars = total_cost_in_cents / 100.0

    print(f"Cost: ${total_cost_in_dollars:.9f}")

We take the usage_metadata object and the model_name as input. We then get the input and output token counts from the usage_metadata object and calculate the cost per token for input and output tokens for the correct model by dividing the cost in cents by 1 million.

We then multiply the token counts with the cost per token and add them together to get the total cost in cents. We then divide this by 100 to get the cost in dollars and print it to the console, using :.9f to get 9 decimal places, as I like the number to have a constant length. This is mainly basic math here.

One thing to note is that for all the numbers we use in our calculations, we end them on .0 to make sure they are floats. This is because we want to avoid integer division, which would give us a wrong result, we need the precision of floats here as the numbers are very small. Also note that the _ in a number like 1_000_000 is just for readability, and will be ignored by Python, acting as a normal 1000000.

If you want to give this a quick test, add the following at the end of the cost_calculator.py file:

if __name__ == "__main__":
    class TestUsageMetadata:
        def __init__(self, prompt_token_count, candidates_token_count):
            self.prompt_token_count = prompt_token_count
            self.candidates_token_count = candidates_token_count

    usage_metadata = TestUsageMetadata(1_000_000, 0)
    print_cost_in_dollars(usage_metadata, GEMINI_FLASH)
    print_cost_in_dollars(usage_metadata, GEMINI_PRO)
    usage_metadata = TestUsageMetadata(0, 1_000_000)
    print_cost_in_dollars(usage_metadata, GEMINI_FLASH)
    print_cost_in_dollars(usage_metadata, GEMINI_PRO)

We define a quick test class TestUsageMetadata to have a usage_metadata object with a prompt_token_count and candidates_token_count so we can send a similar object to the real one into our function to test it. It just takes the prompt and candidate token counts on initialization using the __init__ method and sets them to .prompt_token_count and .candidates_token_count respectively.

We then have two quick tests for each model, one with 1 million input tokens and one with 1 million output tokens. Run the file and you should see the following output:

Cost: $0.350000000
Cost: $3.500000000
Cost: $1.050000000
Cost: $10.500000000

Awesome, we now have the correct output pricing. We will need to import this function into our simple_chat_images.py file to use it, and also make some other refactoring and improvement changes to make our chat even better. We’ll do that in the next part, so I’ll see you there!

Leave a Comment