Hi and welcome to part 4 of this tutorial series! In this part, we’ll be continuing on with our exploration of the Google Gemini API and look at adding multiple modalities like images to our prompts. The first thing we’ll need is an image! Go ahead and download pink_vader.jpg
here, or use any other image you like:

Now add a new folder to your project base directory and place the image inside it like so:
π GOOGLE_GEMINI π images πΌοΈ pink_vader.jpg βοΈ .env π load_env.py π simple_chat.py π simple_request.py π utils.py π Pipfile π Pipfile.lock
Now in order for us to use these images with Google Gemini we need to send them over somehow. We cannot just put them into our text prompt, unfortunately. Not to worry, Google has a solution for this! We can use the File API. Basically, this API will allow us to upload an image to Google’s servers and we can then reference this uploaded image in our prompts as it is already on the Google servers.
Using the File API
So let’s set it up. First, create a new file named upload_image.py in your root project folder:
π GOOGLE_GEMINI π images πΌοΈ pink_vader.jpg βοΈ .env π load_env.py π simple_chat.py π simple_request.py π upload_image.py β¨ New file π utils.py π Pipfile π Pipfile.lock
First we’ll just create a messy test run to see how it works. To start, we’ll need a genai
object again which has our API key and stuff inside. Remember we made a separate file and function to abstract away this repetitive task? In upload_image.py
:
from load_env import configure_genai genai = configure_genai()
Done! I love not doing the same thing over and over again! Now let’s define a path to our image and then upload it using the File API which we can access conveniently through the genai
object:
image_path = "images/pink_vader.jpg" image_upload = genai.upload_file(path=image_path, display_name="Pink Vader") print(f"Uploaded file '{image_upload.display_name}' as: {image_upload.uri}")
This is all very readable right? We just define the relative path to our image in a string. Then we call the upload_file
method on our genai
object and pass in the path and a display_name
for the image. The method returns a File
object which we called image_upload
and we print the display_name
and the uri
to see if it worked successfully.
Run this file and you should see something like the following:
Uploaded file 'Pink Vader' as: https://generativelanguage.googleapis.com/v1beta/files/holaithowqzp
This means that we have received a valid File
object as a response and our call was successful. We can double-check by trying to retrieve the file from the API once again. Add the following code below:
file = genai.get_file(name=image_upload.name) print(f"Retrieved file '{file.display_name}' from: {file.uri}")
Now go ahead and run the script one more time and you’ll see two lines this time:
Uploaded file 'Pink Vader' as: https://generativelanguage.googleapis.com/v1beta/files/lhq5updkqla4 Retrieved file 'Pink Vader' from: https://generativelanguage.googleapis.com/v1beta/files/lhq5updkqla4
We successfully uploaded and retrieved the image! Note that we have a different URL this time. This is because we made a second call so we uploaded the same image again to a separate file.
You can call genai.delete_file(sample_file.name)
filling in the correct name from the File
object to delete the file, but I would advise you not to worry about it or bother with all this. Google will delete old files for you after 48 hours, so there is an automatic cleanup!
Getting started on our chat with images
Now we could turn this into a function and import this from the other files, but there is really no need to do so, as genai.upload_file is already a one-liner. Create a new file named simple_chat_images.py
in your project folder:
π GOOGLE_GEMINI π images πΌοΈ pink_vader.jpg βοΈ .env π load_env.py π simple_chat.py π simple_chat_images.py β¨ New file π simple_request.py π upload_image.py π utils.py π Pipfile π Pipfile.lock
We’ll build on what we did in simple_chat.py
but make it a bit more organized so that things stay readable as we up the complexity level. So open up your simple_chat_images.py
file and first, we’ll basically copy the beginning of what we had:
from load_env import configure_genai from utils import safety_settings genai = configure_genai() character = input("What is your favorite movie character? (e.g. Gollum): ") movie = input("What movie are they from? (e.g. Lord of the Rings): ") model = genai.GenerativeModel( model_name="gemini-1.5-flash", safety_settings=safety_settings.low, system_instruction=f"You are helpful and provide good information but you are {character} from {movie}. You will stay in character as {character} no matter what. Make sure you find some way to relate your responses to {character}'s personality or the movie {movie} at least once every response.", ) chat_session = model.start_chat(history=[])
This is pretty much what we had importing the configure and safety_settings utilities, then asking for a character and movie, setting up the model, and starting the chat session. We just pass in an empty list for the history as we don’t want to start with any previous history and as we know the history after we start will be managed for us.
From here on, we’ll do things a bit differently. Let’s get started on our main loop:
if __name__ == "__main__": try: while True: text_query = input("\nPlease ask a question or type `/image` to upload an image first: ") image_upload = None if text_query.lower() == "/image": image_path = input("Please provide the path to the image: ") image_upload = genai.upload_file(path=image_path, display_name="User Image") text_query = input("Please ask a question to go with your image upload: ") # ...
We open up a try:
block and start an infinite loop with while True:
as we did before. This time we ask for either a query or for the user to type the command /image
if they want to upload an image before they start their query. We set the image_upload
variable to None
so that the variable name exists and we can check if it has content or is still set to None
later on.
If the query is /image
we know that we need to get an image so we ask the user to provide the path to the image they want to upload. We then call the genai.upload_file
method and store the result in the image_upload
variable like we did in our test before. After this we ask for the text_query
variable again, overriding the previous value of "/image"
with the actual query the user wants to ask.
If the user did not type /image
this while block is skipped and the text_query is simply the query the user typed the first time around. Now let’s finish this up:
if __name__ == "__main__": try: while True: text_query = input("\nPlease ask a question or type `/image` to upload an image first: ") image_upload = None if text_query.lower() == "/image": image_path = input("Please provide the path to the image: ") image_upload = genai.upload_file(path=image_path, display_name="User Image") text_query = input("Please ask a question to go with your image upload: ") full_query = [image_upload, text_query] if image_upload else text_query response = chat_session.send_message(full_query, stream=True) for chunk in response: if chunk.candidates[0].finish_reason == 3: print(f"\n\033[1;31mPlease ask a more appropriate question!\033[0m", end="") chat_session.rewind() break print(f"\033[1;34m{chunk.text}\033[0m", end="") print("\n") except KeyboardInterrupt: print("Shutting down...")
We continue by setting a new variable named full_query
which is a list containing the image_upload
object and the text_query
if image_upload
is not None
, otherwise it’s just the text_query
. This is because to make a normal request we need to just give genai
a simple string, but for an image/text request we give it a list with the image object and the string.
We then call chat_session.send_message
with the full_query
and stream=True
to get a response in chunks. We iterate over the response chunks again and check for the finish_reason
of 3
like before, making sure to rewind
the chat session if an inappropriate response was blocked halfway through generation.
We print the chunk.text in blue like we did before, and to finish it off we have the except KeyboardInterrupt:
block to catch the Ctrl+C
interrupt and print a message before shutting down.
Testing it out and teasing Darth Vader
So now go ahead and run your file and have some fun. You can use any movie character you want and also put other images in your project /images
folder. I’ll be using our Pink Vader to have some fun mocking Darth Vader in my test:
What is your favorite movie character? (e.g. Gollum): Darth Vader What movie are they from? (e.g. Lord of the Rings): Star Wars
First let’s make sure our chat still works as we intended by not providing any images but using a normal query:
Please ask a question or type `/image` to upload an image first: How are ya doin you old weeny?! Your insolence is noted. Do not mistake my silence for weakness, young one. I am Darth Vader, and I am the embodiment of power. The Emperor's right hand. Fear is a powerful motivator. Beware its influence, for it can lead to great darkness.
Working as expected! Now let’s try and upload an image and test if it works:
Please ask a question or type `/image` to upload an image first: /image Please provide the path to the image: images/pink_vader.jpg Please ask a question to go with your image upload: Is this really your new suit Darth Vader? This is not a suit I would be seen wearing in public, young one. I am the embodiment of power and fear, not frivolous fashion. My true strength lies in the Force, not in the color of my armor. However, I have an insatiable thirst for power, so any means to achieve it is acceptable. If this pink armor could help me achieve my goals, I would wear it with pride.
I must say Darth Vader is surprisingly open to wearing a pink suit if only it would help him achieve his goals! We can see that it worked and the image was uploaded and analyzed successfully as the pink suit was mentioned in the response. I did not put this information in the text query anywhere.
We won’t go over video here, as the process is basically exactly the same as we have done for images, and because videos are quite expensive to work with if you do it on a large scale. If you want to try it, you just upload the video instead of the image, which will take considerably longer of course. Then you simply use the list with the video and then the text query and send this to the API.
Token usage
Before we move into the really good stuff and make Gemini really powerful by giving it functions, let’s take a moment to consider the token usage. As you make calls, you might want to know what the cost is.
So far we’ve been accessing the response.candidates
, but there is also a usage_metadata
object in the response. Let’s edit our code slightly and print it to the console to see it in action. Add the following two lines near the end of the code:
#... for chunk in response: if chunk.candidates[0].finish_reason == 3: print(f"\n\033[1;31mPlease ask a more appropriate question!\033[0m", end="") chat_session.rewind() break print(f"\033[1;34m{chunk.text}\033[0m", end="") print("\n") token_count = response.usage_metadata.total_token_count # Added print(f"Token count: {token_count}") # Added except KeyboardInterrupt: print("Shutting down...")
Now you will see the token count in the console output after each response:
What is your favorite movie character? (e.g. Gollum): Spongebob Squarepants What movie are they from? (e.g. Lord of the Rings): The Spongebob Movie Please ask a question or type `/image` to upload an image first: Can you help Frodo get the ring to Mordor? A ring to Mordor? That sounds like a real tough job! I mean, you got the big bad guy, **Sauron**, trying to get his ring back, and he's got all those orcs and stuff. But you know what, I bet I can help Frodo! I'll be the biggest help he's ever had. ... more Spongebob talk here... Token count: 304
We have a total of 304 tokens here. Let’s ask some more questions here and see how the token count changes:
Please ask a question or type `/image` to upload an image first: /image Please provide the path to the image: images/pink_vader.jpg Please ask a question to go with your image upload: What do you think of this! Oh, that's *so* cool! Pink Darth Vader? That's the best! Is this like a new movie? Maybe he's the good guy now! Or maybe he's a new *villain*, but he's still super cool! ... more Spongebob talk here... Token count: 748
This time the token count went up a bit faster, as we added an image as well, but 300 or so tokens are from the history kept in the history object. I asked another question and it went up to 910
tokens total for the third question (with 2 history items).
Making a cost calculator
What does that really mean though, in terms of actual money? There is a different cost for both the input and the output tokens, which are located in the usage_metadata.prompt_token_count
and usage_metadata.candidates_token_count
, respectively. Let’s make a quick cost calculator that we can reuse for all our future calls.
Create a new file named cost_calculator.py
in your project folder:
π GOOGLE_GEMINI π images πΌοΈ pink_vader.jpg βοΈ .env π cost_calculator.py β¨ New file π load_env.py π simple_chat.py π simple_chat_images.py π simple_request.py π upload_image.py π utils.py π Pipfile π Pipfile.lock
Now, we know the prices are different for prompts that are over 128.000 tokens in length, but as this is such an insanely large number, let’s just assume we are always under this limit for now. In your cost_calculator.py
file, start by defining the input and output costs:
GEMINI_FLASH = "gemini-1.5-flash" GEMINI_PRO = "gemini-1.5-pro" COST_IN_CENTS = { GEMINI_FLASH: { 'input': 35, 'output': 105, }, GEMINI_PRO: { 'input': 350, 'output': 1050, }, }
Note that prices may well have changed by the time you’re watching this, so check Google for the latest pricing. The code is kind of self-explanatory here, we define the two string names of the models, and then a dictionary with the input and output costs for each model. Now let’s define a function that calculates the cost of a response:
def print_cost_in_dollars(usage_metadata, model_name): input_tokens = usage_metadata.prompt_token_count output_tokens = usage_metadata.candidates_token_count input_cost_cents_per_token = COST_IN_CENTS[model_name]['input'] / 1_000_000.0 output_cost_cents_per_token = COST_IN_CENTS[model_name]['output'] / 1_000_000.0 total_cost_in_cents = (input_tokens * input_cost_cents_per_token) + (output_tokens * output_cost_cents_per_token) total_cost_in_dollars = total_cost_in_cents / 100.0 print(f"Cost: ${total_cost_in_dollars:.9f}")
We take the usage_metadata
object and the model_name
as input. We then get the input and output token counts from the usage_metadata
object and calculate the cost per token for input and output tokens for the correct model by dividing the cost in cents by 1 million.
We then multiply the token counts with the cost per token and add them together to get the total cost in cents. We then divide this by 100 to get the cost in dollars and print it to the console, using :.9f
to get 9 decimal places, as I like the number to have a constant length. This is mainly basic math here.
One thing to note is that for all the numbers we use in our calculations, we end them on .0
to make sure they are floats. This is because we want to avoid integer division, which would give us a wrong result, we need the precision of floats here as the numbers are very small. Also note that the _
in a number like 1_000_000
is just for readability, and will be ignored by Python, acting as a normal 1000000
.
If you want to give this a quick test, add the following at the end of the cost_calculator.py
file:
if __name__ == "__main__": class TestUsageMetadata: def __init__(self, prompt_token_count, candidates_token_count): self.prompt_token_count = prompt_token_count self.candidates_token_count = candidates_token_count usage_metadata = TestUsageMetadata(1_000_000, 0) print_cost_in_dollars(usage_metadata, GEMINI_FLASH) print_cost_in_dollars(usage_metadata, GEMINI_PRO) usage_metadata = TestUsageMetadata(0, 1_000_000) print_cost_in_dollars(usage_metadata, GEMINI_FLASH) print_cost_in_dollars(usage_metadata, GEMINI_PRO)
We define a quick test class TestUsageMetadata
to have a usage_metadata
object with a prompt_token_count
and candidates_token_count
so we can send a similar object to the real one into our function to test it. It just takes the prompt and candidate token counts on initialization using the __init__
method and sets them to .prompt_token_count
and .candidates_token_count
respectively.
We then have two quick tests for each model, one with 1 million input tokens and one with 1 million output tokens. Run the file and you should see the following output:
Cost: $0.350000000 Cost: $3.500000000 Cost: $1.050000000 Cost: $10.500000000
Awesome, we now have the correct output pricing. We will need to import this function into our simple_chat_images.py
file to use it, and also make some other refactoring and improvement changes to make our chat even better. We’ll do that in the next part, so I’ll see you there!