Google Gemini Course (7/7) – Code Execution and Context Caching

Welcome back to the final part of this tutorial series. This part was not actually supposed to exist, but after completing the curriculum for the Gemini tutorial series Google suddenly dropped some new and awesome features, so here we are.

Please bear with me as this might be a bit more spontaneous and unscripted compared to normal and I’ll use very basic and quick examples to explain the new features, but they are just too cool not to cover at all, so this is like an extra bonus part if you will.

Code Execution

So the first new feature I want to talk about is code execution. This might seem similar to function calling, which we’ve been using in the course, but it is actually very different. When doing function calling we write functions on our end and then Gemini merely gives us the arguments and the function it wants us to call. The function exists on our computer and gets called on our computer.

The new code execution feature allows the code to run on Gemini’s servers. They basically give you a small containerized environment on the Gemini servers that will execute the code on the API side. What’s more is you don’t have to write the function here, in this particular case it is Gemini that writes the code, and executes it on their own servers, based on your instructions!

So let’s look at a quick example to see this in action. First, make sure you are in your project root folder, and that your pipenv is activated. Then run the following command to update the google-generativeai package to the latest version:

pipenv update google-generativeai

This is needed so we can run the new syntax (including the code_execution option). Now create a new file in your project folder and name it code_execution.py. In this file write the following code:

from load_env import configure_genai
from utils import safety_settings


genai = configure_genai()


model = genai.GenerativeModel(
&nbsp; &nbsp; model_name="gemini-1.5-pro",
&nbsp; &nbsp; safety_settings=safety_settings.low,
&nbsp; &nbsp; tools='code_execution',
)


response = model.generate_content((
&nbsp; &nbsp; "Please create a Python script that will print an integer in reverse order. "
&nbsp; &nbsp; "The script should take an integer as input and return the reversed integer. "
&nbsp; &nbsp; "For example, if the input is 123, the output should be 321. "
&nbsp; &nbsp; "Make sure you execute the script to test if it really works. "
))

print(response.text)

We use the easy configure_genai setup again to save us some time loading in the API key and whatnot, and then get the genai object.

When setting up the model make sure you use the pro version here, as code generation is definitely one of the more advanced features that will test the limits, so having the strongest model here is a must.

We use the safety settings from our utility, and now we add the tools argument like we’ve seen before. Instead of passing in a list with our own functions, we can simply pass in the string code_execution this time to enable the code execution functionality. Note that this will not work if you skipped over the pipenv update step.

Then we call the generate_content method on the model, which is basically just the single request version of the chat method we’ve been using, but everything you see here can also be applied in the chat method.

I’ll ask it to create a Python script to reverse integers, and to make sure to test if the script actually works. Now if we run this we’ll see something really cool. This is all one single output, from one single API call, but I’ll split it up here for readability.

The first code it came up with:

def reverse_integer(n):
&nbsp; """
 This function takes an integer and returns the reversed integer.

 Args:
 n: An integer.

 Returns:
 The reversed integer.
 """
 reversed_n = 0
&nbsp; while n > 0:
 digit = n % 10
 reversed_n = reversed_n * 10 + digit
 n //= 10
&nbsp; return reversed_n

# Get input from the user
n = int(input("Enter an integer: "))

# Reverse the integer
reversed_n = reverse_integer(n)

# Print the reversed integer
print(f"The reversed integer is: {reversed_n}")

Now we can see that it ran this code:

print(reverse_integer(123))

Traceback (most recent call last):
 File "<string>", line 2, in <module>
NameError: name 'reverse_integer' is not defined

And now it realized its mistake:

It seems I forgot to include the function definition in the code sent to the interpreter. Let's fix that!

It comes up with new code:

def reverse_integer(n):
&nbsp; """
 This function takes an integer and returns the reversed integer.

 Args:
 n: An integer.

 Returns:
 The reversed integer.
 """
 reversed_n = 0
&nbsp; while n > 0:
 digit = n % 10
 reversed_n = reversed_n * 10 + digit
 n //= 10
&nbsp; return reversed_n

# Get input from the user
n = int(input("Enter an integer: "))

# Reverse the integer
reversed_n = reverse_integer(n)

# Print the reversed integer
print(f"The reversed integer is: {reversed_n}")

And now it tries to run again:

print(reverse_integer(123))

Traceback (most recent call last):
 File "<string>", line 2, in <module>
NameError: name 'reverse_integer' is not defined

It seems here, that the Gemini API is writing the function in one file and tries to run it in a different file, which is why it can’t find the function. Keep in mind that the code execution feature has literally been released yesterday so I’m sure this will be fixed soon.

Nevertheless, the model does finally realize what is going wrong here:

I apologize for the continued error. It appears I am still not correctly executing the code within the same context. I need to find a way to persist the function definition between code blocks.

Let me try a different approach. I will define the function within the execution block itself:

def reverse_integer(n):
&nbsp; """
 This function takes an integer and returns the reversed integer.

 Args:
 n: An integer.

 Returns:
 The reversed integer.
 """
 reversed_n = 0
&nbsp; while n > 0:
 digit = n % 10
 reversed_n = reversed_n * 10 + digit
 n //= 10
&nbsp; return reversed_n

print(reverse_integer(123))

It seems this approach worked! The code successfully reversed the integer 123 and printed the output 321.

Let me know if you have any other tasks or code you'd like me to execute! I am still learning to interact with the code execution environment effectively.

So finally it realized it should just test a simplified version and have the print statement in the same file. While this doesn’t look very impressive yet, I’m sure this will improve rapidly. The one thing that is very cool to see is that it started to reason iteratively and learn from its mistakes to fix them over time.

Gemini actually generated code 3 times and sent it to its on-server code execution environment 3 times to test if it worked. That’s pretty darn cool and kind of like having a mini AI Agent already without any setup work like you would have in LangGraph or Autogen, if you’ve worked with either of those.

Context caching

The next cool newly released feature I really want to highlight before we go is context caching. I believe this is going to be very big in the future. The problem with LLMs right now is that they are becoming more powerful, and as such new uses such as analyzing videos are coming up, which are really cool, but also very expensive.

As the possible amount of input tokens increases the cost of the API calls increases as well, and where images or text make no major difference, a video is multiple orders of magnitude more expensive because the file size is so much bigger.

Now if we want to have a proper conversation with multiple turns to analyze a video and ask several questions about it, for each call we have to send that entire video in the context again, which is super expensive. Remember, you don’t actually have a conversation with an LLM, this is all an illusion.

You just have a history that saves the conversation so far and gives the illusion that the LLM remembers what you talked about because you send the history along with each new call, but each call to the LLM is in fact completely isolated and has no relationship to the previous ones.

To solve this problem, Google came up with Context Caching. This is for large sets of input tokens like a video. You pass the video in context once, the normal way, and then cache those input tokens. Now for the next follow-up question you ask about this same video, you don’t pass in the video again, but refer to the cached tokens on the Gemini servers instead.

This is a really cool idea, so let’s look at the prices. Now there are different prices between the flash and pro models here, and since this feature has literally been public for one whole day at this time, the prices are bound to change. You can check here for the most up-to-date prices.

To give you an idea, I’ll use flash as an example here:

Price (input)

$0.35 / 1 million tokens (for prompts up to 128K tokens)
$0.70 / 1 million tokens (for prompts longer than 128K)

Context caching

$0.0875 / 1 million tokens (for prompts up to 128K tokens)
$0.175 / 1 million tokens (for prompts longer than 128K)
$1.00 / 1 million tokens per hour (storage)

Price (output)

$1.05 / 1 million tokens (for prompts up to 128K tokens)
$2.10 / 1 million tokens (for prompts longer than 128K)

So we can see here that recalling a video from context is 4 times cheaper than sending the video in context again, making for a very large saving. Do notice that there is also a storage cost though, so make sure that you delete the cache as soon as the chat is done to save costs. You can also manually set a time limit for the cache to be stored in advance.

So let’s look at an example of how this works. If you want to follow along you will have to set up billing on your API key to enable the context caching feature, but you can also just follow along by watching the video. You can always enable billing later if you want to. Also make sure you have updated the google-generativeai package to the latest version, which we already did for the previous example before.

I’m going to download a very old movie which is also used by Gemini in their documentation examples. You can download it from here – Sherlock Junior and it is about 316 megabytes in size with a length of ~44 minutes.

Of course, you could use literally any video, but I will use this as an example for the same reason the Gemini docs do, copyright. This is very old and in the public domain, so I can use it without any issues.

I’m going to just quickly save the Sherlock_Jr_FullMovie.mp4 file in my project folder, and then create a new file called cached_chat.py, starting with the imports:

import datetime
import time

from google.generativeai import caching

from load_env import configure_genai
from utils import safety_settings

genai = configure_genai()

The datetime library will be used to set a time of expiry for our cache, and the time library will be used to sleep the program for a bit to make sure the video is properly processed before we start asking questions.

We import caching for obvious reasons and our configure-genai and safety_settings as usual for quick setup, creating the genai object.

path_to_video_file = 'images/Sherlock_Jr_FullMovie.mp4'

video_file = genai.upload_file(path=path_to_video_file)

while video_file.state.name == 'PROCESSING':
    print('Video is still processing...')
    time.sleep(2)
    video_file = genai.get_file(video_file.name)

print(f'Video processing complete: {video_file.uri}')

As this is just a quick example, I’ll hardcode the video path in the code. We use the file API that we have used before, so far, there is nothing new here. Now we see that we can call video_file.state.name to check if the video is still processing, and if it is we wait for 2 seconds and check again. When it’s done we print the URI of the video file.

Now that we have uploaded our file to the file API, it’s time to create a cache for it:

cache = caching.CachedContent.create(
&nbsp; &nbsp; model='models/gemini-1.5-flash-001',
&nbsp; &nbsp; display_name='sherlock',
&nbsp; &nbsp; system_instruction=(
&nbsp; &nbsp; &nbsp; &nbsp; 'You are an expert video analyzer, and your job is to answer '
&nbsp; &nbsp; &nbsp; &nbsp; 'the user\'s query based on the video file you have access to.'
    ),
&nbsp; &nbsp; contents=[video_file],
&nbsp; &nbsp; ttl=datetime.timedelta(minutes=10),
)

We create a cache object with the CachedContent.create method. We pass in the model we want to use, so you have to choose right now which one you want. As context caching is currently only available for stable models with fixed versions, we must choose either gemini-1.5-flash-001 or gemini-1.5-pro-001, though this will undoubtedly change in the future.

We have a display name, system instruction, the object referencing the video file on the file API, and a time to live (TTL) which is set to 10 minutes here. This is the time the cache will be stored on the Gemini servers before it is automatically deleted.

Now we can setup our model and chat in a fairly standard way:

model = genai.GenerativeModel.from_cached_content(
&nbsp; &nbsp; cached_content=cache,
&nbsp; &nbsp; safety_settings=safety_settings.low,
)


video_chat = model.start_chat(history=[])

Except that we used the .from_cached_content method and passed in the cache. Let’s set up a chat loop to finish it up:

if __name__ == "__main__":
    try:
        while True:
            text_query = input("\nPlease ask a question: ")

            response = video_chat.send_message(text_query, stream=True)
            for chunk in response:
                if chunk.candidates[0].finish_reason == 3:
                    print(f"\n\033[1;31mPlease ask a more appropriate question!\033[0m", end="")
                    video_chat.rewind()
                    break
                print(f"\033[1;34m{chunk.text}\033[0m", end="")
            print("\n")

            print(f"\033[93m{response.usage_metadata}\033[0m")

    except KeyboardInterrupt:
        print("Shutting down...")

    finally:
        cache.delete()
        print("Cache deleted.")

This is basically the standard chat loop we have developed and been using so far, with streaming enabled and our blocked response check, because the last thing we want to do is have our chat crash out and have to start over after waiting for our video to upload and cache.

The major change here is I added a print statement for the response.usage_metadata so we can check the tokens and how many are coming from the cache here.

The final thing we add is a finally clause to our try block, where we delete the cache after the chat is done. Since there is a cost for the amount of time the tokens were stored in cache we want to delete it as soon as we are done here. The finally clause in a try block is always executed, no matter what happens in the try block, even if there is an unexpected error and it crashes out, so this is the perfect place to put this.

I’ll go ahead and run this now, but you can also just follow along on my screen if you don’t want to upload such a large video file. This will take a while to upload and use a very large amount of tokens!

Video is still processing...
Video is still processing...
Video is still processing...
Video is still processing...
Video is still processing...
Video is still processing...
Video is still processing...
Video is still processing...
Video is still processing...
Video is still processing...
Video is still processing...
Video is still processing...
Video is still processing...
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/30k7z1juf9lr

Ok, so it took quite a while to upload the video, and after that, it took around 26 more seconds to process it judging from the number of print statements. Let’s ask a question now:

Please ask a question: What is the overall plot of this movie?
Buster Keaton stars as a young man working as a projectionist at a small-town theatre. He also wants to be a detective and studies a "How-To-Be A Detective" manual. He is given a task by the theatre owner to sweep up the theatre and also to solve the mystery of a stolen diamond ring. 
Buster discovers the ring hidden in a box of chocolates, but the owner of the ring (and the confectionary store) accuses him of stealing it. 
After a long chase through town, the true thief is caught and Buster is cleared of any wrongdoing. &nbsp;


prompt_token_count: 696200
cached_content_token_count: 696190
candidates_token_count: 117
total_token_count: 696317

Cool, we got a pretty good answer, and we can see from the token counts that nearly all of the whopping 696200 input tokens came from the cache, which is a huge saving the more questions we ask.

Please ask a question: What are the characters appearing in this movie, and the timestamps when they first appear?
The characters and their first appearance timestamps are: 

- **Buster Keaton** (projectionist) 0:26
- **Joe Keaton** (the theatre owner) 1:47
- **Kathryn McGuire** (the girl in the case) 2:11
- **Ward Crane** (the sheik) 3:42 


prompt_token_count: 696337
cached_content_token_count: 696190
candidates_token_count: 75
total_token_count: 696412

Again, we can see our cache activated and took care of all the video file tokens here. I’ll test it once more giving it a bit of a fun challenge. I found this scene in the movie:

Please ask a question: For how much is the watch and chain sold to the pawn brokers?
The watch and chain is sold to the pawn broker for $40.00. You can see the amount written on a piece of paper in the pawn broker's hand. &nbsp;


prompt_token_count: 696676
cached_content_token_count: 696190
candidates_token_count: 37
total_token_count: 696713

Aw, so close! Still pretty impressive though for a handwritten note in a movie from 1924. I’ll go and shut it down now:

Please ask a question: Shutting down...
Cache deleted.

And we can see that our cache was deleted to minimize costs. The price of chatting with a very large video like this is still very high, but the context caching feature is definitely a big help here and is one of those things that is a good start to the declining cost curve for this new tech.

That’s it for this tutorial series. I hope you thoroughly enjoyed it and thanks for taking this journey together! As always, it was a pleasure and an honor, and I’ll see you in the next one!

Code Execution

Context caching

Leave a Comment Cancel Reply