Intro to AI Engineering with Python (6/6) – Google Gemini

Hi and welcome to the last part of this tutorial series, where we will be taking a look at Google Gemini. Google Gemini is a powerful LLM (Large Language Model) just like ChatGPT.

It is designed to help you generate text, answer questions, and you can even have conversations with it. Just like ChatGPT, there are multi-modal capabilities available, allowing you to not just work with text but also with images, videos, and audio.

There are several models available but the two main current ones are Gemini 1.5 Pro, which is the flagship model, and Gemini 1.5 Flash, which is the smaller, faster, and roughly 10 times cheaper version of the flagship model.

Notice the similarities between gpt-4o and gpt-4o-mini, it’s basically the same concept, and prices are also roughly comparable.

So for this introduction, let’s look at an example of how we can use the Gemini API to ask a multi-modal question also involving an image. Before we can get started we’ll need an API key though.

Don’t worry, our costs will be negligible, and you’ll likely have a free trial option available as well. This is only really expensive if you scale up to a very large amount of automated requests or users or give video files as input.

Go to the Google AI Studio and you will be greeted with a screen something like this:

Login using your Google account, or create one if you don’t have one (it’s free!). Most people nowadays already have a Google account for Gmail so you can just log in straight away.

We’ll be taken to the next screen:

You can actually play around with Gemini in the web interface here, but as we don’t have much time in this single lesson, we’ll just focus on the API. See the full course later for more details.

You’ll find a πŸ”‘ Get API key button at the top left of the Google AI Studio interface:

Click on it and you’ll be taken to a page where you can manage your API keys:

On this new page, simply go ahead and click the πŸ”‘ Create API key button. It will ask you to select a project, which may seem a bit confusing:

Just go ahead and select something from the list by clicking inside the search box. Then click the button to create a key again and it will generate your API key:

Make sure you do not share your key and save it somewhere for later. You now likely have a Free of charge key, though these options could be dependent on locale or change over time.

As you can see, we have the option to Set up Billing. Your free key is very rate-limited with the number of requests you can make per minute and day, though you can easily follow along with the tutorial. You can always set up billing later to get a more powerful key.

The first thing we need to do is open back up our .env file and add our new key:

OPENAI_API_KEY=sk-YoUrApIkEy1234567890
GEMINI_API_KEY=AIYoUrApIkEy1234567890

Next, let’s install the Google Generative AI library:

pip install google-generativeai

You can liken this to the OpenAI library we used earlier, it will make it easier to interact with the Gemini API from our Python code.

Making a simple request

Now let’s create our first simple request to Gemini as a warm-up. Inside the root of your project directory, create a new file called simple_gemini_request.py:

πŸ“ Intro_to_AI_engineering
    πŸ“ output
    πŸ“„ .env
    πŸ“„ chat_gpt_request.py
    πŸ“„ generate_image.py
    πŸ“„ generate_speech.py
    πŸ“„ gradio_project.py
    πŸ“„ image_gen.ipynb
    πŸ“„ langchain_basics.py
    πŸ“„ simple_gemini_request.py     ✨ New file
    πŸ“„ test_audio.mp3
    πŸ“„ text_to_summarize.txt
    πŸ“„ transcribe_audio.py

Now let’s get started on our code. First, we’ll need to import the necessary libraries:

import os

import google.generativeai as genai
from dotenv import load_dotenv

We import os and load_dotenv from the dotenv library, this is for reading the API key just like with OpenAI. We also import the google.generativeai library as genai to make the name a bit easier to work with. Again, this genai object will be our main interface to interact with the Gemini API.

Now we’ll load the .env file and give the API key to the genai library. You already know how this works from our earlier exploits with OpenAI:

load_dotenv()
genai.configure(api_key=os.environ["GEMINI_API_KEY"])

The only difference is that genai uses a .configure() method to receive the api_key parameter.

Now I’ll create an object that contains our desired safety settings. I’ll explain this in more detail in the full course, but basically, Gemini has a pretty strict system to block potentially unwanted responses. As it can be a bit too strict at times, we want to set the settings a bit on the lower side:

SAFETY_SETTINGS = [
    {
        "category": "HARM_CATEGORY_HARASSMENT",
        "threshold": "BLOCK_ONLY_HIGH",
    },
    {
        "category": "HARM_CATEGORY_HATE_SPEECH",
        "threshold": "BLOCK_ONLY_HIGH",
    },
    {
        "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
        "threshold": "BLOCK_ONLY_HIGH",
    },
    {
        "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
        "threshold": "BLOCK_ONLY_HIGH",
    },
]

As you can see, there are four categories, and we set them all to block only the high-risk responses, which is a lot more lenient than the default settings.

Now we need to actually define our model:

model = genai.GenerativeModel(
    model_name="gemini-1.5-pro",
    safety_settings=SAFETY_SETTINGS,
    system_instruction="You are a meme-o-matic. Whatever I say, you will respond using only internet-memes and famous movie lines. (E.G. one does not simply...) Feel free to also use Emojis when you need to express something visually.",
)

Here we create a new GenerativeModel object from the genai library. We pass in the model name, which is "gemini-1.5-pro" in this case. For the Flash version, you can pass in the string "gemini-1.5-flash". Both will do fine for this simple example, so it doesn’t really matter which one you choose.

We also pass in the safety settings we defined earlier. The system_instruction parameter is basically the same idea as the first system message we sent to ChatGPT in our list of messages, except it’s in a separate parameter here. I’m asking it to give us back responses only in the form of memes.

Now that we have defined the model and exact settings and system setup we want to use, we can initialize a new chat session:

history = []

chat_session = model.start_chat(history=history)

The code is very easy to read and sort of self-explanatory. We’ll just start with an empty list for the history here, and then we instantiate a new chat session with the start_chat method. This method will return a new ChatSession object that we can use to interact with the model.

Now we can ask a question to the model:

response = chat_session.send_message("Who baked the first croissant in the world, and where?")

print(response.text)

Here we send a message to the chat session with the send_message method. The response will be stored in the response variable, and we can access the text of the response with the text attribute, printing it out to the console.

So go ahead and run this Python file and you should see something like the following:

Drakeposting:

❌ Nobody knows who baked the first croissant.    

βœ…  It was me, I baked the first croissant. 😏 πŸ₯ 

So I guess it’s kind of describing our meme for us, which would look like this if we actually created it with a meme-generator:

The answer is not very useful from a learning viewpoint, but we asked it to come up with memes, so that’s on us! Our call works and we can move on to the next level. For this simple introduction, let’s see how we can ask a question involving an image.

Ok, I’ll stop with the memes now, I promise!

So in order to add an image to our API call, the first thing we’ll actually need is an image! (I know, shocking right?) You can use any image you like, but I’ll use this one which we generated earlier for demonstration purposes:

Whichever image you choose to use, make sure you save it to your root project folder inside a new folder named images:

πŸ“ Intro_to_AI_engineering
    πŸ“ images
        πŸ–ΌοΈ bikinizilla.png
    πŸ“ output
    πŸ“„ .env
    πŸ“„ chat_gpt_request.py
    πŸ“„ generate_image.py
    πŸ“„ generate_speech.py
    πŸ“„ gradio_project.py
    πŸ“„ image_gen.ipynb
    πŸ“„ langchain_basics.py
    πŸ“„ simple_gemini_request.py
    πŸ“„ test_audio.mp3
    πŸ“„ text_to_summarize.txt
    πŸ“„ transcribe_audio.py

Now in order for us to use this image with Google Gemini we need to send it over somehow. We cannot just put it into our text prompt. Not to worry, Google has a solution for this! We can use the File API. Basically, this API will allow us to upload an image to Google’s servers and we can then reference this uploaded image in our prompts as it is already on the Google servers.

Using the File API

So let’s take a look at how this works in isolation, and then we can put it all together afterwards. Create a new file named gemini_image_upload.py in your root project folder:

πŸ“ Intro_to_AI_engineering
    πŸ“ images
        πŸ–ΌοΈ bikinizilla.png
    πŸ“ output
    πŸ“„ .env
    πŸ“„ chat_gpt_request.py
    πŸ“„ generate_image.py
    πŸ“„ generate_speech.py
    πŸ“„ gemini_image_upload.py     ✨ New file
    πŸ“„ gradio_project.py
    πŸ“„ image_gen.ipynb
    πŸ“„ langchain_basics.py
    πŸ“„ simple_gemini_request.py
    πŸ“„ test_audio.mp3
    πŸ“„ text_to_summarize.txt
    πŸ“„ transcribe_audio.py

We’ll use this file as a test run to see how it works. To start, we’ll need a genai object again which has our API key and stuff inside. We can just do this in the same way we did in the other file:

import os

import google.generativeai as genai
from dotenv import load_dotenv

load_dotenv()
genai.configure(api_key=os.environ["GEMINI_API_KEY"])

Now let’s define a path to our image and then upload it using the File API which we can access conveniently through the genai object:

image_path = "images/bikinizilla.png"

image_upload = genai.upload_file(path=image_path, display_name="Bikini Zilla")

print(f"Uploaded file '{image_upload.display_name}' as: {image_upload.uri}")

This is all quite readable, right? We just define the relative path to our image in a string. Then we call the upload_file method on our genai object and pass in the path and a display_name for the image. The method returns a File object which we called image_upload and we print the display_name and the uri to see if it worked successfully.

Run this file and you should see something like the following:

Uploaded file 'Bikini Zilla' as: https://generativelanguage.googleapis.com/v1beta/files/kxzh8zfr6ddk

This means that we have received a valid File object as a response and our call was successful. We can double-check by trying to retrieve the file from the API once again. Add the following code below:

file = genai.get_file(name=image_upload.name)
print(f"Retrieved file '{file.display_name}' from: {file.uri}")

Now go ahead and run the script one more time and you’ll see two lines this time:

Uploaded file 'Bikini Zilla' as: https://generativelanguage.googleapis.com/v1beta/files/8qh21ekz68x6
Retrieved file 'Bikini Zilla' from: https://generativelanguage.googleapis.com/v1beta/files/8qh21ekz68x6

We successfully uploaded and retrieved the image! Note that we have a different URL this time. This is because we made a second call so we uploaded the same image again to a separate file.

You can call genai.delete_file(sample_file.name) by filling in the correct name from the File object to delete the file, but Google will delete old files for you after 48 hours, so you frankly don’t have to bother with this.

So now that we know how this works, let’s go back to our simple_gemini_request.py file and add image functionality to our simple request. We can leave the imports, set up, and SAFETY_SETTINGS as they are, but I am going to change the system_instruction to something else for this one:

model = genai.GenerativeModel(
    model_name="gemini-1.5-pro",
    safety_settings=SAFETY_SETTINGS,
    system_instruction="You are Sherlock Holmes. You will help me solve my questions and mysteries about anything and everything, using your superior analytical skills and reasoning power, all while staying in character as Sherlock Holmes.",
)

The history and chat_session stay the same as they are:

history = []

chat_session = model.start_chat(history=history)

Now we’ll apply the if __name__ == "__main__": guard to our code just like we did in previous parts, and make this into an interactive script. First we’ll ask the user for an image path and then upload it to the Google servers:

if __name__ == "__main__":
    image_path = input("Please provide the path to the image: ")
    image_upload = genai.upload_file(path=image_path, display_name="User Image")

    # Rest of the code here

We ask using the input() function we’ve used before and then upload the image. Now we still need to get the user’s question, and then combine the two into the full query, sending it to the model:

if __name__ == "__main__":
    image_path = input("Please provide the path to the image: ")
    image_upload = genai.upload_file(path=image_path, display_name="User Image")

    text_query = input("Please ask a question to go with your image upload: ")

    full_query = [image_upload, text_query]
    response = chat_session.send_message(full_query)

    print(response.text)

We ask the user for their question about this image, and then we can combine the image_upload (the reference to the image on Google’s servers) with the text query by simply putting them into a list.

We then send this list to the model and print out the response just like we would have for any other query. So let’s test it out. I’ll ask Sherlock Holmes a question about the image of Bikini Zilla:

Please provide the path to the image: images/bikinizilla.png
Please ask a question to go with your image upload: What is going on here? What is this creature?

Ah, most curious! This is clearly no ordinary lizard basking in the sun. The creature's size, the reptilian features, the distinct dorsal plates... all point to one conclusion:  we are dealing with a Godzilla, though one with a rather unexpected sense of holiday spirit.

The question is, what brings this king of the monsters to, I presume, the banks of the Nile? A simple vacation seems unlikely for a creature of such destructive potential. Could this be a clever ruse, a way to lower the guard of unsuspecting humans? Or perhaps there is a more nuanced explanation... a mystery only a keen observer could unravel.

Provide me with more details, my dear Watson. Where was this image procured? Were there any other clues, no matter how insignificant they may seem?

There we go. We have successfully sent our image along and the model has clearly analyzed the contents of the image and given us a response, also recognizing the pyramids in the background and all. This is a very simple example, but it shows the power of multi-modal AI models like OpenAI’s ChatGPT and Google Gemini. As to which one is best? Well… they are both quite amazing!

The question we asked wasn’t highly valuable and more for fun, but you can see how this could be used in a more serious context, and having this type of automated programmatic access to a model versus having to do so manually through a web interface increases your power and flexibility by a lot, as now you can integrate this as a part of more complex and automated systems.

If you enjoyed this chapter on Gemini and would love to learn more, check out The Ultimate Gemini API Course for Upcoming AI Engineers, where we cover the basics and underlying details in much more depth, and also go into more advanced topics like function calling, code execution, and context caching. Especially function calling is covered in detail and we even call them in parallel, as this makes AI so much more powerful by allowing you to expand its capabilities with any custom tool and functionality you can think of.

Finally, before we go I would like to introduce one more very cool course which is also part of the Finxter Academy. This one was too big to create a 1 lesson introduction for, as in this course you will build your very own full-stack AI application website and deploy it live to the internet. It is a really cool course and at the end, you will have your very own website running on the internet, so I highly recommend checking out the ChatGPT Meets NextJS – Complete Guide to Creating a Full-Stack AI App course:

That’s it for this broad introduction to AI tutorial series. I hope you found it interesting and learned some cool stuff, and I’d be delighted to see you back in any of the other more in-depth courses on a topic that piqued your interest. As always, it was my honor and pleasure to take this journey together, and until the next time, happy coding!

Leave a Comment