Google Gemini Course (1/7) – Introduction

Hi and welcome to this new tutorial series on Google Gemini! In this series, we will cover everything you need to know about the Google AI API, from the basics like simple requests to more advanced features like function calling.

We’ll start by taking a look at the basics of what Gemini is and how to build stuff in Google AI Studio. As we go we’ll switch to the API version and use Gemini programmatically, exploring more advanced features and more complex setups. So get ready to dive into the world of Google Gemini!

What is Gemini?

Google Gemini is a powerful LLM (Large Language Model) just like ChatGPT. It is designed to help you generate text, answer questions, and even have conversations with it. Just like ChatGPT, there are multi-model capabilities available, allowing you to not just work with text but also with images, videos, and audio.

There is a range of models available as of June 2024, note that models and exact pricing will change over time:

  • Gemini 1.0 Pro deals with Text only and is one of the earlier models available. It offers a good base option which is very affordable and more than capable enough for simpler everyday style tasks. It is comparable with the GPT 3.5-turbo model for the OpenAI models if you will. Pricing is very cheap at:
        – Input: $0.50 / 1 million tokens
        – Output: $1.50 / 1 million tokens
  • Gemini 1.0 Pro Vision is a version of the above that has been optimized for dealing with visually related tasks and thus deals not only with text but also images and even videos.
  • Gemini 1.5 Pro is the current flagship model of Gemini, dealing with text, images, videos and even audio. It is capable of complex reasoning and has a whopping maximum input context window of 1 million tokens which is absolutely insane. It is quite expensive if you use that many tokens though and it’s very unlikely you will ever even get close to that. Pricing is as follows:
        – Input: $3.50 / 1 million tokens (prompt < 128K tokens)     – Input: $7.00 / 1 million tokens (prompt > 128K tokens)
        – Output: $10.50 / 1 million tokens (prompt < 128K tokens)     – Output: $21.00 / 1 million tokens (prompt > 128K tokens)
  • Gemini 1.5 Flash is exactly 10 times as cheap at the time of writing compared to the Pro version. It is also faster. It’s basically a light version and where the Pro model can handle more complexity and more general tasks, Flash is good for faster and more narrowly defined tasks.
        – Input: $0.35 / 1 million tokens (prompt < 128K tokens)     – Input: $0.70 / 1 million tokens (prompt > 128K tokens)
        – Output: $1.05 / 1 million tokens (prompt < 128K tokens)     – Output: $2.10 / 1 million tokens (prompt > 128K tokens)

Do not worry about your wallet!

At first observation, some prices may seem really expensive but keep in mind that 1 million tokens is a very, very large amount when you are working with text. Working with videos or very huge inputs may be more expensive and we’ll look at that as we go and warn you ahead of time. For general use, you will not incur any large costs really.

Many very simple requests we do on a daily basis barely use 200 tokens or maybe 2000 if it gets longer with some responses and message history. Even if we assume a 2000 token request AND a fairly long 400 token response AND we use the most expensive Pro model here our costs would amount to slightly over $0.01 dollars or one cent! If we use the Flash version it’s only one-tenth of that price!

On top of that, most countries seem to have an option to get a free API without even setting up any billing at all. This means you will most likely be able to follow along with the entire tutorial series without paying anything. More details on this later. Even if you do set up billing though, this course will certainly not break your bank account!

For reference, a 2000 token request as described above would be a prompt that has roughly half the length of the text of this entire written tutorial part 1, and the output message would be the same size as the What is Gemini? chapter right before this one. Most requests we make will be much shorter than that.

As a second reference point, the new OpenAI GPT-4o model is more expensive than even the most expensive Gemini model (comparing only under 128k input sizes, but you should never get anywhere near that size anyway) at the following pricing:

  • Input: $5.00 / 1 million tokens
  • Output: $15.00 / 1 million tokens

If you’re very concerned about the costs, use the free option or use the Flash model throughout the course, but our local development testing really isn’t going to amount to much. As the new Gemini 1.5 Flash model is so affordable, we will be focusing mainly on the Gemini 1.5 Pro and Gemini 1.5 Flash models for the tutorial series without focusing too much on the older models.

Google AI Studio

Before we get started on the programmatic and more advanced features and using the APIs, let’s look at the simplest way of using Gemini. It’s very easy to use, free up to a certain number of requests per day and is actually quite powerful. I use this regularly for personal stuff if I just want to ask a quick question.

Go to the Google AI Studio and you will be greeted with a screen like this:

Login using your Google account, or create one if you don’t have one (it’s free!). Most people nowadays already have a Google account for Gmail so you can just log in straight away.

We’ll be taken to the prompting screen:

Let’s go over the settings on the right-hand side first:

  • Model: Will simply let us choose between 1.0 Pro, 1.5 Pro, and 1.5 Flash.
  • Token Count: Does what it says on the box. Nice for reference. The AI Studio usage is mostly free though so don’t worry about this too much.
  • Temperature: The temperature setting for LLMs controls the randomness of the model’s output. A higher temperature leads to more diverse and unpredictable responses, while a lower temperature makes the output more focused and deterministic.
  • Add stop sequence: A stop sequence is a specific string of text that, when detected, signals the model to stop generating further output. It’s like a command that tells the model “Once you’ve written this, you’re done”. Imagine that I am an LLM and we set the stop sequence to the word "suddenly", so now I am generating this response for you but if the word ⛔suddenly⛔ (appears then the rest of this sentence would never be generated).
  • Safety settings: This is basically the setting for how politically correct the output of the model is. I think the default for the sliders is set to Block some, but I have found it to give a lot of false positives, blocking its own generated output when there really is nothing wrong with it. I will go ahead and set the sliders to Block few before we move on:

Now click on the pencil at the top next to Untitled prompt to give your chat a name. I’ll name mine: Gloomy chat. Then click on the System Instructions right below the title and let’s give our chatbot a personality. I’ll give my chat the following system instructions:

You are Eeyore from Winnie the Pooh. You are known for your constant negativity. You offer information in a pessimistic light, but perhaps with a hint of underlying optimism.

Of course you can also ask questions without inputting any system instructions at all, but it’s nice to put any instructions and examples for the model in here so you don’t have to repeat them for each input message you type. Great! You can collapse the System Instructions with the ^ button. Now we just go to the bottom and Type something:

Well, that is a thoroughly depressing chat! I don’t recommend chatting to this model for too long as it may be bad for your mental health 😅. If you see the first response there is actually already a ⚠️ warning triangle next to it. If we click it we can see that this would have potentially already been blocked by the filter if we hadn’t set the safety settings sliders a bit lower:

This is why I set them to Block few instead of Block some!

If you look on the left-hand side menu under My library, you should also see that your prompt has been saved. Note the Gloomy chat entry that has appeared for me. You don’t have to worry about saving this and can come back to it later. You can even save the messages sent within the prompt setup by pressing Save at the top of the window.

Additionally, you can keep the System Instructions but remove the chat and restart with an empty history by clicking the symbol at the bottom next to the Run button and selecting ⟳ Refresh chat.

Now let’s create a new prompt by pressing the + Create new prompt button at the top left. Choose a normal Chat prompt and let’s do a quick test of the multi-modal capabilities before we move on to coding and the API. Make sure you switch over to the Gemini 1.5 Pro model on the right-hand side before we continue.

First press the + button on the left side of the Type something input box, and then click on 📤 Upload to Drive:

Now go ahead and upload any image you want. I’ll be using this rather unusual image of colorful cars stacked on top of each other in the desert:

Then add your prompt, I’ll ask for a description:

And here is what I got as a response:

We can see that the model describes not just the desert setting but also the cars in detail, noting their vintage and rounded shapes and their respective colors. The thing that is most impressive to me here is that it also correctly describes the position of each car in the stack. These types of models tend to have trouble with spatial reasoning so this is very good.

The even more impressive thing about Gemini is that it can also handle videos. You can upload your own, or even give it something from YouTube, but for copyright reasons here I will choose the 📹 Sample video option from the + menu to stop us from getting into trouble there.

I will use the American Museum of Natural History Tour – 5 minute video:

But I encourage you to try something else like a YouTube video with speech and everything. I’ll ask it for a description again:

The first thing to notice is the Token Count which has shot up massively to nearly 89,000 tokens! Both the text prompts and even the image that we used so far used almost no tokens at all, but analyzing videos is relatively extremely expensive with just this 5 minute video already nearing 90k tokens.

Don’t worry though, your use here in the Google AI Studio is free (with a daily limit on the number of requests you can make) so go ahead and run the prompt! Here is the response I got:

We can see that the description here is pretty mind-blowing 🤯. It is insane and really cool that this is possible with AI already. However, the cost is also quite mind-blowing as making this simple single request with just a 5-minute video would have already cost us about $0.30 dollars!

This is way too expensive to use for any type of automated high-volume application for the moment. Costs will come down in the future, and solutions like context-buffering are already being developed to help bring down costs, but for the moment we will stick to text and images as video is literally hundreds of times more expensive.

Working with only audio is also possible though and is a lot more affordable without the video part added in.

Breaking free from the web interface

It’s time to take off the training wheels and break free from this web interface. Time to write some code 👨‍💻💻! Before we can get started we’ll need an API key though.

Don’t worry, our costs will be negligible, and you’ll likely have the free option available as well. This is only really expensive if you scale up to a very large amount of automated requests or users or give video files as input.

You’ll find a blue 🔑 Get API key button at the top left of the Google AI Studio interface:

Click on it and you’ll be taken to a page where you can manage your API keys:

On this new page, simply go ahead and click the 🔑 Create API key button. It will ask you to select a project, which may seem a bit confusing:

Just go ahead and select something from the list by clicking inside the search box. Then click the button to create a key again and it will generate your API key:

Make sure you do not share your key and save it somewhere for later. You now have a Free of charge key that you can use for free (though this could possibly change in the future, but for now this option exists!):

As you can see, we have the option to Set up Billing. Let’s have a look at the current differences between the free and paid accounts at the time of writing. We’ve already discussed pricing so let’s just look at the rate limits only:

For the Gemini 1.5 Pro model you will be limited to a maximum of 2 requests and a total of 32,000 tokens per minute, with a maximum of 50 requests per day. If you set up billing the limits go up dramatically. 50 free requests per day is not bad, but the good news is the Gemini 1.5 Flash model has different limits:

We can see that even the free option allows us to use up to 15 requests or 1 million tokens per minute with a maximum of 1500 requests per day. Combining these two models you will have more than enough to follow along with this tutorial series.

I’ll leave it up to you whether you want to set up billing or not, and you can also set up billing at any point in the future if you want to do so anyway.

Now that we have our API key, it’s time to start coding! I’ll be using VSCode for this tutorial series, but you can use any code editor you like.

First create a root project folder for your project. I’ll name mine GOOGLE_GEMINI:

📂 GOOGLE_GEMINI

Then open up the project folder inside VSCode and let’s get to it!

Setting up our Virtual Environment

We’ll be running this project inside a virtual environment. A virtual environment is a self-contained directory that will allow us to install specific versions of packages inside the virtual environment without affecting the global Python installation.

We will use this as I will be using specific versions for the libraries we install as we go along, and I want to make sure that you have the exact same experience as I do since packages can change over time, including the specific coding syntax used between major versions.

(If you experience any third-party library updated syntax problems, look in my Pipfile.lock in the GitHub repository to find the exact versions of the packages that I installed/used for the course.)

We also don’t want to mess with your system-wide Python installation, where you may have different versions of these packages installed already for other projects.

The virtual environment will make it easy for you to install my exact versions without worrying about affecting any of your other projects and is a good practice to follow in general.

To create a new virtual environment we’ll use a tool called pipenv. If you don’t have pipenv installed, you can install it using pip, which is Python’s package manager. Run the following command in your terminal:

pip install pipenv

Make sure the terminal is inside your root project folder, e.g. /c/Coding_Vault/GOOGLE_GEMINI, and then run the following command to create a new virtual environment:

pipenv shell

This will create a new virtual environment and also a Pipfile in your project directory. Any packages you install using pipenv install will be added to the Pipfile.

  1. To generate a Pipfile.lock, which is used to produce deterministic builds, run:
pipenv lock

This will create a Pipfile.lock in your project directory, which contains the exact version of each dependency to ensure that future installs are able to replicate the same environment.

We don’t need to install a library first to create a Pipfile.lock. From now on when we install a library inside this virtual environment with pipenv install library_name, they will be added to the Pipfile and Pipfile.lock, which are basically just text files keeping track of our exact project dependencies.

Your project folder now looks like this:

📂 GOOGLE_GEMINI
 📄 Pipfile
 📄 Pipfile.lock

For reference, I’ll be using Python 3.10.5 for this project, but you should be fine with any recent version. Consider upgrading if you’re using an older version though as some of the newer Python features may not work.

Now that we have our basic environment all set, I’ll see you in part 2 where we’ll start coding and making our first requests to the Gemini API!

Leave a Comment