Hugging Face Course (1/6) – Running LLMs locally

πŸ‘‰ Back to the Full Course on local models and Hugging Face (+Videos)

Hi and welcome to this tutorial series on running Large Language and Machine Learning Models on your local machine. In this first part, we’ll be looking at running LLMs locally on our own computer for free. To achieve this we will use ollama which is a tool that is designed to help us achieve just this. As usual, let’s just jump right in and we’ll explain as we go along.

Installing ollama

Go to the ollama website and download the appropriate version for your operating system.

I am on Windows, so I’ll be using the Windows (Preview) version here. Once you have downloaded the file, click the installer to get started.

Just let it run and it will do its own thing…

When it’s done, you will see an alert box in the corner:

Running an LLM

We now have the Ollama server running in the background. Let’s tell it what LLM we want it to run. We’ll get started with Llama 3.

Llama 3 is Meta (Facebook)’s open-source large language model. It is a general-purpose chat model just like ChatGPT, meaning it is good at general NLP tasks like question answering, sentiment analysis, text classification, and even code generation.

Unlike OpenAI GPT-4 and Google Gemini, Llama 3 is freely available for almost all uses and is open-source.

There are two basic sizes of Llama 3, the smaller one having 8 billion parameters which is quite practical to run on most computers, and the larger one having 70 billion parameters.

To get started all you need is a terminal window. I’ll use my bash terminal integrated into VSCode here as we’ll be using VSCode for our coding in a moment anyway.

ollama run llama3

When you run this command, Ollama will automatically start pulling the llama 3 model for you. This will take a while as the model is around 4.7 GB in size.

Note that we are downloading the smaller version of Llama 3 here. If you want to run the larger version, replace the above command with ollama run llama3:70b instead, but make sure you have a powerful enough computer as the model alone is 40 GB in size.

As all steps are exactly the same I’ll stick with the smaller version for now.

You should see something like the following:

pulling manifest
pulling 00e1317cbf74... 100%
β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– 4.7 GB
pulling 4fa551d4f938... 100%
β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  12 KB
pulling 8ab4849b038c... 100%
β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  254 B
pulling 577073ffcc6c... 100%
β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  110 B
pulling ad1518640c43... 100%
β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  483 B
verifying sha256 digest
writing manifest
removing any unused layers
success
>>> Send a message (/? for help)

So let’s go and send it a message, shall we?

>>> What is a main sequence star?

Give it a second and it should start streaming your response to the terminal. This may be pretty fast or very slow depending on how powerful your system is. Ollama will use your GPU if available or run in CPU-only mode if required.

A Main Sequence Star!

In the vast universe, stars come in different forms and stages of life. A Main Sequence Star (MSS) is a specific type of star that's incredibly
important for our existence.

**What is a Main Sequence Star?**

A Main Sequence Star is a star that's fusing hydrogen atoms into helium in its core. This process releases an enormous amount of energy, making the  
star shine brightly.

Here are some key characteristics of Main Sequence Stars:

1. **Stable burning**: MSS stars have a stable, consistent energy output, which means their brightness remains relatively constant over millions of  
     years.
2. **Hydrogen fusion**: The core of the star is where hydrogen atoms (protons) combine to form helium, releasing energy in the form of light and
     heat.
3. **Medium mass**: Main Sequence Stars typically have a mass between 0.1 and 10 times that of our Sun (G-type star). More massive stars are red
     supergiants or blue giants.
4. **Surface temperature**: MSS temperatures range from about 3,000 to 50,000 Kelvin (4,500Β°F to 90,000Β°F), which determines the star's color:
     white, yellow, orange, or red.

**Why Main Sequence Stars matter**

Our Sun is a G-type Main Sequence Star, and it's the perfect distance from us for life as we know it. The energy released by our Sun (and other MSS  
stars) powers:

1. **Photosynthesis**: Plants convert sunlight into chemical energy through photosynthesis.
2. **Earth's climate**: The stable energy output of our Sun regulates Earth's temperature and weather patterns.
3. **Life on Earth**: The warmth and light provided by the Sun support life as we know it.

In summary, Main Sequence Stars are stars that fuse hydrogen into helium in their cores, releasing energy and supporting life across the universe!

Do you have any more questions about Main Sequence Stars or anything else?

Pretty good, you are now running llama 3 locally and chatting with it inside your terminal! Play around with it as you like and then type the following to exit:

/bye

Running other models

So let’s take a look at running some other models. We’ll start with Phi-3. Run the following command to pull the model to your computer (the run command did this automatically for us before):

ollama pull phi3

Phi-3 is a family of open AI models developed by Microsoft that are small language models (SLMs). Microsoft claims that Phi-3 models are the most capable SLMs available, outperforming models of the same size and the next size up in language, coding, and math capabilities.

Phi-3 models are designed to be more efficient and cost-effective than large language models (LLMs), while still delivering strong capabilities. They are designed to be more accessible and deployable on resource-constrained devices like smartphones. This comes at the cost of less factual knowledge and only dealing with the English language well.

The download will finish reasonably quickly, as the model is only 2.3 GB in size. Once it’s done, run the following command to check our locally saved models:

ollama list

And you will see we now have two models available:

NAME       ID        SIZE   MODIFIED
llama3:latest   a6990ed6be41   4.7 GB  25 minutes ago
phi3:latest    a2c89ceaed85   2.3 GB  3 seconds ago

Now let’s run phi3:

ollama run phi3

Then ask it a question to check it out (if you need multiple lines wrap them in """):

>>> """What
... is a sea turtle?
... """
A sea turtle, also known as a marine turtle, is a reptile of the order Testudines that has adapted to living in aquatic environments......
(truncated)

Easy enough! Type /bye to exit the conversation. If you want to remove this smaller model from your computer, run the following command using rm for remove:

ollama rm phi3

And now if you run ollama list again, you will see that the phi3 model is no longer available:

NAME            ID              SIZE    MODIFIED
llama3:latest   a6990ed6be41    4.7 GB  35 minutes ago

Available models

There are many more models available to run with ollama, including general and more specialized models. It can be hard to keep track of all the different options out there so before we move on to adding prompts and a REST API and such, let’s explore what is out there in the free open-source LLM world.

  • Mistral (7B)ollama run mistral. Made by the French company Mistral AI.
  • Neural Chat (7B)ollama run neural-chat. Fine-tuned model based on Mistral. Released by Intel.
  • Starling (7B)ollama run starling-lm. New open-source large language model (LLM) developed by researchers at UC Berkeley. It used feedback from AI models like GPT-4 as part of its training.
  • Solar (10.7B)ollama run solar. Developed by Upstage, a South Korean AI company. They focus on providing purpose-trained versions of Solar for various domains like healthcare, customer support, finance, etc.
  • Llama 2 Uncensored (7B)ollama run llama2-uncensored. Based on Llama 2 by Meta but has been retrained into an uncensored version that will not refuse certain questions.

These are fairly similar to Llama 3, so we won’t be installing them here one by one. Feel free to pull any of these you like and play around with them.

In addition, we also have Gemma, which is developed by Google. You can sort of think of it as a ‘lite’ version of its powerful Gemini models that you can download and run locally. There are two sizes:

  • Gemma 2Bollama run gemma:2b.
  • Gemma 7Bollama run gemma:7b.

Specific use-case models.

To top it all off, we have two more specific models available. These are:

  • Code Llama (7B)ollama run codellama. Code Llama is a code-specialized version of Meta’s Llama 2 model, further trained on a massive 500 billion token code and code-related dataset.
  • LLaVAollama run llava. LLaVA (Large Language and Vision Assistant) is a large multimodal model developed by Microsoft Research. It combines a vision encoder and a language model to achieve multimodal chat capabilities mimicking the popular GPT-4.

We’re going to be taking a look at LLaVA, but before we dive in any further, let’s create a base directory to hold our project files. I’ll name my base directory Local_Models and then also create a new folder inside named test_files:

πŸ“Local_Models
   πŸ“test_files

I’ll be opening the Local_Models directory in VSCode and running all the commands from there as we will soon get into more coding anyway.

Now save the next two images to the test_files folder, making sure to give them nondescriptive names (so the LLM cannot cheat). I’ll name them output1.jpg and output2.jpg:

output1.jpg -> Trex on a bicycle.

output2.jpg -> A single strawberry.

So now you have the following:

πŸ“Local_Models
   πŸ“test_files
      πŸ–ΌοΈoutput1.jpg
      πŸ–ΌοΈoutput2.jpg

Start up a new terminal and make sure you are in the root project folder or use the cd command to get there. In my case, this is:

admin@DirkMasterPC MINGW64 /c/Coding_Vault/Local_Models

Now lets both pull and run the LLaVA model:

ollama run llava

When it is done downloading the model you can ask your question including the image in the following manner:

>>> Please describe this image ./test_files/output1.jpg

Make sure you do not forget the . in front of the path and use the correct file name and extension, otherwise, it will hallucinate something random for you.

Trex
This image features a digital illustration of an anthropomorphic Tyrannosaurus Rex riding a bicycle. The dinosaur appears to be a juvenile with smaller size, characterized by a large head with two sharp teeth, and its body is proportioned to human standards. It has two arms, each holding the handlebars of the bike, and one leg bent over the pedal for propulsion.

The T-Rex is wearing a black helmet with a visor and is looking ahead. The bicycle itself is equipped with training wheels on both sides, suggesting it's a children's or beginner's model. The tires are black, and the bike frame seems to be made of metal, possibly steel or aluminum.

The background depicts a forested area with trees and foliage, creating a natural setting. There are also small details such as a bird flying in the upper left corner and what appears to be a small pond or stream visible through the branches. The overall style of the image is cartoonish and whimsical, likely intended for entertainment or educational purposes.

That is really impressive for an open-source and free model. It even realizes the Trex is smaller than it would normally be. It does get some small details wrong such as a helmet and training wheels on the side, but overall it is very good. This is a really difficult and detailed image as well to be fair. Let’s try with the more simple strawberry image:

First type /bye to exit the conversation and then run the LLaVA model again:

ollama run llava

You may wonder why we pointlessly restarted LLaVA. I’ve noticed the terminal version of Ollama sometimes will get stuck describing features of the previous image. We’ll leave the terminal and start coding with our local LLMs soon, but for now, we just restart LLaVA to get a fresh start and get rid of the previous context.

>>> Please describe this image ./test_files/output2.jpg
The image is a digitally altered photograph that combines two different subjects. On the left side, there's a section of a yellow stool with an abstract design. Overlaid on this is a visual effect where several magnets with the word "Downtown" are arranged...
(truncated)

Somehow it has trouble recognizing a strawberry! So yeah, the image recognition for LLaVA is really not perfect yet and is a bit hit and miss and seems to be ignorant of fruits in particular. It is still very impressive that it can recognize the Trex on a bike though.

I’m sure that if an open-source model of this size can get hit-and-miss results running on a normal local computer already, in a couple of years your phone will run models that will recognize anything you throw at it in real time.

As this is not quite reliable enough for us to use in our projects yet, we’ll be sticking to the text-based models for now as we move on to using Ollama with a REST API so we can approach it from our code.

Setting up a virtual environment

We’ll be running this project inside a virtual environment. A virtual environment is a self-contained directory that will allow us to install specific versions of packages inside the virtual environment without affecting the global Python installation.

We will use this mainly as there are many different versions of the packages we will be installing and we don’t want to have conflicts with other projects and installations already on your computer.

The virtual environment will make it easy for you to install my exact versions without worrying about affecting any of your other projects and is a good practice to follow in general.

To create a new virtual environment we’ll use a tool called pipenv. If you don’t have pipenv installed, you can install it using pip, which is Python’s package manager. Run the following command in your terminal:

pip install pipenv

Make sure the terminal is inside your root project folder, e.g. /c/Coding_Vault/Local_Models, and then run the following command to create a new virtual environment:

pipenv shell

This will create a new virtual environment and also a Pipfile in your project directory. Any packages you install using pipenv install will be added to the Pipfile.

  1. To generate a Pipfile.lock, which is used to produce deterministic builds, run:
pipenv lock

This will create a Pipfile.lock in your project directory, which contains the exact version of each dependency to ensure that future installs are able to replicate the same environment.

We don’t need to install a library first to create a Pipfile.lock. From now on when we install a library in this virtual environment with pipenv install library_name, it will be added to the Pipfile and Pipfile.lock automatically, which are basically just text files keeping track of our exact project dependencies.

For reference, I’m using Python 3.10 for this project, but you should be fine with any recent version. Consider upgrading if you’re using an older version.

Now press Ctrl + Shift + P and type Python: Select Interpreter and select the virtual environment you just created. This will make sure that you are using the correct Python version for this project, as sometimes this seems to not happen automatically. You can find the correct one by looking at the name inside the (braces) as it should contain the name of your project folder.

Setting up a REST API for our LLMs

In order to start interacting with our LLMs programmatically we need to have an API to interact with the models running on our local computer. We can’t just keep chatting one message at a time in the terminal.

First, let’s start a model. I’ll be using llama3 for this example:

ollama run llama3

When it is up and running go ahead and open a second terminal window by clicking the + button:

You now have a second terminal window and can use the menu on the right to switch between them. In this new terminal window let’s do a simple API test call using the curl command. This is a command line tool that will allow us to make HTTP requests.

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt":"What is the hypothalamus?"
}'

This will use the command line tool curl to make a POST request to the Ollama API running on your computer. In the unlikely case that your system doesn’t have curl installed, don’t worry about it and just read along or use a program like Postman to make the request. (We won’t be using curl after this test).

The -d flag is used to send HTTP POST data along. The data is a JSON object with two keys: model and prompt. The model key specifies the model we want to use, and the prompt key specifies the prompt, as simple as that.

You may wonder where the port number 11434 and this /api/generate path come from. Ollama will host an API for you on this port and path by default when you run a model.

Go ahead and send the request through your terminal and it will start streaming the response to you like this:

$ curl http://localhost:11434/api/generate -d '{
>   "model": "llama3",
>   "prompt":"What is the hypothalamus?"
> }'
{"model":"llama3","created_at":"...","response":"The","done":false}
{"model":"llama3","created_at":"...","response":" hypoth","done":false}
{"model":"llama3","created_at":"...","response":"alam","done":false}
{"model":"llama3","created_at":"...","response":"us","done":false}
etc, etc...

You can press Ctrl + C to stop the streaming and get back to your terminal when you’re satisfied that it works. We now know how to use Ollama and run LLMs locally on our own system and have an API to make requests to our own local models.

Go back to the first terminal window and type /bye to exit the conversation and shut down the model. You can also close the terminal window if you like.

In the next part, we’ll learn how to make this into a proper chatbot with memory so it has an awareness of the context of the conversation and the ability to remember previous messages. We’ll also create a simple web interface for our chatbot using Gradio, so we can interact with it in a more convenient way.

I’ll see you soon in the next part!

πŸ‘‰ Back to the Full Course on local models and Hugging Face (+Videos)

1 thought on “Hugging Face Course (1/6) – Running LLMs locally”

  1. jerry.isaac@gmail.com

    I just downloaded the llava model on 5/27/24, and the description of the strawberry is greatly improved from what you got:

    “The image shows a single, ripe strawberry placed on what appears to be a surface with a neutral color. The strawberry has a rich red color and is characterized by its white core and the green leaves still attached. It’s positioned in such a way that it takes up a central role in the frame, with the top of the strawberry
    visible at the top of the image, giving depth to the composition. The background is blurred, focusing attention on the strawberry. There are no visible texts or brands within the image, and the style suggests a high-resolution, detailed photograph with an artistic composition emphasizing the texture and vibrant color of the
    fruit. “

Leave a Comment