Intro to AI Engineering with Python (5/6) – Local LLMs and HuggingFace 🤗

Welcome back to part 5 of this tutorial series. In this part, we’ll take a look at how to run LLMs locally on your own computer for free and also at the HuggingFace community which has loads of models available for you to use in all sorts of fields. As always, we’ll have resources available to go way more in-depth on these topics.

Running LLM models locally

There are many free LLM models out there that you can run locally on your computer. Some of the more well-known models are the LLAMA models. We can very easily run these models on our own machine using a tool called ollama. Despite what the name suggests, this tool can also run many other models besides LLAMA. As usual, let’s just jump right in and we’ll explain as we go along.

Installing `ollama`

Go to the ollama website and download the appropriate version for your operating system.

I am on Windows, so I’ll be using the Windows (Preview) version here. Once you have downloaded the file, click the installer to get started.

Just let it run and it will do its own thing…

When it’s done, you will see an alert box in the corner:

Running an LLM

We now have the Ollama server running in the background. Let’s tell it what LLM we want it to run. We’ll get started with Llama 3.

Llama 3 is Meta (Facebook)’s open-source large language model. It is a general-purpose chat model just like ChatGPT, meaning it is good at general NLP tasks like question answering, sentiment analysis, text classification, and even code generation. It’s also free and open source!

There are two basic sizes of Llama 3, the smaller one having 8 billion parameters which is quite practical to run on most computers, and the larger one having 70 billion parameters.

To get started all you need is a terminal window. I’ll use my bash terminal integrated into VSCode here as I still have it open from the previous tutorial and we’ll get back to coding soon anyway. Run the following command in a terminal window:

ollama run llama3

When you run this command, Ollama will automatically start pulling the llama 3 model for you. This will take a while as the model is around 4.7 GB in size.

Note that we are downloading the smaller version of Llama 3 here. If you want to run the larger version, replace the above command with ollama run llama3:70b instead, but make sure you have a powerful enough computer as the model alone is 40 GB in size.

As all steps are exactly the same I’ll stick with the smaller version for demonstration purposes.

pulling manifest
pulling 00e1317cbf74... 100%
▕████████████████████████████████▏ 4.7 GB
pulling 4fa551d4f938... 100%
▕████████████████████████████████▏ &nbsp;12 KB
pulling 8ab4849b038c... 100%
▕████████████████████████████████▏ &nbsp;254 B
pulling 577073ffcc6c... 100%
▕████████████████████████████████▏ &nbsp;110 B
pulling ad1518640c43... 100%
▕████████████████████████████████▏ &nbsp;483 B
verifying sha256 digest
writing manifest
removing any unused layers
success
>>> Send a message (/? for help)

So let’s go and send it a message, shall we?

>>> Where does the sad frog meme come from?

Give it a second and it should start streaming your response to the terminal. This may be pretty fast or very slow depending on how powerful your system is. Ollama will use your GPU if available or run in CPU-only mode if required.

And there’s our answer:

The Sad Frog, also known as "Pepe" (yes, it's a guy!), is a beloved internet meme that originated on 4chan's /b/ board in      
the early 2010s. Here's a brief history:

* In 2008, artist Matt Furie created Pepe, a cartoon character for his comic book series "Boys LoveGirls" (yes, it's a
play on words!). Initially, Pepe was just a normal frog with a big smile and googly eyes.
* Around 2012-2013, /b/ board users began to manipulate and modify Pepe's facial expressions to create the sad version.        
This was done by altering his eyebrows, mouth, or adding tears, making him appear dejected and melancholic.
* The Sad Frog quickly gained popularity as a meme, symbolizing disappointment, sadness, or just general unhappiness. 
... cut for brevity ...

It was longer and very informative, but the purpose here is not to learn about our sad green froggy friend, so I cut it short! For a reasonably small model running on our own computer, this is pretty impressive! This stuff is not coming from the internet, it’s coming from the model running on our own computer.

When you’re done chatting with your new local LLM friend, you can type the following in the terminal to end the session:

/bye

There are many other models we can run with Ollama, all with differing sizes and capabilities. It’s also possible to communicate with the Ollama server running on your machine using rest API. This means we can programmatically interact with it and also use our LangChain knowledge to chain together and build cool automated stuff on top of this.

There will be a link at the end of this lesson with a course that has all of that and more in it for you if you’re interested in learning more.

HuggingFace

We’ll be stepping away from Ollama and switching to the HuggingFace machine-learning web community for now. This opens up a whole world of other non-LLM AI models!

Hugging Face is a leading platform and community for open-source natural language processing (NLP) and machine learning models. It has a vast library of pre-trained models for tasks like text generation, translation, summarization, question answering, etc. However, it also has many other types of models like image generation, audio generation, and even models for tasks like object detection.

Researchers and developers can share, discover, and deploy machine learning models and the website provides model cards with details on performance, training data, intended use cases, and licenses.

Hugging Face has a vibrant open-source community contributing code, Hugging Face Spaces, Datasets, and more. As it has a large model library and active community, Hugging Face has become a central hub for the open-source AI Machine Learning community.

So without further ado, let’s head over to the Hugging Face website and see what we can do with it:

Before we head over to the Models section I just want to point out that Hugging Face also has an excellent section for Datasets:

You can find all sorts of useful datasets in many categories to use for training, fine-tuning, or testing models. Finally, if we move over one more tab to the right we have Hugging Face Spaces, where you can see all sorts of cool live demos for AI models.

Running a model

At any point, you may need to sign up for a free account to access some of the features on the website. I recommend just creating a free account before continuing. When you’re done, click on the Models tab, and let’s get started!

We have so many models available, from Computer Vision to LLMs to audio-related models! First click on the Text-to-Image option on the left side. Sort the results by Most downloads and you should see a model named stabilityai/sdxl-turbo somewhere near the top.

SDXL Turbo is made by the stability.ai team and is based on Stable Diffusion XL. It is a pretty fast model that we can deploy locally to generate images at will! Click on the model to open the model card with more information:

Here you can see all the details for this model. It has a whopping 2million++ downloads in the last month alone! If you scroll down you will see instructions on how to install this model and usually some example code.

Jupyter Notebooks

For this part of the tutorial, we’ll be using Jupyter Notebooks, as it’s more practical to code interactively and keep our Python kernel running while we add more code. It will also display our generated images very nicely which is the main benefit here, as we cannot display image output in the console.

Install the Jupyter Notebooks extension for VS Code if you don’t have it already, by going to the extensions tab and searching for Jupyter Notebooks. It’s the extension with an insanely high number of downloads. (Jupyter by Microsoft).

I won’t be going into a detailed explanation of Jupyter Notebooks here, but if you’re not quite familiar with them, don’t worry. Just follow along with the video version of this tutorial so you can follow along step-by-step and you’ll get a feel for how it works.

First, go ahead and create a new file named image_gen.ipynb in the root folder of your project:

📁 Intro_to_AI_engineering
    📁 output
    📄 .env
    📄 chat_gpt_request.py
    📄 generate_image.py
    📄 generate_speech.py
    📄 gradio_project.py
    📄 image_gen.ipynb     ✨ New file
    📄 langchain_basics.py
    📄 test_audio.mp3
    📄 text_to_summarize.txt
    📄 transcribe_audio.py

Installing PyTorch

Now we’ll need to install PyTorch. PyTorch is an open-source machine learning library based on the Torch library, used for building and training neural networks. PyTorch supports hardware acceleration like CUDA GPUs and has a large community around it. For many models, you will need to have PyTorch installed to be able to run them.

Here is where it gets a bit tricky though, as some of you will have a CUDA GPU and some won’t. Head over to the PyTorch website and make sure you’re on the Start Locally tab:

Now scroll down until you get to the following choice menu section:

Choose the Stable build version, then pick the platform you’re on. I’ll be going for Windows and will use the Pip package manager inside VSCode as we have done so far. Obviously, our language will be Python. Next up is your choice for either CUDA 11.8, CUDA 12.1, or CPU.

If you don’t have a CUDA GPU, CPU is fine. Everything we will do will still run fine, it will just be slower, so go ahead and choose CPU. You will get the command to run in your terminal. Depending on your system configuration there is a chance you’ll have to change pip3 in the command to pip instead.

So if you’re on CPU only, run the command provided on the website. After running this command, skip ahead a little bit to where we get started with the Jupyter Notebook as the next section is for those with a CUDA GPU.

If you do have a CUDA GPU, we have 2 options left. Personally, I would just go for the CUDA 12.1 version. (Note, there has been an update since and version 12.4 is now also available, both will serve you fine). There are a couple of potentially confusing points here so let’s go over them:

It may seem like you need to install CUDA yourself, but the CUDA version of the command below will actually install the correct runtime dependencies for you.
If you have a graphics card but you’re not quite sure whether it’s CUDA compatible, you can check here. If it’s on the list you should be good to go.

So go ahead and run the command in your terminal to install everything we need. It will look something like this:

```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

But just use the updated command the website provides for you based on your choices. Keep in mind again that you may need pip3 or pip like me here depending on your system.

Testing PyTorch

When that is done running, go ahead and finally open the image_gen.ipynb Yupyter notebook file we created earlier. Go ahead and create a first code cell and put the following code in it:

import torch

torch.cuda.is_available()

Again, if you’re not familiar with Jupyter Notebooks, it may be helpful to follow along with the video version of the tutorial here. This code will check if we have a CUDA GPU available. If you do, it will return True, otherwise it will return False.

Go ahead and run this cell by pressing the play button to the left of it and you may be asked to select the Python kernel you want to use. Just select the standard Python version you have installed.

You may get a popup message that you need to install ipykernel:

Ipykernel is sort of like the Jupyter version of the Python kernel which is what allows the session to stay alive between code cells etc. If you don’t get this message it is on your system already.

If you do get it, just click install and it will take care of it for you, and then run the cell.

Now when the first cell runs you should hopefully see the following:

Yay!! We did it! Torch and CUDA are all set. If you chose to run on CPU above without the CUDA installation this will obviously return False for you, which is absolutely fine, all the code we will use will still run for you as well. As long as you get either a True or False return that means PyTorch is successfully running.

Finishing the model setup

Now that we have PyTorch🔥 installed, we have a couple more installs to make our chosen SDXL-Turbo model work. Do not despair though, these are really simple one-liner installs! Run the following in the terminal:

pip install diffusers transformers accelerate

The diffusers library deals with diffusion models for generating images, audio, and more. It also has pre-built diffusion pipelines, which are basically like pre-built chains that we can use to generate images with just a few lines of code.

The other two are requirements to get everything running and we’ll not go into detail on them here.

Running SDXL-Turbo

With that out of the way let’s get back to our Jupyter notebook. In image_gen.ipynb replace the code in the first code cell like this:

# Generic code that will work on any system. See alternative version below.
from diffusers.pipelines.auto_pipeline import AutoPipelineForText2Image
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
pipe = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo")
pipe.to(device)

We import torch and we use the AutoPipelineForText2Image class from the diffusers library, which is a pre-built pipeline for running these types of image generation models with minimal code.

Now we need to set the device that the torch library will use to perform its calculations. We check if a CUDA GPU is available and if it is we set the device to cuda:0, otherwise we set it to cpu. This same line will work for you no matter if you chose the CUDA or CPU version of PyTorch earlier.

Finally, we create a new pipeline by using the from_pretrained method of the AutoPipelineForText2Image class. We pass in the name of the model we want to use, which is stabilityai/sdxl-turbo, the model we chose earlier on the Hugging Face website. Now that we have this pipeline we need to move it onto the same device we set for the torch library.

A quick note, I made the code above a bit more generic so it will run on any system. If you are running a good graphics card, you can make the edit shown below here, but make sure you do not use this version if you are running on CPU as it will not work:

## Do not use this code block with CPU. Use the code block above instead!
from diffusers.pipelines.auto_pipeline import AutoPipelineForText2Image
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
pipe = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16")
pipe.to(device)

This version sets the data type to floating point 16, but this is not possible for CPU inference, hence the two different versions.

Go ahead and run this new cell. It may take some time to load all this stuff and download the model the first time around, so we can just leave it running.

Note: If you get import errors or other issues when running this cell, check out the video version of this tutorial where I go into some of the issues I encountered while recording this and how to fix them. Sometimes interdependent libraries are updated and even the code you wrote before no longer works as expected after updating.

Now go to the next code cell and let’s code up a quick reusable function using the code example from the Hugging Face model card for the stabilityai/sdxl-turbo model as a reference.

In the new cell, write the following code:

def generate_image(prompt: str):
    image = pipe(prompt=prompt, num_inference_steps=1, guidance_scale=0.0).images[0]
    return image

The code just calls the pipe that we set up. The model card from Hugging Face also offers us an explanation of why we set the guidance_scale to 0.0 (not used for this model), and that the num_inference_steps is set to 1 is enough. (Of course we can play around with these values later!).

It shows us that the image output will be located in the output object’s .images attribute in index [0], so we just follow its lead. Make sure you run this cell so the function gets loaded into memory.

Generating images

generate_image("Pepe the frog sad from meme feelsbadman with sunglasses")

Now give it some time to run. It may be pretty quick or take a minute or two based on how powerful your system is. When it’s done Jupyter Notebooks will display the function output image automatically without needing a print statement:

That’s a pretty good meme sketch, looks like someone actually sketched this out on their graphic pad. This is my very first try for that particular prompt (You can always retry if you don’t like the result, or just generate like 4 at a time and pick the best one).

generate_image("A dragon flying in the sky, breathing fire, a large moon shines with a purple hue in the background")

Now we can see a slight artifact on the wing here and one of the paws looks a bit off, but this is still pretty impressive considering it was done by our humble local computer running a free open-source model!

Of course, it’s not quite as good as the most state-of-the-art paid models running on massive servers, but it’s still pretty good for free!

Many other models on HuggingFace have different installation requirements and pitfalls, so you’ll tend to need a couple more tricks up your sleeve if you want to run more different local models.

Luckily, if this is something you’re interested in exploring further, and you’d like to see what other models are out there, we have a whole course dealing with this topic!

You can check out the Hugging Face: Running Free and Open-Source LLMs Locally to Generate Text, Images, Speech, and Music on Your Machine course on the Finxter Academy. In this course, we will check out many other models available to run for free using Ollama and build a fully-fledged chatbot with an interface on our local system using these free models. We’ll also take a much deeper look into HuggingFace and its many available models, generating not just images, but also speech, music, and even 3d models, all using free and open-source models on our local machine!

That’s it for part 5 of this broad introduction to AI tutorial series! I’ll see you back in the 6th and final part of the series where we will take a look at Google Gemini, another powerful LLM model, and its capabilities. See you there!

Intro to AI Engineering with Python (5/6) – Local LLMs and HuggingFace 🤗

Running LLM models locally

Installing `ollama`

Running an LLM

HuggingFace

Running a model

Jupyter Notebooks

Installing PyTorch

Testing PyTorch

Finishing the model setup

Running SDXL-Turbo

Generating images

1 thought on “Intro to AI Engineering with Python (5/6) – Local LLMs and HuggingFace 🤗”

Leave a Comment Cancel Reply

Running LLM models locally

Installing ollama

Running an LLM

HuggingFace

Running a model

Jupyter Notebooks

Installing PyTorch

Testing PyTorch

Finishing the model setup

Running SDXL-Turbo

Generating images

1 thought on “Intro to AI Engineering with Python (5/6) – Local LLMs and HuggingFace 🤗”

Leave a Comment Cancel Reply

Installing `ollama`