Hugging Face Course (2/6) – Creating a Proper Chatbot

👉 Back to the Full Course on local models and Hugging Face (+Videos)

Hi and welcome back to part 2! In this part, we’ll be taking our bare-bones API and getting started on creating a proper chatbot with a convenient interface.

We will be using LangChain to define and run our chatbot, so let’s install it first. In your terminal window first make sure your virtual environment is activated. If you’re not sure then you can use the following trick:

echo $VIRTUAL_ENV

This will simply print (echo) the VIRTUAL_ENV environment variable to the console. If you see a path something like C:/Users/admin/.virtualenvs/Local_Models-FCDcoHgn it means you are in a virtual environment located at this path. If you have any trouble or do not see any return simply re-activate your virtual environment by running:

pipenv shell

Adding LangChain

With that out of the way let’s install LangChain inside our virtual environment:

pipenv install langchain

This installs LangChain which is the base library. Now let’s add one more:

pipenv install langchain-community

This installs the LangChain Community package which we need as well.

💡 We’ll be covering the basics of LangChain as needed in this tutorial. If you want to really do a deep dive into the more advanced uses of this library and learn to create complex chains and setups, check out my dedicated LangChain tutorial which also covers LangSmith and LangGraph.

Then create a new Python file in your root project folder and name it local_chat.py:

📁Local_Models
    📁test_files
    📄local_chat.py
    📄Pipfile
    📄Pipfile.lock

Now open the local_chat.py file and let’s start with a very basic request:

from langchain_community.llms import Ollama

llama3 = Ollama(model="llama3")
response = llama3.invoke(
    "If you understand this message, please reply with the sentence: 'The meaning of life is bucket'."
)
print(response)

We import the Ollama class from the langchain_community.llms module and create an instance of it with the model parameter set to "llama3". Again, feel free to use the llama3:70b model if your computer can handle it.

Then we call the invoke method on our llm object with the question “If you understand this message, please reply with the sentence: ‘The meaning of life is bucket’.”
and print the response to the console. See how easy and readable this is? This is the power of LangChain! It handles the API request logic for us.

Make sure you have Llama 3 running in Ollama, run the command in your terminal if it is not:

ollama run llama3

Now run the script in your terminal and you should get a response something like this:

I see what's going on here!
You're testing to see if I can recognize and respond to a nonsensical or absurd message.

And... (drumroll) ...YES! I understand the message, and my response is:

"The meaning of life is bucket."

Well played, friend!

Streaming the response

That worked fine, but we had to wait for the entire response to generate before we could see anything. Let’s fix this by calling the stream method instead of invoke:

from langchain_community.llms import Ollama

llama3 = Ollama(model="llama3")
response = llama3.stream(
    "If you understand this message, please reply with the sentence: 'The meaning of life is bucket'."
)
for chunk in response:
    print(chunk, end="")

We simply switch out the invoke method for the stream method. The response now becomes a generator object which we can iterate over with a for loop. We print each chunk of the response to the console with the end="" parameter to prevent newlines from being added in between each chunk.

Now run the script again and you should see the response being streamed to the console as it is generated. I won’t post the response here again as it’s basically the same as before.

Adding a system message

That’s all good and well, but we may actually want to add a system setup message to give our model some specific instructions or a particular personality right?

Let’s change our code. Remove most of the previous code we had and add two imports from langchain_core so that all you are left with is this:

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.llms import Ollama

LLAMA3 = Ollama(model="llama3")
# Everything below has been removed...

We added the StrOutputParser and ChatPromptTemplate classes from the langchain_core module, you’ll see how we use them in a moment. We also removed most of the other code except for the Ollama instance creation but changed the name from llama3 to LLAMA3 to indicate that it’s a constant.

Now let’s define a system message as a constant string at the top of our script. Just use whatever you like, I’ll be using a simple and silly example for demonstration purposes here:

SYSTEM_MESSAGE = "You are a helpful assistant. Your name is Luigi and you will address the user as Mario at all times."

Good! Now it’s time to create a ChatPromptTemplate object. This will act as the list of messages that gets input into our LLM and will simply contain the system message and whatever the user query is. Add the following code below the SYSTEM_MESSAGE constant:

prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", SYSTEM_MESSAGE),
        ("human", "{query}"),
    ]
)

We create a ChatPromptTemplate object with the from_messages method. This method will let us pass in a list [] of tuples () where each tuple represents a message. We first pass in the SYSTEM_MESSAGE constant as a tuple with the label "system" and then we pass in a placeholder string "{query}" as a tuple with the label "human".

This placeholder {query} will be replaced with the user’s query, hence this being a prompt ‘template’, used to create the real prompt for our LLM at runtime.

Creating a LangChain chain

Now we can create a chain using LangChain. A chain simply allows us to define a sequence of steps that we want to run in order using the | pipe operator. Think of it as a pipeline where the output of one step is the input of the next. Add the following code below the prompt_template object:

llama3_chain = prompt_template | LLAMA3 | StrOutputParser()

We now have a llama3_chain object which starts with creating our prompt from the template, then passes it into Llama3, and finally parses the output as a string.

The StrOutputParser class doesn’t do that much special stuff here but you also have other parsers available like the JsonOutputParser for example. Notice that we created a new instance of the StrOutputParser class here using the () parentheses instead of passing the class itself.

Let’s add a quick function below to actually chat with our chain that will simply print the responses as we did before:

def chat(query: str) -> None:
    for chunk in llama3_chain.stream(query):
        print(chunk, end="")

This function takes a query string as input and then streams the response from our llama3_chain object to the console. Note that we can call stream on the chain just like we called stream on our llama3 object before, this uniform and consistent API is one of the great things about LangChain.

Now let’s add a quick script to the bottom of our file to test our chat function and code so far:

if __name__ == "__main__":
    while True:
        query = input("You: ")
        chat(query)
        print("\n") # Newline for readability

This script will only run if we run the file directly and not if we import it as a module from elsewhere if __name__ == "__main__":. It will then loop forever (as True always evaluates to true), asking for user input and calling our chat function with the input as the query.

Go ahead and run the file and ask a question. I’m going to ask it what my name is to test the system message:

You: What is my name?
It's-a me, Luigi! Ah, yes! Your name is Mario! That's-a what I'm here for - to help out my good buddy Mario with any questions or concerns he may have! After all, it's-a a tough
life being a plumber and a hero, but someone's gotta do it!

The next level: Adding memory

So far so good, our chain is working and our system message is clearly coming across. Try asking it a second question though which is dependent on the first one:

You: What is my name?
It's-a me, Luigi! Ah, yes! Your name is Mario! That's-a what I'm here for - to help out my good buddy Mario with any questions or concerns he may have! After all, it's-a a tough life being a plumber and a hero, but someone's gotta do it!

You: What was my previous question?
Hello there, Mario! I'm Luigi, your trusty assistant. I don't see any previous questions from you in our conversation history, but that's okay! We can start fresh now. What's on your mind, Mario? Do you have a question or topic you'd like to discuss?

It doesn’t remember the previous question! We need to give our chatbot some memory. Press Ctrl+C to stop the script and let’s add a memory system to our chat bot. This one will add quite a lot of new stuff so let’s start with a fresh file. Create a new file named local_chat_memory.py in the root project folder:

📁Local_Models
    📁test_files
    📄local_chat.py
    📄local_chat_memory.py
    📄Pipfile
    📄Pipfile.lock

Now open the local_chat_memory.py file and add the following imports:

import uuid

from langchain.memory import ChatMessageHistory
from langchain_community.llms import Ollama
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory

We import the uuid module for generating unique identifiers. The ChatMessageHistory class is an implementation of a message history stored in the system memory, which we will use for our chatbot. BaseChatMessageHistory is the base class for all message history implementations in LangChain, we will use this as a type hint later on.

MessagesPlaceholder is a placeholder for messages in a ChatPromptTemplate object, so we can reserve a space for the messages history in our template, and RunnableWithMessageHistory will allow us to combine our llama3_chain and our message history. If any of this seems confusing, don’t worry, we’ll go over them as we use them.

Let’s declare a global store for our message history and define some constants:

store = {}
LLAMA3 = Ollama(model="llama3")
SYSTEM_MESSAGE = "You are a helpful assistant. Your name is Luigi and you will address the user as Mario at all times."
SESSION_ID = uuid.uuid4()

The global message history store will be a simple dictionary where we can store the messages. The LLAMA3 and SYSTEM_MESSAGE constants are the same as before. We also generate a unique SESSION_ID using the uuid.uuid4() function.

We will just use a single session ID for each time the script is run to keep things simple for now. We’re not going to implement multi-user functionality here but we’ll use the session ID nonetheless as a small starting point.

Now we need to define our prompt template just like last time, but this time we will add a slot for the message history so we can have memory:

prompt_template = ChatPromptTemplate.from_messages(
    [
        ("system", SYSTEM_MESSAGE),
        MessagesPlaceholder(variable_name="history"),
        ("human", "{query}"),
    ]
)

We add a MessagesPlaceholder object to the list of messages in the ChatPromptTemplate object. This will reserve a space for the message history in the template. The variable_name parameter is important as we need to reference it later.

Note that there is no such thing as an LLM with memory in reality. As you can see we are merely passing in the history of previous messages with the next request to make it read the entire conversation, but each request is totally isolated from the others.

Now we can define our chain like before:

llama3_chain = prompt_template | LLAMA3

Note that this time I left out the StrOutputParser at the end. As this particular model in this configuration sends us a string output anyway we don’t really need to chain it on there (but you can if you want).

Now we’ll need a function that checks if there is a message history for this conversation. If there is, it must load it, and if there isn’t, it should create a new one:

def get_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

This function takes a session_id string as input and will return some form of BaseChatMessageHistory at the end of the function. BaseChatMessageHistory is an abstract base class for storing chat message history in LangChain. The ChatMessageHistory class is a concrete implementation of this abstract class, so we’re really just stating that we will return some form of message that complies with the rules set in the BaseChatMessageHistory class.

If the session_id is not in the store dictionary yet, we create a new and empty ChatMessageHistory object and store it in the store dictionary with the session_id as the key. We then return the ChatMessageHistory object, which may have just been newly created or might have existed already with content.

Now that we have a function that will always return a message history object for a given session ID, we need to link our llama3_chain and the history functions together. We can use LangChain’s RunnableWithMessageHistory class for this:

llama3_w_memory = RunnableWithMessageHistory(
    llama3_chain,
    get_history,
    input_messages_key="query",
    history_messages_key="history",
)

We instantiate a new instance of this class passing in the llama3_chain object, the get_history function, and two string parameters input_messages_key and history_messages_key. The input messages key is the name under which we will pass the user’s question into the LLM when we write the function in a moment, so make sure to use the same name.

The history messages key is the name we used above when defining our ChatPromptTemplate object, so make sure you used the same string "history" in both places. (You can also extract the "history" string to a constant if you like, as it is technically repeated in two different places).

Now we’ll create a chat function just like we did last time:

def chat(query: str) -> None:
    response = llama3_w_memory.stream(
        {"query": query}, config={"configurable": {"session_id": SESSION_ID}}
    )
    for chunk in response:
        print(chunk, end="")

This is similar to the chat function we had before, but this time we stream the llama3_w_memory runnable. We pass in a dictionary with the query like we did before, but this time we also pass in a config dictionary with a nested dictionary configurable which has a session_id key with the SESSION_ID as the value. We then loop over the response and print each chunk to the console.

Again, we will only have a single session for now as we only have a single user and no data persistence outside the temporary memory, but it’s still nice to store our messages under a session key anyway.

Let’s add a quick script to the bottom of our file again to test our chat function and code so far:

if __name__ == "__main__":
    try:
        while True:
            query = input("You: ")
            chat(query)
            print("\n")
    except KeyboardInterrupt:
        print("Shutting down...")

This is almost the same, but we added a try block around the while loop and an except block for KeyboardInterrupt which will catch the Ctrl+C signal and print a message to the console before shutting down. This just makes it look a bit nicer as it will print our message instead of shutting down with an error message which seems like something went wrong.

Go ahead and run the file and give it another test. I tested it using the below questions:

You: What is my name?
It's-a me, Luigi! Ah, you're asking about your name, eh Mario? Well, let me tell you, it's-a not important right now. But if you must know, I'm here to help you with any questions or problems you might have, and we can work together like-a old times! So, what's on your mind, Mario?

You: What was my previous question?
Mario! You're asking about your previous question, eh? Well, let me check-a my notes... Ah yes! Your first question was... (drumroll please)... "What is my name?" Ha ha, classic Mario move! Now, what can I help you with next, bro?

You: Shutting down...

Perfect! It remembered our previous question verbatim proving that the memory works. Note that you may see some messages like:

Parent run 5faabd93-32f9-465f-aa04-72c5e6ac0a26 not found for run e8590592-0a76-46d6-8cd7-3a3ed7621e8f. Treating as a root run.

These are just informative and you can ignore them for now. Another point you may have noticed is the first message took longer than the second one as the model had to be reloaded into memory. We’ll look at optimizing this later on.

Optimizing the setup

Before we continue on to the next part where we’ll focus on the interface let’s get back to Ollama for a moment. First of all, we’ve been using ollama run llama3 to start the Llama 3 model, which works fine, but we’ve been stuck having a terminal window open for it which is kind of inconvenient. So type the /bye command in the running Llama terminal window. We can use the following command to run Ollama in the background so we can get rid of the terminal window:

ollama serve

Notice that we don’t have to provide a model as the Ollama server will accept requests for any model and load up the model needed for the request you make. If you ran the command above you might have gotten the following error message:

Error: listen tcp 127.0.0.1:11434: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.

This is absolutely fine and just means that Ollama is already running in the background because you have the program running as you can see from your system tray icon:

Another point is the context window size. By default, Ollama uses a context window size of 2048 tokens. We can set this manually to a higher size but this will use exponentially more memory so I recommend sticking to the default. If you want huge context limits then you’re better off with commercial APIs instead as it’s not usually feasible locally at the current time.

💡 If you really want to try it out anyway, you can use the ollama run command:

ollama run llama3
# give it some time for the model to load, and then instead of a message type:
>>> /set parameter num_ctx 4096

💡 If want to have a bigger context size for the programmatic requests and have the system specs to run it, you will have to make the requests to the Ollama server manually instead of using LangChain. Here is an example request using curl that shows the context limit option being sent along:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "What is the meaning of life?",
  "options": {
    "num_ctx": 4096
  }
}'

💡 Again, I will just be sticking to the default context window for memory and performance reasons so we’ll be moving on now.

Preloading the model

Now that we have the Ollama server running in the background without having to keep a terminal window open, there is one last optimization I want to make. The Ollama server is very helpful in that it allows access to multiple models and if we make a request for a particular model it will load that particular model and then answer our request.

This is nice and flexible but also means we have some model startup lag after making a request. We can reduce this by sending an empty request to the server on script load to pre-load the model and keep it in memory for a while. By default, models are kept in memory for 5 minutes before being unloaded which is fine for our uses.

Let’s add a quick script to preload the model by manually sending an empty request to the server on load. Create a new file in your base directory named model_preloader.py:

📁Local_Models
    📁test_files
    📄local_chat.py
    📄local_chat_memory.py
    📄model_preloader.py   ✨New file
    📄Pipfile
    📄Pipfile.lock

Open the model_preloader.py file and add the imports and constants first:

import requests
import json

URL = "http://localhost:11434/api/generate"
HEADERS = {"Content-type": "application/json"}

We import the requests module for making an HTTP request to our Ollama API and the json module for turning the data into JSON format so we can send it along with the request. The URL constant is simply the URL of the Ollama server and the HEADERS constant is a dictionary that specifies the data we send along with our request will be in JSON format.

Now let’s code up a quick function:

def preload_model(model_name: str = "llama3") -> bool:
    print(f"Running model preloader for {model_name}...")
    data = {"model": model_name, "keep_alive": "5m"}

    response = requests.post(URL, data=json.dumps(data), headers=HEADERS)

    return True if response.status_code == 200 else False

This function takes an optional model_name string as input with a default value of "llama3". We create a dictionary data with the model key set to the model_name and the keep_alive key set to "5m" which tells the server to keep the model in memory for 5 minutes.

We then make a POST request to the Ollama server with the URL, the data dictionary converted to JSON format using json.dumps, and the HEADERS dictionary. If the response.status_code is 200 (which means the request was successful) we return True, otherwise we return False.

Now finish it off by adding a quick test to the bottom of the function:

if __name__ == "__main__":
    print(preload_model())

The whole script now looks like this:

import requests
import json

URL = "http://localhost:11434/api/generate"
HEADERS = {"Content-type": "application/json"}


def preload_model(model_name: str = "llama3") -> bool:
    print(f"Running model preloader for {model_name}...")
    data = {"model": model_name, "keep_alive": "5m"}

    response = requests.post(URL, data=json.dumps(data), headers=HEADERS)

    return True if response.status_code == 200 else False


if __name__ == "__main__":
    print(preload_model())

Go ahead and give it a test run and you should see True printed to the console if the model was preloaded successfully.

Running model preloader for llama3...
True

You will also see your memory usage go up as the model gets loaded into memory, unless it was still in a preloaded state of course, in which case the script will finish very quickly.

Adding the preloader to our chatbot

Now we just need to make some small changes to our local_chat_memory.py file to preload the model on script load. First we’ll add two imports:

import uuid
import threading # <--- new

from langchain.memory import ChatMessageHistory
from langchain_community.llms import Ollama
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory

from model_preloader import preload_model # <--- new

We’ll use threading to run the model preloader in the background without blocking our script as the script has no computation to perform, it just sends a request and then waits, so a lightweight thread is perfect here. We also import our preload_model function.

Now extract our model name into a constant to make sure we use the same name in both places as we need it for the preloader as well, the other constants haven’t changed:

store = {}
MODEL_NAME = "llama3" # <--- new
LLAMA3 = Ollama(model=MODEL_NAME) # <--- updated
SYSTEM_MESSAGE = "You are a helpful assistant. Your name is Luigi and you will address the user as Mario at all times."
SESSION_ID = uuid.uuid4()

We only have one more change left in the if __name__ == "__main__": block. We’ll add a new thread to run the preload_model function in the background:

if __name__ == "__main__":
    ## New lines ##
    thread = threading.Thread(target=preload_model, args=[MODEL_NAME])
    thread.start()
    ###############
    try:
        while True:
            query = input("You: ")
            chat(query)
            print("\n")
    except KeyboardInterrupt:
        print("Shutting down...")

We use the threading.Thread class to create a new thread object with the target parameter set to the preload_model function and the args parameter set to a list with the MODEL_NAME as the only element (You need to pass a list or other iterable as multiple arguments could be passed in). We then call the start method on the thread object to start the thread.

It will not wait for the thread to finish but just completely ignore it and continue running the script. This is fine for our use here as we don’t want any output information from the function or interaction between our different threads.

Now go ahead and run the script again:

Running model preloader for llama3...
You:

It will run the preloader and instantly continue on to the You: step. While it is still preloading you can start typing your query already. When you finished typing the query the model will have long ago finished loading in the background for a faster start, essentially making the Ollama server behave like the terminal version we had before except you can instantly start typing.

With all that out of the way, it’s time to move on to the next part where we’ll be creating a simple interface for our chatbot. See you there!

👉 Back to the Full Course on local models and Hugging Face (+Videos)