Hugging Face Course (3/6) – Creating an Interface for Our Chatbot

👉 Back to the Full Course on local models and Hugging Face (+Videos)

Welcome back to part 3! In this part, we will be using Gradio to create a quick but powerful interface for our chatbot. Gradio is a Python library that allows you to quickly create customizable UI components around your machine-learning models using the many pre-built components available.

To get started, let’s install Gradio in our virtual environment. Run the following command in a terminal window:

pipenv install gradio

Integrating our local chatbot with Gradio

Now it’s time to revisit our local_chat_memory.py file. We will be making some changes to it to integrate it with Gradio. Open local_chat_memory.py back up and get started by adding the following import to your imports up top:

import uuid
import threading
from typing import Iterator # <--- new

... (other imports) ...

We will use this as a type hint later on as we want to send an Iterator for Gradio to loop over.

Now let’s go to the constants below the imports, where I only changed the SYSTEM_MESSAGE constant:

store = {}
MODEL_NAME = "llama3"
LLAMA3 = Ollama(model=MODEL_NAME)
SYSTEM_MESSAGE = "You are a wise Jedi Master and helpful assistant. Your name is Yoda and you will address the user as Obi-Wan Kenobi at all times." # <--- updated
SESSION_ID = uuid.uuid4()

This is entirely optional but I felt like testing with a Star Wars theme this time because why not? 😄 Feel free to use a sensible system message instead.

Now scroll down to the chat function. We’ll be rewriting this function so that it has a normal mode which prints to the console and a Gradio mode that returns the iterator instead so that Gradio can loop over and display the messages. Here is the updated chat function:

def chat(query: str, gradio_mode: bool = False) -> None | Iterator[str]:
    response = llama3_w_memory.stream(
        {"query": query}, config={"configurable": {"session_id": SESSION_ID}}
    )
    if gradio_mode:
        return response
    else:
        for chunk in response:
            print(chunk, end="")

So we have our updated chat function that now takes an additional parameter gradio_mode which is a boolean that defaults to False. The function will either return None or an Iterator that will yield str values.

We call llama3_w_memory.stream like before with the query and config objects, catching the response. If Gradio mode is on we simply return this Iterator named response. If not, we loop over the response and print it to the console like the old function did, so we can still run this script in the terminal like we have done so far.

Now we’ll add a new function below the chat function where we put the model preload logic, taking it out of the if __name__ == "__main__": block. This is so we can reuse this function both in this script and in our Gradio interface script:

def preload(model_name: str = MODEL_NAME) -> None:
    thread = threading.Thread(target=preload_model, args=[model_name])
    thread.start()


if __name__ == "__main__":
    preload()
    try:
        while True:
            query = input("You: ")
            chat(query)
            print("\n")
    except KeyboardInterrupt:
        print("Shutting down...")

We have a new function called preload that takes a model_name parameter which defaults to our MODEL_NAME constant, so it doesn’t have to be set but you have the ability to overwrite it should you want to in the future. The function has no return. We then just call the thread logic that was in the if __name__ == "__main__": block before, and then we call the preload function from inside the if __name__ == "__main__": block instead. The rest of the block with the try and except is the same as before.

Feel free to test your local_chat_memory.py script in the terminal to make sure everything is still working as expected. It should behave exactly the same as before.

Adding the Gradio chat interface

Let’s move on to the Gradio chat app itself, which will be surprisingly simple to make. Gradio has a lower-level API using blocks that allow you to create custom interfaces for your models. This is useful if you want to create a more complex interface which is very customized to your model.

We will be using the higher-level API which has a pre-built interface for chatbots that we can use with minimal code, as this tutorial is not about Gradio. We don’t want to stray too far off-topic so we’ll just stick with quick-and-dirty!

Create a new file named chat_app.py in the root directory of your project:

📁Local_Models
    📁test_files
    📄chat_app.py   ✨New file
    📄local_chat.py
    📄local_chat_memory.py
    📄model_preloader.py
    📄Pipfile
    📄Pipfile.lock

Inside chat_app.py let’s start with our imports:

import gradio as gr
from local_chat_memory import chat, preload

No surprises here, we import Gradio and our chat and preload functions from local_chat_memory.py.

In order to use the prebuilt chat interface from Gradio we need to use the ChatInterface class. This ChatInterface class takes a function that we need to give as input that will be called whenever a message is sent.

The ChatInterface class will then send both the current query and the message history to this function that we give it each time the user makes a request.

ChatInterface (gradio) -> {query, history} -> our_function

Our function then in turn is supposed to return a response in the form of a generator, that can be looped over by Gradio to display the messages in its prebuilt interface.

our_function -> {response} -> ChatInterface (gradio)

Let’s first define our_function, and then afterward we can call the ChatInterface class and give it our function as input:

def run_gradio_chat(query, _):
    response = chat(query, gradio_mode=True)
    full_message = ""
    if response:
        for chunk in response:
            full_message += chunk
            yield full_message
    else:
        yield "There was an error fetching the response. Please try again."

We named our function run_gradio_chat and we take the query and history as input like shown above, except we will not be using the history parameter so we just named it _ for now. We already have the history we created in LangChain so we don’t need another history.

We then get the response by simply calling our chat function with the query and gradio_mode set to True so we get the Iterator back. If there is a response, we’re going to loop over each chunk in the response and append it to the full_message string, which starts as an empty string.

We then yield the full_message string which means that instead of waiting for the entire response to be assembled and then returned, the function run_gradio_chat can start providing parts of the response as soon as they are ready.

If there is no response, we yield a simple error message instead.

💡 In Python, yield is a keyword that is used in defining a generator. A generator is a special type of iterator. Instead of creating and returning a whole list of items at once (like a function would do), a generator uses the yield keyword to yield one item at a time, waiting to produce the next item until the current one has been consumed. This allows our chatbot here to ‘return’ what it has received so far bit by bit as it comes in.

Now let’s define an if __name__ == "__main__": block where we use our preload function and then create the Gradio interface:

if __name__ == "__main__":
    preload()
    gr.ChatInterface(run_gradio_chat).launch()

Preload will just start loading our llama 3 model in a separate thread without blocking the script just like we defined. Then we call the ChatInterface class with our run_gradio_chat function as input and call the launch method on it to start the Gradio interface.

Testing the Gradio chat interface

Go ahead and run the chat_app.py script and the following should pop up in your terminal if you give it a moment to load:

Running model preloader for llama3...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

Go ahead and Ctrl + click the URL to open it in your browser. You should see a chat interface that looks like this:

Go ahead, don’t be shy! Say hello to your new chatbot! 🤖

You will see the response stream in token by token just like using the real web version of popular LLMs like ChatGPT or Gemini! 🎉

Customizing the Gradio chat interface

One quick note, as we decided to keep the memory implementation in LangChain, clicking the Undo or Clear buttons will only remove these messages from the interface but not actually from the message history in LangChain. I don’t want this tutorial to become about Gradio, which is why we’ll just get rid of the extra buttons we don’t need.

In your chat_app.py script, update the ChatInterface call to look like this:

if __name__ == "__main__":
    preload()
    gr.ChatInterface(
        run_gradio_chat, retry_btn=None, undo_btn=None, clear_btn=None
    ).launch()

Passing in None for these three buttons will get rid of them in the interface for you so we can keep our focus on the local LLMs instead of getting caught up in Gradio interfaces. Gradio has pretty good documentation in case you’re interested to learn more later. Go ahead and Ctrl + C in the terminal to stop the Gradio interface. Then run the chat_app.py script again to see the changes:

The memory problem

With that out of the way, there still is one glaringly obvious and fatal flaw in our local chatbot implementation. Do you have an idea what it might be? 🤔

As we discussed we are working with a limited context window for the local LLM model input, so we can only send so many tokens as input, and raising this will dramatically increase the needed memory. We have LangChain keep track of our message history but we did not set any sort of limit, we just allow the message history to grow indefinitely.

As we keep chatting here this message history will eventually grow larger than our context window, and it will also unnecessarily slow us down before then. It is not likely we will need to have the LLM remember what we asked it 10 messages ago. To find a good balance between speed and length of memory let’s have a look at our memory implementation and customize it a little bit.

Create a new file in the root directory of your project named memory.py:

📁Local_Models
    📁test_files
    📄chat_app.py
    📄local_chat.py
    📄local_chat_memory.py
    📄memory.py   ✨New file
    📄model_preloader.py
    📄Pipfile
    📄Pipfile.lock

Creating a custom memory implementation

So far we’ve been using the ChatMessageHistory class from LangChain, which is an implementation of the base BaseChatMessageHistory class in LangChain. This base class contains the implementation rules if you will. We can create our own custom class that inherits from BaseChatMessageHistory and change how it works to have our own memory implementation.

Inside our memory.py file let’s start with the imports:

from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.messages import BaseMessage
from langchain_core.pydantic_v1 import BaseModel, Field

As we discussed, BaseChatMessageHistory is the base class for LangChain Chat Message History implementations. Likewise BaseMessage is the base class for the messages as a history is just a list of these messages. We also import BaseModel and Field from Pydantic as we will be using these to define our custom memory class. Note that LangChain likes Pydantic V1 in particular, so much so that it comes included.

Let’s start defining our custom memory class that will only remember the last K messages. We will call this class LimitedHistory:

class LimitedHistory(BaseChatMessageHistory, BaseModel):
    """In memory implementation of limited chat message history storing only the last K messages."""

    messages: list[BaseMessage] = Field(default_factory=list)
    max_messages: int = 10

If you’re not familiar with Pydantic this may look a bit weird, so let’s go over it. The LimitedHistory class is a subclass of BaseChatMessageHistory and BaseModel. The BaseModel class is from the pydantic library, which is a data validation library used to create data classes.

When you create a subclass of BaseModel, you don’t need to define an __init__ method for your class. Instead, you define class attributes, and pydantic automatically generates an __init__ method for you.

This generated __init__ method takes keyword arguments with the same names as the class attributes, and it validates and assigns them. So if we create a new instance we could just do history = LimitedHistory(messages=[message1, message2], max_messages=5), and it would work out and handle the class __init__ logic for us.

We’re saying here that our LimitedHistory class has two attributes: messages and max_messages. The messages attribute is a list of BaseMessage objects, and we set the default value to an empty list. The max_messages attribute is an integer that defaults to 10.

Let’s continue. In order to make our own valid implementation of a LangChain BaseChatMessageHistory we need to define at least an add_message and a clear method:

class LimitedHistory(BaseChatMessageHistory, BaseModel):
    """In memory implementation of limited chat message history storing only the last K messages."""

    messages: list[BaseMessage] = Field(default_factory=list)
    max_messages: int = 10

    def add_messages(self, messages: list[BaseMessage]) -> None:
        self.messages.extend(messages)
        self.messages = self.messages[-self.max_messages :]

    def clear(self) -> None:
        self.messages = []

We overwrite the add_messages method from the base class, using the same method signature, taking a list of BaseMessage objects as input, and of course self as this is a class method. We extend the messages list with the new messages and then we slice the list to only keep the last K messages, where K is the value of max_messages. So say K is 4 then we will slice from -4 (the fourth last element) to the end of the list.

We could technically make this more efficient using a deque data type and whatnot, but for a modern computer creating a list of a couple of messages is not a big deal so over-optimizing is a bit pointless here.

The clear method is simple, we just set the messages list to an empty list. This method is needed to complete a valid implementation.

💡 Strictly speaking this implementation is not fool-proof either because a smaller number of very long messages can also get you past the context limit. I’ll leave it to you to improve this further if you want to, as this is a bit beyond the scope of this tutorial. You can set an overall message history size-limit and implement LLM summarization of the messages in the history when they get too long, reducing the number of tokens.

Go ahead and save and close this file. Now we need to update our local_chat_memory.py file to use our new LimitedHistory class instead of the ChatMessageHistory class. Open up local_chat_memory.py and add an import for our new memory:

import uuid
import threading
from typing import Iterator

from langchain.memory import ChatMessageHistory
from langchain_community.llms import Ollama
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory

from model_preloader import preload_model
from memory import LimitedHistory # <--- new

Now go to the section with the store and all the constants below and add a constant named HISTORY_LIMIT for the maximum number of messages to keep in memory:

store = {}
HISTORY_LIMIT = 4 # <--- new
MODEL_NAME = "llama3"
LLAMA3 = Ollama(model=MODEL_NAME)
SYSTEM_MESSAGE = "You are a wise Jedi Master and helpful assistant. Your name is Yoda and you will address the user as Obi-Wan Kenobi at all times."
SESSION_ID = uuid.uuid4()

All that is left to do now is update our get_history function as follows:

def get_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = LimitedHistory(max_messages=HISTORY_LIMIT)
    return store[session_id]

We replaced the old ChatMessageHistory class with our new LimitedHistory class, passing in the HISTORY_LIMIT constant as the max_messages parameter. You can also go ahead and delete the ChatMessageHistory import on line 5 of the file as we are not using it anymore.

Testing the memory limit

We don’t need to change anything else as we just swapped out the memory for a similar but slightly different implementation. Go ahead and run your Gradio server again by running the chat_app.py script and give it a test. It should forget messages older than the HISTORY_LIMIT you set.

It seems to get rid of a message when a new call is made so if you set 4 it will practically remember the last 3 messages including the current one. Set it slightly higher than the number you want. Example:

You can see once we get to question #4 it has forgotten number #1, and answers as if question #2 was the first thing we asked when giving answer #4. We can see our memory limit is working!

Awesome, we now have a working local chat with a convenient interface to interact with our local models. You can expand on this further, like adding a dropdown to the interface to switch between models. We will be moving on to the next topic so as not to get sidetracked but if you do want to add this feature I’ll give you some pointers on how to do it:

Add a gr.Dropdown using the additional_inputs parameter in the gr.ChatInterface call in chat_app.py and give it a list of the models you want to be able to call.
Gradio docs – additional inputs
Gradio docs – Dropdown
Add a new parameter to the run_gradio_chat function in chat_app.py to take the model name as input, matching the label name specified in the gr.Dropdown call. Send this model parameter along to the chat function from local_chat_memory.py.
Update local_chat_memory.py to allow for switching the model variable by creating a new Ollama instance with the changed model name (Ollama(model=MODEL_NAME))

Again, this is totally optional and we will be moving on for now. In the next part, we’ll start taking a look at HuggingFace and running image models locally on our own system. See you there soon! 🚀