👉 Back to the Full Course on local models and Hugging Face (+Videos)
Welcome back to part 3! In this part, we will be using Gradio to create a quick but powerful interface for our chatbot. Gradio is a Python library that allows you to quickly create customizable UI components around your machine-learning models using the many pre-built components available.
To get started, let’s install Gradio in our virtual environment. Run the following command in a terminal window:
pipenv install gradio
Integrating our local chatbot with Gradio
Now it’s time to revisit our local_chat_memory.py
file. We will be making some changes to it to integrate it with Gradio. Open local_chat_memory.py
back up and get started by adding the following import to your imports up top:
import uuid import threading from typing import Iterator # <--- new ... (other imports) ...
We will use this as a type hint later on as we want to send an Iterator
for Gradio to loop over.
Now let’s go to the constants below the imports, where I only changed the SYSTEM_MESSAGE
constant:
store = {} MODEL_NAME = "llama3" LLAMA3 = Ollama(model=MODEL_NAME) SYSTEM_MESSAGE = "You are a wise Jedi Master and helpful assistant. Your name is Yoda and you will address the user as Obi-Wan Kenobi at all times." # <--- updated SESSION_ID = uuid.uuid4()
This is entirely optional but I felt like testing with a Star Wars theme this time because why not? 😄 Feel free to use a sensible system message instead.
Now scroll down to the chat
function. We’ll be rewriting this function so that it has a normal mode which prints to the console and a Gradio mode that returns the iterator instead so that Gradio can loop over and display the messages. Here is the updated chat
function:
def chat(query: str, gradio_mode: bool = False) -> None | Iterator[str]: response = llama3_w_memory.stream( {"query": query}, config={"configurable": {"session_id": SESSION_ID}} ) if gradio_mode: return response else: for chunk in response: print(chunk, end="")
So we have our updated chat
function that now takes an additional parameter gradio_mode
which is a boolean that defaults to False
. The function will either return None
or an Iterator
that will yield str
values.
We call llama3_w_memory.stream
like before with the query
and config
objects, catching the response
. If Gradio mode is on we simply return this Iterator
named response
. If not, we loop over the response
and print it to the console like the old function did, so we can still run this script in the terminal like we have done so far.
Now we’ll add a new function below the chat
function where we put the model preload logic, taking it out of the if __name__ == "__main__":
block. This is so we can reuse this function both in this script and in our Gradio interface script:
def preload(model_name: str = MODEL_NAME) -> None: thread = threading.Thread(target=preload_model, args=[model_name]) thread.start() if __name__ == "__main__": preload() try: while True: query = input("You: ") chat(query) print("\n") except KeyboardInterrupt: print("Shutting down...")
We have a new function called preload
that takes a model_name
parameter which defaults to our MODEL_NAME
constant, so it doesn’t have to be set but you have the ability to overwrite it should you want to in the future. The function has no return. We then just call the thread logic that was in the if __name__ == "__main__":
block before, and then we call the preload
function from inside the if __name__ == "__main__":
block instead. The rest of the block with the try
and except
is the same as before.
Feel free to test your local_chat_memory.py
script in the terminal to make sure everything is still working as expected. It should behave exactly the same as before.
Adding the Gradio chat interface
Let’s move on to the Gradio chat app itself, which will be surprisingly simple to make. Gradio has a lower-level API using blocks that allow you to create custom interfaces for your models. This is useful if you want to create a more complex interface which is very customized to your model.
We will be using the higher-level API which has a pre-built interface for chatbots that we can use with minimal code, as this tutorial is not about Gradio. We don’t want to stray too far off-topic so we’ll just stick with quick-and-dirty!
Create a new file named chat_app.py
in the root directory of your project:
📁Local_Models 📁test_files 📄chat_app.py ✨New file 📄local_chat.py 📄local_chat_memory.py 📄model_preloader.py 📄Pipfile 📄Pipfile.lock
Inside chat_app.py
let’s start with our imports:
import gradio as gr from local_chat_memory import chat, preload
No surprises here, we import Gradio and our chat
and preload
functions from local_chat_memory.py
.
In order to use the prebuilt chat interface from Gradio we need to use the ChatInterface
class. This ChatInterface
class takes a function that we need to give as input that will be called whenever a message is sent.
The ChatInterface
class will then send both the current query and the message history to this function that we give it each time the user makes a request.
ChatInterface (gradio)
-> {query
,history
} ->our_function
Our function then in turn is supposed to return a response in the form of a generator, that can be looped over by Gradio to display the messages in its prebuilt interface.
our_function
-> {response
} ->ChatInterface (gradio)
Let’s first define our_function
, and then afterward we can call the ChatInterface
class and give it our function as input:
def run_gradio_chat(query, _): response = chat(query, gradio_mode=True) full_message = "" if response: for chunk in response: full_message += chunk yield full_message else: yield "There was an error fetching the response. Please try again."
We named our function run_gradio_chat
and we take the query
and history
as input like shown above, except we will not be using the history
parameter so we just named it _
for now. We already have the history we created in LangChain so we don’t need another history.
We then get the response by simply calling our chat
function with the query
and gradio_mode
set to True
so we get the Iterator back. If there is a response, we’re going to loop over each chunk in the response and append it to the full_message
string, which starts as an empty string.
We then yield the full_message
string which means that instead of waiting for the entire response to be assembled and then returned, the function run_gradio_chat
can start providing parts of the response as soon as they are ready.
If there is no response, we yield a simple error message instead.
💡 In Python,
yield
is a keyword that is used in defining agenerator
. Agenerator
is a special type of iterator. Instead of creating and returning a wholelist
of items at once (like a function would do), agenerator
uses theyield
keyword to yield one item at a time, waiting to produce the next item until the current one has been consumed. This allows our chatbot here to ‘return’ what it has received so far bit by bit as it comes in.
Now let’s define an if __name__ == "__main__":
block where we use our preload function and then create the Gradio interface:
if __name__ == "__main__": preload() gr.ChatInterface(run_gradio_chat).launch()
Preload will just start loading our llama 3
model in a separate thread without blocking the script just like we defined. Then we call the ChatInterface
class with our run_gradio_chat
function as input and call the launch
method on it to start the Gradio interface.
Testing the Gradio chat interface
Go ahead and run the chat_app.py
script and the following should pop up in your terminal if you give it a moment to load:
Running model preloader for llama3... Running on local URL: http://127.0.0.1:7860 To create a public link, set `share=True` in `launch()`.
Go ahead and Ctrl + click
the URL to open it in your browser. You should see a chat interface that looks like this:
Go ahead, don’t be shy! Say hello to your new chatbot! 🤖
You will see the response stream in token by token just like using the real web version of popular LLMs like ChatGPT or Gemini! 🎉
Customizing the Gradio chat interface
One quick note, as we decided to keep the memory implementation in LangChain, clicking the Undo
or Clear
buttons will only remove these messages from the interface but not actually from the message history in LangChain. I don’t want this tutorial to become about Gradio, which is why we’ll just get rid of the extra buttons we don’t need.
In your chat_app.py
script, update the ChatInterface
call to look like this:
if __name__ == "__main__": preload() gr.ChatInterface( run_gradio_chat, retry_btn=None, undo_btn=None, clear_btn=None ).launch()
Passing in None
for these three buttons will get rid of them in the interface for you so we can keep our focus on the local LLMs instead of getting caught up in Gradio interfaces. Gradio has pretty good documentation in case you’re interested to learn more later. Go ahead and Ctrl + C
in the terminal to stop the Gradio interface. Then run the chat_app.py
script again to see the changes:
The memory problem
With that out of the way, there still is one glaringly obvious and fatal flaw in our local chatbot implementation. Do you have an idea what it might be? 🤔
As we discussed we are working with a limited context window for the local LLM model input, so we can only send so many tokens as input, and raising this will dramatically increase the needed memory. We have LangChain keep track of our message history but we did not set any sort of limit, we just allow the message history to grow indefinitely.
As we keep chatting here this message history will eventually grow larger than our context window, and it will also unnecessarily slow us down before then. It is not likely we will need to have the LLM remember what we asked it 10 messages ago. To find a good balance between speed and length of memory let’s have a look at our memory implementation and customize it a little bit.
Create a new file in the root directory of your project named memory.py
:
📁Local_Models 📁test_files 📄chat_app.py 📄local_chat.py 📄local_chat_memory.py 📄memory.py ✨New file 📄model_preloader.py 📄Pipfile 📄Pipfile.lock
Creating a custom memory implementation
So far we’ve been using the ChatMessageHistory
class from LangChain, which is an implementation of the base BaseChatMessageHistory
class in LangChain. This base class contains the implementation rules if you will. We can create our own custom class that inherits from BaseChatMessageHistory
and change how it works to have our own memory implementation.
Inside our memory.py
file let’s start with the imports:
from langchain_core.chat_history import BaseChatMessageHistory from langchain_core.messages import BaseMessage from langchain_core.pydantic_v1 import BaseModel, Field
As we discussed, BaseChatMessageHistory
is the base class for LangChain Chat Message History implementations. Likewise BaseMessage
is the base class for the messages as a history is just a list of these messages. We also import BaseModel
and Field
from Pydantic as we will be using these to define our custom memory class. Note that LangChain likes Pydantic V1 in particular, so much so that it comes included.
Let’s start defining our custom memory class that will only remember the last K
messages. We will call this class LimitedHistory
:
class LimitedHistory(BaseChatMessageHistory, BaseModel): """In memory implementation of limited chat message history storing only the last K messages.""" messages: list[BaseMessage] = Field(default_factory=list) max_messages: int = 10
If you’re not familiar with Pydantic this may look a bit weird, so let’s go over it. The LimitedHistory
class is a subclass of BaseChatMessageHistory
and BaseModel
. The BaseModel
class is from the pydantic
library, which is a data validation library used to create data classes.
When you create a subclass of BaseModel
, you don’t need to define an __init__
method for your class. Instead, you define class attributes, and pydantic automatically generates an __init__
method for you.
This generated __init__
method takes keyword arguments with the same names as the class attributes, and it validates and assigns them. So if we create a new instance we could just do history = LimitedHistory(messages=[message1, message2], max_messages=5)
, and it would work out and handle the class __init__
logic for us.
We’re saying here that our LimitedHistory
class has two attributes: messages
and max_messages
. The messages
attribute is a list of BaseMessage
objects, and we set the default value to an empty list. The max_messages
attribute is an integer that defaults to 10
.
Let’s continue. In order to make our own valid implementation of a LangChain BaseChatMessageHistory
we need to define at least an add_message
and a clear
method:
class LimitedHistory(BaseChatMessageHistory, BaseModel): """In memory implementation of limited chat message history storing only the last K messages.""" messages: list[BaseMessage] = Field(default_factory=list) max_messages: int = 10 def add_messages(self, messages: list[BaseMessage]) -> None: self.messages.extend(messages) self.messages = self.messages[-self.max_messages :] def clear(self) -> None: self.messages = []
We overwrite the add_messages
method from the base class, using the same method signature, taking a list of BaseMessage
objects as input, and of course self
as this is a class method. We extend the messages
list with the new messages and then we slice the list to only keep the last K
messages, where K
is the value of max_messages
. So say K
is 4
then we will slice from -4
(the fourth last element) to the end of the list.
We could technically make this more efficient using a deque
data type and whatnot, but for a modern computer creating a list of a couple of messages is not a big deal so over-optimizing is a bit pointless here.
The clear
method is simple, we just set the messages
list to an empty list. This method is needed to complete a valid implementation.
💡 Strictly speaking this implementation is not fool-proof either because a smaller number of very long messages can also get you past the context limit. I’ll leave it to you to improve this further if you want to, as this is a bit beyond the scope of this tutorial. You can set an overall message history size-limit and implement LLM summarization of the messages in the history when they get too long, reducing the number of tokens.
Go ahead and save and close this file. Now we need to update our local_chat_memory.py
file to use our new LimitedHistory
class instead of the ChatMessageHistory
class. Open up local_chat_memory.py
and add an import for our new memory:
import uuid import threading from typing import Iterator from langchain.memory import ChatMessageHistory from langchain_community.llms import Ollama from langchain_core.chat_history import BaseChatMessageHistory from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder from langchain_core.runnables.history import RunnableWithMessageHistory from model_preloader import preload_model from memory import LimitedHistory # <--- new
Now go to the section with the store
and all the constants below and add a constant named HISTORY_LIMIT
for the maximum number of messages to keep in memory:
store = {} HISTORY_LIMIT = 4 # <--- new MODEL_NAME = "llama3" LLAMA3 = Ollama(model=MODEL_NAME) SYSTEM_MESSAGE = "You are a wise Jedi Master and helpful assistant. Your name is Yoda and you will address the user as Obi-Wan Kenobi at all times." SESSION_ID = uuid.uuid4()
All that is left to do now is update our get_history
function as follows:
def get_history(session_id: str) -> BaseChatMessageHistory: if session_id not in store: store[session_id] = LimitedHistory(max_messages=HISTORY_LIMIT) return store[session_id]
We replaced the old ChatMessageHistory
class with our new LimitedHistory
class, passing in the HISTORY_LIMIT
constant as the max_messages
parameter. You can also go ahead and delete the ChatMessageHistory
import on line 5 of the file as we are not using it anymore.
Testing the memory limit
We don’t need to change anything else as we just swapped out the memory for a similar but slightly different implementation. Go ahead and run your Gradio server again by running the chat_app.py
script and give it a test. It should forget messages older than the HISTORY_LIMIT
you set.
It seems to get rid of a message when a new call is made so if you set 4
it will practically remember the last 3
messages including the current one. Set it slightly higher than the number you want. Example:
You can see once we get to question #4 it has forgotten number #1, and answers as if question #2 was the first thing we asked when giving answer #4. We can see our memory limit is working!
Awesome, we now have a working local chat with a convenient interface to interact with our local models. You can expand on this further, like adding a dropdown to the interface to switch between models. We will be moving on to the next topic so as not to get sidetracked but if you do want to add this feature I’ll give you some pointers on how to do it:
- Add a
gr.Dropdown
using theadditional_inputs
parameter in thegr.ChatInterface
call inchat_app.py
and give it a list of the models you want to be able to call. - Gradio docs – additional inputs
- Gradio docs – Dropdown
- Add a new parameter to the
run_gradio_chat
function inchat_app.py
to take the model name as input, matching thelabel
name specified in thegr.Dropdown
call. Send this model parameter along to thechat
function fromlocal_chat_memory.py
. - Update
local_chat_memory.py
to allow for switching the model variable by creating a newOllama
instance with the changed model name (Ollama(model=MODEL_NAME)
)
Again, this is totally optional and we will be moving on for now. In the next part, we’ll start taking a look at HuggingFace and running image models locally on our own system. See you there soon! 🚀
👉 Back to the Full Course on local models and Hugging Face (+Videos)