👉 Back to the Full Course on local models and Hugging Face (+Videos)
Hi and welcome back to part 2! In this part, we’ll be taking our bare-bones API and getting started on creating a proper chatbot with a convenient interface.
We will be using LangChain
to define and run our chatbot, so let’s install it first. In your terminal window first make sure your virtual environment is activated. If you’re not sure then you can use the following trick:
echo $VIRTUAL_ENV
This will simply print (echo) the VIRTUAL_ENV
environment variable to the console. If you see a path something like C:/Users/admin/.virtualenvs/Local_Models-FCDcoHgn
it means you are in a virtual environment located at this path. If you have any trouble or do not see any return simply re-activate your virtual environment by running:
pipenv shell
Adding LangChain
With that out of the way let’s install LangChain
inside our virtual environment:
pipenv install langchain
This installs LangChain which is the base library. Now let’s add one more:
pipenv install langchain-community
This installs the LangChain Community package which we need as well.
💡 We’ll be covering the basics of LangChain as needed in this tutorial. If you want to really do a deep dive into the more advanced uses of this library and learn to create complex chains and setups, check out my dedicated LangChain tutorial which also covers LangSmith and LangGraph.
Then create a new Python file in your root project folder and name it local_chat.py
:
📁Local_Models 📁test_files 📄local_chat.py 📄Pipfile 📄Pipfile.lock
Now open the local_chat.py
file and let’s start with a very basic request:
from langchain_community.llms import Ollama llama3 = Ollama(model="llama3") response = llama3.invoke( "If you understand this message, please reply with the sentence: 'The meaning of life is bucket'." ) print(response)
We import the Ollama
class from the langchain_community.llms
module and create an instance of it with the model
parameter set to "llama3"
. Again, feel free to use the llama3:70b
model if your computer can handle it.
Then we call the invoke
method on our llm
object with the question “If you understand this message, please reply with the sentence: ‘The meaning of life is bucket’.”
and print the response to the console. See how easy and readable this is? This is the power of LangChain! It handles the API request logic for us.
Make sure you have Llama 3 running in Ollama, run the command in your terminal if it is not:
ollama run llama3
Now run the script in your terminal and you should get a response something like this:
I see what's going on here! You're testing to see if I can recognize and respond to a nonsensical or absurd message. And... (drumroll) ...YES! I understand the message, and my response is: "The meaning of life is bucket." Well played, friend!
Streaming the response
That worked fine, but we had to wait for the entire response to generate before we could see anything. Let’s fix this by calling the stream
method instead of invoke
:
from langchain_community.llms import Ollama llama3 = Ollama(model="llama3") response = llama3.stream( "If you understand this message, please reply with the sentence: 'The meaning of life is bucket'." ) for chunk in response: print(chunk, end="")
We simply switch out the invoke
method for the stream
method. The response now becomes a generator
object which we can iterate over with a for
loop. We print each chunk
of the response
to the console with the end=""
parameter to prevent newlines from being added in between each chunk.
Now run the script again and you should see the response being streamed to the console as it is generated. I won’t post the response here again as it’s basically the same as before.
Adding a system message
That’s all good and well, but we may actually want to add a system setup message to give our model some specific instructions or a particular personality right?
Let’s change our code. Remove most of the previous code we had and add two imports from langchain_core
so that all you are left with is this:
from langchain_core.output_parsers import StrOutputParser from langchain_core.prompts import ChatPromptTemplate from langchain_community.llms import Ollama LLAMA3 = Ollama(model="llama3") # Everything below has been removed...
We added the StrOutputParser
and ChatPromptTemplate
classes from the langchain_core
module, you’ll see how we use them in a moment. We also removed most of the other code except for the Ollama
instance creation but changed the name from llama3
to LLAMA3
to indicate that it’s a constant.
Now let’s define a system message as a constant string at the top of our script. Just use whatever you like, I’ll be using a simple and silly example for demonstration purposes here:
SYSTEM_MESSAGE = "You are a helpful assistant. Your name is Luigi and you will address the user as Mario at all times."
Good! Now it’s time to create a ChatPromptTemplate
object. This will act as the list of messages that gets input into our LLM and will simply contain the system message and whatever the user query is. Add the following code below the SYSTEM_MESSAGE
constant:
prompt_template = ChatPromptTemplate.from_messages( [ ("system", SYSTEM_MESSAGE), ("human", "{query}"), ] )
We create a ChatPromptTemplate
object with the from_messages
method. This method will let us pass in a list []
of tuples ()
where each tuple represents a message. We first pass in the SYSTEM_MESSAGE
constant as a tuple with the label "system"
and then we pass in a placeholder string "{query}"
as a tuple with the label "human"
.
This placeholder {query}
will be replaced with the user’s query, hence this being a prompt ‘template’, used to create the real prompt for our LLM at runtime.
Creating a LangChain chain
Now we can create a chain using LangChain. A chain simply allows us to define a sequence of steps that we want to run in order using the |
pipe operator. Think of it as a pipeline where the output of one step is the input of the next. Add the following code below the prompt_template
object:
llama3_chain = prompt_template | LLAMA3 | StrOutputParser()
We now have a llama3_chain
object which starts with creating our prompt from the template, then passes it into Llama3, and finally parses the output as a string.
The StrOutputParser
class doesn’t do that much special stuff here but you also have other parsers available like the JsonOutputParser
for example. Notice that we created a new instance of the StrOutputParser
class here using the ()
parentheses instead of passing the class itself.
Let’s add a quick function below to actually chat with our chain that will simply print the responses as we did before:
def chat(query: str) -> None: for chunk in llama3_chain.stream(query): print(chunk, end="")
This function takes a query
string as input and then streams the response from our llama3_chain
object to the console. Note that we can call stream
on the chain just like we called stream
on our llama3
object before, this uniform and consistent API is one of the great things about LangChain.
Now let’s add a quick script to the bottom of our file to test our chat function and code so far:
if __name__ == "__main__": while True: query = input("You: ") chat(query) print("\n") # Newline for readability
This script will only run if we run the file directly and not if we import it as a module from elsewhere if __name__ == "__main__":
. It will then loop forever (as True
always evaluates to true), asking for user input and calling our chat
function with the input as the query.
Go ahead and run the file and ask a question. I’m going to ask it what my name is to test the system message:
You: What is my name? It's-a me, Luigi! Ah, yes! Your name is Mario! That's-a what I'm here for - to help out my good buddy Mario with any questions or concerns he may have! After all, it's-a a tough life being a plumber and a hero, but someone's gotta do it!
The next level: Adding memory
So far so good, our chain is working and our system message is clearly coming across. Try asking it a second question though which is dependent on the first one:
You: What is my name? It's-a me, Luigi! Ah, yes! Your name is Mario! That's-a what I'm here for - to help out my good buddy Mario with any questions or concerns he may have! After all, it's-a a tough life being a plumber and a hero, but someone's gotta do it! You: What was my previous question? Hello there, Mario! I'm Luigi, your trusty assistant. I don't see any previous questions from you in our conversation history, but that's okay! We can start fresh now. What's on your mind, Mario? Do you have a question or topic you'd like to discuss?
It doesn’t remember the previous question! We need to give our chatbot some memory. Press Ctrl+C
to stop the script and let’s add a memory system to our chat bot. This one will add quite a lot of new stuff so let’s start with a fresh file. Create a new file named local_chat_memory.py
in the root project folder:
📁Local_Models 📁test_files 📄local_chat.py 📄local_chat_memory.py 📄Pipfile 📄Pipfile.lock
Now open the local_chat_memory.py
file and add the following imports:
import uuid from langchain.memory import ChatMessageHistory from langchain_community.llms import Ollama from langchain_core.chat_history import BaseChatMessageHistory from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder from langchain_core.runnables.history import RunnableWithMessageHistory
We import the uuid
module for generating unique identifiers. The ChatMessageHistory
class is an implementation of a message history stored in the system memory, which we will use for our chatbot. BaseChatMessageHistory
is the base class for all message history implementations in LangChain, we will use this as a type hint later on.
MessagesPlaceholder
is a placeholder for messages in a ChatPromptTemplate
object, so we can reserve a space for the messages history in our template, and RunnableWithMessageHistory
will allow us to combine our llama3_chain and our message history. If any of this seems confusing, don’t worry, we’ll go over them as we use them.
Let’s declare a global store for our message history and define some constants:
store = {} LLAMA3 = Ollama(model="llama3") SYSTEM_MESSAGE = "You are a helpful assistant. Your name is Luigi and you will address the user as Mario at all times." SESSION_ID = uuid.uuid4()
The global message history store will be a simple dictionary where we can store the messages. The LLAMA3
and SYSTEM_MESSAGE
constants are the same as before. We also generate a unique SESSION_ID
using the uuid.uuid4()
function.
We will just use a single session ID for each time the script is run to keep things simple for now. We’re not going to implement multi-user functionality here but we’ll use the session ID nonetheless as a small starting point.
Now we need to define our prompt template just like last time, but this time we will add a slot for the message history so we can have memory:
prompt_template = ChatPromptTemplate.from_messages( [ ("system", SYSTEM_MESSAGE), MessagesPlaceholder(variable_name="history"), ("human", "{query}"), ] )
We add a MessagesPlaceholder
object to the list of messages in the ChatPromptTemplate
object. This will reserve a space for the message history in the template. The variable_name
parameter is important as we need to reference it later.
Note that there is no such thing as an LLM with memory in reality. As you can see we are merely passing in the history of previous messages with the next request to make it read the entire conversation, but each request is totally isolated from the others.
Now we can define our chain like before:
llama3_chain = prompt_template | LLAMA3
Note that this time I left out the StrOutputParser
at the end. As this particular model in this configuration sends us a string output anyway we don’t really need to chain it on there (but you can if you want).
Now we’ll need a function that checks if there is a message history for this conversation. If there is, it must load it, and if there isn’t, it should create a new one:
def get_history(session_id: str) -> BaseChatMessageHistory: if session_id not in store: store[session_id] = ChatMessageHistory() return store[session_id]
This function takes a session_id
string as input and will return some form of BaseChatMessageHistory
at the end of the function. BaseChatMessageHistory
is an abstract base class for storing chat message history in LangChain. The ChatMessageHistory
class is a concrete implementation of this abstract class, so we’re really just stating that we will return some form of message that complies with the rules set in the BaseChatMessageHistory
class.
If the session_id
is not in the store
dictionary yet, we create a new and empty ChatMessageHistory
object and store it in the store
dictionary with the session_id
as the key. We then return the ChatMessageHistory
object, which may have just been newly created or might have existed already with content.
Now that we have a function that will always return a message history object for a given session ID, we need to link our llama3_chain
and the history functions together. We can use LangChain’s RunnableWithMessageHistory
class for this:
llama3_w_memory = RunnableWithMessageHistory( llama3_chain, get_history, input_messages_key="query", history_messages_key="history", )
We instantiate a new instance of this class passing in the llama3_chain
object, the get_history
function, and two string parameters input_messages_key
and history_messages_key
. The input messages key is the name under which we will pass the user’s question into the LLM when we write the function in a moment, so make sure to use the same name.
The history messages key is the name we used above when defining our ChatPromptTemplate
object, so make sure you used the same string "history"
in both places. (You can also extract the "history"
string to a constant if you like, as it is technically repeated in two different places).
Now we’ll create a chat
function just like we did last time:
def chat(query: str) -> None: response = llama3_w_memory.stream( {"query": query}, config={"configurable": {"session_id": SESSION_ID}} ) for chunk in response: print(chunk, end="")
This is similar to the chat function we had before, but this time we stream
the llama3_w_memory
runnable. We pass in a dictionary with the query
like we did before, but this time we also pass in a config
dictionary with a nested dictionary configurable
which has a session_id
key with the SESSION_ID
as the value. We then loop over the response and print each chunk to the console.
Again, we will only have a single session for now as we only have a single user and no data persistence outside the temporary memory, but it’s still nice to store our messages under a session key anyway.
Let’s add a quick script to the bottom of our file again to test our chat function and code so far:
if __name__ == "__main__": try: while True: query = input("You: ") chat(query) print("\n") except KeyboardInterrupt: print("Shutting down...")
This is almost the same, but we added a try
block around the while
loop and an except
block for KeyboardInterrupt
which will catch the Ctrl+C
signal and print a message to the console before shutting down. This just makes it look a bit nicer as it will print our message instead of shutting down with an error message which seems like something went wrong.
Go ahead and run the file and give it another test. I tested it using the below questions:
You: What is my name? It's-a me, Luigi! Ah, you're asking about your name, eh Mario? Well, let me tell you, it's-a not important right now. But if you must know, I'm here to help you with any questions or problems you might have, and we can work together like-a old times! So, what's on your mind, Mario? You: What was my previous question? Mario! You're asking about your previous question, eh? Well, let me check-a my notes... Ah yes! Your first question was... (drumroll please)... "What is my name?" Ha ha, classic Mario move! Now, what can I help you with next, bro? You: Shutting down...
Perfect! It remembered our previous question verbatim proving that the memory works. Note that you may see some messages like:
Parent run 5faabd93-32f9-465f-aa04-72c5e6ac0a26 not found for run e8590592-0a76-46d6-8cd7-3a3ed7621e8f. Treating as a root run.
These are just informative and you can ignore them for now. Another point you may have noticed is the first message took longer than the second one as the model had to be reloaded into memory. We’ll look at optimizing this later on.
Optimizing the setup
Before we continue on to the next part where we’ll focus on the interface let’s get back to Ollama for a moment. First of all, we’ve been using ollama run llama3
to start the Llama 3 model, which works fine, but we’ve been stuck having a terminal window open for it which is kind of inconvenient. So type the /bye
command in the running Llama terminal window. We can use the following command to run Ollama in the background so we can get rid of the terminal window:
ollama serve
Notice that we don’t have to provide a model as the Ollama server will accept requests for any model and load up the model needed for the request you make. If you ran the command above you might have gotten the following error message:
Error: listen tcp 127.0.0.1:11434: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.
This is absolutely fine and just means that Ollama is already running in the background because you have the program running as you can see from your system tray icon:
Another point is the context window size. By default, Ollama uses a context window size of 2048 tokens. We can set this manually to a higher size but this will use exponentially more memory so I recommend sticking to the default. If you want huge context limits then you’re better off with commercial APIs instead as it’s not usually feasible locally at the current time.
💡 If you really want to try it out anyway, you can use the
ollama run
command:
ollama run llama3 # give it some time for the model to load, and then instead of a message type: >>> /set parameter num_ctx 4096
💡 If want to have a bigger context size for the programmatic requests and have the system specs to run it, you will have to make the requests to the Ollama server manually instead of using LangChain. Here is an example request using
curl
that shows the context limit option being sent along:
curl http://localhost:11434/api/generate -d '{ "model": "llama3", "prompt": "What is the meaning of life?", "options": { "num_ctx": 4096 } }'
💡 Again, I will just be sticking to the default context window for memory and performance reasons so we’ll be moving on now.
Preloading the model
Now that we have the Ollama server running in the background without having to keep a terminal window open, there is one last optimization I want to make. The Ollama server is very helpful in that it allows access to multiple models and if we make a request for a particular model it will load that particular model and then answer our request.
This is nice and flexible but also means we have some model startup lag after making a request. We can reduce this by sending an empty request to the server on script load to pre-load the model and keep it in memory for a while. By default, models are kept in memory for 5 minutes before being unloaded which is fine for our uses.
Let’s add a quick script to preload the model by manually sending an empty request to the server on load. Create a new file in your base directory named model_preloader.py
:
📁Local_Models 📁test_files 📄local_chat.py 📄local_chat_memory.py 📄model_preloader.py ✨New file 📄Pipfile 📄Pipfile.lock
Open the model_preloader.py
file and add the imports and constants first:
import requests import json URL = "http://localhost:11434/api/generate" HEADERS = {"Content-type": "application/json"}
We import the requests
module for making an HTTP request to our Ollama API and the json
module for turning the data into JSON format so we can send it along with the request. The URL
constant is simply the URL of the Ollama server and the HEADERS
constant is a dictionary that specifies the data we send along with our request will be in JSON format.
Now let’s code up a quick function:
def preload_model(model_name: str = "llama3") -> bool: print(f"Running model preloader for {model_name}...") data = {"model": model_name, "keep_alive": "5m"} response = requests.post(URL, data=json.dumps(data), headers=HEADERS) return True if response.status_code == 200 else False
This function takes an optional model_name
string as input with a default value of "llama3"
. We create a dictionary data
with the model
key set to the model_name
and the keep_alive
key set to "5m"
which tells the server to keep the model in memory for 5 minutes.
We then make a POST request to the Ollama server with the URL
, the data
dictionary converted to JSON format using json.dumps
, and the HEADERS
dictionary. If the response.status_code
is 200
(which means the request was successful) we return True
, otherwise we return False
.
Now finish it off by adding a quick test to the bottom of the function:
if __name__ == "__main__": print(preload_model())
The whole script now looks like this:
import requests import json URL = "http://localhost:11434/api/generate" HEADERS = {"Content-type": "application/json"} def preload_model(model_name: str = "llama3") -> bool: print(f"Running model preloader for {model_name}...") data = {"model": model_name, "keep_alive": "5m"} response = requests.post(URL, data=json.dumps(data), headers=HEADERS) return True if response.status_code == 200 else False if __name__ == "__main__": print(preload_model())
Go ahead and give it a test run and you should see True
printed to the console if the model was preloaded successfully.
Running model preloader for llama3... True
You will also see your memory usage go up as the model gets loaded into memory, unless it was still in a preloaded state of course, in which case the script will finish very quickly.
Adding the preloader to our chatbot
Now we just need to make some small changes to our local_chat_memory.py
file to preload the model on script load. First we’ll add two imports:
import uuid import threading # <--- new from langchain.memory import ChatMessageHistory from langchain_community.llms import Ollama from langchain_core.chat_history import BaseChatMessageHistory from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder from langchain_core.runnables.history import RunnableWithMessageHistory from model_preloader import preload_model # <--- new
We’ll use threading
to run the model preloader in the background without blocking our script as the script has no computation to perform, it just sends a request and then waits, so a lightweight thread is perfect here. We also import our preload_model
function.
Now extract our model name into a constant to make sure we use the same name in both places as we need it for the preloader as well, the other constants haven’t changed:
store = {} MODEL_NAME = "llama3" # <--- new LLAMA3 = Ollama(model=MODEL_NAME) # <--- updated SYSTEM_MESSAGE = "You are a helpful assistant. Your name is Luigi and you will address the user as Mario at all times." SESSION_ID = uuid.uuid4()
We only have one more change left in the if __name__ == "__main__":
block. We’ll add a new thread to run the preload_model
function in the background:
if __name__ == "__main__": ## New lines ## thread = threading.Thread(target=preload_model, args=[MODEL_NAME]) thread.start() ############### try: while True: query = input("You: ") chat(query) print("\n") except KeyboardInterrupt: print("Shutting down...")
We use the threading.Thread
class to create a new thread object with the target
parameter set to the preload_model
function and the args
parameter set to a list with the MODEL_NAME
as the only element (You need to pass a list or other iterable as multiple arguments could be passed in). We then call the start
method on the thread object to start the thread.
It will not wait for the thread to finish but just completely ignore it and continue running the script. This is fine for our use here as we don’t want any output information from the function or interaction between our different threads.
Now go ahead and run the script again:
Running model preloader for llama3... You:
It will run the preloader and instantly continue on to the You:
step. While it is still preloading you can start typing your query already. When you finished typing the query the model will have long ago finished loading in the background for a faster start, essentially making the Ollama server behave like the terminal version we had before except you can instantly start typing.
With all that out of the way, it’s time to move on to the next part where we’ll be creating a simple interface for our chatbot. See you there!
👉 Back to the Full Course on local models and Hugging Face (+Videos)