Lesson 3 – Database Querying using ChatGPT

Welcome back to part 3, where we’ll be doing something mind-blowing yet again!

In this part, we will be connecting ChatGPT to a database and then using ChatGPT to extract information from the database using only natural language on the end user’s part. This is a very powerful tool that can be used to create a chatbot that can answer questions about information in an in-company database that is not available otherwise.

As with each part, we have some additional setup work to do. We will be creating a database utility, a prompt_setup message to feed ChatGPT, and of course a function description object to explain to ChatGPT how our database function we will feed it works.

If you don’t know SQL, don’t worry. ChatGPT will be generating the database queries for us. SQL is a fascinating and fun subject all on its own, but we’ll not go off on that particular path for this tutorial series.

Database information utility

Let’s start with our database utility first. In your root project directory, create a folder called ‘3_Database_functions‘ and then inside start with a new empty file called ‘database_util.py‘.

📁FINX_FUNC_EMBED
    📁1_Simple_function_call
    📁2_Parallel_function_calling
    📁3_Database_functions
        📄database_util.py
    📄.env

In this file, we will create a class called Database, which we’ll just use as a quick container for some of the utility functions we’ll be using. Let’s first initialize the class and then declare our methods inside:

class Database:
    def __init__(self, conn):
        self.conn = conn

The class “database” takes a database connection as an argument after which we set self.conn to the connection. This will allow us to use the connection in our methods. Continue with the next method inside the class:

    def get_table_names(self):
        table_names = []
        query = "SELECT name FROM sqlite_schema WHERE type='table';"
        tables = self.execute(query)
        for table in tables:
            table_names.append(table[0])
        return table_names

The first method we’ll create is get_table_names. This method will return a list of all the table names in our database. We do this by first creating an empty list called table_names. Then we create a query that will return all the table names in our database, note that the query is quite readable, even if you don’t know SQL, simply selecting the names of everything that is of type table. We execute the query using the execute method we will create later. We then loop through the results and append the first item in each result to our table_names list.

    def get_column_names(self, table_name):
        column_names = []
        query = f"PRAGMA table_info({table_name});"
        columns = self.execute(query)
        for column in columns:
            column_names.append(column[1])
        return column_names

The second method we’ll create is get_column_names.

This method will return a list of all the column names in a given table. We do this by first creating an empty list called column_names. Then we create a query that will return all the column names in a given table.

This query looks a bit more cryptic, but basically, the PRAGMA table_info() method will allow us to get metadata about the columns in a specific table.

Again, we won’t go too deep into SQL in this tutorial and you don’t need to know or understand it deeply to follow along. We then execute the query using the execute method we will create in a moment. We then loop through the results and append the second item in each result to our column_names list, as this index holds the column name.

    def get_database_info(self):
        database_info = []
        for table_name in self.get_table_names():
            column_names = self.get_column_names(table_name)
            database_info.append(
                {"table_name": table_name, "column_names": column_names}
            )
        return database_info

The next method we’ll create is get_database_info. This method will return a list of dictionaries containing the table names and column names for each table in our database, using the two methods we already created above.

For each table_name in self.get_table_names() it will get the column_names by calling self.get_column_names(table_name) and then append a dictionary containing the table_name and column_names to our database_info list.

    def execute(self, query):
        res = self.conn.execute(query)
        return res.fetchall()

The next method we’ll create is execute. This method will execute a given query and return the results. We do this by first executing the query using self.conn.execute(query) and then returning the results using res.fetchall().

    def close(self):
        self.conn.close()

The last method we’ll create is close which will close the database connection. The full database utility class now looks like this:

class Database:
    def __init__(self, conn):
        self.conn = conn

    def get_table_names(self):
        table_names = []
        query = "SELECT name FROM sqlite_schema WHERE type='table';"
        tables = self.execute(query)
        for table in tables:
            table_names.append(table[0])
        return table_names

    def get_column_names(self, table_name):
        column_names = []
        query = f"PRAGMA table_info({table_name});"
        columns = self.execute(query)
        for column in columns:
            column_names.append(column[1])
        return column_names

    def get_database_info(self):
        database_info = []
        for table_name in self.get_table_names():
            column_names = self.get_column_names(table_name)
            database_info.append(
                {"table_name": table_name, "column_names": column_names}
            )
        return database_info

    def execute(self, query):
        res = self.conn.execute(query)
        return res.fetchall()

    def close(self):
        self.conn.close()

Again, if you’re less familiar with SQL, don’t worry about it, we won’t be doing too much SQL in this tutorial as ChatGPT will be doing all of this for us!

Prompt setup

Let’s prepare our prompt_setup for the database-querying-ChatGPT-bot next. In your '3_Database_functions' folder create a new file named prompt_setup.py

📁FINX_FUNC_EMBED
    📁1_Simple_function_call
    📁2_Parallel_function_calling
    📁3_Database_functions
        📄database_util.py
        📄prompt_setup.py
    📄.env

Inside we’ll just define a simple variable with a basic setup explaining to ChatGPT what its new identity is.

database_query_bot_setup = "You are a in company amazon bot providing information on customer review helpfulness data. You answer the users query in the most helpful way possible using the database and function provided. Provide plenty of information."

The printer utility

To easily read the output, we will be using the ColorPrinter utility class we created in part 1. Copy the 'printer.py' file from the '1_Simple_function_call' folder and paste a copy into the '3_Database_functions' folder.

📁FINX_FUNC_EMBED
    📁1_Simple_function_call
        📄printer.py
    📁2_Parallel_function_calling
    📁3_Database_functions
        📄database_util.py
        📄prompt_setup.py
        📄printer.py
    📄.env

Alternatively, if you don’t have the 'printer.py' code for any reason, copy the code below and paste it inside the 'printer.py' file:

class ColorPrinter:
    _colors = {
        "yellow": "\033[33m",
        "green": "\033[32m",
        "blue": "\033[34m",
        "purple": "\033[35m",
        "cyan": "\033[36m",
        "white": "\033[37m",
        "closing_tag": "\033[00m",
    }

    def _get_current_color(index, no_of_messages) -> str:
        if index == 0:
            return ColorPrinter._colors["yellow"]
        elif index == no_of_messages - 1:
            return ColorPrinter._colors["purple"]
        elif index % 2 == 0:
            return ColorPrinter._colors["blue"]
        return ColorPrinter._colors["green"]

    def color_print(messages) -> None:
        no_of_messages = len(messages)
        cyan_open_tag = ColorPrinter._colors["cyan"]
        color_closing_tag = ColorPrinter._colors["closing_tag"]
        print(f"\n{cyan_open_tag}###### Conversation History ######{color_closing_tag}")
        for index, message in enumerate(messages):
            color = ColorPrinter._get_current_color(index, no_of_messages)
            print(f"{color}{message}{color_closing_tag}")
        print(f"{cyan_open_tag}##################################{color_closing_tag}\n")

You would normally never, ever do this and copy the same code in multiple places, but to keep this tutorial simple and divided into progressive separate folders without getting into the intricacies of Python imports and going too far off-topic, we’ll just copy the code in multiple places for tutorial purposes only.

The database

To read information from a database, first, we’ll need an actual database with good data! I’ve based the setup on a freely available SQL database. The database contains information about customer reviews on Amazon for us to play around with. Go to the following link and download the database:

https://www.kaggle.com/datasets/joychakraborty2000/amazon-customers-data

You’ll probably need to log in using your Google account or create a quick account to get the download, don’t worry, it’s all free. When you download the file, unzip it and place the database.sqlite file inside your '3_Database_functions' directory.

You can open this database using SQLiteBrowser or similar software if you want to look around inside, but you don’t have to do so to follow along. Just make sure to take the database.sqlite file from the download, and place it inside your '3_Database_functions' folder:

📁FINX_FUNC_EMBED
    📁1_Simple_function_call
    📁2_Parallel_function_calling
    📁3_Database_functions
        📄database_util.py
        🗃️database.sqlite
        📄prompt_setup.py
        📄printer.py
    📄.env

I’ll quickly describe the basic table structure inside for your convenience:

Table: Reviews (Single table database)
    Columns:
        Id
            -Just a database Id
        ProductId
            -The Id of the product the review is about
        UserId
            -The Id of the user who wrote the review
        ProfileName
            -The nickname of the user who wrote the review
        HelpfulnessNumerator
            -The number of 'helpful' votes this review has received
        HelpfulnessDenominator
            -The total number votes the review has received ('helpful' and 'unhelpful').
        Score
            -The rating between 1 and 5 the user gave the product in their review
        Time
            -The timestamp for the review
        Summary
            -The summary of the review
        Text
            -The text of the review

Function description

Next up is our function description that we feed to ChatGPT to let it know what our function does and what arguments it takes. In your '3_Database_functions' folder create a new file named 'function_description.py'.

📁FINX_FUNC_EMBED
    📁1_Simple_function_call
    📁2_Parallel_function_calling
    📁3_Database_functions
        📄database_util.py
        🗃️database.sqlite
        📄function_description.py
        📄prompt_setup.py
        📄printer.py
    📄.env

Inside this file let’s define our function description, which you’ll notice is actually a function that returns a function description:

def describe_get_info_from_database(schema) -> dict:
    return {
        "type": "function",
        "function": {
            "name": "get_info_from_database",
            "description": """Use this function to answer questions about amazon customers' reviews and their helpfulness.
            If the user asks for a customer's name, this will refer to their ProfileName specifically.
            The 'HelpfulnessNumerator' is the number of 'helpful' votes the review has received.
            The 'HelpfulnessDenominator' is the total number votes the review has received ('helpful' and 'unhelpful').
            The 'Score' column indicates the rating between 1 and 5 the user gave the product in their review.
            You are not to use the 'UserId' column in your queries, use the 'ProfileName' as identifier instead.
            Argument should be a fully formed SQL query.""",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": f"""
                        SQL query extracting info from the database to answer the user's question.
                        The database schema is as follows:
                        {schema}
                        The query should be returned in string format as a single command.
                        """,
                    }
                },
                "required": ["query"],
            },
        },
    }

This function description is quite a bit more complex than the ones we’ve created before. This is because we’re providing ChatGPT with some hints about the database structure to help it perform optimally. You can see I’ve specifically cleared up some confusion about the column namings to be very clear about what the meaning of the ProfileName, HelpfulnessNumerator, and HelpfulnessDenominator columns is exactly and what they mean.

This sort of setup comes about by just playing around and seeing where ChatGPT has problems, then providing it with more information to help it out. As some of the column names in the dataset were kind of weird I’ve simply provided descriptions of exactly what they are. If you want to keep it a generic database tool you’d of course remove these.

The parameters here are very interesting. We’ve seen this part defining the arguments that ChatGPT has to provide several times before, but this time we ask it to provide an argument that has to be a fully formed SQL query. You’ll notice this function description is not a variable like the previous ones but a function that takes an argument of ‘schema‘ and returns the description.

This allows us to pass in the database schema we get from calling our other database utility and embed it inside the argument description. We ask ChatGPT for a fully formed SQL query as an argument to the function, based on the database schema variable plugged in.

Querying our database using ChatGPT

Ok, now that we have our database utility class, prompt setup, function description, and database file all ready, it’s time to have some fun here! Create a file in the '3_Database_functions' folder called 'database_gpt.py'.

📁FINX_FUNC_EMBED
    📁1_Simple_function_call
    📁2_Parallel_function_calling
    📁3_Database_functions
        📄database_gpt.py
        📄database_util.py
        🗃️database.sqlite
        📄function_description.py
        📄prompt_setup.py
        📄printer.py
    📄.env

Inside 'database_gpt.py' we’ll start with our imports up top:

import json
import sqlite3
from pathlib import Path

from decouple import config
from openai import OpenAI

from database_util import Database
from function_description import describe_get_info_from_database
from printer import ColorPrinter as Printer
from prompt_setup import database_query_bot_setup

We’ll be using the JSON library again for the dumps (dump to string) function, and sqlite3 for the database connection. We will use the pathlib library to create a path to our database file, and OpenAI and decouple are familiar. The rest is just importing our own code from the other files.

MODEL = "gpt-3.5-turbo-1106"
client = OpenAI(api_key=config("OPENAI_API_KEY"))
current_directory = Path(__file__).parent
conn = sqlite3.connect(current_directory / "database.sqlite")

First, we define our model, create a client, and get the current directory. Calling Path and passing in the __file__ variable will return the current file’s path, and calling .parent on that will return the parent directory, which will be our ‘3_Database_functions‘ folder, no matter where it is located on your computer. We then create a connection to our database using sqlite3.connect() and pass in the path to our database file by adding / "database.sqlite" to the current_directory path.

company_db = Database(conn)
database_schema: str = str(company_db.get_database_info())
print(database_schema)

Then we create a new Database object called company_db using the Database class we created earlier, passing in the connection we just made.

We get the database schema in string format so we can pass it into the function description later. As our helper function’s .get_database_info() method returns a list of dictionaries we wrapped the whole thing inside a str() call to stringify it, because we cannot feed an object to ChatGPT which only takes text as input.

Next, we’ll create a quick helper function we’ll actually use for the calls:

def get_info_from_database(query) -> str:
    try:
        res = company_db.execute(query)
        return json.dumps(res)
    except Exception as e:
        return f"Error executing query: {e}, please try again, passing in a valid SQL query in string format as only argument."

This function will take a query as an argument, execute it using our company_db object, and then return the results in string format as ChatGPT handles only strings. If an error occurs, we will not raise an error but instead send back a string format helpful error message, so we can feed it back to ChatGPT to inform it and have it try again.

From here on, most of the following code should be quite familiar, so we’ll write/discuss the whole function at once:

def ask_company_db(query):
    messages = [
        {"role": "system", "content": database_query_bot_setup},
        {"role": "user", "content": query},
    ]
    tools = [describe_get_info_from_database(database_schema)]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        tools=tools,
        tool_choice={
            "type": "function",
            "function": {"name": "get_info_from_database"},
        },
    )

    response_message = response.choices[0].message
    messages.append(response_message)

    while response_message.tool_calls:
        tool_calls = response_message.tool_calls
        available_functions = {
            "get_info_from_database": get_info_from_database,
        }
        for call in tool_calls:
            func_name: str = call.function.name
            func_to_call = available_functions[func_name]
            func_args: dict = json.loads(call.function.arguments)
            func_response = func_to_call(**func_args)

            messages.append(
                {
                    "tool_call_id": call.id,
                    "role": "tool",
                    "name": func_name,
                    "content": func_response,
                }
            )

        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            tools=tools,
            tool_choice="auto",
        )
        response_message = response.choices[0].message
        messages.append(response_message)

    Printer.color_print(messages)
    return response_message.content

Our function takes in the query as an argument. We then create a messages list containing the database_query_bot_setup prompt setup and the query the user asked. We then create a tools list containing the function description we created earlier. Remember that this time our function description was actually a function, so we pass in the database schema for our function to return the actual function description we need with the database schema embedded inside of it.

We then call ChatGPT, passing in the messages and tools lists, but this time we also specify a tool_choice, calling our function by name. We will force ChatGPT to make a function call using this specific function for now. Then we get the response and append its message to the message history as we’ve always done.

Next, we open a while loop that will keep running as long as the response message has tool_calls. Remember that if the model fails to generate proper SQL, we will get our error string and feed that back to ChatGPT which will generate another tool_call so we cannot assume there will be only a single call, hence the while loop.

We then get the tool_calls from the response message, specify the available functions, and loop through the tool_calls. For each tool_call we get the function name, get the function to call from the available_functions dictionary, get the function arguments by parsing the JSON, and call the function passing in the arguments.

Note that this code is unnecessarily complex since we only have one possible function and we could just call it directly passing in the arguments, but coding in this manner allows us to easily add more functions later without having to change the code.

We append the function response to the message history as always, passing in the id, role, function name, and content. Then we make our next call to ChatGPT with the now extended message history holding the SQL database query’s result (or possibly an error). We append the response we get back and it should not have a tool_call anymore, unless there is an error, breaking us out of the while loop, triggering the color printer, and returning the response message’s content.

Time to ask some questions!

Good, let’s ask some questions! Add a print statement to the bottom of the file for a quick test:

print(
    ask_company_db(
        "What is the name of the user who wrote the largest number of helpful reviews?"
    )
)

And now run our file:

###### Conversation History ######
{'role': 'system', 'content': 'You are a in company amazon bot providing information on customer review helpfulness data. You answer the users query in the most helpful way possible using the database and function provided. Provide plenty of information.'}
{'role': 'user', 'content': 'What is the name of the user who wrote the largest number of helpful reviews?'}
ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_DPHl9hmF1sfUlKch0AmCYUSy', function=Function(arguments='{"query":"SELECT ProfileName, COUNT(Id) AS HelpfulReviews FROM Reviews WHERE HelpfulnessNumerator > 0 GROUP BY ProfileName ORDER BY HelpfulReviews DESC LIMIT 1;"}', name='get_info_from_database'), type='function')])
{'tool_call_id': 'call_DPHl9hmF1sfUlKch0AmCYUSy', 'role': 'tool', 'name': 'get_info_from_database', 'content': '[["O. Brown \\"Ms. O. Khannah-Brown\\"", 364]]'}
ChatCompletionMessage(content='The user who wrote the largest number of helpful reviews is "O. Brown \\"Ms. O. Khannah-Brown\\"" with a total
of 364 helpful reviews.', role='assistant', function_call=None, tool_calls=None)
##################################

The user who wrote the largest number of helpful reviews is "O. Brown \"Ms. O. Khannah-Brown\"" with a total of 364 helpful reviews.

That’s pretty darn cool! You can see that ChatGPT has generated a SQL query for us based on the question we asked. We then called the function using the generated SQL query and fed the result back to ChatGPT which came back to us with a natural language response. The end user asks a natural language question and gets a natural language response.

Note that if you want consistent results you will have to be very specific with your questions. The above question is open for interpretation on what exactly is a helpful review. Any review with more than 1 helpful vote? More than 10? A ratio of helpful votes of 50% or more compared to unhelpful votes? The above question can therefore possibly generate inconsistent answers based on the exact definition if the user is vague.

Let’s try another one:

print(
    ask_company_db(
        "What is the name of the user who got the greatest number of 'helpful' votes over all their reviews combined?"
    )
)

And We get:

###### Conversation History ######
{'role': 'system', 'content': 'You are a in company amazon bot providing information on customer review helpfulness data. You answer the users query in the most helpful way possible using the database and function provided. Provide plenty of information.'}
{'role': 'user', 'content': "What is the name of the user who got the greatest number of 'helpful' votes over all their reviews combined?"}
ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_l1ZoHlh02YPlQ98oWSr5ZLCO', function=Function(arguments='{"query":"SELECT ProfileName, SUM(HelpfulnessNumerator) AS TotalHelpfulVotes FROM Reviews GROUP BY ProfileName ORDER BY TotalHelpfulVotes DESC LIMIT 1"}', name='get_info_from_database'), type='function')])
{'tool_call_id': 'call_l1ZoHlh02YPlQ98oWSr5ZLCO', 'role': 'tool', 'name': 'get_info_from_database', 'content': '[["D. Truong \\"Duke of New Mexico\\"", 5594]]'}
ChatCompletionMessage(content='The user who received the greatest number of \'helpful\' votes over all their reviews combined is D. Truong "Duke of New Mexico", with a total of 5594 helpful votes. This user\'s reviews have been particularly impactful in helping others make informed decisions about products.', role='assistant', function_call=None, tool_calls=None)
##################################

The user who received the greatest number of 'helpful' votes over all their reviews combined is D. Truong "Duke of New Mexico", with a total of 5594 helpful votes. This user's reviews have been particularly impactful in helping others make informed decisions about products.

Play around with it and try asking a question, or try the following two questions if you like:

"What is the name of the user who wrote the largest number of negative reviews?"
"What are the names of the 10 users who wrote the most reviews?"

I don’t know about you, but my mind is pretty much blown! When you’re done playing around with this, let’s move on to part 4 where we’ll be diving into the magical world of embeddings!