Lesson 4: Database querying using ChatGPT

💡 Full Course with Videos and Course Certificate (PDF): https://academy.finxter.com/university/openai-api-function-calls-and-embeddings/

Welcome back to part 4, where we’ll be doing something mind-blowing yet again! In this part, we will be connecting ChatGPT to a database and then using ChatGPT to extract information from the database using only natural language on the end user’s part. This is a very powerful tool that can be used to create a chatbot that can answer questions about information in an in-company database that is not available otherwise.

As with each part, we have some additional setup work to do. We will be creating a database utility, a prompt_setup message to feed ChatGPT, and of course a function description to explain to ChatGPT how our database function we will feed it works.

If you don’t know SQL, don’t worry. ChatGPT will be generating the database queries for us. SQL is a fascinating and fun subject all on its own, but we’ll not go off on that particular path for this tutorial series.

Database information utility

Let’s start with our database utility first. In your utils folder, create a new file called database.py

> utils
    > database.py

In this file, we will create a class called Database, which we’ll just use as a quick container for some of our utility functions we’ll be using. Let’s first initialize our class and then declare our methods inside:

class Database:
    def __init__(self, conn):
        self.conn = conn

The class “database” takes a database connection as an argument after which we set self.conn to the connection. This will allow us to use the connection in our methods.

    def get_table_names(self):
        table_names = []
        query = "SELECT name FROM sqlite_schema WHERE type='table';"
        tables = self.execute(query)
        for table in tables:
            table_names.append(table[0])
        return table_names

The first method we’ll create is get_table_names. This method will return a list of all the table names in our database. We do this by first creating an empty list called table_names. Then we create a query that will return all the table names in our database, note that the query is quite readable, even if you don’t know SQL. We then execute the query using the execute method we will create later. We then loop through the results and append the first item in each result to our table_names list.

    def get_column_names(self, table_name):
        column_names = []
        query = f"PRAGMA table_info({table_name});"
        columns = self.execute(query)
        for column in columns:
            column_names.append(column[1])
        return column_names

The second method we’ll create is get_column_names. This method will return a list of all the column names in a given table. We do this by first creating an empty list called column_names. Then we create a query that will return all the column names in a given table. This query looks a bit more cryptic, but basically, the PRAGMA table_info() method will allow us to get metadata about the columns in a specific table. Again, we won’t go too deep into SQL in this tutorial. We then execute the query using the execute method we will create in a moment. We then loop through the results and append the second item in each result to our column_names list.

    def get_database_info(self):
        database_info = []
        for table_name in self.get_table_names():
            column_names = self.get_column_names(table_name)
            database_info.append(
                {"table_name": table_name, "column_names": column_names}
            )
        return database_info

The next method we’ll create is get_database_info. This method will return a list of dictionaries containing the table names and column names for each table in our database, using the two methods we already created above. For each table_name in self.get_table_names() it will get the column_names by calling self.get_column_names(table_name) and then append a dictionary containing the table_name and column_names to our database_info list.

    def execute(self, query):
        res = self.conn.execute(query)
        return res.fetchall()

The next method we’ll create is execute. This method will execute a given query and return the results. We do this by first executing the query using self.conn.execute(query) and then returning the results using res.fetchall().

    def close(self):
        self.conn.close()

The last method we’ll create is close which will close the database connection. The full database utility class now looks like this:

class Database:
    def __init__(self, conn):
        self.conn = conn

    def get_table_names(self):
        table_names = []
        query = "SELECT name FROM sqlite_schema WHERE type='table';"
        tables = self.execute(query)
        for table in tables:
            table_names.append(table[0])
        return table_names

    def get_column_names(self, table_name):
        column_names = []
        query = f"PRAGMA table_info({table_name});"
        columns = self.execute(query)
        for column in columns:
            column_names.append(column[1])
        return column_names

    def get_database_info(self):
        database_info = []
        for table_name in self.get_table_names():
            column_names = self.get_column_names(table_name)
            database_info.append(
                {"table_name": table_name, "column_names": column_names}
            )
        return database_info

    def execute(self, query):
        res = self.conn.execute(query)
        return res.fetchall()

    def close(self):
        self.conn.close()

Again, if you’re less familiar with SQL, don’t worry about it, we won’t be doing too much SQL in this tutorial as ChatGPT will be doing this for us!

Prompt setup

Let’s prepare our prompt_setup for the database-querying-ChatGPT-bot next. In your prompt_setups folder create a new file named database.py

> prompt_setups
    > database.py

Inside we’ll just define a simple variable with a basic setup explaining to ChatGPT what its new identity is.

database_query_bot_setup = "You are a in company amazon bot providing information on customer review helpfulness data. You answer the user's query in the most helpful way possible using the database and function provided. Provide plenty of information."

The database

I’ve based the setup on a freely available SQL database. The database contains information about customer reviews on Amazon for us to play around with. Go to the following link and download the database:

https://www.kaggle.com/datasets/joychakraborty2000/amazon-customers-data

You’ll probably need to log in using your Google account or create a quick account to get the download. When you downloaded the file, unzip it and place the database.sqlite file inside your base directory.

You can open this database using SQLiteBrowser or similar software if you want to look around inside, but you don’t have to do so to follow along. Just make sure the database.sqlite file is in your tutorial base directory. I’ll quickly describe the basic table structure inside for your convenience:

Table: Reviews (Single table database)
    Columns:
        Id
            -Just a database Id
        ProductId
            -The Id of the product the review is about
        UserId
            -The Id of the user who wrote the review
        ProfileName
            -The nickname of the user who wrote the review
        HelpfulnessNumerator
            -The number of 'helpful' votes this review has received
        HelpfulnessDenominator
            -The total number of 'helpful' or 'unhelpful' votes this review has received
        Score
            -The rating between 1 and 5 the user gave the product in their review
        Time
            -The timestamp for the review
        Summary
            -The summary of the review
        Text
            -The text of the review

Function description

Next up is our function description that we feed to ChatGPT to let it know what our function does and what arguments it takes. In your func_descriptions folder create a new file named database.py

> func_descriptions
    > database.py

Inside this file let’s define our function description:

def describe_get_info_from_database(schema):
    return {
        "name": "get_info_from_database",
        "description": """Use this function to answer questions about amazon customers' reviews and their helpfulness.
        If the user asks for a customer's name, this will refer to their ProfileName specifically.
        The 'HelpfulnessNumerator' is the number of 'helpful' votes the review has received.
        The 'HelpfulnessDenominator' is the total number 'helpful' or 'unhelpful' votes the review has received.
        The 'Score' column indicates the rating between 1 and 5 the user gave the product in their review.
        You are not to use the 'UserId' column in your queries, use the 'ProfileName' as identifier instead.
        Argument should be a fully formed SQL query.""",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": f"""
                        SQL query extracting info from the database to answer the user's question.
                        The database schema is as follows:
                        {schema}
                        The query should be returned in string format as a single command.
                        """,
                }
            },
            "required": ["query"],
        },
    }

This function description is quite a bit more complex than the ones we’ve created before. This is because I’ve provided ChatGPT with some hints about the database structure to help it perform optimally. You can see I’ve specifically cleared up some confusion about the column namings to be very clear about what the meaning of the ProfileName, HelpfulnessNumerator and HelpfulnessDenominator columns is exactly and what they mean.

This sort of setup comes about by just playing around and seeing where ChatGPT has problems, then providing it with more information to help it out. As some of the column names in the dataset were kind of weird I’ve simply provided descriptions on exactly what they are.

The parameters here are very interesting. We’ve seen this part defining the arguments that ChatGPT has to provide several times before, but this time we ask it to provide an argument that has to be a fully formed SQL query. You’ll notice this function description is not a variable like the previous ones but a function that takes an argument of ‘schema’ and returns the description. This allows us to pass in the database schema inside the argument description. We ask ChatGPT for a fully formed SQL query as an argument to the function, based on the database schema variable plugged in.

Querying our database using ChatGPT

Ok, now that we have our database utility class, prompt setup, function description, and database file all ready, it’s time to have some fun here! Create a file in the base directory called ‘Da_database_functions.py’.

We’ll start with our imports up top:

import json
import sqlite3

from utils.printer import ColorPrinter as Printer
from utils.database import Database
from apis.chat_gpt import gpt_3_5_turbo_0613
from prompt_setups.database import database_query_bot_setup
from func_descriptions.database import describe_get_info_from_database

We’ll be using the json library again for the dumps (dump to string) function, and sqlite3 for the database connection. All the rest are our own files we’ve created in the previous parts.

conn = sqlite3.connect("./database.sqlite")
print("Database connection successful.")

company_db = Database(conn)
database_schema: str = str(company_db.get_database_info())

First, we establish a connection to the database (make sure both files are in your base directory or set the path accordingly, you may also need to cd into your base directory when running the Python file later). Then we create a new Database object called company_db using our Database class we created earlier, passing in the connection we just made.

Finally, we get the database schema in string format so we can pass it into the function description later. As our helper function’s .get_database_info() method returns a list of dictionaries we wrapped the whole thing inside a str() call.

Next, we’ll create a quick helper function we’ll actually use for the calls:

def get_info_from_database(query) -> str:
    try:
        res = company_db.execute(query)
        return json.dumps(res)
    except Exception as e:
        raise Exception(f"Error executing query: {e}, please try again, passing in a valid SQL query in string format as only argument.")

This function will take a query as an argument, execute it using our company_db object and then return the results in string format as ChatGPT handles only strings. If an error occurs, it will raise an exception with a helpful message.

From here on, most of the following code should be quite familiar, so we’ll discuss the whole function at once:

def ask_company_db(query):
    messages = [
        {"role": "system", "content": database_query_bot_setup},
        {"role": "user", "content": query},
    ]
    functions = [describe_get_info_from_database(database_schema)]

    current_response = gpt_3_5_turbo_0613(
        messages, functions, function_call={"name": "get_info_from_database"}
    )
    current_message = current_response["choices"][0]["message"]
    messages.append(current_message)

    if current_message.get("function_call"):
        available_functions = {
            "get_info_from_database": get_info_from_database,
        }
        function_name = current_message["function_call"]["name"]
        function_to_call = available_functions[function_name]
        function_args = json.loads(current_message["function_call"]["arguments"])
        function_response = function_to_call(**function_args)

        messages.append(
            {
                "role": "function",
                "name": function_name,
                "content": function_response,
            }
        )

        current_response = gpt_3_5_turbo_0613(messages, functions)
        current_message = current_response["choices"][0]["message"]
        messages.append(current_message)

    Printer.color_print(messages)
    return current_message["content"]

Our function takes in the query as an argument. We then create a messages list containing the database_query_bot_setup prompt setup and the query the user asked. We then create a functions list containing the function description we created earlier. Remember that this time our function description was actually a function, so we pass in the database schema for our function to return the actual function description we need with the database schema plugged in.

We then call our own gpt_3_5_turbo_0613 helper function, passing in the messages and functions lists, but this time we also specify a function call, calling our function by name. We will force ChatGPT to make a function call using this specific function for now. Then we get the current response and append its message to the messages history as we’ve always done.

Next, we check if the current message has a function call in it. It always should, as we sort of force it. We then specify the available functions, get the function to be called, get the arguments, and call the function passing in the arguments as we always have. Note that this code is unnecessarily complex since we only have one possible function and we could just call it directly passing in the arguments, but coding in this manner allows us to easily add more functions later without having to change the code. If you’re sure you’ll never add more functions feel free to just call the function more directly and pass in the arguments.

We append the function response to the message history as always, then make our next response to ChatGPT with the now extended message history, then get the message and append it to the history and print the history and return the result.

Time to ask some questions!

Good, let’s ask some questions! Add a print statement to the bottom of the file for a quick test:

print(
    ask_company_db(
        "What is the name of the user who wrote the largest number of helpful reviews?"
    )
)

And now run our file:

###### Conversation History ######
system : You are a in company amazon bot providing information on customer review helpfulness data. You answer the users query in the most helpful way possible using the database and function provided. Provide plenty of information.
user : What is the name of the user who wrote the largest number of helpful reviews?
assistant : get_info_from_database({
"query": "SELECT ProfileName, COUNT(*) AS HelpfulReviews FROM Reviews WHERE HelpfulnessDenominator > 0 AND HelpfulnessNumerator/HelpfulnessDenominator >= 0.5 GROUP BY ProfileName ORDER BY HelpfulReviews DESC LIMIT 1"
})
function : [["O. Brown \"Ms. O. Khannah-Brown\"", 297]]
assistant : The user who wrote the largest number of helpful reviews is O. Brown "Ms. O. Khannah-Brown". They have written a total of 297 helpful reviews.
##################################

That’s pretty darn cool! You can see that ChatGPT has generated a SQL query for us based on the question we asked it. We then called the function using the generated SQL query and fed the result back to ChatGPT which came back to us with a natural language response. The end user asks a natural language question and gets a natural language response.

Let’s try another one:

print(
    ask_company_db(
        "What is the name of the user who got the greatest number of 'helpful' votes over all their reviews combined?"
    )
)

And We get:

###### Conversation History ######
system : You are a in company amazon bot providing information on customer review helpfulness data. You answer the users query in the most helpful way possible using the database and function provided. Provide plenty of information.
user : What is the name of the user who got the greatest number of 'helpful' votes over all their reviews combined?
assistant : get_info_from_database({
"query": "SELECT ProfileName, SUM(HelpfulnessNumerator) as TotalHelpfulVotes FROM Reviews GROUP BY ProfileName ORDER BY TotalHelpfulVotes DESC LIMIT 1"
})
function : [["D. Truong \"Duke of New Mexico\"", 5594]]
assistant : The user who received the greatest number of 'helpful' votes over all their reviews combined is D. Truong "Duke of New Mexico". They have a total of 5594 helpful votes.
##################################

Play around with it and try asking a question, or try the following two questions if you like:

"What is the name of the user who wrote the largest number of negative reviews?"
"What are the names of the 10 users who wrote the most reviews?"

When you’re done playing around with this, I’ll see you in part 5, where we’ll be calling functions that do not even exist!