OpenAI Fine-Tuning (2/4): Generating the Training Data

Hi and welcome back to part 2, where we will be generating the training data for our own ChrisGPT model! We’ll pick up where we left off in part 1 and use our HTML to Markdown converter utility and more to write and run our training dataset generator that will build the ChrisGPT training data for us.

Api Key

We need to start by setting our key for the OpenAI API, but we cannot hardcode this one in our source code. Go to https://platform.openai.com/api-keys and copy your API key. If you don’t have one, make sure to get one. You’ll only pay for what you use which will be cents if you just play around with it casually. Then create a new file called .env in the root folder of your project:

📁Finx_Fine_Tuning
    📁data
        📄Finx_dataset.json
    📁utils
        📄html_email.py
    📄.env                  (new file)
    📄Pipfile
    📄Pipfile.lock

And paste your API key in there like this, making sure not to use any spaces or quotes:

OPENAI_API_KEY=your_api_key_here

Then go ahead and save and close this file. (You can also use environment variables to set the key if you prefer, but this is simpler to set up for tutorial purposes as it works the same on all platforms.)

Constants

Now, let’s define some project-wide settings by creating a new file named constants.py in the root directory of our project. This file will contain the constants that we will use multiple times throughout our project:

📁Finx_Fine_Tuning
    📁data
        📄Finx_dataset.json
    📁utils
        📄html_email.py
    📄.env
    📄constants.py        (new file)
    📄Pipfile
    📄Pipfile.lock

Run the following command in your terminal to add the python-decouple package inside your pipenv environment:

pipenv install python-decouple==3.7

We will use this package to read the .env file and get the API key from it. Then install the openai library as well:

pipenv install openai==1.12.0

Make sure you run the installs even if you already have the packages installed, as we need to make sure they are installed in the virtual environment we are using for this project. Now, open the constants.py file and add the following code:

from pathlib import Path

from decouple import config
from openai import OpenAI

CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))
DATA_DIRECTORY = Path(__file__).parent / "data"

We import Path from pathlib, the OpenAI class from the openai package, and the config function from the decouple package. We then create a CLIENT constant that will be an instance of the OpenAI class, passing the API key from the .env file to it by reading it from the .env file using the config function.

💡 Python Top-tip 💡
The 'config' function from the 'decouple' package reads the value of the environment variable from the '.env' file and returns it as a string. If you share or upload your code to a public repository, make sure to add the '.env' file to your '.gitignore' file to avoid sharing your API key with others.

We also create a DATA_DIRECTORY constant that will be a Path object pointing to the data directory. Path(__file__) returns the absolute path of the current file, and parent returns the parent directory of the current file (which is constants.py so that gives us the root folder). We then use the / operator to add the data directory to the path.

The Data Generator

Go ahead and save and close the constants.py file. Then create a new file in the root directory called chris_gpt_dataset_generator.py:

📁Finx_Fine_Tuning
    📁data
        📄Finx_dataset.json
    📁utils
        📄html_email.py
    📄.env
    📄constants.py
    📄chris_gpt_dataset_generator.py        (new file)
    📄Pipfile
    📄Pipfile.lock

Now let’s install two libraries before we dive in here. We’ll combine them into a single command this time:

pipenv install tqdm==4.65.0 && pipenv install tenacity==8.2.3

The tqdm package will be used to easily add a progress bar to our generator, and the tenacity package will be used to retry the API requests if they fail for some reason. Both are extremely simple in their usage, as you will see.

Ok now open the chris_gpt_dataset_generator.py file and we will start with our imports for the file:

import json

from tenacity import retry, stop_after_attempt, wait_fixed
from tqdm import tqdm

from constants import CLIENT, DATA_DIRECTORY
from utils import html_email

We use json to load and save back the JSON data, tenacity and tqdm for the reasons we just mentioned, and then we import the CLIENT and DATA_DIRECTORY constants from the constants module, and our html_email utility we just made. Now let’s set up some base constants for use in our generator only:

INPUT_DATA = DATA_DIRECTORY / "Finx_dataset.json"
OUTPUT_DATA = DATA_DIRECTORY / "Finx_completed_dataset.json"
MODEL: str = "gpt-3.5-turbo-0125"
TOTAL_TOKENS_USED = 0
RATE_ERROR_MESSAGE = "There was an error calling 'get_user_query'. Perhaps the OpenAI ChatGPT rate limit has been reached. Retrying one more time in 60 seconds to reset the rate limiter..."

We define INPUT_DATA and OUTPUT_DATA constants that will be Path objects pointing to the Finx_dataset.json and Finx_completed_dataset.json files, respectively. The output one doesn’t exist yet obviously, but that is fine. The MODEL constant is just a string with the model name we use.

💡 OpenAI Top-tip 💡
There are many versions of the gpt-3.5-turbo model:

✨ gpt-3.5-turbo-0613. This one is the older version.
✨ gpt-3.5-turbo-1106. Has much improved functionality and a larger context window compared to the 0613 model. This is the newest version that can be used for fine-tuning purposes.
✨ gpt-3.5-turbo-0125. The latest minor update to the 1106 model, this one cannot be used for fine-tuning yet. Aside from the minor improvements, it also has a reduced price.

This is why we will use the `0125` version for the data generation in this part of the tutorial, but later when we do the fine-tuning we will be using the `1106` version instead.

We also define a TOTAL_TOKENS_USED constant that will be used to keep track of the total number of tokens used in the API requests, and a RATE_ERROR_MESSAGE that we reference for an error message later. As all of the above constants will only be used for this script we defined them all inside the chris_gpt_dataset_generator.py file.

The data structure

Now we’ll define a class that will serve as the data structure “blueprint” if you will, for each example in our dataset. This is exactly the same structure we used in the Harry Potter magical spells example in part 1:

class TrainingDataEntry:
    def __init__(self, fictional_user_query, markdown_email) -> None:
        self.data = {
            "messages": [
                {
                    "role": "system",
                    "content": 'You are a helpful assistant that writes emails for the Finxter email newsletter, adhering perfectly to the style and tone of the Finxter brand and Chris\' writing style. You will respond in the following format: {"subject": "The subject of the email", "body": "The body of the email in Markdown formatting"}.',
                },
                {"role": "user", "content": fictional_user_query},
                {"role": "assistant", "content": markdown_email},
            ]
        }

We have a class named TrainingDataEntry, which has an __init__ method that takes two arguments: fictional_user_query and markdown_email. The __init__ method initializes the data attribute of the TrainingDataEntry object with a dictionary that has a messages key, which is a list of dictionaries.

The first dictionary in the list has a role key with the value "system" and a content key with a long string that explains the role of the assistant. This is going to be the same for all examples of the training data so I just hard-coded it in the class.

The second dictionary in the list has a role key with the value "user" and a content key with the value of the fictional_user_query argument. The third dictionary in the list has a role key with the value "assistant" and a content key with the value of the markdown_email argument.

So the system message is the same for each entry and the assistant message is simply the email that we already have saved in our dataset and can convert using the HTML to Markdown converter we built.

The only challenge we have left is the user message, which is supposed to hold the user query that resulted in the generated email. We will have to generate a fictional user query for each “result” email we have in our dataset. This should mostly be a short summary of the email contents for our case, as we want to train ChrisGPT to generate an email if we give it the topics to write on.

Getting the user query (that was never made)

So let’s create a function that will take an email as input and generate a fictional user query for it:

def get_fictional_user_query(email: str) -> str:
    global TOTAL_TOKENS_USED
    response = CLIENT.chat.completions.create(
        model=MODEL,
        messages=[
            {
                "role": "system",
                "content": 'You will be provided with an email in the following format:{"subject": "The subject of the email", "body": "The body of the email in Markdown formatting"}. Your task is to go back in time and write a rough draft of the topics the email writer decided to discuss in the email. This will basically be a summary of the email\'s topics formatted in short bullet points, that the author would have used as a basis to then write the real email.',
            },
            {
                "role": "user",
                "content": f"Here is the output LLM generated email: {email}. Bullet point draft of the topics discussed in the email:",
            },
        ],
    )
    if not response.usage or not response.choices[0].message.content:
        raise Exception("Invalid response from OpenAI API")
    TOTAL_TOKENS_USED += response.usage.total_tokens
    return response.choices[0].message.content

We define a function named get_fictional_user_query that takes an email string argument and will return a string. Inside the function, we first reference the TOTAL_TOKENS_USED variable as a global variable, so that we can update its value from inside the function. Then we call the CLIENT.chat.completions.create method with the model and messages arguments.

For the prompt setup message we basically just explain that the LLM will receive an email in the format of a subject and body and that it should write a rough draft of the topics the email writer decided to discuss in the email. This will basically be a summary of the email’s topics formatted in short bullet points. That way we can give our trained model a list of bullet points to write and email for us later on.

The user message is then just the email we received as input, and a short message requesting the bullet points draft of the topics discussed. Note how it is stated as “Bullet point draft of the topics discussed in the email:” ending with a : colon so that the LLM will answer with the bullet points right away without giving some introduction or “Sure, I can help you with that” or something like that.

We then double-check if the response has a usage and choices attribute and if the content of the choices attribute is not empty. If any of these checks fail we raise an exception, but if everything is fine we add the total_tokens used in the response to the TOTAL_TOKENS_USED variable and return the content of the choices attribute.

Making it robust

Ok, so that is pretty good so far, but we will be calling this function many times in a row, and saving the data into an object in memory, which we write to a file at the end. If we get an error halfway then the script will crash out and we have to try again from the start. We can also hit an error because of a rate limit, as we will be doing many requests in a row. This is not ideal, so we will use the tenacity package to retry the API requests if they fail for some reason.

We can simply add the @retry decorator to our function, by adding it straight above the function definition like this:

@retry(
    wait=wait_fixed(60),
    stop=stop_after_attempt(2),
    reraise=True,
    before_sleep=lambda _: print(RATE_ERROR_MESSAGE),
)
def get_fictional_user_query(email):
    ... function here ...

We use the @retry decorator to specify that we want to retry the function if it fails due to any type of error. As you can see the tenacity library is very easy to read and use. We specify that we want to wait 60 seconds between each retry and that we want to stop after 2 attempts, maximum.

The 60 seconds are in case we hit a rate limit, as trying again right away might trigger the rate limit again. We also specify that we want to reraise the exception if the function fails after the maximum number of attempts, so we can see what the original error was.

Now the last part is a bit trickier. As tenacity will stop the program from crashing out, we won’t see the error and we won’t be aware of what happened. This is why we added a before_sleep function that will print the RATE_ERROR_MESSAGE before each retry. This way we can keep ourselves up to date in the terminal about what is happening.

The problem is that if I just pass in print(RATE_ERROR_MESSAGE) then the print statement will be triggered right away when Python first loads the function into memory because it is being called with the ellipsis. This is not what we want, so we need to wrap it in a lambda function that will then return the print call when the lambda is called.

The lambda _: print(RATE_ERROR_MESSAGE) is basically equivalent to:

# Do not put in your code
def before_sleep_function(_):
    print(RATE_ERROR_MESSAGE)

But it’s just much shorter to write. The _ is the name of the input argument, as tenacity will send an input argument to our before_sleep function, but we won’t be using it, so we just name it _ to sort of throw it away.

Ok so now we have a TrainingDataEntry class to hold the data for each entry and a function to generate the fictional user queries that will automatically retry if there is an error. See how we’re making the problem smaller and smaller?

Putting it all together

All we need to do now is create a TrainingDataEntry object for each entry in the training data, get the fictional user query by calling our function, and then save all the objects to a new JSON file to get our training data. Let’s do that now:

with open(INPUT_DATA, "r", encoding="utf-8") as file:
    input_data = json.load(file)
    output_data = []

We open the INPUT_DATA file in read mode and use the json.load function to load the JSON data from the file into the input_data variable. We then create an empty list and assign it to the output_data variable. So far so good right? Let’s finish it:

with open(INPUT_DATA, "r", encoding="utf-8") as file:
    input_data = json.load(file)
    output_data = []
    for finx_email in tqdm(input_data, desc="Generating training data"):
        finx_email["body"] = html_email.html_to_markdown(finx_email["body"])
        training_data_entry = TrainingDataEntry(
            fictional_user_query=get_fictional_user_query(finx_email),
            markdown_email=str(finx_email),
        )
        output_data.append(training_data_entry.data)

First of all, we open a loop. We’re going to loop over each finx_email in the input_data list. The reason input_data is wrapped inside tqdm() here is that this will allow us to add a progress bar to the loop. The desc argument is used to specify the description of the progress bar for tqdm, which will be “Generating training data”.

Now remember that each finx_email is a dictionary with a subject and body key. We convert the body from HTML to Markdown using our html_to_markdown utility function, and then we create a TrainingDataEntry object for the finx_email. The fictional_user_query is obtained by calling our get_fictional_user_query function with the finx_email as input, and the markdown_email is just the finx_email dictionary converted to a string. The str() conversion here is important as the OpenAI API will crash if you send it all sorts of nested objects as input, but if they’re in string format it will work fine.

We then append the dictionary that the data attribute of the TrainingDataEntry holds to the output_data list. Now when we’ve called this once for every entry in the dataset we must save our data:

with open(OUTPUT_DATA, "w", encoding="utf-8") as file:
    json.dump(output_data, file, indent=4)

print(f"Total tokens used: {TOTAL_TOKENS_USED}")

This opens the OUTPUT_DATA file in write mode and uses the json.dump function to save the output_data list to the file in JSON format. We also specify the indent argument as 4 to make the JSON data more readable. Finally, we added a print statement to see how many tokens we used in the process. I’m not going to calculate the exact costs here as it would add unneeded complexity for a 1-time use script. It will be roughly like ~$0.30 if you make the full 200 calls, possibly a bit less.

Time to sit back!

And that’s it! We now have a data generator. Go ahead and run your chris_gpt_dataset_generator.py file and you should see a progress bar in the terminal as it generates the data. It will take a few minutes to run. You may run into a rate limit in which case tenacity will trigger for you and pause the process for 60 seconds before continuing. (This is also the reason we’re not using some fancy async API calls here, as we would hit the rate limit anyway and introduce pointless complexity to a script we will not run often anyway.)

Let it run and grab a coffee☕ or your favorite snack🥨 while you watch that glorious progress bar creep up and relax for a moment, you’ve earned it 😄! When it’s done you should have a new Finx_completed_dataset.json file in your data directory.

Generating training data: 100%|███████████████████████████████████████████████| 200/200 [08:56<00:00,  2.68s/it]
Total tokens used: 224800

Open it up and take a look at the data, it should be a list of dictionaries with the same structure as the data attribute of the TrainingDataEntry class:

[
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant that writes emails for the Finxter email newsletter, adhering perfectly to the style and tone of the Finxter brand and Chris' writing style. You will respond in the following format: {\"subject\": \"The subject of the email\", \"body\": \"The body of the email in Markdown formatting\"}."
            },
            {
                "role": "user",
                "content": "- Investment opportunities in OpenAI\n- Five proxies to invest in OpenAI:\n  - MSFT\n  - NVIDIA\n  - ARKVX\n  - META\n  - TSLA\n- Community corner featuring TradeUnafraid\n- Invitation to feature one's startup, YouTube channel, blog, or website\n- Feedback request on the email course\n- Invitation to invite a friend or colleague to the Finxter Email Academy\n- Download link for the Ultimate Python Cheat Sheet\n- Motivational quote on knowledge compounding\n- Contact information for Dr. Christian Mayer\n- Unsubscribe option"
            },
            {
                "role": "assistant",
                "content": "{'subject': '5 Proxies to Investing in OpenAI', 'body': '[![]({Link})]({Link})\\n\\nHey {User},\\n\\nTo profit from change, we need to increase ownership of disruptive trends.\\nToday\\'s article covers a question that many Finxters frequently ask:\\n\\n\ud83e\udeb4 [ **Blog** ] [How to Invest in OpenAI?]({Link}) \ud83c\udf33\\n\\nWhile it\\'s not possible to invest in OpenAI directly, the blog discusses five\\nalternatives:\\n\\n  * **MSFT** (49% stake in OpenAI),\\n  * **NVIDIA** (makes more revenue from OpenAI than any other company),\\n  * **ARKVX** ( _Anthropic!_ ),\\n  * **META** ( _Llama 2!_ ), and\\n  * **TSLA** (Optimus!).\\n\\nCheck it out if you\\'re interested in any of those! No financial advice. \ud83d\ude0a\\n\\nBe on the right side of change. \ud83d\ude80  \\nChris\\n\\n**\u2665\ufe0f Community Corner: Featured Resources**\\n\\n  * [TradeUnafraid]({Link}) is a trading platform owned and operated by Finxter community member Lee.\\n\\nDo you want to feature your own startup, YouTube channel, blog, or website as\\na [Finxter premium member]({Link})? Hit reply and let me know!\\n\\n**[How are we doing?]({Link})**  \\n[\u2b50]({Link})  \\n[\u2b50\u2b50]({Link})  \\n[\u2b50\u2b50\u2b50]({Link})  \\n  \\nTo make sure you keep getting these emails, please add _chris@finxter.com_ to\\nyour address book.  \\n  \\nI\\'d love to hear your feedback so that I can improve this free email course\\nover time. Please reply to this email and share everything on your mind!  \\n  \\n**If you find the Finxter Email Academy useful, please invite a friend or\\ncolleague! \u2764**\\n\\n  \\nHere\\'s the subscription link you can share:  \\n[https://blog.finxter.com/subscribe/]({Link})  \\n  \\nDownload the Ultimate Python Cheat Sheet here (direct PDF download): \ud83d\udc0d\\n\\n**[The Ultimate Python Cheat Sheet]({Link})**  \\n  \\nNot very motivated to learn today? Consider this:  \\n**_\"Knowledge compounds!\"_** \\\\-- Warren Buffett  \\n  \\nConsequently, if you improve your skills by 1% every day, you\\'ll 36x your\\nprogramming skills within a year!\\n\\n  \\n_**Finxter, Dr. Christian Mayer**  \\n{Address}., {City}, {Country}_\\n\\nWant out of the loop? I\\'m so sad to see you go. \ud83d\ude22 How could we have done\\nbetter?  \\nTo help future Finxters, please hit reply and tell us! \ud83e\udd17\\n\\n[Unsubscribe here]({Link})\\n\\n![]({Link})\\n\\n'}"
            }
        ]
    },
    ... many more entries ...

Perfect, this has all the data we need to train our ChrisGPT model. We have the system message that is the same for all entries, the user message that is a fictional user query for the email, and the assistant’s response message that is the email itself. That’s it for part 2 of this tutorial. I’ll see you in part 3 where we will be fine-tuning our ChrisGPT model using the data we just generated 🚀. See you there!