Hi and welcome back to part 2, where we will be generating the training data for our own ChrisGPT model! We’ll pick up where we left off in part 1 and use our HTML to Markdown converter utility and more to write and run our training dataset generator that will build the ChrisGPT training data for us.
Api Key
We need to start by setting our key for the OpenAI API, but we cannot hardcode this one in our source code. Go to https://platform.openai.com/api-keys and copy your API key. If you don’t have one, make sure to get one. You’ll only pay for what you use which will be cents if you just play around with it casually. Then create a new file called .env
in the root folder of your project:
📁Finx_Fine_Tuning 📁data 📄Finx_dataset.json 📁utils 📄html_email.py 📄.env (new file) 📄Pipfile 📄Pipfile.lock
And paste your API key in there like this, making sure not to use any spaces or quotes:
OPENAI_API_KEY=your_api_key_here
Then go ahead and save and close this file. (You can also use environment variables to set the key if you prefer, but this is simpler to set up for tutorial purposes as it works the same on all platforms.)
Constants
Now, let’s define some project-wide settings by creating a new file named constants.py
in the root directory of our project. This file will contain the constants that we will use multiple times throughout our project:
📁Finx_Fine_Tuning 📁data 📄Finx_dataset.json 📁utils 📄html_email.py 📄.env 📄constants.py (new file) 📄Pipfile 📄Pipfile.lock
Run the following command in your terminal to add the python-decouple
package inside your pipenv
environment:
pipenv install python-decouple==3.7
We will use this package to read the .env
file and get the API key from it. Then install the openai library as well:
pipenv install openai==1.12.0
Make sure you run the installs even if you already have the packages installed, as we need to make sure they are installed in the virtual environment we are using for this project. Now, open the constants.py
file and add the following code:
from pathlib import Path from decouple import config from openai import OpenAI CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY"))) DATA_DIRECTORY = Path(__file__).parent / "data"
We import Path
from pathlib
, the OpenAI
class from the openai
package, and the config
function from the decouple
package. We then create a CLIENT
constant that will be an instance of the OpenAI
class, passing the API key from the .env
file to it by reading it from the .env
file using the config
function.
💡 Python Top-tip 💡 The 'config' function from the 'decouple' package reads the value of the environment variable from the '.env' file and returns it as a string. If you share or upload your code to a public repository, make sure to add the '.env' file to your '.gitignore' file to avoid sharing your API key with others.
We also create a DATA_DIRECTORY
constant that will be a Path
object pointing to the data
directory. Path(__file__)
returns the absolute path of the current file, and parent
returns the parent directory of the current file (which is constants.py
so that gives us the root folder). We then use the /
operator to add the data
directory to the path.
The Data Generator
Go ahead and save and close the constants.py
file. Then create a new file in the root directory called chris_gpt_dataset_generator.py
:
📁Finx_Fine_Tuning 📁data 📄Finx_dataset.json 📁utils 📄html_email.py 📄.env 📄constants.py 📄chris_gpt_dataset_generator.py (new file) 📄Pipfile 📄Pipfile.lock
Now let’s install two libraries before we dive in here. We’ll combine them into a single command this time:
pipenv install tqdm==4.65.0 && pipenv install tenacity==8.2.3
The tqdm
package will be used to easily add a progress bar to our generator, and the tenacity
package will be used to retry the API requests if they fail for some reason. Both are extremely simple in their usage, as you will see.
Ok now open the chris_gpt_dataset_generator.py
file and we will start with our imports for the file:
import json from tenacity import retry, stop_after_attempt, wait_fixed from tqdm import tqdm from constants import CLIENT, DATA_DIRECTORY from utils import html_email
We use json
to load and save back the JSON data, tenacity
and tqdm
for the reasons we just mentioned, and then we import the CLIENT
and DATA_DIRECTORY
constants from the constants
module, and our html_email
utility we just made. Now let’s set up some base constants for use in our generator only:
INPUT_DATA = DATA_DIRECTORY / "Finx_dataset.json" OUTPUT_DATA = DATA_DIRECTORY / "Finx_completed_dataset.json" MODEL: str = "gpt-3.5-turbo-0125" TOTAL_TOKENS_USED = 0 RATE_ERROR_MESSAGE = "There was an error calling 'get_user_query'. Perhaps the OpenAI ChatGPT rate limit has been reached. Retrying one more time in 60 seconds to reset the rate limiter..."
We define INPUT_DATA
and OUTPUT_DATA
constants that will be Path
objects pointing to the Finx_dataset.json
and Finx_completed_dataset.json
files, respectively. The output one doesn’t exist yet obviously, but that is fine. The MODEL
constant is just a string with the model name we use.
💡 OpenAI Top-tip 💡 There are many versions of the gpt-3.5-turbo model: ✨ gpt-3.5-turbo-0613. This one is the older version. ✨ gpt-3.5-turbo-1106. Has much improved functionality and a larger context window compared to the 0613 model. This is the newest version that can be used for fine-tuning purposes. ✨ gpt-3.5-turbo-0125. The latest minor update to the 1106 model, this one cannot be used for fine-tuning yet. Aside from the minor improvements, it also has a reduced price. This is why we will use the `0125` version for the data generation in this part of the tutorial, but later when we do the fine-tuning we will be using the `1106` version instead.
We also define a TOTAL_TOKENS_USED
constant that will be used to keep track of the total number of tokens used in the API requests, and a RATE_ERROR_MESSAGE
that we reference for an error message later. As all of the above constants will only be used for this script we defined them all inside the chris_gpt_dataset_generator.py
file.
The data structure
Now we’ll define a class that will serve as the data structure “blueprint” if you will, for each example in our dataset. This is exactly the same structure we used in the Harry Potter magical spells example in part 1:
class TrainingDataEntry: def __init__(self, fictional_user_query, markdown_email) -> None: self.data = { "messages": [ { "role": "system", "content": 'You are a helpful assistant that writes emails for the Finxter email newsletter, adhering perfectly to the style and tone of the Finxter brand and Chris\' writing style. You will respond in the following format: {"subject": "The subject of the email", "body": "The body of the email in Markdown formatting"}.', }, {"role": "user", "content": fictional_user_query}, {"role": "assistant", "content": markdown_email}, ] }
We have a class named TrainingDataEntry
, which has an __init__
method that takes two arguments: fictional_user_query
and markdown_email
. The __init__
method initializes the data
attribute of the TrainingDataEntry
object with a dictionary that has a messages
key, which is a list of dictionaries.
The first dictionary in the list has a role
key with the value "system"
and a content
key with a long string that explains the role of the assistant. This is going to be the same for all examples of the training data so I just hard-coded it in the class.
The second dictionary in the list has a role
key with the value "user"
and a content
key with the value of the fictional_user_query
argument. The third dictionary in the list has a role
key with the value "assistant"
and a content
key with the value of the markdown_email
argument.
So the system
message is the same for each entry and the assistant
message is simply the email that we already have saved in our dataset and can convert using the HTML to Markdown converter we built.
The only challenge we have left is the user
message, which is supposed to hold the user query that resulted in the generated email. We will have to generate a fictional user query for each “result” email we have in our dataset. This should mostly be a short summary of the email contents for our case, as we want to train ChrisGPT to generate an email if we give it the topics to write on.
Getting the user query (that was never made)
So let’s create a function that will take an email as input and generate a fictional user query for it:
def get_fictional_user_query(email: str) -> str: global TOTAL_TOKENS_USED response = CLIENT.chat.completions.create( model=MODEL, messages=[ { "role": "system", "content": 'You will be provided with an email in the following format:{"subject": "The subject of the email", "body": "The body of the email in Markdown formatting"}. Your task is to go back in time and write a rough draft of the topics the email writer decided to discuss in the email. This will basically be a summary of the email\'s topics formatted in short bullet points, that the author would have used as a basis to then write the real email.', }, { "role": "user", "content": f"Here is the output LLM generated email: {email}. Bullet point draft of the topics discussed in the email:", }, ], ) if not response.usage or not response.choices[0].message.content: raise Exception("Invalid response from OpenAI API") TOTAL_TOKENS_USED += response.usage.total_tokens return response.choices[0].message.content
We define a function named get_fictional_user_query
that takes an email
string argument and will return a string. Inside the function, we first reference the TOTAL_TOKENS_USED
variable as a global variable, so that we can update its value from inside the function. Then we call the CLIENT.chat.completions.create
method with the model
and messages
arguments.
For the prompt setup message we basically just explain that the LLM will receive an email in the format of a subject and body and that it should write a rough draft of the topics the email writer decided to discuss in the email. This will basically be a summary of the email’s topics formatted in short bullet points. That way we can give our trained model a list of bullet points to write and email for us later on.
The user
message is then just the email we received as input, and a short message requesting the bullet points draft of the topics discussed. Note how it is stated as “Bullet point draft of the topics discussed in the email:” ending with a :
colon so that the LLM will answer with the bullet points right away without giving some introduction or “Sure, I can help you with that” or something like that.
We then double-check if the response
has a usage
and choices
attribute and if the content
of the choices
attribute is not empty. If any of these checks fail we raise an exception, but if everything is fine we add the total_tokens
used in the response to the TOTAL_TOKENS_USED
variable and return the content
of the choices
attribute.
Making it robust
Ok, so that is pretty good so far, but we will be calling this function many times in a row, and saving the data into an object in memory, which we write to a file at the end. If we get an error halfway then the script will crash out and we have to try again from the start. We can also hit an error because of a rate limit, as we will be doing many requests in a row. This is not ideal, so we will use the tenacity
package to retry the API requests if they fail for some reason.
We can simply add the @retry decorator to our function, by adding it straight above the function definition like this:
@retry( wait=wait_fixed(60), stop=stop_after_attempt(2), reraise=True, before_sleep=lambda _: print(RATE_ERROR_MESSAGE), ) def get_fictional_user_query(email): ... function here ...
We use the @retry
decorator to specify that we want to retry the function if it fails due to any type of error. As you can see the tenacity
library is very easy to read and use. We specify that we want to wait 60 seconds between each retry and that we want to stop after 2 attempts, maximum.
The 60 seconds are in case we hit a rate limit, as trying again right away might trigger the rate limit again. We also specify that we want to reraise the exception if the function fails after the maximum number of attempts, so we can see what the original error was.
Now the last part is a bit trickier. As tenacity
will stop the program from crashing out, we won’t see the error and we won’t be aware of what happened. This is why we added a before_sleep
function that will print the RATE_ERROR_MESSAGE
before each retry. This way we can keep ourselves up to date in the terminal about what is happening.
The problem is that if I just pass in print(RATE_ERROR_MESSAGE)
then the print statement will be triggered right away when Python first loads the function into memory because it is being called with the ellipsis. This is not what we want, so we need to wrap it in a lambda function that will then return the print call when the lambda is called.
The lambda _: print(RATE_ERROR_MESSAGE)
is basically equivalent to:
# Do not put in your code def before_sleep_function(_): print(RATE_ERROR_MESSAGE)
But it’s just much shorter to write. The _
is the name of the input argument, as tenacity
will send an input argument to our before_sleep
function, but we won’t be using it, so we just name it _
to sort of throw it away.
Ok so now we have a TrainingDataEntry
class to hold the data for each entry and a function to generate the fictional user queries that will automatically retry if there is an error. See how we’re making the problem smaller and smaller?
Putting it all together
All we need to do now is create a TrainingDataEntry
object for each entry in the training data, get the fictional user query by calling our function, and then save all the objects to a new JSON file to get our training data. Let’s do that now:
with open(INPUT_DATA, "r", encoding="utf-8") as file: input_data = json.load(file) output_data = []
We open the INPUT_DATA
file in read mode and use the json.load
function to load the JSON data from the file into the input_data
variable. We then create an empty list and assign it to the output_data
variable. So far so good right? Let’s finish it:
with open(INPUT_DATA, "r", encoding="utf-8") as file: input_data = json.load(file) output_data = [] for finx_email in tqdm(input_data, desc="Generating training data"): finx_email["body"] = html_email.html_to_markdown(finx_email["body"]) training_data_entry = TrainingDataEntry( fictional_user_query=get_fictional_user_query(finx_email), markdown_email=str(finx_email), ) output_data.append(training_data_entry.data)
First of all, we open a loop. We’re going to loop over each finx_email
in the input_data
list. The reason input_data
is wrapped inside tqdm()
here is that this will allow us to add a progress bar to the loop. The desc
argument is used to specify the description of the progress bar for tqdm
, which will be “Generating training data”.
Now remember that each finx_email
is a dictionary with a subject
and body
key. We convert the body
from HTML to Markdown using our html_to_markdown
utility function, and then we create a TrainingDataEntry
object for the finx_email
. The fictional_user_query
is obtained by calling our get_fictional_user_query
function with the finx_email
as input, and the markdown_email
is just the finx_email
dictionary converted to a string. The str()
conversion here is important as the OpenAI API will crash if you send it all sorts of nested objects as input, but if they’re in string format it will work fine.
We then append the dictionary that the data
attribute of the TrainingDataEntry
holds to the output_data
list. Now when we’ve called this once for every entry in the dataset we must save our data:
with open(OUTPUT_DATA, "w", encoding="utf-8") as file: json.dump(output_data, file, indent=4) print(f"Total tokens used: {TOTAL_TOKENS_USED}")
This opens the OUTPUT_DATA
file in write mode and uses the json.dump
function to save the output_data
list to the file in JSON format. We also specify the indent
argument as 4
to make the JSON data more readable. Finally, we added a print statement to see how many tokens we used in the process. I’m not going to calculate the exact costs here as it would add unneeded complexity for a 1-time use script. It will be roughly like ~$0.30 if you make the full 200 calls, possibly a bit less.
Time to sit back!
And that’s it! We now have a data generator. Go ahead and run your chris_gpt_dataset_generator.py
file and you should see a progress bar in the terminal as it generates the data. It will take a few minutes to run. You may run into a rate limit in which case tenacity
will trigger for you and pause the process for 60 seconds before continuing. (This is also the reason we’re not using some fancy async API calls here, as we would hit the rate limit anyway and introduce pointless complexity to a script we will not run often anyway.)
Let it run and grab a coffee☕ or your favorite snack🥨 while you watch that glorious progress bar creep up and relax for a moment, you’ve earned it 😄! When it’s done you should have a new Finx_completed_dataset.json
file in your data
directory.
Generating training data: 100%|███████████████████████████████████████████████| 200/200 [08:56<00:00, 2.68s/it] Total tokens used: 224800
Open it up and take a look at the data, it should be a list of dictionaries with the same structure as the data
attribute of the TrainingDataEntry
class:
[ { "messages": [ { "role": "system", "content": "You are a helpful assistant that writes emails for the Finxter email newsletter, adhering perfectly to the style and tone of the Finxter brand and Chris' writing style. You will respond in the following format: {\"subject\": \"The subject of the email\", \"body\": \"The body of the email in Markdown formatting\"}." }, { "role": "user", "content": "- Investment opportunities in OpenAI\n- Five proxies to invest in OpenAI:\n - MSFT\n - NVIDIA\n - ARKVX\n - META\n - TSLA\n- Community corner featuring TradeUnafraid\n- Invitation to feature one's startup, YouTube channel, blog, or website\n- Feedback request on the email course\n- Invitation to invite a friend or colleague to the Finxter Email Academy\n- Download link for the Ultimate Python Cheat Sheet\n- Motivational quote on knowledge compounding\n- Contact information for Dr. Christian Mayer\n- Unsubscribe option" }, { "role": "assistant", "content": "{'subject': '5 Proxies to Investing in OpenAI', 'body': '[![]({Link})]({Link})\\n\\nHey {User},\\n\\nTo profit from change, we need to increase ownership of disruptive trends.\\nToday\\'s article covers a question that many Finxters frequently ask:\\n\\n\ud83e\udeb4 [ **Blog** ] [How to Invest in OpenAI?]({Link}) \ud83c\udf33\\n\\nWhile it\\'s not possible to invest in OpenAI directly, the blog discusses five\\nalternatives:\\n\\n * **MSFT** (49% stake in OpenAI),\\n * **NVIDIA** (makes more revenue from OpenAI than any other company),\\n * **ARKVX** ( _Anthropic!_ ),\\n * **META** ( _Llama 2!_ ), and\\n * **TSLA** (Optimus!).\\n\\nCheck it out if you\\'re interested in any of those! No financial advice. \ud83d\ude0a\\n\\nBe on the right side of change. \ud83d\ude80 \\nChris\\n\\n**\u2665\ufe0f Community Corner: Featured Resources**\\n\\n * [TradeUnafraid]({Link}) is a trading platform owned and operated by Finxter community member Lee.\\n\\nDo you want to feature your own startup, YouTube channel, blog, or website as\\na [Finxter premium member]({Link})? Hit reply and let me know!\\n\\n**[How are we doing?]({Link})** \\n[\u2b50]({Link}) \\n[\u2b50\u2b50]({Link}) \\n[\u2b50\u2b50\u2b50]({Link}) \\n \\nTo make sure you keep getting these emails, please add _chris@finxter.com_ to\\nyour address book. \\n \\nI\\'d love to hear your feedback so that I can improve this free email course\\nover time. Please reply to this email and share everything on your mind! \\n \\n**If you find the Finxter Email Academy useful, please invite a friend or\\ncolleague! \u2764**\\n\\n \\nHere\\'s the subscription link you can share: \\n[https://blog.finxter.com/subscribe/]({Link}) \\n \\nDownload the Ultimate Python Cheat Sheet here (direct PDF download): \ud83d\udc0d\\n\\n**[The Ultimate Python Cheat Sheet]({Link})** \\n \\nNot very motivated to learn today? Consider this: \\n**_\"Knowledge compounds!\"_** \\\\-- Warren Buffett \\n \\nConsequently, if you improve your skills by 1% every day, you\\'ll 36x your\\nprogramming skills within a year!\\n\\n \\n_**Finxter, Dr. Christian Mayer** \\n{Address}., {City}, {Country}_\\n\\nWant out of the loop? I\\'m so sad to see you go. \ud83d\ude22 How could we have done\\nbetter? \\nTo help future Finxters, please hit reply and tell us! \ud83e\udd17\\n\\n[Unsubscribe here]({Link})\\n\\n![]({Link})\\n\\n'}" } ] }, ... many more entries ...
Perfect, this has all the data we need to train our ChrisGPT model. We have the system message that is the same for all entries, the user message that is a fictional user query for the email, and the assistant’s response message that is the email itself. That’s it for part 2 of this tutorial. I’ll see you in part 3 where we will be fine-tuning our ChrisGPT model using the data we just generated 🚀. See you there!