OpenAI Fine-Tuning (3/4): Data Validation and Training Cost

Welcome back to part 3! This is where we’re going to do the last preparation and data validation steps on our dataset and also calculate how much it’s going to cost us to train the model.

JSONL format

Remember in part 1 where we discussed the training data? We discussed the data needing to be in JSONL format. Well, it’s time to come back to that now. So what is JSONL format?

JSONL, or JSON Lines, is a convenient format for storing structured data that may be processed one record at a time. Each line in a JSONL file is a valid JSON object. This is different from a regular JSON file, where the entire file is a single JSON object or array.

Each line is a separate, independent JSON object. This means that a large file can be read into memory one line at a time, instead of needing to read the entire data into memory at once, which can be a significant advantage when working with very large datasets. This makes it very useful for streaming JSON data object by object through another process like training an LLM model!

So say we have an object that looks like this:

[
  {
    "employee": {
      "name": "John Doe",
      "age": 30,
      "department": "Sales",
      "address": {
        "street": "123 Main St",
        "city": "Springfield",
        "state": "IL",
        "zip": "62701"
      }
    }
  },
  {
    "employee": {
      "name": "Jane Smith",
      "age": 28,
      "department": "Marketing",
      "address": {
        "street": "456 Elm St",
        "city": "Springfield",
        "state": "IL",
        "zip": "62701"
      }
    }
  },
  {
    "employee": {
      "name": "Joe Schmoe",
      "age": 35,
      "department": "Engineering",
      "address": {
        "street": "789 Oak St",
        "city": "Springfield",
        "state": "IL",
        "zip": "62701"
      }
    }
  }
]

Then the JSONL version is essentially just a flattened-down version of this, with each object on a single line. Note that we can remove the brackets and the commas between different objects, as it is a given that each line contains one JSON object in this format:

{"name": "John Doe", "age": 30, "department": "Sales", "address": {"street": "123 Main St", "city": "Springfield", "state": "IL", "zip": "62701"}}
{"name": "Jane Smith", "age": 28, "department": "Marketing", "address": {"street": "456 Elm St", "city": "Springfield", "state": "IL", "zip": "62701"}}
{"name": "Joe Schmoe", "age": 35, "department": "Engineering", "address": {"street": "789 Oak St", "city": "Springfield", "state": "IL", "zip": "62701"}}

You will probably see the objects wrap around, but this is only a visual thing. In the actual file, each object is on a single line.

Flattening our dataset into a JSONL file

So let’s create a utility function to flatten our dataset into a JSONL file. In your existing utils folder, make a new file called jsonl.py:

📁Finx_Fine_Tuning
    📁data
        📄Finx_completed_dataset.json
        📄Finx_dataset.json
    📁utils
        📄html_email.py
        📄jsonl.py          (new file)
    📄.env
    📄constants.py
    📄chris_gpt_dataset_generator.py
    📄Pipfile
    📄Pipfile.lock

In jsonl.py, add the following imports to get started:

import json
from pathlib import Path
from typing import Iterable

We import the json module to read and save JSON data. We import Path and Iterable only to use them as type hints, to make sure our code is as clear and readable as possible. First, let’s make the problem smaller by creating a function that takes a list or iterable of dictionaries, and converts them into a JSONL file. Add the following function to jsonl.py:

def dicts_to_jsonl(output_file: Path, data: Iterable[dict]) -> Path:
    with open(output_file, "w") as file:
        for dict_obj in data:
            json_string = json.dumps(dict_obj)
            file.write(json_string + "\n")
    return output_file

This function takes two arguments: output_file is the path to the file we want to write, and data is an iterable of dictionaries. We open the file in write mode, and then loop through each dictionary in the iterable. We convert each dictionary to a JSON string using json.dumps, and then write it to the file. We add a newline character at the end of each line to separate the JSON objects. Finally, we return the path to the file as a Path object.

Ok, that handles directly converting a list of dictionaries stored in memory, now let’s add a second function below that will handle converting an existing JSON file into a JSONL file. Add the following function to jsonl.py:

def json_to_jsonl(input_file: Path, output_file: Path) -> Path:
    with open(input_file, "r") as in_file:
        data = json.load(in_file)

    return dicts_to_jsonl(output_file, data)

This function takes two arguments: input_file is the path to the JSON file we want to read, and output_file is the path to the JSONL file we want to write. We open the input file in read mode, and then load the JSON data into memory using json.load. We then call the dicts_to_jsonl function we defined earlier to write the data to the output file.

Using this composition, we now have two functions, one for converting dictionaries, and another for an existing JSON file, yet we did not duplicate any code. Go ahead and save and close jsonl.py

Validating our dataset

Before we train our model, we need to make sure our dataset is in the right format and we’ll also check how much this is going to cost, and make sure none of the entries exceed the token limit. This may all seem a bit overkill, but you really don’t want to start training a model and have it fail halfway due to sloppy data or a single entry that is too long. It’s also considerably more expensive than other ways of using ChatGPT because we’re creating a whole custom model, so it’s nice to know ahead of time exactly how much money you’re going to spend.

We’re writing most of these specific things in utility functions in separate files, so you can reuse all of these for your future fine-tuning projects. We’ll do the same for the validation and price-calculator logic. In your existing utils folder, make a new file called data_validation.py:

📁Finx_Fine_Tuning
    📁data
        📄Finx_completed_dataset.json
        📄Finx_dataset.json
    📁utils
        📄data_validation.py          (new file)
        📄html_email.py
        📄jsonl.py
    📄.env
    📄constants.py
    📄chris_gpt_dataset_generator.py
    📄Pipfile
    📄Pipfile.lock

Time to install the tiktoken library before we start writing the code. Open your terminal and run the following command:

pipenv install tiktoken==0.6.0

The tiktoken library is a Python package developed by OpenAI. We’ll use it to count the number of tokens in a text string without making any API calls.

In data_validation.py, get started by adding the following imports:

import json
from decimal import Decimal
from pathlib import Path

import tiktoken

Most of these are familiar by now, but we also import Decimal from the decimal module. We’ll use this to handle the cost calculations, as it’s more precise than using floating point numbers, not giving us the annoying rounding errors to deal with.

Now define a constant that will be used for our calculations:

TRAINING_COST_PER_1000_TOKENS = Decimal("0.0080")

This is the cost per 1000 tokens for training data at the time of writing, but it may have changed if you’re watching this tutorial in the future. You can check the current cost on the OpenAI pricing page and adjust this number accordingly.

Creating the Validator class

Now let’s create our Validator. As we’ll have a lot of related functions, let’s use a class to group them together and start with the __init__ method:

class Validator:
    def __init__(self, jsonl_file: Path) -> None:
        self.data = self._load_data(jsonl_file)
        self._token_list = None
        self.encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

So the __init__ method will get called when we instantiate a new instance of this class, and it will take a Path argument to the JSONL file we want to validate. We’ll load the data from the file and store it in the data attribute using the _load_data method we’ll define next.

We’ll also initialize the _token_list attribute to None for now, and we’ll use it to store the token count for each entry in the dataset. Finally we store the encoding for the model we’re going to use in the encoding attribute. As the tiktoken library was also made by OpenAI, it has a handy method to let us load up the proper encoding for the model we’re going to use.

Now let’s add the _load_data method. As our data file is not that massive, we’ll just load up the whole file at once and not worry about loading the JSONL one line at a time:

class Validator:
    def __init__():
        ...

    def _load_data(self, jsonl_file: Path) -> list:
        with open(jsonl_file, "r", encoding="utf-8") as file:
            data = [json.loads(line) for line in file]
        return data

No big surprises here, we take the path as input and return a list. The only different thing is since the data is in JSONL format, we use a list comprehension. For each line in the fine, we call json.loads to convert the JSON string to a Python dictionary, which will then become an element in the list saved as the variable data.

Now let’s add a method to calculate the token count for each entry in the dataset:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry(self) -> list[int]:
        extra_tokens_per_message = 2
        token_list = []
        for training_data_object in self.data:
            num_tokens = 0
            for message in training_data_object["messages"]:
                for _, value in message.items():
                    num_tokens += len(self.encoding.encode(str(value)))
                    num_tokens += extra_tokens_per_message
            token_list.append(num_tokens)
        return token_list

This method will return the approximate amount of tokens as a list of integers. We start by defining a variable extra_tokens_per_message and set it to 2. This is the current number of extra tokens we need to add to account for the object structure besides just the strings themselves to come to an accurate number. We then loop through each training_data_object in the dataset and set a counter num_tokens to 0.

As this is ChatCompletion data, we know that the messages are stored in a list under the key “messages”. We loop through each message and then through each key-value pair in the message. (We use an _ for the key because we don’t need it in this case, but we need to use it as a placeholder to unpack the tuple.)

We call self.encoding.encode to encode the value to a list of tokens, and then add the length of this list to num_tokens, as it’s only the len or length that we are interested in. We then add the extra_tokens_per_message to account for the object structure as discussed, as this also takes up tokens.

After all the key-value pairs inside each index of the messages inside a training_data_object are processed, we append the num_tokens to the token_list and then move on to the next training_data_object in the list.

Now let’s add a function to check if our formatting has any mistakes in it:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format(self, entry) -> bool:
        if not isinstance(entry, dict):
            return False

        if list(entry.keys()) != ["messages"]:
            return False

        messages = entry.get("messages", [])

        return all(
            isinstance(message, dict) and "role" in message and "content" in message
            for message in messages
        )

This function will return True if the entry is in the correct format, and False if it’s not. It will check a single entry, or training_data_object, in our dataset at a time. First, it will check if the entry is a dictionary. After that, we call keys() on the entry to get the dictionary keys and call list() on it to convert it to a list. We then check if the list is equal to ["messages"], so make sure it has one key and only one, and that key is “messages”.

We then call the get() method on the entry to get the value of the “messages” key. Now the last line uses a generator expression and might look confusing if you’re not familiar with it, so let’s break it down step by step.

A generator expression is similar to a list comprehension, but it doesn’t store the list in memory. Instead, it generates each value on the fly as you iterate over it. This can be more memory-efficient than a list comprehension for large sequences, though it doesn’t matter much for our dataset size here. The generator expression in the code is:

(message for message in messages)

This generates a sequence of message values, one for each message in messages.

The isinstance(message, dict) and "role" in message and "content" in message part is a condition that checks whether each message is a dictionary and whether it contains the keys role and content.

The all() function takes an iterable (in this case, the generator expression) and returns True if all elements of the iterable are truthy (i.e., they evaluate to True), and False if even a single entry is not True. So, in simple terms, we check whether all messages in the messages list are dictionaries that contain the keys role and content, and return either True or False.

Now, let’s add a property to get the token_list, so we can easily access it:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format():
        ...

    @property
    def token_list(self) -> list[int]:
        if self._token_list is None:
            self._token_list = self._calculate_token_amount_per_entry()
        return self._token_list

The @property decorator here means that we can access this particular method as a property, so using self.token_list instead of calling it as a method with self.token_list(). First, this will check if self._token list is None, which it will be the first time we access it. If it is, it will call the _calculate_token_amount_per_entry method to calculate the token list and store it in the self._token_list attribute. Then it will return the _token_list attribute. If this method is called again, it will just return the _token_list attribute without recalculating it as it’s already been calculated and stored.

Note that the methods with the _ prefix are meant to be private, so the _token_list is our implementation detail here, and the token_list property is the public interface to access it. This is a good practice because it ensures that _token_list is always in a valid state when it’s accessed, and it hides the details of how _token_list is implemented and managed from the rest of your program by providing token_list as an access point.

Now let’s add a method to check if the dataset is valid:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format():
        ...

    @property
    def token_list():
        ...

    def validate_data(self) -> bool:
        return all(self._check_single_entry_format(entry) for entry in self.data)

This method will return True if all entries in the dataset are in the correct format, and False if any of them are not. It uses a generator expression in the same style as we did before. Note that it will stop checking as soon as it finds an entry that fails the _check_single_entry_format test, because all stops iterating as soon as it encounters a False value.

Now let’s add a to get the training cost in dollars:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format():
        ...

    @property
    def token_list():
        ...

    def validate_data():
        ...

    def get_training_cost_in_dollars(self, epochs: int = 3) -> Decimal:
        total_tokens = sum(self.token_list)
        total_cost_dollars = (
            TRAINING_COST_PER_1000_TOKENS * total_tokens / 1000 * epochs
        )
        print(
            f"Total estimated cost: ~${total_cost_dollars:.3f} for training {epochs} epochs on {total_tokens} token dataset."
        )
        return total_cost_dollars

💡 Machine-learning Top-tip 💡
Epochs are the number of times the model will go through the entire dataset during training. The more epochs, the more the model will learn and internalize our dataset. If the number is too low, it will not fully internalize our training data, but if the number is too high it will internalize our specific examples too much and lose its ability to generalize, a concept called overfitting. 3 Epochs is a good starting point for most fine-tuning tasks.

This method will return the total cost in dollars for training the model for a given number of epochs as a Decimal type object. It uses the sum function to calculate the total number of tokens in the dataset and then does simple math to get the total cost in dollars. We print the total cost with an accuracy of 3 decimal places by using the :.3f format specifier in the f-string and then return the total cost.

One last method and we’ll be done, I promise! 😄 We want to be able to make sure the longest entry is not above our token limit:

class Validator:
    def __init__():
        ...

    def _load_data():
        ...

    def _calculate_token_amount_per_entry():
        ...

    def _check_single_entry_format():
        ...

    @property
    def token_list():
        ...

    def validate_data():
        ...

    def get_training_cost_in_dollars():
        ...

    def longest_entry_token_count(self) -> int:
        return max(self.token_list)

We use the max function to get the maximum value from the token_list and return it. Token limits per training example, so for every line in our JSONL file, are the same as the context limit for the ChatGPT model we’re using. For gpt-3.5-turbo-1106, the maximum context length is 16,385 tokens, so as long as this number is below that, you’ll know you’re safe.

Here is the whole class again for reference:

class Validator:
    def __init__(self, jsonl_file: Path) -> None:
        self.data = self._load_data(jsonl_file)
        self._token_list = None
        self.encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

    def _load_data(self, jsonl_file: Path) -> list:
        with open(jsonl_file, "r", encoding="utf-8") as file:
            data = [json.loads(line) for line in file]
        return data

    def _calculate_token_amount_per_entry(self) -> list[int]:
        extra_tokens_per_message = 2
        token_list = []
        for training_data_object in self.data:
            num_tokens = 0
            for message in training_data_object["messages"]:
                for _, value in message.items():
                    num_tokens += len(self.encoding.encode(str(value)))
                    num_tokens += extra_tokens_per_message
            token_list.append(num_tokens)
        return token_list

    def _check_single_entry_format(self, entry) -> bool:
        if not isinstance(entry, dict):
            return False

        if list(entry.keys()) != ["messages"]:
            return False

        messages = entry.get("messages", [])

        return all(
            isinstance(message, dict) and "role" in message and "content" in message
            for message in messages
        )

    @property
    def token_list(self) -> list[int]:
        if self._token_list is None:
            self._token_list = self._calculate_token_amount_per_entry()
        return self._token_list

    def validate_data(self) -> bool:
        return all(self._check_single_entry_format(entry) for entry in self.data)

    def get_training_cost_in_dollars(self, epochs: int = 3) -> Decimal:
        total_tokens = sum(self.token_list)
        total_cost_dollars = (
            TRAINING_COST_PER_1000_TOKENS * total_tokens / 1000 * epochs
        )
        print(
            f"Total estimated cost: ~${total_cost_dollars:.3f} for training {epochs} epochs on {total_tokens} token dataset."
        )
        return total_cost_dollars

    def longest_entry_token_count(self) -> int:
        return max(self.token_list)

Using the Validator

So give yourself a pat on the back for that 😎. Now let’s train us some ChrisGPT! Save and close this file, then create a new file in your root directory named chris_gpt_dataset_validation.py:

📁Finx_Fine_Tuning
    📁data
        📄Finx_completed_dataset.json
        📄Finx_dataset.json
    📁utils
        📄data_validation.py
        📄html_email.py
        📄jsonl.py
    📄.env
    📄constants.py
    📄chris_gpt_dataset_generator.py
    📄chris_gpt_dataset_validation.py          (new file)
    📄Pipfile
    📄Pipfile.lock

In chris_gpt_dataset_validation.py, add the following setup to get started:

from utils import data_validation, jsonl
from constants import DATA_DIRECTORY


JSON_FILE = DATA_DIRECTORY / "Finx_completed_dataset.json"
JSONL_FILE = DATA_DIRECTORY / "Finx_completed_dataset.jsonl"

We import all the stuff we made and prepared ourselves, and then we define the paths to the existing JSON file and the JSONL file we want to create. Now let’s make some good use of all the hard work we’ve done so far:

jsonl.json_to_jsonl(JSON_FILE, JSONL_FILE)  # Only run once

data_validator = data_validation.Validator(JSONL_FILE)

print(f"Data valid: {data_validator.validate_data()}")
data_validator.get_training_cost_in_dollars()
print(f"Longest entry: {data_validator.longest_entry_token_count()} tokens")

We convert our JSON file to a JSONL file with the same name. It says “Only run once” so you can comment out the code after we run the file the first time. Nothing bad will happen if you don’t though, it just does some unneeded calculations to make the same file again.

Then we create a new instance of our Validator class and pass the path to the JSONL file as an argument. We call the validate_data method to check if the dataset is valid and print the result. We then call the get_training_cost_in_dollars method to get the estimated training cost, which will get printed to the console automatically, and finally, we call the longest_entry_token_count method to get the token count of the longest entry in the dataset so we can make sure we don’t exceed the token limit.

Let’s run the file we have so far just as an interim test. You should get an output in your terminal that looks something like this:

Data valid: True
Total estimated cost: ~$5.184 for training 3 epochs on 216000 token dataset.
Longest entry: 2441 tokens

Your numbers will be slightly different from mine, as the data is partly LLM generated, but it will be very close to this. We can see our data is valid, we have over 200,000 tokens in total, and the longest entry is 2441 tokens, which is well below the 16,385 token limit for the gpt-3.5-turbo-1106 model.

You’ll also notice that a JSONL file has been created in your data directory with the training data in JSONL format:

📁Finx_Fine_Tuning
    📁data
        📄Finx_completed_dataset.json
        📄Finx_completed_dataset.jsonl ✨
        📄Finx_dataset.json
    ...

Now you might be surprised by the cost here. While $5 is not a massive amount of money it is a whole lot more than we typically consume when making regular ChatGPT calls. This is the reason we took so much time on the data validation, to make sure we get the data right the first time, and to know the exact cost before we commit to the training.

For those $5 you get something pretty damn cool though, your own custom ChatGPT 😎. That being said, I understand if you’re not willing to spend $5 on this simple test project. You can run with half the training data, which is 100 examples, or even a quarter, which is 50 examples. But your output will not be as good as mine if you do so.

Limiting the dataset size

Let’s make some small changes to the code so you can limit your dataset size if you want to:

import json

from constants import DATA_DIRECTORY
from utils import data_validation, jsonl


JSON_FILE = DATA_DIRECTORY / "Finx_completed_dataset.json"
JSONL_FILE = DATA_DIRECTORY / "Finx_completed_dataset.jsonl"
LIMIT = 100


with open(JSON_FILE, "r", encoding="utf-8") as in_file:
    data = json.load(in_file)
    jsonl.dicts_to_jsonl(JSONL_FILE, data[:LIMIT])

data_validator = data_validation.Validator(JSONL_FILE)

print(f"Data valid: {data_validator.validate_data()}")
data_validator.get_training_cost_in_dollars()
print(f"Longest entry: {data_validator.longest_entry_token_count()} tokens")

We added an import for json, and we set a constant named LIMIT. We then simply manually load the data from the JSON_FILE and use the dicts_to_jsonl function instead of the json_to_jsonl function, passing in only the first LIMIT number of examples using a simple slice. Note how easy this is as we made the jsonl utility module out of pieces so we can simply use a different piece this time.

I’m going to set the LIMIT variable to None as I want to use the full 200 examples for mine. Choose whatever number you want to use for the LIMIT, and then run the file again. It will create the new JSONL_FILE with the limited number of examples, and then validate and tell you the new cost. Limiting to 100 examples will cost you around $2.55.

Now that we know the cost, and we know our data is valid, we can move on to the next part where we’ll actually train our model on the JSONL data. I’ll see you there! 🚀