Lesson 5 – Sentiment Analysis Using Embeddings

Welcome back to part 5 of this tutorial series.

In this part, we’ll be looking at simple sentiment analysis using embeddings. For most text classification tasks, fine-tuned machine learning models will do better than embeddings, because they have been meticulously tuned and trained on problem-specific data.

There is training data, with the correct answers and classifications, and the model is trained to predict the correct answer by seeing lots of correct answers. But what if we don’t have any training data? We can use zero-shot classification to classify with absolutely zero labeled training data using ChatGPT embeddings.

In this last part, we’ll be working with a Jupyter notebook, as this will allow us to easily display the graphs in line with the code, and have a nice visual representation of our Pandas DataFrames. If you don’t like to use Jupyter Notebooks you can just use a regular Python file and insert the same code, but you’ll occasionally need to insert a print statement in your file to see what we’re doing, and your print output will look a little bit less pretty is all.

I won’t go into depth on Jupyter notebooks here, but I will explain the bare basics you need to know, so if you’ve not used Jupyter notebooks before I would encourage you to follow along and take this opportunity to explore them.

For those new to Jupyter notebooks

Assuming you’re working with VS Code, you’ll need two things. If you’re already using Jupyter notebooks you can obviously skip these two steps.

1. pip install jupyter (just run the command in your console window)
2. Install the Jupyter extension in VS Code by selecting the extensions icon on the left side and searching for Jupyter, by Microsoft.

Once you’ve done that you should be good, depending on the configuration of your system.

A Jupyter notebook very very basically just allows us to chop our code up into blocks, which we can run one at a time. Unless we restart our notebook the kernel executing our code will be kept alive between running cells, also keeping our variables in memory. So in one cell, we could define 'variable = "Hi this is some text"', and run that cell, and then in the next cell we could 'print(variable)' and it would print "Hi this is some text". In fact, we can often skip the print statement altogether as you’ll soon see.

OK let’s get started!

For this part, we’ll be using the same database we’ve used for part 3 of our tutorial where we had ChatGPT generate SQL queries to answer our questions about the database. You can download the file for free from https://www.kaggle.com/datasets/joychakraborty2000/amazon-customers-data and extract the zip file anywhere.

The file has 2 versions of the data inside, one called database.sqlite which we used for part 3 of the tutorial series, and one called Reviews.csv. For this part, we’re going to be using the CSV version, and I’m going to rename it to 'sentiment_reviews_database.csv' and put it in a new directory named 5_Embeddings_for_sentiment:

📁FINX_FUNC_EMBED
    📁1_Simple_function_call
    📁2_Parallel_function_calling
    📁3_Database_functions
    📁4_Embeddings_for_similarity
    📁5_Embeddings_for_sentiment
        🗃️sentiment_reviews_database.csv    <-(This is Reviews.csv from the download file but renamed)
    📄.env

This CSV file has exactly the same customer reviews data as the SQLite version we used for part 3. Now let’s create a new file called '1_data_preparation.ipynb' in the '5_Embeddings_for_sentiment' directory:

📁FINX_FUNC_EMBED
    📁1_Simple_function_call
    📁2_Parallel_function_calling
    📁3_Database_functions
    📁4_Embeddings_for_similarity
    📁5_Embeddings_for_sentiment
        🗃️sentiment_reviews_database.csv
        📄1_data_preparation.ipynb
    📄.env

The .ipynb extension is the extension for Jupyter Notebooks, and VS Code will automatically recognize and open it in the Jupyter Notebook editor. If you’re using a regular Python file you can just call it '1_data_preparation.py' instead. In the top left you can click +Code to add more code blocks to your notebook. Go ahead and just add like 5 or 6 before we get started.

In the first code cell, we’ll put our imports and some variables:

from openai import OpenAI
import pandas as pd
import decouple

config = decouple.AutoConfig(" ")
client = OpenAI(api_key=config("OPENAI_API_KEY"))

EMBEDDING_MODEL = "text-embedding-ada-002"
INPUT_DB_NAME = "sentiment_reviews_database.csv"
OUTPUT_DB_NAME = "sentiment_review_embeddings.csv"

Note that the decouple and config part where we load the API key is slightly different than you’re used to. This is needed to make it work in Jupyter notebooks. Use the old method from the previous parts if you’re using a regular Python file. The other imports are all familiar by now and we define a couple of constants up top like the embedding model, the name of the input database, and the output file name we’ll use to store the embeddings. (This output file does not have to exist yet, it will be auto-created).

💡For those new to Jupyter notebooks (the very basics you need to know):
    - On the left side of each cell you'll see an arrow, if you click it this particular cell will be executed.
    - The variables will stay in memory and be available amongst different cells.
    - If you want to start fresh you can restart your notebook by pressing the 'Restart' button at the top, which will restart the kernel and clear all variables. You then have to run each block again, or you can also press the 'Run All' button up top.

In the next cell, we’ll read up some data for us to work with:

df = pd.read_csv(INPUT_DB_NAME, usecols=["Summary", "Text", "Score"], nrows=500)
df = df[df["Score"] != 3]
df["Summ_and_Text"] = "Title: " + df["Summary"] + "; Content: " + df["Text"]
df.head(5)

In the first line, we use Pandas to read data from a CSV file like the previous tutorial. We specify the database name as the first argument, then the columns we want to use, which means we will ignore all other columns in the data except for summary, text, and score, and the final argument is the number of rows we want to read. I’m going to read only 500 rows from this massive dataset. But if you’re very worried about tokens you can read even less and set it to 100.

The next line "df = df[df["Score"] != 3]" may look a bit confusing at first glance if you’re not familiar with Pandas, so let’s read it from the inside out. df["Score"] != 3 will return a boolean array of True and False values, with each row being tested for a True or False evaluation, where True means the score is not equal to 3. Then we use this boolean array to index our DataFrame, which means we only keep the rows where the score is not equal to 3.

Any rows where the statement df["Score"] != 3 evaluates to True will be retained in our dataset and any rows where this same statement evaluates to False will be filtered out. This is because we want to do binary classification, and we only want to classify positive and negative reviews, so we’ll remove all reviews with a score of 3, which is a neutral review.

In the third line, we add a new column to our DataFrame called "Summ_and_Text" which is just a concatenation of the summary and the text of each review, with a little bit of text added in between to separate the two. Finally, we print the first 5 rows of our DataFrame to see what it looks like. Note we can just declare df.head(5) whereas in a normal Python file, we have to use print(df.head(5)).

Go ahead and run this cell (make sure you run cell number 1 first with the imports). You should see a pretty representation where each row has 4 columns, prefixed by an ID that Pandas generated, making for a data structure that looks like this:

    Score   Summary             Text                Summ_and_Text
0   5       Summary here..      Review here...      Title: Summary here; Content: Review here
1   1       Summary here..      Review here...      Title: Summary here; Content: Review here
2   4       Summary here..      Review here...      Title: Summary here; Content: Review here
3   2       Summary here..      Review here...      Title: Summary here; Content: Review here
4   5       Summary here..      Review here...      Title: Summary here; Content: Review here

Generating the embeddings

Now that we have a DataFrame with only the data we want, we will need to generate embeddings again and save them somewhere, before we can start analyzing the sentiment and doing stuff with it. In a new cell of your Jupyter notebook, write the following function:

total_token_usage = 0
embeddings_generated = 0
total_data_rows = df.shape[0]


def get_embedding(item):
    global total_token_usage, embeddings_generated
    response = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=item,
    )
    tokens_used = response.usage.total_tokens
    total_token_usage += tokens_used
    embeddings_generated += 1
    if (embeddings_generated % 10) == 0:
        print(
            f"Generated {embeddings_generated} embeddings so far with a total of {total_token_usage} tokens used. ({int((embeddings_generated / total_data_rows) * 100)}%)"
        )
    return response.data[0].embedding

This is mostly the same tactic we used in the last tutorial, so just quickly copy it over. We define the global variables for the number of tokens used, the number of embeddings generated, and the total number of data rows in our dataset. Then we define a function called get_embedding which takes an item as input and returns the embedding for that item. Inside the function we use the global keyword to access the global variables and increment them as appropriate and just like in the previous tutorial, we also print a progress message for every 10 embeddings generated.

Go ahead and run this cell so the function will be stored in memory and available for us to use. Now we can use this function to generate embeddings for our dataset. In a new cell, write the following code:

df["Embedding"] = df.Summ_and_Text.apply(lambda item: get_embedding(item))

df.to_csv(OUTPUT_DB_NAME, index=False)

print(
    f"""
Generated {embeddings_generated} embeddings with a total of {total_token_usage} tokens used. (Done!)
Successfully saved embeddings to {OUTPUT_DB_NAME}.

    """
)

df.head(10)

We add a new column to our DataFrame named ‘Embedding‘ and set its value to a copy of the Summary and Text column after a function has been applied to each item inside using the apply method. This function takes each item and runs the get_embedding function, passing in each item one by one and returning the embedding, thus filling the ‘Embedding‘ column in our DataFrame with the embeddings of the Summ_and_Text column.

We then use Pandas to save the DataFrame to a CSV file again, skipping the index (the ID numbers auto-generated by Pandas) as we have no need for it. Finally, we print a message to the console and print the first 10 rows of our DataFrame to see what it looks like. Go ahead and run this cell and wait until it’s done running.

Generated 10 embeddings so far with a total of 680 tokens used. (2%)
Generated 20 embeddings so far with a total of 1531 tokens used. (4%)
Generated 30 embeddings so far with a total of 2313 tokens used. (6%)
Generated 40 embeddings so far with a total of 3559 tokens used. (8%)
Generated 50 embeddings so far with a total of 4806 tokens used. (10%)
Generated 60 embeddings so far with a total of 5567 tokens used. (12%)
...
Generated 463 embeddings with a total of 45051 tokens used. (Done!)
Successfully saved embeddings to Gx_review_embeddings.csv.

    Score   Summary             Text                Summ_and_Text       Embedding
0   5       Summary here..      Review here...      Summ_and_text...    [numbers...]
1   1       Summary here..      Review here...      Summ_and_text...    [numbers...]
2   4       Summary here..      Review here...      Summ_and_text...    [numbers...]
3   2       Summary here..      Review here...      Summ_and_text...    [numbers...]
4   5       Summary here..      Review here...      Summ_and_text...    [numbers...]

You’ll see your progress as it’s running and finally, your success message and a representation of the DataFrame printed out, representing a structure like the above. You’ll also have a file named sentiment_review_embeddings.csv with the data stored in CSV format. We now have our data prepared and we’re ready to do some sentiment analysis!

Sentiment analysis

To keep things organized, I’m going to be doing this in a separate file. Go ahead and save and close this one and create a new Jupyter notebook called '2_classification.ipynb' in the '5_Embeddings_for_sentiment' directory:

📁FINX_FUNC_EMBED
    📁1_Simple_function_call
    📁2_Parallel_function_calling
    📁3_Database_functions
    📁4_Embeddings_for_similarity
    📁5_Embeddings_for_sentiment
        🗃️sentiment_reviews_database.csv
        🗃️sentiment_review_embeddings.csv
        📄1_data_preparation.ipynb
        📄2_classification.ipynb
    📄.env

Open up '2_classification.ipynb' and press the '+ Code' button in the top left a couple of times to give us a few cells to work with. In the first cell, place the following imports and setup variables:

import pandas as pd
import numpy as np
import decouple
from sklearn.metrics import classification_report, PrecisionRecallDisplay
from openai import OpenAI

config = decouple.AutoConfig(" ")
client = OpenAI(api_key=config("OPENAI_API_KEY"))

EMBEDDING_MODEL = "text-embedding-ada-002"
CSV_DB_NAME = "sentiment_review_embeddings.csv"
THRESHOLD = 0

Pandas and Numpy are familiar, and naturally, we also import openai and the decouple module to use our config and then setup the client object with our API key. Just a quick reminder, we have to use the alternative config = decouple.AutoConfig call again as this is required for Jupyter notebooks over the way we used in our regular Python files before.

We also import the classification_report and PrecisionRecallDisplay from sklearn.metrics, which we’ll use to evaluate our model. Sklearn will make it easy for us to see how many correct versus incorrect classifications our model is making, and what its precision is.

Below we declare our embedding model, database name, and a threshold as constant variables so we can use them throughout this file. The threshold refers to the threshold we’ll use to classify a review as either positive or negative. We’ll be able to play around with this value later to find the sweet spot for the greatest accuracy.

Now we’ll need both a function to get an embedding, and a function to calculate cosine similarity between two embeddings. First, the function to get an embedding:

def get_embedding(text: str) -> list[float]:
    return client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=text,
    ).data[0].embedding

This is just what we always use, and then in the next cell take the cosine_similarity function from the last file:

def cosine_similarity(a, b) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

If you’re working on a real project, make sure to save these types of reusable functions somewhere so you can just write them once and then import them every time you need them.

In the next cell, we’ll read up our the data we prepared in the last part:

df = pd.read_csv(CSV_DB_NAME)
df["Embedding"] = df.Embedding.apply(eval).apply(np.array)
df["Sentiment"] = df.Score.replace(
    {1: "Negative", 2: "Negative", 4: "Positive", 5: "Positive"}
)
df = df[["Sentiment", "Summ_and_Text", "Embedding"]]
df.head(5)

First, we read the CSV file and load the data to a Pandas DataFrame. Then we select the ‘Embedding‘ column and evaluate the string values back to arrays and then Numpy arrays for greater efficiency just like we did last time. Then we add a new column called ‘Sentiment‘ which is just a copy of the ‘Score‘ column, but with the values 1 and 2 replaced with ‘Negative‘ and 4 and 5 replaced with ‘Positive‘. This is because we want to do binary classification between either positive or negative reviews.

Finally, we set the df variable equal to the DataFrame but with only the ‘Sentiment‘, ‘Summ_and_Text‘, and ‘Embedding‘ columns selected, effectively filtering out all other columns. Then we print the first 5 rows of our DataFrame to see what it looks like using the .head method. Go ahead and run all the cells we wrote so far including this last one. Your data structure will look something like this:

    Sentiment   Summ_and_Text                               Embedding
0   Positive    Title: Summary here; Content: Review here   [numbers...]
1   Negative    Title: Summary here; Content: Review here   [numbers...]
2   Positive    Title: Summary here; Content: Review here   [numbers...]
3   Negative    Title: Summary here; Content: Review here   [numbers...]
4   Positive    Title: Summary here; Content: Review here   [numbers...]

Testing different classification labels

Now let’s move on to the next cell. It will contain a single function, which we’ll go over in parts. This function will test the accuracy of classification labels, outputting a Precision-Recall curve, which is just a graph showing the accuracy of our predictions.

This will allow us to test labels such as ‘Positive‘ and ‘Negative‘, or more complex labels such as 'Positive product review' and 'Negative product review' to see which best match positive/negative review embeddings. The idea of this is that we test the embedding for a term like 'Positive product review' against the embeddings of the actual reviews in the database. If a particular review’s embedding has a high similarity to the embedding for the string 'Positive product review', we can assume there is a high similarity in meaning, as in, this is likely a positive product review.

Our function will have the ability to take any labels we pass in, so we can test different sets of labels and see which gives us the highest accuracy. We also made the sentiment column in our dataset (see above), which contains the correct answers. Therefore we’ll be able to compare our predictions based on the embeddings with the correct answers in the sentiment column and see how good our accuracy is.

So let’s get started on this function in a new code cell:

def evaluate_classification_labels(labels: list[str], threshold=THRESHOLD)-> None:
    """
    This function will test the accuracy of classification labels, outputting a Precision-Recall curve. This will allow us to test labels such as Positive/Negative, or more complex labels such as 'Positive product review' and 'Negative product review' to see which best match positive/negative review embeddings.
    labels: List of two terms, the first meant to represent a positive review and the second meant to represent a negative review.
    """
    test_label_embeddings = [get_embedding(label) for label in labels]

First, we define our function, evaluate_classification_labels, which takes the labels as an argument, and type hints that this should be a list of strings. We also take the threshold as an optional argument, which will default to the constant we defined earlier. Then we have a simple multi-line comment explaining what the function does.

In the last line, we get the test label embeddings, which means one embedding for the positive review label and one for the negative review label. we use the get_embedding function we wrote above, calling it for each label in the variable labels, and passing in the model name as an argument. This will return a list of embeddings, one for each label.

Now we have our two embeddings for the two labels, let’s continue (still inside the same cell and function):

    def label_score(review_emb, test_label_emb):
        positive_similarity = cosine_similarity(review_emb, test_label_emb[0])
        negative_similarity = cosine_similarity(review_emb, test_label_emb[1])
        return positive_similarity - negative_similarity

Inside our evaluate_classification_labels function, we define an inner function of label_score. This function takes two arguments, the embedding for a particular review and the two test label embeddings, one for positive and one for negative. Then we calculate the similarity between the review embedding and the first test label embedding, and the similarity between the review embedding and the second test label embedding. Remember that this similarity is calculated using the cosine similarity method we wrote above.

Then we return the difference between the two similarities. This will give us a score, which we can use to determine which label the review embedding is most similar to. If the score is positive, the review embedding is more similar to the first (positive) test label embedding, and if the score is negative, the review embedding is more similar to the second (negative) test label embedding.

    probabilities = df["Embedding"].apply(
        lambda review_emb: label_score(review_emb, test_label_embeddings)
    )
    predictions = probabilities.apply(lambda score: "Positive" if score > threshold else "Negative")

Then we use the apply method on the 'Embedding' column of our DataFrame, which will apply a function to each row in the column. We pass in a lambda function which takes the review embedding as an argument and calls the label_score function we defined earlier, passing in the review embedding and the test label embeddings. This will return a score, which we store in the probabilities variable.

Finally, we use the apply method again, this time on the probabilities variable, which will apply a function to each row in the probabilities column. We pass in a lambda function which takes the score as an argument and returns 'Positive' if the score is greater than the threshold, and 'Negative' if the score is less than the threshold. This will return a list of predictions, one for each review embedding, predicting whether the review is positive or negative based only on embeddings.

Still in the same cell, continuing the evaluate_classification_labels function:

    report = classification_report(df["Sentiment"], predictions)
    print(report)
    display = PrecisionRecallDisplay.from_predictions(
        df["Sentiment"], probabilities, pos_label="Positive"
    )
    display.ax_.set_title("Precision-Recall curve for test classification labels")

We then use the classification_report method from sklearn.metrics to generate a classification report, which will compare the predictions we made with the correct answers in the 'Sentiment' column of our DataFrame. We pass in the correct answers and the predictions, and the method will return a report which we store in the report variable. Then we print the report to the console.

In addition, we use the PrecisionRecallDisplay.from_predictions method from sklearn.metrics to generate a Precision-Recall curve, which will show us the accuracy of our predictions in graph format. We pass in the correct answers, the probabilities, and the positive label, which is 'Positive' in our case. Then we set the title of the graph to 'Precision-Recall curve for test classification labels'. We don’t need to store the graph in a variable, we just need to call the method and it will display the graph for us as we’re in Jupyter notebooks.

Your entire cell and function now look like this:

def evaluate_classification_labels(labels: list[str], threshold=THRESHOLD)-> None:
    """
    This function will test the accuracy of classification labels, outputting a Precision-Recall curve. This will allow us to test labels such as Positive/Negative, or more complex labels such as 'Positive product review' and 'Negative product review' to see which best match positive/negative review embeddings.
    labels: List of two terms, the first meant to represent a positive review and the second meant to represent a negative review.
    """
    test_label_embeddings = [get_embedding(label) for label in labels]

    def label_score(review_emb, test_label_emb):
        positive_similarity = cosine_similarity(review_emb, test_label_emb[0])
        negative_similarity = cosine_similarity(review_emb, test_label_emb[1])
        return positive_similarity - negative_similarity

    probabilities = df["Embedding"].apply(
        lambda review_emb: label_score(review_emb, test_label_embeddings)
    )
    predictions = probabilities.apply(lambda score: "Positive" if score > threshold else "Negative")

    report = classification_report(df["Sentiment"], predictions)
    print(report)
    display = PrecisionRecallDisplay.from_predictions(
        df["Sentiment"], probabilities, pos_label="Positive"
    )
    display.ax_.set_title("Precision-Recall curve for test classification labels")

Go ahead and run this cell so the function is loaded in memory, as we’re done writing it. Now we can use it to test different labels and see which set gives us the highest accuracy. In the next cell, write the following code:

simple_labels = ["Positive", "Negative"]
evaluate_classification_labels(simple_labels)

Now run the cell to test the accuracy of these labels and you will see something like the following:

            precision   recall  f1-score    support

Negative        0.88    0.70    0.78        54
Positive        0.96    0.99    0.97        409

accuracy                        0.95        463
macro avg       0.92    0.85    0.88        463
weighted avg    0.95    0.95    0.95        463

[a pretty graph here showing the curve]

This is the classification report, which shows us the accuracy of our predictions. We can see that we have an accuracy of 95%, which is pretty good. We can also see that the precision for the positive label is 96%, which means that 96% of the time when we predict a review is positive, it is actually positive.

The recall for the positive label is 99%, which means that 99% of the time when a review is actually positive, we predict it as positive. The f1-score is a combination of precision and recall and is 97% for the positive label.

The support is the number of times the label appears in the dataset, which is 409 for the positive label. The same goes for the negative scores, but we can see the accuracy is lower on the negative reviews.

At this point, it would be up to you to play with the threshold between positive and negative and the evaluation labels to get higher accuracy. Let’s try a set of more descriptive labels and see if we can get a higher accuracy. In the next cell, write the following code:

improved_labels = [
    "A product review with positive sentiment",
    "A product review with negative sentiment"
]
evaluate_classification_labels(improved_labels)

Note how each cell has its own output so you can see the results of the previous labels in the output of the previous cell and the results of these current labels below the current cell. This is the advantage of Jupyter notebooks for these types of data analysis tasks.

            precision   recall  f1-score    support

Negative        0.96    0.83    0.89        54
Positive        0.98    1.00    0.99        409

accuracy                        0.98        463
macro avg       0.97    0.91    0.94        463
weighted avg    0.98    0.98    0.98        463

[a pretty graph here showing the curve]

You can see our accuracy increased significantly to 98%, and the precision and recall for the positive label are both 98% and 100% respectively. We can also see that the precision and recall for the negative label are both higher than before, at 96% and 83% respectively. This is because the labels are more descriptive and thus more accurate. Remember this is not a machine learning algorithm but a comparison of similarity between the embeddings of our two labels and the embeddings of the reviews in our dataset. We did not train any type of model for these classifications!

Running the classifier on our data

Let’s go to the next cell, and write a function to add our descriptions to the DataFrame, so we can take a more detailed and visual look at exactly what the predictions are:

def add_prediction_to_df(labels: list[str], threshold=THRESHOLD)-> None:
    """
    This function will add a prediction column to the DataFrame, based on the labels provided.
    """
    label_embeddings = [get_embedding(label) for label in labels]

    def label_score(review_emb, test_label_emb):
        positive_similarity = cosine_similarity(review_emb, test_label_emb[0])
        negative_similarity = cosine_similarity(review_emb, test_label_emb[1])
        return positive_similarity - negative_similarity

    probabilities = df["Embedding"].apply(
        lambda review_emb: label_score(review_emb, label_embeddings)
    )
    df["Prediction"] = probabilities.apply(lambda score: "Positive" if score > threshold else "Negative")

This function takes our chosen classification labels as argument, and the threshold, which will again default to the number in our variable up top. The string comment is just for our own reference. We get the embeddings again using a list comprehension that runs the get_embedding method for every label in labels, passing the label into the method call.

The inner function label_score is a copy-paste of what we already wrote above. A quick caveat, if you want to make some sort of reusable module or production code you should always extract this kind of duplicate code and put it in a separate function or class to make sure all code is only repeated once. We could merge both functions into a single one with a variable for 'test mode' which returns the test data and graph or 'save to DataFrame' mode, but to keep the code easier to explain for tutorial purposes we’ll just have a separate function for now. I’ll leave the refactoring and removing duplication as an extra exercise.

We then get the probabilities using the exact same method we did above. We then take these probabilities and apply a lambda function to them, which will take each score as input one by one and evaluate Positive if the score is above our threshold and else Negative. This result is stored in the new DataFrame column 'Prediction'.

Finally, create another cell and write the following code:

add_prediction_to_df(improved_labels)
pd.set_option('display.max_colwidth', None)
print_df = df.drop(columns=["Embedding"])
print_df.head(30)

We call the function to add our predictions to the DataFrame, passing in our two winning labels. We then set a Pandas option to make the printing prettier as this will be quite wide, and then we create a new DataFrame called "print_df" which is a copy of our original DataFrame but with the 'Embedding' column dropped, as we don’t want to print a million numbers. Then we print the first 30 rows of our DataFrame to see what it looks like. You’ll get something like this.

    Sentiment   Summ_and_Text                                           Prediction
0   Positive    Title: Title of review; Content: Content of review.     Positive
1   Negative    Title: Title of review; Content: Content of review.     Negative

Most of these are all correct, like number 1 for example:

Id:         1
Sentiment:  Negative
Prediction: Negative
Title: Not as Advertised; Content: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".

In the first 30 results I can actually find only two problematic predictions, the first being:

Id:         3
Sentiment:  Negative
Prediction: Positive
Title: Cough Medicine; Content: If you are looking for the secret ingredient in Robitussin I believe I have found it. I got this in addition to the Root Beer Extract I ordered (which was good) and made some cherry soda. The flavor is very medicinal.

It seems like the embeddings got confused by the Root Beer extract which is labeled as good and adds positive words to this review but is not the actual product being reviewed in this review, as any human intelligence would obviously point out. The second problematic prediction I found is actually the model being correct:

Id:         16
Sentiment:  Negative
Prediction: Positive
Title: poor taste; Content: I love eating them and they are good for watching TV and looking at movies! It is not too sweet. I like to transfer them to a zip lock baggie so they stay fresh so I can take my time eating them.

Here we can see that the user likely made an error mixing up reviews. The embeddings are not wrong here, this is clearly a positive review as the user 'loves eating them'. The title of 'poor taste' and the user rating of Negative do not match their words and the user likely made a mistake writing this review, which the embeddings picked up on. The embeddings are actually correct and our data is wrong on this one!

All the other review sentiment predictions are spot on. That’s pretty impressive for only using embeddings and doing classification without any dataset-specific training data! You can play around with the threshold and the labels to see if you can get even higher accuracy, but I’m pretty happy for now. Again, if you have a massive production-grade environment you’ll need to look into a vector database to store the embeddings instead of CSV files.

That’s it for this part of the tutorial. I’ll see you in the next part where we’ll combine everything we’ve learned so far as we take a look at the next step up and play around with OpenAI assistants.