OpenAI Video Creator (2/6): Crafting Video Scripts with OpenAI’s Image Processing Capabilities

In today’s digital age, the fusion of text and visuals is paramount in creating engaging content. But what if you could elevate your content game by generating video scripts directly from images? Thanks to the advanced capabilities of OpenAI’s Vision API, this is now a reality. Whether you’re a content creator, educator, or just someone with a penchant for innovation, this comprehensive guide will walk you through the exciting process of creating video scripts using images via the OpenAI API.

Step-by-Step Guide to Video Script Creation

1. Getting Started with OpenAI API

Before diving into the image processing capabilities, we need to secure the OpenAI API key. This key is our gateway to accessing the powerful functionalities offered by OpenAI. We need to store this key securely in a .env file to keep it protected.

Before we proceed, open your Visual Studio Code terminal and install the dotenv module. This tool will help us manage our environment variables safely.

pip install python-dotenv

Next, create a new file in your project directory and name it “.env”. In this file, you’ll store your API key as an environment variable. Add the following line to the .env file:

OPENAI_API_KEY = your_actual_api_key_here

2. Setting Up Your Environment

from dotenv import load_dotenv
import base64
import requests
from openai import OpenAI
load_dotenv()
client = OpenAI()

Install the necessary packages and load our environment variables. Essential libraries such as base64 and requests are crucial for handling HTTP requests and initializing the OpenAI client. This setup ensures we have a seamless workflow from start to finish. Here’s a breakdown of the code:

3. Encoding Images

# Path to your image
image_path_1 = "./images/axolotl_1.jpg"
image_path_2 = "./images/axolotl_2.jpg"

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Getting the base64 string
base64_image_1 = encode_image(image_path_1)
base64_image_2 = encode_image(image_path_2)

we need to define the path to our image and encode it into a base64 string. This encoding is pivotal for making the image data compatible with the OpenAI API. Once encoded, we are ready to generate text completions. Here’s a breakdown of the code:

image_path_1 = "./images/axolotl_1.jpg" and image_path_2 = "./images/axolotl_2.jpg": These lines define the file paths to two image files that will be used as input for the image generation process.
The encode_image() function takes an image file path as input and returns the image data encoded in base64 format. This is necessary because the OpenAI API expects the image data to be sent in this format.
base64_image_1 = encode_image(image_path_1) and base64_image_2 = encode_image(image_path_2): These lines call the encode_image() function to convert the image data into base64 format, which will be used in the subsequent steps.

4. Generating Text Completions

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "You are a prominent script writer for YouTube videos. Take the info from the images and make a one-minute-long video script for me showing the unknown facts of the axolotl.",
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image_1}"
          }
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image_2}"
          }
        },
      ],
    }
  ], 
  max_tokens = 300,
)

In this step we provide prompts related to our image content along with the encoded image data. The input for text completion includes a list of messages with a maximum token limit of 300, accommodating both text and image inputs. It’s advisable to run a modified prompt initially to verify the accuracy of the extracted data. Let’s break it down:

response = client.chat.completions.create(...): This line calls the OpenAI API’s chat completion endpoint to generate text based on the provided input.
model="gpt-4o": This specifies the model to be used for generating the response. “gpt-4o” appears to be a custom or specific version of GPT-4.
messages=[...]: This is a list containing the input messages for the chat completion.

The message structure:

"role": "user": Indicates that this message is from the user.
"content": [...]: A list containing different types of content (text and images).

The content list includes:

A text item with instructions for the AI: It’s asked to act as a YouTube script writer and create a one-minute video script about unknown facts of the axolotl, based on the provided images.

Two image items, each containing a base64-encoded image. These are the axolotl images that were encoded earlier in the script.

max_tokens=300: This parameter limits the length of the generated response to 300 tokens.

5. Multi-Image Processing

For projects requiring multiple images, encode each image into base64 and create a dictionary for easy management. Use prompts to extract detailed information about your subjects—in this case, axolotls—and generate varied video scripts, including descriptions of visuals and background music. As an example, we have encoded two images here.

The purpose of this code is to:

Send a request to the OpenAI API with both text instructions and image data.
Ask the AI to analyze the images of axolotls.
Generate a short video script about unknown facts of axolotls based on the information in the images.

This demonstrates the use of multimodal AI capabilities, where the model can process both text and image inputs to generate text output. The AI is expected to “look” at the axolotl images, extract relevant information, and use that to create an informative and engaging video script about axolotls.

After this code runs, the response variable will contain the AI’s generated script, which can then be extracted and used as needed.

print(response.choices[0])

But the text will appear with new line characters. To extract and display the primary text content from the OpenAI API response without unnecessary formatting characters, we can use the following code:

print(response.choices[0].message.content)

Why This Matters

This innovative approach transforms how we create and consume content. By leveraging the OpenAI API’s image processing capabilities, you can automate and enhance your content creation process, saving time and increasing engagement. Whether for educational purposes, marketing, or pure entertainment, this method opens new horizons for creativity and efficiency.