OpenAI Video Creator (6/6): Building a Fully Automated Video Creation System Using OpenAI & Python

Remember the last time you spent hours hunched over your computer, piecing together a video? The late nights, the coffee runs, the frustration of perfecting every frame? But what if you type a few words into your computer, hit enter, and voila! A fully-formed video appears, as if by magic. Sounds like science fiction, right? In this article, we’ll explore how to build an automated video creation pipeline that transforms user input into complete video productions. In the process, we will combine all the stages that we followed in last few blogs.

Step 1: Setting Up the Environment

To start, you need to ensure that your environment is correctly set up with necessary dependencies and your API keys are loaded safely.

import os
from dotenv import load_dotenv
from openai import OpenAI
from moviepy.editor import ImageClip, concatenate_videoclips, AudioFileClip
import requests
import re

# Load environment variables
load_dotenv()
client = OpenAI()

Step 2: Generating Video Script with GPT-4o

This section is responsible for generating textual content, which serves as the video script. The function uses GPT-4o to create paragraphs based on a user-provided topic.

def generate_paragraphs(prompt):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": f"Create detailed paragraphs for a voiceover based on the following topic.I want the paragraphs only. no numbers or instruction included.: {prompt}"
            }
        ],
        max_tokens=500,
    )
    paragraphs = response.choices[0].message.content
    print("Generated paragraphs:", paragraphs)
    return paragraphs

This function is where the magic happens:

It takes a prompt as input.
It sends a request to the GPT-4o model via the OpenAI API.
The prompt is cleverly constructed to generate detailed paragraphs suitable for a voiceover, based on the given topic.
The function limits the response to 500 tokens to keep the content concise.
It extracts the generated text from the API response and prints it.
Finally, it returns the generated paragraphs.

Step 3: Cleaning the Text

When creating audio content, such as podcasts or audiobooks, it’s essential to ensure the text is clean and free from unnecessary elements that could disrupt the listening experience.

def clean_paragraphs(paragraphs):
    # Simply return the paragraphs after stripping whitespace
    cleaned_paragraphs = [p.strip() for p in paragraphs.split('\n') if p.strip()]
    cleaned_text = ' '.join(cleaned_paragraphs)
    print("Cleaned paragraphs for voiceover:", cleaned_text)
    return cleaned_text

The clean_paragraphs function takes a string as input, representing the text to be cleaned. It performs the following steps:

Splitting: The input text is split into individual paragraphs using the newline character (\n).
Whitespace Removal: Each paragraph is stripped of leading and trailing whitespace.
Filtering: Empty paragraphs (containing only whitespace) are discarded.
Joining: The cleaned paragraphs are joined together into a single string using spaces.

Step 4: Dynamic Image Prompts for Visual Inspiration

This step involves creating prompts dynamically for image generation based on the user topic, which DALL-E-3 will later use to create images.

def generate_image_prompts(user_prompt, num_prompts=5):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": f"Create {num_prompts} distinct image prompts for visually representing: {user_prompt}"
            }
        ],
        max_tokens=100,
    )
    prompts_text = response.choices[0].message.content
    prompts = [prompt.strip() for prompt in prompts_text.split('\n') if prompt.strip()]
    
    if len(prompts) > num_prompts:
        prompts = prompts[:num_prompts]
    elif len(prompts) < num_prompts:
        prompts.extend(['Repeat for variation'] * (num_prompts - len(prompts)))

    print("Generated image prompts:", prompts)
    return prompts

Want just a few key image prompts, or perhaps a wider selection for ultimate creative exploration? No problem! this function allows you to specify the desired number of prompts, giving you complete control over the volume of visual inspiration at your disposal.

Here’s a breakdown of how the function works:

User Input: You provide your desired topic.
GPT-4 Integration: The function utilizes GPT-4 to generate a set number of unique image prompts that effectively represent your topic.
Customizable Quantity: You can choose how many prompts you’d like to receive, ensuring you have the perfect amount of creative fuel.
Dynamic Adjustment: The function automatically adjusts the number of prompts if the initial generation doesn’t match your request. If you ask for 5 prompts, but GPT-4 offers 7, it will be trimmed down to 5. Conversely, if fewer are generated, it will add generic prompts to ensure you have the desired quantity.

Step 5: From Prompt to Picture: Witness the Magic

Once prompts are ready, images are generated using DALL-E-3 and saved for later use in the video.

def generate_image(prompt, output_filename):
    response = client.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size="1024x1024",
        quality="standard",
        n=1,
    )
    image_url = response.data[0].url
    image_data = requests.get(image_url).content

    with open(output_filename, "wb") as f:
        f.write(image_data)
    print(f"Image saved to {output_filename}")

This step utilizes DALL-E-3, a cutting-edge AI image generation system. Here’s how it works:

Feeding the Prompt: The function takes one of your carefully crafted image prompts as input.
DALL-E-3 Takes Action: DALL-E-3 analyzes your prompt and utilizes its vast knowledge to generate a corresponding image.
Specifying Preferences (Optional): While not explicitly shown in the code snippet, some additional configuration options might be available depending on the specific API implementation. These could include setting the image size (e.g., 1024×1024 pixels), quality (e.g., standard), and the number of images to generate (although here we’re requesting just one).
Downloading the Masterpiece: The generated image URL is retrieved from the DALL-E-3 response.
Saving Your Creation: The function utilizes the requests library to download the image data from the URL. Finally, it saves the downloaded image data to a file with the specified output filename.

Step 6: From Text to Voice: The Power of AI Narration

The following function takes the cleaned text and generates an audio file using OpenAI’s text-to-speech capabilities.

def generate_voiceover(text, output_filepath):
    response = client.audio.speech.create(
        model="tts-1-hd",
        voice="fable",
        input=text
    )
    response.stream_to_file(output_filepath)
    print(f"Voiceover saved to {output_filepath}")

Here’s a breakdown of the magic behind it:

Text Input: You provide the text you want to be narrated as input for the function.
AI Narrator at Your Service: The function utilizes an AI TTS model (here, "tts-1-hd") to convert your text into high-quality audio speech.
Choosing Your Voice (Optional): Depending on the specific API implementation, you might have the option to select a specific voice for the narration (here, "fable"). This allows you to tailor the voiceover to match the tone and style of your project.
Streamlined Downloading: The function leverages the stream_to_file method to efficiently download the generated audio directly to your designated output file.
Confirmation Message: Once complete, the function informs you of the filename where your voiceover masterpiece is saved.

Step 7: Synchronizing Images and Audio

The final step in our journey is to bring together your AI-generated images and voiceover into a cohesive video. This function takes your carefully curated images and the audio file you’ve created and combines them into a seamless video presentation.

def create_video(images, audio_path, output_video_path):
    audio = AudioFileClip(audio_path)
    audio_duration = audio.duration
    image_duration = audio_duration / len(images)

    clips = [ImageClip(img).set_duration(image_duration) for img in images]
    video_clip = concatenate_videoclips(clips, method="compose")
    video_clip = video_clip.set_audio(audio)

    fps = 24
    video_clip.write_videofile(output_video_path, codec="libx264", audio_codec="aac", fps=fps)
    print(f"Video created at {output_video_path}")

Here’s a breakdown of how this function works:

Audio Analysis: The function starts by analyzing the audio file to determine its duration.
Image Duration Calculation: The total duration of the audio is then divided by the number of images to calculate the appropriate duration for each image.
Image Clips Creation: Each image is converted into a video clip with the calculated duration.
Concatenating Clips: All the image clips are joined together to create a single video sequence.
Adding Audio: The audio file is added to the video sequence as the soundtrack.
Video Export: Finally, the combined video is exported to a file with the specified output path. The function uses the libx264 codec for video encoding and the aac codec for audio encoding, with a default frame rate of 24 frames per second.

Bringing It All Together: The main() Function

This main() function serves as the central hub that orchestrates the entire video creation process. It acts as the entry point for your application, guiding the flow of execution and ensuring that each step is performed seamlessly.

def main():
    user_prompt = input("Please describe the topic you want to create video for: ")

    # Generate the paragraphs for the video voiceover using GPT-4 based on user prompt
    paragraphs = generate_paragraphs(user_prompt)

    # Clean the paragraphs for smooth voiceover narration
    cleaned_text = clean_paragraphs(paragraphs)

    # Generate image prompts based on the user prompt
    image_prompts = generate_image_prompts(user_prompt, num_prompts=5)

    os.makedirs("./vid_images2", exist_ok=True)
    os.makedirs("./voice2", exist_ok=True)

    image_files = [f"./vid_images2/vid_image{i+1}.png" for i in range(5)]

    # Generate images based on generated prompts
    for prompt, filename in zip(image_prompts, image_files):
        generate_image(prompt, filename)

    voiceover_path = "./voice2/speech.mp3"
    generate_voiceover(cleaned_text, voiceover_path)

    create_video(image_files, voiceover_path, "output_video.mp4")

if __name__ == "__main__":
    main()

A Step-by-Step Breakdown

User Input: The function begins by prompting the user to enter the desired topic for the video. This input will serve as the foundation for the subsequent steps.
Text Generation: Using the generate_paragraphs function, the main() function generates a set of paragraphs based on the user’s topic. These paragraphs will form the basis of the video’s voiceover.
Text Cleaning: To ensure a smooth and natural-sounding voiceover, the clean_paragraphs function is called to remove any unnecessary whitespace or formatting from the generated paragraphs.
Image Prompt Generation: The generate_image_prompts function is invoked to create a series of image prompts that align with the user’s topic. These prompts will be used to guide the image generation process.
Directory Creation: To organize the generated images and voiceover files, the function creates necessary directories if they don’t already exist.
Image Generation: A loop iterates through the generated image prompts and filenames, calling the generate_image function to create corresponding images and save them to the specified directory.
Voiceover Generation: The generate_voiceover function is called to create the voiceover based on the cleaned text and save it to a designated file.
Video Creation: Finally, the create_video function is invoked to combine the generated images and voiceover into a cohesive video, saving the final output to a specified file.

The Heart of the Operation

The main() function acts as the conductor, ensuring that each step of the video creation process is executed in the correct order and with the necessary inputs. It provides a clear and structured flow, making the entire application more manageable and efficient.

Now that the code is ready, you can run it! It will prompt you to enter a topic, and then watch as the magic unfolds.