Remember the last time you spent hours hunched over your computer, piecing together a video? The late nights, the coffee runs, the frustration of perfecting every frame? But what if you type a few words into your computer, hit enter, and voila! A fully-formed video appears, as if by magic. Sounds like science fiction, right? In this article, we’ll explore how to build an automated video creation pipeline that transforms user input into complete video productions. In the process, we will combine all the stages that we followed in last few blogs.
Step 1: Setting Up the Environment
To start, you need to ensure that your environment is correctly set up with necessary dependencies and your API keys are loaded safely.
import os from dotenv import load_dotenv from openai import OpenAI from moviepy.editor import ImageClip, concatenate_videoclips, AudioFileClip import requests import re # Load environment variables load_dotenv() client = OpenAI()

Step 2: Generating Video Script with GPT-4o
This section is responsible for generating textual content, which serves as the video script. The function uses GPT-4o to create paragraphs based on a user-provided topic.
def generate_paragraphs(prompt): response = client.chat.completions.create( model="gpt-4o", messages=[ { "role": "user", "content": f"Create detailed paragraphs for a voiceover based on the following topic.I want the paragraphs only. no numbers or instruction included.: {prompt}" } ], max_tokens=500, ) paragraphs = response.choices[0].message.content print("Generated paragraphs:", paragraphs) return paragraphs
This function is where the magic happens:
- It takes a prompt as input.
- It sends a request to the GPT-4o model via the OpenAI API.
- The prompt is cleverly constructed to generate detailed paragraphs suitable for a voiceover, based on the given topic.
- The function limits the response to 500 tokens to keep the content concise.
- It extracts the generated text from the API response and prints it.
- Finally, it returns the generated paragraphs.
Step 3: Cleaning the Text
When creating audio content, such as podcasts or audiobooks, it’s essential to ensure the text is clean and free from unnecessary elements that could disrupt the listening experience.

def clean_paragraphs(paragraphs): # Simply return the paragraphs after stripping whitespace cleaned_paragraphs = [p.strip() for p in paragraphs.split('\n') if p.strip()] cleaned_text = ' '.join(cleaned_paragraphs) print("Cleaned paragraphs for voiceover:", cleaned_text) return cleaned_text
The clean_paragraphs
function takes a string as input, representing the text to be cleaned. It performs the following steps:
- Splitting: The input text is split into individual paragraphs using the newline character (\n).
- Whitespace Removal: Each paragraph is stripped of leading and trailing whitespace.
- Filtering: Empty paragraphs (containing only whitespace) are discarded.
- Joining: The cleaned paragraphs are joined together into a single string using spaces.
Step 4: Dynamic Image Prompts for Visual Inspiration
This step involves creating prompts dynamically for image generation based on the user topic, which DALL-E-3 will later use to create images.
def generate_image_prompts(user_prompt, num_prompts=5): response = client.chat.completions.create( model="gpt-4o", messages=[ { "role": "user", "content": f"Create {num_prompts} distinct image prompts for visually representing: {user_prompt}" } ], max_tokens=100, ) prompts_text = response.choices[0].message.content prompts = [prompt.strip() for prompt in prompts_text.split('\n') if prompt.strip()] if len(prompts) > num_prompts: prompts = prompts[:num_prompts] elif len(prompts) < num_prompts: prompts.extend(['Repeat for variation'] * (num_prompts - len(prompts))) print("Generated image prompts:", prompts) return prompts

Want just a few key image prompts, or perhaps a wider selection for ultimate creative exploration? No problem! this function allows you to specify the desired number of prompts, giving you complete control over the volume of visual inspiration at your disposal.
Here’s a breakdown of how the function works:
- User Input: You provide your desired topic.
- GPT-4 Integration: The function utilizes GPT-4 to generate a set number of unique image prompts that effectively represent your topic.
- Customizable Quantity: You can choose how many prompts you’d like to receive, ensuring you have the perfect amount of creative fuel.
- Dynamic Adjustment: The function automatically adjusts the number of prompts if the initial generation doesn’t match your request. If you ask for 5 prompts, but GPT-4 offers 7, it will be trimmed down to 5. Conversely, if fewer are generated, it will add generic prompts to ensure you have the desired quantity.
Step 5: From Prompt to Picture: Witness the Magic
Once prompts are ready, images are generated using DALL-E-3 and saved for later use in the video.
def generate_image(prompt, output_filename): response = client.images.generate( model="dall-e-3", prompt=prompt, size="1024x1024", quality="standard", n=1, ) image_url = response.data[0].url image_data = requests.get(image_url).content with open(output_filename, "wb") as f: f.write(image_data) print(f"Image saved to {output_filename}")
This step utilizes DALL-E-3, a cutting-edge AI image generation system. Here’s how it works:

- Feeding the Prompt: The function takes one of your carefully crafted image prompts as input.
- DALL-E-3 Takes Action: DALL-E-3 analyzes your prompt and utilizes its vast knowledge to generate a corresponding image.
- Specifying Preferences (Optional): While not explicitly shown in the code snippet, some additional configuration options might be available depending on the specific API implementation. These could include setting the image size (e.g., 1024×1024 pixels), quality (e.g., standard), and the number of images to generate (although here we’re requesting just one).
- Downloading the Masterpiece: The generated image URL is retrieved from the DALL-E-3 response.
- Saving Your Creation: The function utilizes the requests library to download the image data from the URL. Finally, it saves the downloaded image data to a file with the specified output filename.
Step 6: From Text to Voice: The Power of AI Narration
The following function takes the cleaned text and generates an audio file using OpenAI’s text-to-speech capabilities.
def generate_voiceover(text, output_filepath): response = client.audio.speech.create( model="tts-1-hd", voice="fable", input=text ) response.stream_to_file(output_filepath) print(f"Voiceover saved to {output_filepath}")
Here’s a breakdown of the magic behind it:
- Text Input: You provide the text you want to be narrated as input for the function.
- AI Narrator at Your Service: The function utilizes an AI TTS model (here,
"tts-1-hd"
) to convert your text into high-quality audio speech. - Choosing Your Voice (Optional): Depending on the specific API implementation, you might have the option to select a specific voice for the narration (here,
"fable"
). This allows you to tailor the voiceover to match the tone and style of your project. - Streamlined Downloading: The function leverages the
stream_to_file
method to efficiently download the generated audio directly to your designated output file. - Confirmation Message: Once complete, the function informs you of the filename where your voiceover masterpiece is saved.
Step 7: Synchronizing Images and Audio
The final step in our journey is to bring together your AI-generated images and voiceover into a cohesive video. This function takes your carefully curated images and the audio file you’ve created and combines them into a seamless video presentation.

def create_video(images, audio_path, output_video_path): audio = AudioFileClip(audio_path) audio_duration = audio.duration image_duration = audio_duration / len(images) clips = [ImageClip(img).set_duration(image_duration) for img in images] video_clip = concatenate_videoclips(clips, method="compose") video_clip = video_clip.set_audio(audio) fps = 24 video_clip.write_videofile(output_video_path, codec="libx264", audio_codec="aac", fps=fps) print(f"Video created at {output_video_path}")
Here’s a breakdown of how this function works:
- Audio Analysis: The function starts by analyzing the audio file to determine its duration.
- Image Duration Calculation: The total duration of the audio is then divided by the number of images to calculate the appropriate duration for each image.
- Image Clips Creation: Each image is converted into a video clip with the calculated duration.
- Concatenating Clips: All the image clips are joined together to create a single video sequence.
- Adding Audio: The audio file is added to the video sequence as the soundtrack.
- Video Export: Finally, the combined video is exported to a file with the specified output path. The function uses the libx264 codec for video encoding and the aac codec for audio encoding, with a default frame rate of 24 frames per second.
Bringing It All Together: The main() Function
This main()
function serves as the central hub that orchestrates the entire video creation process. It acts as the entry point for your application, guiding the flow of execution and ensuring that each step is performed seamlessly.

def main(): user_prompt = input("Please describe the topic you want to create video for: ") # Generate the paragraphs for the video voiceover using GPT-4 based on user prompt paragraphs = generate_paragraphs(user_prompt) # Clean the paragraphs for smooth voiceover narration cleaned_text = clean_paragraphs(paragraphs) # Generate image prompts based on the user prompt image_prompts = generate_image_prompts(user_prompt, num_prompts=5) os.makedirs("./vid_images2", exist_ok=True) os.makedirs("./voice2", exist_ok=True) image_files = [f"./vid_images2/vid_image{i+1}.png" for i in range(5)] # Generate images based on generated prompts for prompt, filename in zip(image_prompts, image_files): generate_image(prompt, filename) voiceover_path = "./voice2/speech.mp3" generate_voiceover(cleaned_text, voiceover_path) create_video(image_files, voiceover_path, "output_video.mp4") if __name__ == "__main__": main()
A Step-by-Step Breakdown
- User Input: The function begins by prompting the user to enter the desired topic for the video. This input will serve as the foundation for the subsequent steps.
- Text Generation: Using the generate_paragraphs function, the
main()
function generates a set of paragraphs based on the user’s topic. These paragraphs will form the basis of the video’s voiceover. - Text Cleaning: To ensure a smooth and natural-sounding voiceover, the
clean_paragraphs
function is called to remove any unnecessary whitespace or formatting from the generated paragraphs. - Image Prompt Generation: The
generate_image_prompts
function is invoked to create a series of image prompts that align with the user’s topic. These prompts will be used to guide the image generation process. - Directory Creation: To organize the generated images and voiceover files, the function creates necessary directories if they don’t already exist.
- Image Generation: A loop iterates through the generated image prompts and filenames, calling the generate_image function to create corresponding images and save them to the specified directory.
- Voiceover Generation: The
generate_voiceover
function is called to create the voiceover based on the cleaned text and save it to a designated file. - Video Creation: Finally, the
create_video
function is invoked to combine the generated images and voiceover into a cohesive video, saving the final output to a specified file.
The Heart of the Operation
The main()
function acts as the conductor, ensuring that each step of the video creation process is executed in the correct order and with the necessary inputs. It provides a clear and structured flow, making the entire application more manageable and efficient.
Now that the code is ready, you can run it! It will prompt you to enter a topic, and then watch as the magic unfolds.