OpenAI Video Creator (4/6): OpenAI’s Text-to-Speech

In our recent project, we’ve successfully created stunning images. Next on our to-do list is generating a compelling voiceover to accompany our visuals. In the fast-evolving world of digital content creation, high-quality voiceovers can elevate your project to new heights. Whether you’re crafting an engaging video, an informative podcast, or an immersive audiobook, the right voice can make all the difference. That’s where OpenAI’s Text-to-Speech (TTS) capabilities come into play. Discover how we leveraged OpenAI’s Text-to-Speech (TTS) capabilities to create engaging audio content with ease.

Why OpenAI Text-to-Speech?

OpenAI offers an intuitive way to convert text into speech through its advanced TTS model. Specifically, the audio API provides a speech endpoint designed for TTS. Here’s a quick overview of what it offers:

1. Six Built-In Voices: Choose from a variety of pre-built voices.
2. Multilingual Support: Convert text to speech in over 50 languages.
3. Quality Options: Select between different quality models, including real-time and high-definition (HD) options
4. Streaming Capabilities: Produce real-time audio output using streaming.

Let’s walk through the process of setting up and using OpenAI’s TTS for our project.

Initial Setup

To access the text-to-speech functionality in the OpenAI API, Go to the official OpenAI API documentation page. Look for the section labeled “Text to Speech”. It’s typically found under the “Docs” section. Click on the link or button related to text-to-speech. This will take you to the documentation or API reference for the text-to-speech endpoint, which provides details on how to use it to convert text into speech.

The speech endpoint takes three key inputs:

1. Model: The TTS model to be used (e.g., TTS-1, TTS-1 HD).
2. Text: The content that needs to be turned into audio.
3. Voice: The voice variant for the audio generation.

To get started, we need to set up our coding environment in Visual Studio. This integrated development environment (IDE) provides the tools you need to write and test your code efficiently. Once your environment is ready, importing the necessary libraries is your next step. These libraries are the backbone of our TTS project, enabling you to harness the full power of OpenAI’s capabilities.

Creating a path for your output audio file is essential. This step ensures that once your audio is generated, it is stored in an easily accessible location on your device.

Diving into the Code

from pathlib import Path
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI()

Here’s what each of these lines does:

We import Path from pathlib to handle file paths in a cross-platform manner.

load_dotenv() is used to load environment variables from a .env file. This is where you’d typically store your OpenAI API key.

We import the OpenAI client from the openai package to interact with the OpenAI API. The last line creates an instance of the OpenAI client. The client will use the API key stored in our environment variables.

Configuring the Audio Output

speech_file_path = "./voiceover/speech3.mp3"

This line sets the path where our generated MP3 file will be saved. Make sure the directory exists!

Making the API Call

response = client.audio.speech.create(
  model="tts-1-hd",
  voice="fable",
  input="Fact two: These little guys are native to just one place on Earth – Lake Xochimilco in Mexico! And guess what? They're carnivores, chomping down on worms and insects for their meals."
)

Now comes the exciting part – generating the speech. Let’s break this down:

1. We’re using the create method of the client.audio.speech module.
2. We specify the model as "tts-1-hd", which is OpenAI’s high-definition TTS model.
3. We choose the voice "fable". OpenAI offers several voice options, each with its own character.
4. The input parameter is where we provide the text we want to convert to speech.

Saving the Audio File

response.stream_to_file(speech_file_path)

This line takes the audio stream from the API response and saves it directly to the file path we specified earlier.

And there you have it! With just a few lines of code, we’ve created a script that can generate professional-quality voice-overs using OpenAI’s Text-to-Speech API. The flexibility of OpenAI’s TTS system shines here, offering six built-in voices and support for over 50 languages, ensuring your content can reach a global audience. This technology opens up a world of possibilities for content creators, educators, and developers.

Navigating the Challenges

Every coding journey comes with its challenges. During the process, you might encounter some errors. In our experience, switching from the TTS-1 model to the TTS-1 HD model resolved many issues, leading to the successful generation of high-quality audio files. Additionally, paying attention to naming conventions in your code can help avoid common pitfalls.

Next Steps: Organizing Your Content

Once your audio files are ready, it’s time to elevate your project to the next level. The magic happens when you seamlessly blend your freshly minted voice-overs with captivating visuals and a well-crafted script. It’s coming next.