Whisper Lesson 3 – Subtitle Master: Building a Subtitle Generator & Embedder

Welcome back to part 3, where we’ll use Whisper to build another really cool app. In this part, we’ll look at how to work with video files. After all, many of the practical applications of speech recognition don’t come in convenient MP3 files, but rather in video files. We’ll be building a subtitle generator and embedder, which will take a video file as input, transcribe it, and then embed the subtitles into the video file itself, feeding the result back to the end user.

Before we can get started on the main code, we will need to write some utilities again, just like in the previous part. The utilities we’ll need this time are:

  • Subtitles -> We just can reuse the subtitle-to-disk utility from the previous part. (Done✔️)
  • Video -> We will need a way to convert a video file to an mp3 file so that we can feed it to Whisper.
  • Commands -> We will need a way to run commands on the command line, as there are multiple ffmpeg commands we’ll need to run both for the video conversion and the subtitle embedding.

So let’s get started with the command utility. Inside the utils folder, first create a new file named command.py:

📁FINX_WHISPER (project root folder)
    📁output_temp_files
    📁output_video
    📁styles
    📁test_audio_files
    📁utils
        📄__init__.py
        📄podcast.py
        📄subtitles.py
        📄command.py   (✨new file)
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄2_whisper_pods.py
    📄settings.py
    📄.env

Then inside the command.py file let’s start with our imports:

import datetime
import subprocess
from pathlib import Path

We’re going to run commands and provide some very basic logging as well. We imported the datetime module so we can add timestamps to our logs, and pathlib should be familiar by now. The subprocess module in Python is used to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. It allows you to execute system commands and interact with them programmatically. It’s basically a bit like opening a terminal window inside your Python code.

Next, we’ll start with an extremely simple function that will print a message but in blue letters:

def print_blue(message: str) -> None:
    print(f"\033[94m{message}\033[00m")

The \033[94m and \033[00m are ANSI escape codes, which are used to add color and formatting to text in terminal output. The 94 is the code for blue, and the 00 is the code for reset. You can find a list of all the codes here: https://en.wikipedia.org/wiki/ANSI_escape_code#Colors. We will print the commands we execute to the terminal in blue, which helps them stand out from the other white text output and makes it easier for us to check our commands.

Running system commands

Next, we’ll create a function that will run a command like you would run on the command line:

def run_and_log(command: str, log_directory: Path) -> None:
    print_blue(f"Running command: \n{command}")
    with open(log_directory / "commands_log.txt", "a+", encoding="utf-8") as file:
        subprocess.call(
            command,
            stdout=file,
            stderr=file,
        )
        file.write(
            f"\nRan command: {command}\nDate/time: {datetime.datetime.now()}\n\n\n\n"
        )

We create a function called run_and_log, which takes two arguments: command which is a string, and log_directory which is a Path and indicates the directory where we want to save the log file. We then print the command we’re about to execute in blue, and then open the log file in append mode. The a+ means that we will append to the file if it exists, and create it if it doesn’t. Again, we use the encoding="utf-8" argument to make sure that we can write non-ASCII characters to the file as well. If you do not do this you will eventually run into trouble.

Inside the with open context manager, so while the file is open, we call the subprocess.call function. This function takes a command as input and executes it, so as the first argument we pass the command variable. The second argument is stdout=file, which means that we will write the output of the command to the file (instead of the console). The third argument is stderr=file, which means that we will write any errors to the file as well. So we basically execute the command and whatever output there is gets logged inside the text file.

After that, we write what command we executed and a timestamp to the file, and use a couple of \n to add some newlines to the file so that the next command will be lower down, making them easy to distinguish from each other.

Now let’s run a quick test, using the extremely simple terminal command echo 'hello', which will simply print hello to the console. Let’s run this command and see if our function works:

run_and_log("echo 'hello'", Path.cwd())

For the path we’ve used the Path.cwd() method in Python’s pathlib module which returns the current working directory as a Path object. This is the terminal’s current directory when you run the script. (This is just for a quick test, we don’t want to go through the trouble of importing the base directory in here)

Go ahead and run the command.py file, and whatever directory your terminal was in when you ran the script should now have a file named commands_log.txt with the following inside:

hello

Ran command: echo 'hello'
Date/time: 2024-01-14 12:13:49.535692

It worked! We’ve successfully logged the output of hello followed by our logging information of the time and command executed. Make sure you remove or comment out the run_and_log line before we continue, as we don’t want to run this command every time we run the script.

# run_and_log("echo 'hello'", Path.cwd())

A peculiar issue with slashes

With our run_and_log function completed, we have just one more function to create in here. There is a small discrepancy between the file paths where ffmpeg will expect a different format for the system commands than our Python code will give us. So we need to write a short utility to fix the path. This issue only occurs with the subtitle path when trying to embed the subtitles using ffmpeg system commands, and I’m honestly not sure why it occurs, but this is the type of thing you will run into during your software development journey.

If you keep looking you’ll always find a solution, never despair, but I’ll save you this time and tell you about the issue ahead of time!

  • The path C:\Users\dirk\test/subtitle.vtt will not work in the command and will give errors as it get’s messed up and then is unable to be parsed as a valid path.\
  • What we need is C\:\\Users\\dirk\\test\\subtitle.vtt instead. Notice there is an extra \ after the C and after every \ in the path. The first \ is an escape character, which means that the second \ is not interpreted as a special character but as a literal \.
  • This issue only affects the subtitle path and not the input or output video paths, so we only need to fix the subtitle path.

Below the run_and_log function inside the command.py file, add a new function:

def format_ffmpeg_filepath(path: Path) -> str:
    """Turns C:\Users\dirk\test/subtitle.vtt into C\:\\Users\\dirk\\test\\subtitle.vtt"""
    string_path = str(path)
    return string_path.replace("\\", "\\\\").replace("/", "\\\\").replace(":", "\\:")

We take a Path as input, and then first convert it to a string so we can use string methods on it to fix the format. We then use the replace method to replace all the \ with \\ and all the / with \\. We also replace the : with \:. Now I see you looking mighty confused! Why so many slashes? Well, remember the first \ is the escape character so that the second slash is interpreted not as an operator but as a literal slash string-character.

  • So in order to replace \ we need to target it using \\, as we need the escape character to indicate we want to target the \ string-character and not use it as an operator, so a single \ won’t work as it would be interpreted as the slash operator.
  • Likewise, to replace it with \\ we need to use \\\\ as each slash typed needs a slash to escape it, so that each second slash is interpreted as a literal slash string-character.
  • So the above function just means that \ is replaced by \\, / is replaced by \\, and : is replaced by \:. It just looks so confusing because of all the extra escape characters which also happen to be slashes! Phew🤯.

Video utility functions

Okay so with that out of the way, go ahead and save and close the command.py file. It’s time for our video utility file next, so create a new file called video.py inside the utils folder:

    📁FINX_WHISPER (project root folder)
        📁output_temp_files
        📁output_video
        📁styles
        📁test_audio_files
        📁utils
            📄__init__.py
            📄podcast.py
            📄subtitles.py
            📄command.py
            📄video.py   (✨new file)
        📄1_basic_call_english_only.py
        📄1_multiple_languages.py
        📄2_whisper_pods.py
        📄settings.py
        📄.env

Don’t worry, this one won’t be so bad 😀! Open up your new video.py file and let’s start with our imports:

from pathlib import Path
from . import command

All we need is Path for input argument type-hinting and the command module we just created. Next, we’ll create a function that will convert a video file to an mp3 file so it can be fed to Whisper:

def to_mp3(
    input_video: str, log_directory: Path, output_path: Path, mono: bool = False
) -> str:
    output_path_string = str(output_path)

    channels = 1 if mono else 2
    bitrate = 80 if mono else 192

    command_to_run = f'ffmpeg -i "{input_video}" -vn -ar 44100 -ac {channels} -b:a {bitrate}k "{output_path_string}"'
    command.run_and_log(command_to_run, log_directory)
    print(f"Video converted to mp3 and saved to {output_path_string}")

    return output_path_string

We define a function named to_mp3 which takes an input_video as a string, a log_directory as a Path, an output_path as a Path, and a mono option as a boolean. The function returns a string in the end, which holds the output path. The input_video path is a string because gradio will feed it to us, which is why it is not a Path object like the log_directory and output_path. Make sure you always keep track of what type all the variables are or you will run into trouble eventually passing in a Path object where a string is expected, or vice versa.

First, we get a string version of the output_path and save it in output_path_string. Then we check if the mono option is set to True or False, and set the channels and bitrate variables accordingly. If mono is True we set channels to 1 and bitrate to 80, and if mono is False we set channels to 2 and bitrate to 192. We won’t actually need this mono option until part 4, but we might as well add it now.

Then we get to the command, first preparing it in a variable named command_to_run. We use the ffmpeg command and pass in the input_video as the input file (-i). We then use the -vn option to disable video recording, the -ar option to set the audio sampling frequency to 44100 Hz, the -ac option to set the number of audio channels to channels, and the -b:a option to set the audio bitrate to bitrate kbps. We then pass in the output_path_string as the output file location.

Notice that the command is contained inside an f-string which has single quotes on the outside (f'command'). Make sure you imitate this perfectly, using the single quotes on the outside and the double quotes around the variable names of "{input_video}" and "{output_path_string}". We need these double quotes because the user input video file is likely to have spaces in the name, and not having double quotes around a name with spaces inside will cause the command to fail.

Then we call the run_and_log function from our command module, passing in the command and the directory we want to log to, printing a message to the console, and returning the output_path_string.

That completes our video.py file, go ahead and save and close it. We’re ready to start on the main code now!

Subtitle Master – Putting it all together

In your root folder, create a new file named 3_subtitle_master.py:

📁FINX_WHISPER (project root folder)
    📁output_temp_files
    📁output_video
    📁styles
    📁test_audio_files
    📁utils
        📄__init__.py
        📄podcast.py
        📄subtitles.py
        📄command.py
        📄video.py
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄2_whisper_pods.py
    📄3_subtitle_master.py   (✨new file)
    📄settings.py
    📄.env

Inside, let’s start with our imports:

import os
import uuid

import gradio as gr
import whisper
from whisper.utils import WriteVTT

from settings import BASE_DIR, OUTPUT_TEMP_DIR, OUTPUT_VIDEO_DIR, STYLES_DIR
from utils import command, subtitles, video

We import os to do some filename splitting, and all the other imports are familiar from previous parts. To finish up we import several directories from our settings file and the command, subtitles, and video modules from our utils folder, reusing the subtitles module from the previous part.

Next up are our constants for the file:

MODEL = whisper.load_model("base.en")
VTT_WRITER = WriteVTT(output_dir=str(OUTPUT_TEMP_DIR))

We just load up a model, I’ll start with base.en as it will probably be good enough to get started. Then we instantiate a WriteVTT object like we did last time, indicating we want to save the subtitles in the temp directory.

As we are going to be returning a video to the end user this time, I would like to include the original video name in the output file, though we’ll still need a uuid as well to guarantee unique names (the user might upload the same file twice!). So let’s create a quick function that gets us a unique project name. Say the user inputs a file named my_video.mp4, we want the function to return my_video_0f646333-0464-43a1-a75c-ed57c47fbcd5 so that we basically have a uuid with the filename in front of it. We can then add .mp3 or .srt or whatever file extension we need at the end, making sure all the files for this project have the same but unique project name.

def get_unique_project_name(input_video: str) -> str:
    """Get a unique subtitle-master project name to avoid file-name clashes."""
    unique_id = uuid.uuid4()
    filename = os.path.basename(input_video)
    base_fname, _ = os.path.splitext(filename)
    return f"{base_fname}_{unique_id}"

The function takes the input path as a string and then generates a uuid. We then get the filename using os.path.basename, which takes a path like C:\Users\dirk\test\my_video.mp4 and returns my_video.mp4. We then use os.path.splitext to split the filename into a base filename and an extension, so my_video.mp4 becomes my_video and .mp4. We catch the base name as base_fname and the extension under the variable name _ as we don’t need it. We then return the base filename with the uuid appended to it.

Now let’s get started on our main function below that will tie it all together:

def main(input_video: str) -> str:
    """Takes a video file as string path and returns a video file with subtitles embedded as string path."""
    unique_project_name = get_unique_project_name(input_video)
    get_temp_output_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_project_name}{ext}"
    mp3_file = video.to_mp3(
        input_video,
        log_directory=BASE_DIR,
        output_path=get_temp_output_path(".mp3"),
    )

We’ll take an input video, which gradio will pass to our main function as a string path. The function will return a string path pointing towards the processed video file with embedded subtitles back to gradio. First, we get a unique project name using the function we just wrote. Then we create a simple lambda function like the one we had in part 2. It takes an extension like .mp3 as input and returns output_dir/project_name.mp3, as we’ll need temporary directories for both our .mp3 and our .vtt files, and this way we only have one place to change if we ever need to change the output directory.

Then we call the to_mp3 function from our video module, passing in the input video, the project’s base directory as the log directory, and the output path as the get_temp_output_path lambda function with .mp3 as the extension. We save the return of the function as the variable named mp3_file.

Continuing on:

def main(input_video: str) -> str:
    ...previous code...

    whisper_output = MODEL.transcribe(mp3_file, beam_size=5)
    vtt_subs = subtitles.write_to_file(
        whisper_output,
        writer=VTT_WRITER,
        output_path=get_temp_output_path(".vtt"),
    )

We call the transcribe method on our MODEL object, which has an instance of Whisper, passing in the mp3_file as the input file, and setting the beam_size to 5. We then call the write_to_file function from our subtitles module, passing in the whisper_output as the transcript, the VTT_WRITER as the writer, and the get_temp_output_path lambda function with .vtt as the extension as the output path.

So what is this beam_size parameter? Well, it’s one of a number of possible parameters we can pass into the transcribe method. The beam_size parameter is the number of beams to use in the beam search. The higher the number, the more accurate the transcription will be, but the slower it will be as well. The default is 5, and I’ve found that this is a good balance between speed and accuracy. The only reason I’ve passed it in explicitly here is to make you aware of these parameters. It basically refers to the number of different potential paths that will be explored, from which the most likely one is chosen. Here are some of the other possible parameters:

  • temperature -> The higher the temperature, the more likely it is that the model will choose a less likely character. You can think of it in a similar way as the temperature setting you get with ChatGPT calls. The default is 0 and will simply always return the most likely predictions only, 0 is what we have been using so far.
  • beam_size -> The number of beams to use in the beam search. We just discussed this one above. It is only applicable when the temperature is set to 0, and its default value is 5.
  • best_of -> Selects multiple random samples, only for use with a nonzero temperature and will generate more diverse (and possibly wrong) samples.
  • task -> Either transcribe or translate. We’ve used this one before and it defaults to transcribe.
  • language -> The language to use when task = translation. Defaults to None which will perform a language detection first.
  • device -> The device to use for inference. Defaults to cuda if you have a cuda enabled GPU, otherwise, it will default to cpu.
  • verbose -> Whether to print out the progress and debug messages, defaults to True.

And there are more. For general use, you’ll probably do fine with the defaults most of the time, but be aware that you can tweak these parameters to get better results if you need to.

Back to our code, let’s continue:

def main(input_video: str) -> str:
    ...previous code...

    vtt_string_path = command.format_ffmpeg_filepath(vtt_subs)
    output_video_path = OUTPUT_VIDEO_DIR / f"{unique_project_name}_subs.mp4"
    embed_subs_into_vid_command = f'ffmpeg -i "{input_video}" -vf "subtitles=\'{vtt_string_path}\'" "{output_video_path}"'

    command.run_and_log(embed_subs_into_vid_command, log_directory=BASE_DIR)

    return str(output_video_path)

We need to run another ffmpeg system command to embed the subtitles we have created into our video file. We first get the vtt_string_path by passing in the vtt_subs path we already have into that crazy function with all the //// backslashes we called format_ffmpeg_filepath, remember? After that, we save our desired output video path in a variable by just combining our OUTPUT_VIDEO_DIR with the unique_project_name and pasting _subs.mp4 at the end for good measure.

Now we prepare the ffmpeg command we’re about to run in a separate variable for readability. We use the input_video as the input file (-i), and then use the -vf option to add a video filter. The video filter we use is subtitles and we pass in the vtt_string_path as the subtitle file. We then pass in the output_video_path as the output file.

Notice again that the whole command is inside single brackets ' inside of which we have path variables in double brackets " to avoid trouble if there are spaces in the filename. But as we have to pass in "subtitles='{vtt_string_path}'" which requires another level of brackets again, going back to the single brackets ' would cause trouble as we have already used these to open the string variable at the start, so we have to escape them using the backslash \' instead.

Then we call the run_and_log function from our command module, passing in the command we just wrote, and the BASE_DIR as the log directory. We then return the output_video_path as a string, as gradio doesn’t want a Path object.

The whole main function now looks like this:

def main(input_video: str) -> str:
    """Takes a video file as string path and returns a video file with subtitles embedded as string path."""
    unique_project_name = get_unique_project_name(input_video)
    get_temp_output_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_project_name}{ext}"
    mp3_file = video.to_mp3(
        input_video,
        log_directory=BASE_DIR,
        output_path=get_temp_output_path(".mp3"),
    )

    whisper_output = MODEL.transcribe(mp3_file, beam_size=5)
    vtt_subs = subtitles.write_to_file(
        whisper_output,
        writer=VTT_WRITER,
        output_path=get_temp_output_path(".vtt"),
    )

    vtt_string_path = command.format_ffmpeg_filepath(vtt_subs)
    output_video_path = OUTPUT_VIDEO_DIR / f"{unique_project_name}_subs.mp4"
    embed_subs_into_vid_command = f'ffmpeg -i "{input_video}" -vf "subtitles=\'{vtt_string_path}\'" "{output_video_path}"'

    command.run_and_log(embed_subs_into_vid_command, log_directory=BASE_DIR)

    return str(output_video_path)

Building the interface

Now all we need to do to run this is create another gradio interface. As you are already familiar with gradio now we’ll go through this one a bit more quickly, the principles are the same as last time. Below your main function, continue with:

if __name__ == "__main__":
    block = gr.Blocks(
        css=str(STYLES_DIR / "subtitle_master.css"),
        theme=gr.themes.Soft(primary_hue=gr.themes.colors.emerald),
    )

    with block:
        with gr.Group():
            gr.HTML(
                f"""
                <div class="header">
                <img src="https://i.imgur.com/dxHMfCI.png" referrerpolicy="no-referrer" />
                </div>
                """
            )
            with gr.Row():
                input_video = gr.Video(
                    label="Input Video", sources=["upload"], mirror_webcam=False
                )
                output_video = gr.Video()
            with gr.Row():
                button_text = "🎞️ Subtitle my video! 🎞️"
                btn = gr.Button(value=button_text, elem_classes=["button-row"])

            btn.click(main, inputs=[input_video], outputs=[output_video])

    block.launch(debug=True)

We use the if __name__ == "__main__": guard to make sure that the code inside only runs when we run the file directly. We create the gradio block object just like we did before, passing in a css file that doesn’t exist yet, but this time we also pass in a theme. I’ll pass in the gr.themes.Soft() which has a bit of a different style to it, and set the accent color to emerald by passing in primary_hue=gr.themes.colors.emerald when calling Soft(). This will match nicely with the logo I have prepared for you with this application.

Then we open the block object using the with statement, and open up a new Group inside of it, just like we did before, so we can build our block interface. The HTML object is the same as in the last part, except I changed the image link URL to give you a new logo for this app. Then we open up a new Row and add a Video object for the input video, passing in sources=["upload"] so that the user can upload a video file, and setting mirror_webcam=False as we don’t want to take the user’s webcam as input. Still on the same Row, so next to the input video, we declare another Video object for the output video file.

We then have a row that only has a button for which we provide a text and a class of button-row so we can target it with CSS. The btn.click declaration is a lot simpler this time as we just call the main function with only a single input of input_video and only one output of output_video. Finally, we call .launch on the block just like last time.

That’s our code done! You’re probably dying to run it, but wait! We have to create a quick CSS file to finish it off. Create a new file named subtitle_master.css inside the styles folder:

📁FINX_WHISPER (project root folder)
    📁output_temp_files
    📁output_video
    📁styles
        📄subtitle_master.css   (✨new file)
        📄whisper_pods.css
    📁test_audio_files
    📁utils
        📄__init__.py
        📄podcast.py
        📄subtitles.py
        📄command.py
        📄video.py
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄2_whisper_pods.py
    📄3_subtitle_master.py
    📄settings.py
    📄.env

Inside we’ll just write some quick CSS styles:

.header {
  padding: 2em 8em;
}

.header,
.button-row {
  background-color: #1d366f7e;
}

We just gave the header class some padding to stop the logo image from being too large and then gave both the header and button-row classes a background color of #1d366f7e which is a nice dark blue half-transparent color. Save and close the file, and we’re ready to run! Go ahead and run the 3_subtitle_master.py file, and give it some time to load. Click the link in your terminal window again to open the interface in your browser, and you should see something like this:

Yours won’t have Korean in the input video box though, but whatever your computer’s language is set to. Go ahead and upload a video file, wait a second for it to load, and then press the subtitle my video button. This may take quite a while if you’re not on the fastest system with a powerful GPU, but you’ll see the commands and steps being executed in your terminal window just like we set up. Eventually, you’ll see the output video appear with the subtitles embedded, each one perfectly in time with the video, and you can play it back and download it!

You can check the commands_log.txt file in the root directory to see all the commands that were run, and you can check the output_temp_files folder to see the temporary files that were created during the process, and the output_video folder to see the final output video file. If you need some extra quality, set a higher model like small.en or medium.en.

Conclusion

That’s pretty awesome! An automatic subtitler that will subtitle any video for you all on its own. You can build on this maybe by accepting YouTube links or adding translation functionality so you can have English subtitles on foreign language videos, which could be cool for language learning. Make sure you don’t use the .en model if you want to use other languages obviously.

To make a real production-grade application use a front-end framework and have some kind of progress or stream the live transcription to the page to stop the user getting bored, or allow them to do something else while the file processes in the background. A production app would have to run on a server with good processing power and GPU.

That’s it for part 3, I’ll see you soon in part 4 where we’ll look at ways to speed up Whisper or outsource the processing using the OpenAI API endpoint in the cloud. We’ll also build one more app using the cloud API to round off the series. See you there soon!