Welcome back to part 3, where we’ll use Whisper to build another really cool app. In this part, we’ll look at how to work with video files. After all, many of the practical applications of speech recognition don’t come in convenient MP3 files, but rather in video files. We’ll be building a subtitle generator and embedder, which will take a video file as input, transcribe it, and then embed the subtitles into the video file itself, feeding the result back to the end user.
Before we can get started on the main code, we will need to write some utilities again, just like in the previous part. The utilities we’ll need this time are:
- Subtitles -> We just can reuse the subtitle-to-disk utility from the previous part. (Done✔️)
- Video -> We will need a way to convert a video file to an mp3 file so that we can feed it to Whisper.
- Commands -> We will need a way to run commands on the command line, as there are multiple ffmpeg commands we’ll need to run both for the video conversion and the subtitle embedding.
So let’s get started with the command utility. Inside the utils
folder, first create a new file named command.py
:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📁test_audio_files 📁utils 📄__init__.py 📄podcast.py 📄subtitles.py 📄command.py (✨new file) 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄2_whisper_pods.py 📄settings.py 📄.env
Then inside the command.py
file let’s start with our imports:
import datetime import subprocess from pathlib import Path
We’re going to run commands and provide some very basic logging as well. We imported the datetime
module so we can add timestamps to our logs, and pathlib should be familiar by now. The subprocess
module in Python is used to spawn new processes, connect to their input/output/error pipes, and obtain their return codes. It allows you to execute system commands and interact with them programmatically. It’s basically a bit like opening a terminal window inside your Python code.
Next, we’ll start with an extremely simple function that will print a message but in blue letters:
def print_blue(message: str) -> None: print(f"\033[94m{message}\033[00m")
The \033[94m
and \033[00m
are ANSI escape codes, which are used to add color and formatting to text in terminal output. The 94
is the code for blue, and the 00
is the code for reset. You can find a list of all the codes here: https://en.wikipedia.org/wiki/ANSI_escape_code#Colors. We will print the commands we execute to the terminal in blue, which helps them stand out from the other white text output and makes it easier for us to check our commands.
Running system commands
Next, we’ll create a function that will run a command like you would run on the command line:
def run_and_log(command: str, log_directory: Path) -> None: print_blue(f"Running command: \n{command}") with open(log_directory / "commands_log.txt", "a+", encoding="utf-8") as file: subprocess.call( command, stdout=file, stderr=file, ) file.write( f"\nRan command: {command}\nDate/time: {datetime.datetime.now()}\n\n\n\n" )
We create a function called run_and_log
, which takes two arguments: command
which is a string, and log_directory
which is a Path and indicates the directory where we want to save the log file. We then print the command we’re about to execute in blue, and then open the log file in append mode. The a+
means that we will append to the file if it exists, and create it if it doesn’t. Again, we use the encoding="utf-8"
argument to make sure that we can write non-ASCII characters to the file as well. If you do not do this you will eventually run into trouble.
Inside the with open
context manager, so while the file is open, we call the subprocess.call
function. This function takes a command as input and executes it, so as the first argument we pass the command
variable. The second argument is stdout=file
, which means that we will write the output of the command to the file (instead of the console). The third argument is stderr=file
, which means that we will write any errors to the file as well. So we basically execute the command and whatever output there is gets logged inside the text file.
After that, we write what command we executed and a timestamp to the file, and use a couple of \n
to add some newlines to the file so that the next command will be lower down, making them easy to distinguish from each other.
Now let’s run a quick test, using the extremely simple terminal command echo 'hello'
, which will simply print hello
to the console. Let’s run this command and see if our function works:
run_and_log("echo 'hello'", Path.cwd())
For the path we’ve used the Path.cwd()
method in Python’s pathlib
module which returns the current working directory as a Path
object. This is the terminal’s current directory when you run the script. (This is just for a quick test, we don’t want to go through the trouble of importing the base directory in here)
Go ahead and run the command.py
file, and whatever directory your terminal was in when you ran the script should now have a file named commands_log.txt
with the following inside:
hello Ran command: echo 'hello' Date/time: 2024-01-14 12:13:49.535692
It worked! We’ve successfully logged the output of hello
followed by our logging information of the time and command executed. Make sure you remove or comment out the run_and_log
line before we continue, as we don’t want to run this command every time we run the script.
# run_and_log("echo 'hello'", Path.cwd())
A peculiar issue with slashes
With our run_and_log
function completed, we have just one more function to create in here. There is a small discrepancy between the file paths where ffmpeg will expect a different format for the system commands than our Python code will give us. So we need to write a short utility to fix the path. This issue only occurs with the subtitle path when trying to embed the subtitles using ffmpeg system commands, and I’m honestly not sure why it occurs, but this is the type of thing you will run into during your software development journey.
If you keep looking you’ll always find a solution, never despair, but I’ll save you this time and tell you about the issue ahead of time!
- The path
C:\Users\dirk\test/subtitle.vtt
will not work in the command and will give errors as it get’s messed up and then is unable to be parsed as a valid path.\ - What we need is
C\:\\Users\\dirk\\test\\subtitle.vtt
instead. Notice there is an extra\
after theC
and after every\
in the path. The first\
is an escape character, which means that the second\
is not interpreted as a special character but as a literal\
. - This issue only affects the subtitle path and not the input or output video paths, so we only need to fix the subtitle path.
Below the run_and_log
function inside the command.py
file, add a new function:
def format_ffmpeg_filepath(path: Path) -> str: """Turns C:\Users\dirk\test/subtitle.vtt into C\:\\Users\\dirk\\test\\subtitle.vtt""" string_path = str(path) return string_path.replace("\\", "\\\\").replace("/", "\\\\").replace(":", "\\:")
We take a Path
as input, and then first convert it to a string so we can use string methods on it to fix the format. We then use the replace
method to replace all the \
with \\
and all the /
with \\
. We also replace the :
with \:
. Now I see you looking mighty confused! Why so many slashes? Well, remember the first \
is the escape character so that the second slash is interpreted not as an operator but as a literal slash string-character.
- So in order to replace
\
we need to target it using\\
, as we need the escape character to indicate we want to target the\
string-character and not use it as an operator, so a single\
won’t work as it would be interpreted as the slash operator. - Likewise, to replace it with
\\
we need to use\\\\
as each slash typed needs a slash to escape it, so that each second slash is interpreted as a literal slash string-character. - So the above function just means that
\
is replaced by\\
,/
is replaced by\\
, and:
is replaced by\:
. It just looks so confusing because of all the extra escape characters which also happen to be slashes! Phew🤯.
Video utility functions
Okay so with that out of the way, go ahead and save and close the command.py
file. It’s time for our video
utility file next, so create a new file called video.py
inside the utils folder:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📁test_audio_files 📁utils 📄__init__.py 📄podcast.py 📄subtitles.py 📄command.py 📄video.py (✨new file) 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄2_whisper_pods.py 📄settings.py 📄.env
Don’t worry, this one won’t be so bad 😀! Open up your new video.py
file and let’s start with our imports:
from pathlib import Path from . import command
All we need is Path
for input argument type-hinting and the command
module we just created. Next, we’ll create a function that will convert a video file to an mp3 file so it can be fed to Whisper:
def to_mp3( input_video: str, log_directory: Path, output_path: Path, mono: bool = False ) -> str: output_path_string = str(output_path) channels = 1 if mono else 2 bitrate = 80 if mono else 192 command_to_run = f'ffmpeg -i "{input_video}" -vn -ar 44100 -ac {channels} -b:a {bitrate}k "{output_path_string}"' command.run_and_log(command_to_run, log_directory) print(f"Video converted to mp3 and saved to {output_path_string}") return output_path_string
We define a function named to_mp3
which takes an input_video
as a string, a log_directory
as a Path, an output_path as a Path, and a mono
option as a boolean. The function returns a string in the end, which holds the output path. The input_video
path is a string because gradio will feed it to us, which is why it is not a Path
object like the log_directory
and output_path
. Make sure you always keep track of what type all the variables are or you will run into trouble eventually passing in a Path object where a string is expected, or vice versa.
First, we get a string version of the output_path
and save it in output_path_string
. Then we check if the mono
option is set to True
or False
, and set the channels
and bitrate
variables accordingly. If mono
is True
we set channels
to 1
and bitrate
to 80
, and if mono
is False
we set channels
to 2
and bitrate
to 192
. We won’t actually need this mono option until part 4, but we might as well add it now.
Then we get to the command, first preparing it in a variable named command_to_run
. We use the ffmpeg
command and pass in the input_video
as the input file (-i
). We then use the -vn
option to disable video recording, the -ar
option to set the audio sampling frequency to 44100 Hz, the -ac
option to set the number of audio channels to channels
, and the -b:a
option to set the audio bitrate to bitrate
kbps. We then pass in the output_path_string
as the output file location.
Notice that the command is contained inside an f-string which has single quotes on the outside (f'command'
). Make sure you imitate this perfectly, using the single quotes on the outside and the double quotes around the variable names of "{input_video}"
and "{output_path_string}"
. We need these double quotes because the user input video file is likely to have spaces in the name, and not having double quotes around a name with spaces inside will cause the command to fail.
Then we call the run_and_log
function from our command
module, passing in the command and the directory we want to log to, printing a message to the console, and returning the output_path_string.
That completes our video.py
file, go ahead and save and close it. We’re ready to start on the main code now!
Subtitle Master – Putting it all together
In your root folder, create a new file named 3_subtitle_master.py
:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📁test_audio_files 📁utils 📄__init__.py 📄podcast.py 📄subtitles.py 📄command.py 📄video.py 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄2_whisper_pods.py 📄3_subtitle_master.py (✨new file) 📄settings.py 📄.env
Inside, let’s start with our imports:
import os import uuid import gradio as gr import whisper from whisper.utils import WriteVTT from settings import BASE_DIR, OUTPUT_TEMP_DIR, OUTPUT_VIDEO_DIR, STYLES_DIR from utils import command, subtitles, video
We import os
to do some filename splitting, and all the other imports are familiar from previous parts. To finish up we import several directories from our settings
file and the command
, subtitles
, and video
modules from our utils
folder, reusing the subtitles
module from the previous part.
Next up are our constants for the file:
MODEL = whisper.load_model("base.en") VTT_WRITER = WriteVTT(output_dir=str(OUTPUT_TEMP_DIR))
We just load up a model, I’ll start with base.en
as it will probably be good enough to get started. Then we instantiate a WriteVTT
object like we did last time, indicating we want to save the subtitles in the temp directory.
As we are going to be returning a video to the end user this time, I would like to include the original video name in the output file, though we’ll still need a uuid as well to guarantee unique names (the user might upload the same file twice!). So let’s create a quick function that gets us a unique project name. Say the user inputs a file named my_video.mp4
, we want the function to return my_video_0f646333-0464-43a1-a75c-ed57c47fbcd5
so that we basically have a uuid with the filename in front of it. We can then add .mp3
or .srt
or whatever file extension we need at the end, making sure all the files for this project have the same but unique project name.
def get_unique_project_name(input_video: str) -> str: """Get a unique subtitle-master project name to avoid file-name clashes.""" unique_id = uuid.uuid4() filename = os.path.basename(input_video) base_fname, _ = os.path.splitext(filename) return f"{base_fname}_{unique_id}"
The function takes the input path as a string and then generates a uuid
. We then get the filename using os.path.basename
, which takes a path like C:\Users\dirk\test\my_video.mp4
and returns my_video.mp4
. We then use os.path.splitext
to split the filename into a base filename and an extension, so my_video.mp4
becomes my_video
and .mp4
. We catch the base name as base_fname
and the extension under the variable name _
as we don’t need it. We then return the base filename with the uuid appended to it.
Now let’s get started on our main function below that will tie it all together:
def main(input_video: str) -> str: """Takes a video file as string path and returns a video file with subtitles embedded as string path.""" unique_project_name = get_unique_project_name(input_video) get_temp_output_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_project_name}{ext}" mp3_file = video.to_mp3( input_video, log_directory=BASE_DIR, output_path=get_temp_output_path(".mp3"), )
We’ll take an input video, which gradio will pass to our main function as a string path. The function will return a string path pointing towards the processed video file with embedded subtitles back to gradio. First, we get a unique project name using the function we just wrote. Then we create a simple lambda function like the one we had in part 2. It takes an extension like .mp3
as input and returns output_dir/project_name.mp3
, as we’ll need temporary directories for both our .mp3
and our .vtt
files, and this way we only have one place to change if we ever need to change the output directory.
Then we call the to_mp3
function from our video
module, passing in the input video, the project’s base directory as the log directory, and the output path as the get_temp_output_path
lambda function with .mp3
as the extension. We save the return of the function as the variable named mp3_file
.
Continuing on:
def main(input_video: str) -> str: ...previous code... whisper_output = MODEL.transcribe(mp3_file, beam_size=5) vtt_subs = subtitles.write_to_file( whisper_output, writer=VTT_WRITER, output_path=get_temp_output_path(".vtt"), )
We call the transcribe
method on our MODEL
object, which has an instance of Whisper, passing in the mp3_file
as the input file, and setting the beam_size
to 5
. We then call the write_to_file
function from our subtitles
module, passing in the whisper_output
as the transcript, the VTT_WRITER
as the writer, and the get_temp_output_path
lambda function with .vtt
as the extension as the output path.
So what is this beam_size
parameter? Well, it’s one of a number of possible parameters we can pass into the transcribe
method. The beam_size
parameter is the number of beams to use in the beam search. The higher the number, the more accurate the transcription will be, but the slower it will be as well. The default is 5
, and I’ve found that this is a good balance between speed and accuracy. The only reason I’ve passed it in explicitly here is to make you aware of these parameters. It basically refers to the number of different potential paths that will be explored, from which the most likely one is chosen. Here are some of the other possible parameters:
temperature
-> The higher the temperature, the more likely it is that the model will choose a less likely character. You can think of it in a similar way as thetemperature
setting you get with ChatGPT calls. The default is0
and will simply always return the most likely predictions only,0
is what we have been using so far.beam_size
-> The number of beams to use in the beam search. We just discussed this one above. It is only applicable when the temperature is set to0
, and its default value is5
.best_of
-> Selects multiple random samples, only for use with a nonzero temperature and will generate more diverse (and possibly wrong) samples.task
-> Eithertranscribe
ortranslate
. We’ve used this one before and it defaults totranscribe
.language
-> The language to use whentask
=translation
. Defaults toNone
which will perform a language detection first.device
-> The device to use for inference. Defaults tocuda
if you have a cuda enabled GPU, otherwise, it will default tocpu
.verbose
-> Whether to print out the progress and debug messages, defaults toTrue
.
And there are more. For general use, you’ll probably do fine with the defaults most of the time, but be aware that you can tweak these parameters to get better results if you need to.
Back to our code, let’s continue:
def main(input_video: str) -> str: ...previous code... vtt_string_path = command.format_ffmpeg_filepath(vtt_subs) output_video_path = OUTPUT_VIDEO_DIR / f"{unique_project_name}_subs.mp4" embed_subs_into_vid_command = f'ffmpeg -i "{input_video}" -vf "subtitles=\'{vtt_string_path}\'" "{output_video_path}"' command.run_and_log(embed_subs_into_vid_command, log_directory=BASE_DIR) return str(output_video_path)
We need to run another ffmpeg
system command to embed the subtitles we have created into our video file. We first get the vtt_string_path
by passing in the vtt_subs
path we already have into that crazy function with all the ////
backslashes we called format_ffmpeg_filepath
, remember? After that, we save our desired output video path in a variable by just combining our OUTPUT_VIDEO_DIR
with the unique_project_name
and pasting _subs.mp4
at the end for good measure.
Now we prepare the ffmpeg
command we’re about to run in a separate variable for readability. We use the input_video
as the input file (-i
), and then use the -vf
option to add a video filter. The video filter we use is subtitles
and we pass in the vtt_string_path
as the subtitle file. We then pass in the output_video_path
as the output file.
Notice again that the whole command is inside single brackets '
inside of which we have path variables in double brackets "
to avoid trouble if there are spaces in the filename. But as we have to pass in "subtitles='{vtt_string_path}'"
which requires another level of brackets again, going back to the single brackets '
would cause trouble as we have already used these to open the string variable at the start, so we have to escape them using the backslash \'
instead.
Then we call the run_and_log
function from our command
module, passing in the command we just wrote, and the BASE_DIR
as the log directory. We then return the output_video_path
as a string, as gradio doesn’t want a Path object.
The whole main
function now looks like this:
def main(input_video: str) -> str: """Takes a video file as string path and returns a video file with subtitles embedded as string path.""" unique_project_name = get_unique_project_name(input_video) get_temp_output_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_project_name}{ext}" mp3_file = video.to_mp3( input_video, log_directory=BASE_DIR, output_path=get_temp_output_path(".mp3"), ) whisper_output = MODEL.transcribe(mp3_file, beam_size=5) vtt_subs = subtitles.write_to_file( whisper_output, writer=VTT_WRITER, output_path=get_temp_output_path(".vtt"), ) vtt_string_path = command.format_ffmpeg_filepath(vtt_subs) output_video_path = OUTPUT_VIDEO_DIR / f"{unique_project_name}_subs.mp4" embed_subs_into_vid_command = f'ffmpeg -i "{input_video}" -vf "subtitles=\'{vtt_string_path}\'" "{output_video_path}"' command.run_and_log(embed_subs_into_vid_command, log_directory=BASE_DIR) return str(output_video_path)
Building the interface
Now all we need to do to run this is create another gradio interface. As you are already familiar with gradio now we’ll go through this one a bit more quickly, the principles are the same as last time. Below your main function, continue with:
if __name__ == "__main__": block = gr.Blocks( css=str(STYLES_DIR / "subtitle_master.css"), theme=gr.themes.Soft(primary_hue=gr.themes.colors.emerald), ) with block: with gr.Group(): gr.HTML( f""" <div class="header"> <img src="https://i.imgur.com/dxHMfCI.png" referrerpolicy="no-referrer" /> </div> """ ) with gr.Row(): input_video = gr.Video( label="Input Video", sources=["upload"], mirror_webcam=False ) output_video = gr.Video() with gr.Row(): button_text = "🎞️ Subtitle my video! 🎞️" btn = gr.Button(value=button_text, elem_classes=["button-row"]) btn.click(main, inputs=[input_video], outputs=[output_video]) block.launch(debug=True)
We use the if __name__ == "__main__":
guard to make sure that the code inside only runs when we run the file directly. We create the gradio block
object just like we did before, passing in a css
file that doesn’t exist yet, but this time we also pass in a theme
. I’ll pass in the gr.themes.Soft()
which has a bit of a different style to it, and set the accent color to emerald by passing in primary_hue=gr.themes.colors.emerald
when calling Soft()
. This will match nicely with the logo I have prepared for you with this application.
Then we open the block
object using the with statement, and open up a new Group
inside of it, just like we did before, so we can build our block interface. The HTML object is the same as in the last part, except I changed the image link URL to give you a new logo for this app. Then we open up a new Row
and add a Video
object for the input video, passing in sources=["upload"]
so that the user can upload a video file, and setting mirror_webcam=False
as we don’t want to take the user’s webcam as input. Still on the same Row
, so next to the input video, we declare another Video
object for the output video file.
We then have a row that only has a button for which we provide a text and a class of button-row
so we can target it with CSS. The btn.click
declaration is a lot simpler this time as we just call the main
function with only a single input of input_video
and only one output of output_video
. Finally, we call .launch
on the block just like last time.
That’s our code done! You’re probably dying to run it, but wait! We have to create a quick CSS file to finish it off. Create a new file named subtitle_master.css
inside the styles
folder:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📄subtitle_master.css (✨new file) 📄whisper_pods.css 📁test_audio_files 📁utils 📄__init__.py 📄podcast.py 📄subtitles.py 📄command.py 📄video.py 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄2_whisper_pods.py 📄3_subtitle_master.py 📄settings.py 📄.env
Inside we’ll just write some quick CSS styles:
.header { padding: 2em 8em; } .header, .button-row { background-color: #1d366f7e; }
We just gave the header
class some padding to stop the logo image from being too large and then gave both the header
and button-row
classes a background color of #1d366f7e
which is a nice dark blue half-transparent color. Save and close the file, and we’re ready to run! Go ahead and run the 3_subtitle_master.py
file, and give it some time to load. Click the link in your terminal window again to open the interface in your browser, and you should see something like this:
Yours won’t have Korean in the input video box though, but whatever your computer’s language is set to. Go ahead and upload a video file, wait a second for it to load, and then press the subtitle my video
button. This may take quite a while if you’re not on the fastest system with a powerful GPU, but you’ll see the commands and steps being executed in your terminal window just like we set up. Eventually, you’ll see the output video appear with the subtitles embedded, each one perfectly in time with the video, and you can play it back and download it!
You can check the commands_log.txt
file in the root directory to see all the commands that were run, and you can check the output_temp_files
folder to see the temporary files that were created during the process, and the output_video
folder to see the final output video file. If you need some extra quality, set a higher model like small.en
or medium.en
.
Conclusion
That’s pretty awesome! An automatic subtitler that will subtitle any video for you all on its own. You can build on this maybe by accepting YouTube links or adding translation functionality so you can have English subtitles on foreign language videos, which could be cool for language learning. Make sure you don’t use the .en
model if you want to use other languages obviously.
To make a real production-grade application use a front-end framework and have some kind of progress or stream the live transcription to the page to stop the user getting bored, or allow them to do something else while the file processes in the background. A production app would have to run on a server with good processing power and GPU.
That’s it for part 3, I’ll see you soon in part 4 where we’ll look at ways to speed up Whisper or outsource the processing using the OpenAI API endpoint in the cloud. We’ll also build one more app using the cloud API to round off the series. See you there soon!