Whisper Lesson 4 – Speeding Up or Outsourcing the Processing

Hi and welcome back! In this part, we’re going to look at some alternatives to speed stuff up or outsource the processing power to OpenAI’s servers altogether. First, we’ll look at faster-whisper at a basic level. If you’re not sure whether you want to use this you can also just watch this part and decide whether or not you want to install it for further use later as we’re just going to cover it quickly before moving on to the web API version for the rest of this part.

So what is faster-whisper? Faster-Whisper is a quicker version of OpenAI’s Whisper speech-to-text model. As OpenAI released the whisper model as open-source this has naturally allowed others to try to build on and optimize it further. It uses CTranslate2, a fast engine for Transformer models, and is up to 4 times faster and uses considerably less memory than the original openai/whisper while claiming to maintain the same accuracy. You can find the GitHub repository here.

You can use this for the same apps we have built so far, just as a faster version of the Whisper model, so we won’t be building a new app specifically for this, as it would get repetitive and I don’t want to waste your time! You just need some syntax changes to make your app work with faster-whisper instead of the original whisper model. So we’ll take a look at the basics of fast-whisper, let you decide if you want to use/implement it, and then move on to the web-API version.

Installing faster-whisper

Note: If you do not plan on using faster-whisper or are not quite sure, there is no point in going through the install procedures, and you can skip ahead a couple of minutes to the web-API version, or just watch/read along and decide later if you want to use it.

Basically, to install faster-whisper you just have to run the following command in your terminal:

pip install faster-whisper

And to support GPU execution you need to have the appropriate libraries for CUDA installed, which are cuBLAS and cuDNN. This can be the slightly trickier part of the install, and again I cannot really give you platform-specific instructions or help you with the specific troubleshooting if you run into challenges. As always in software development, if you’re lucky you won’t have any problems, and if you’re not, you spend some time on Google and Stackoverflow to find the solution. If you just want to run faster-whisper on your CPU, which will of course be slower but may not be a big deal for small-scale development on your own machine, you can skip the cuBLAS and cuDNN installs.

Using faster-whisper

So let’s give it a spin to see how it works! First create a new file in your project root directory called 4_faster_whisper.py:

📁FINX_WHISPER (project root folder)
    📁output_temp_files
    📁output_video
    📁styles
    📁test_audio_files
    📁utils
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄2_whisper_pods.py
    📄3_subtitle_master.py
    📄4_faster_whisper.py   (✨new file)
    📄settings.py
    📄.env

And inside let’s start with our imports:

from faster_whisper import WhisperModel
from settings import TEST_AUDIO_DIR

model_size = "small"

We import the WhisperModel class from the faster_whisper package, and the TEST_AUDIO_DIR variable from our settings.py file, and then set a string variable to the value small. Like whisper, faster-whisper also comes with different sizes of models. Using the same naming convention we have tiny.en, base.en, small.en, and medium.en as our English-only models. For the multi-language models, we can choose between tiny, base, small, medium, or one of several versions of the full-size model, namely: large-v1, large-v2, large-v3, or large.

Next, we’ll create a new instance of the WhisperModel class, picking only one of the two options below:

model = WhisperModel(model_size, device="cpu", compute_type="int8")
# Choose only one of these, depending on if you're running on CPU or GPU (cuda). (I'll be using the second option)
model = WhisperModel(model_size, device="cuda", compute_type="float16")

More options are available, like running on cuda using int8_float16 or even using float32, see here for more details.

The .transcribe method for faster-whisper is slightly different:

segments, info = model.transcribe(
    str(TEST_AUDIO_DIR / "dutch_long_repeat_file.mp3"),
    beam_size=5,
)

As you can see we get two returns when calling model.transcribe instead of the single dictionary output we had before. The first is a list of segments which contains the transcription. The second is a NamedTuple (a Tuple with named fields) which allows us to access information like the language (info.language), language probability (info.language_probability), etc. So let’s add some print statements to print the information and then the transcription itself to the console:

print(f"Detected language '{info.language}' with probability {info.language_probability}")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

The first print statement just has us access some of the properties of the info object we discussed. The second print statement loops over the list of segments, and for each segment it will print the segment’s start time, end time, and the text of the segment itself. The :.2f is a formatting string that tells Python to print the number with two decimal places, for example: 1.23 instead of 1.23456789.

One interesting thing to note here though is that segments is not actually a list. Segments is a generator, which is a different type of iterable. What this means is that the segments will be generated when you request them and not beforehand. In other words, the transcription only begins when we iterate over the segments and not before. Calling .transcribe() on our model did not start the transcription as vanilla whisper did. You can either loop over the segments as we did above, or you can convert the generator to a list by converting it to a list list(segments).

One of the nice things about this generator is that we can very easily see the live transcription and print it to the console while it is still generating, which is exactly what this code will do. So let’s run it and see what happens:

Estimating duration from bitrate, this may be inaccurate
Detected language 'nl' with probability 0.931703
[0.00s -> 3.04s]  Hoi allemaal, dit is weer een testbestandje.
[3.04s -> 6.88s]  Deze keer om te testen of de Nederlandse taal goed herkent gaat worden.
[6.88s -> 12.68s]  Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat.
[12.68s -> 13.88s]  Ik ben benieuwd.
[13.88s -> 16.84s]  Hoi allemaal, dit is weer een testbestandje.
[16.84s -> 20.72s]  Deze keer om te testen of de Nederlandse taal goed herkent gaat worden.
[20.72s -> 26.48s]  Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat.
[26.48s -> 27.68s]  Ik ben benieuwd.
[27.68s -> 30.72s]  Hoi allemaal, dit is weer een testbestandje.
[30.72s -> 34.60s]  Deze keer om te testen of de Nederlandse taal goed herkent gaat worden.
[34.60s -> 40.36s]  Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat.
[40.36s -> 41.52s]  Ik ben benieuwd.

You can see the output streaming to the console as the model transcribes. Unless you run over CPU you will also notice a pretty good speed. Now as you’re probably not Dutch I’ll just tell you the transcription above is perfect except for the one small (herkent/herkend) issue we had before, but as you know this can be fixed by loading a larger model size.

Play around with any audio file you want and see what model size you need. If you use English files pick a .en model for greater efficiency. Also be aware that you can pass in options into the .transcribe method much like the vanilla whisper model, for instance:

segments, info = model.transcribe(
    str(TEST_AUDIO_DIR / "dutch_long_repeat_file.mp3"),
    beam_size=5,
    word_timestamps=True,  # uncomment this line to get word timestamps
    # without_timestamps=True,  # uncomment this line to get rid of timestamps and just transcribe
)

In conclusion, faster-whisper is a nice optimization to look into if you’re considering deploying this model in a production application somewhere. There are also other optimized versions of the whisper model out there that you can check out, like distil-whisper. Play around and see which gives you the best trade-offs between speed and accuracy. I’ll leave the rest up to you as we move on from faster-whisper to check out the web-API version.

Web-API version

Another option we have is to simply not deploy the model anywhere but outsource this to OpenAI’s fast servers. This is kind of like making a ChatGPT call except we request a transcription instead of a chat completion. The OpenAI servers are also very optimized for machine-learning calculations (obviously) and as you’ll see they are therefore quite fast!

So let’s take a look at the pricing first. The cost for using the Whisper API is $0.006 per minute transcribed, rounded to the nearest second. This means a 20-minute video would cost you $0.12. This is a good solution if you don’t want to deploy the model yourself, perhaps your application will only be used occasionally and it’s simply not worth it to invest that much into having a model running somewhere. For a high-use application dealing with longer files and many users, this is not the way to go though.

So let’s take a quick look at how this would work practically, by building one last quick application, but this time using the web API. Our application will take any video in any language as input and will return a short quiz with questions about the video. First, create a new file in your utils folder named openai_api.py:

📁FINX_WHISPER (project root folder)
    📁output_temp_files
    📁output_video
    📁styles
    📁test_audio_files
    📁utils
        📄command.py
        📄openai_api.py   (✨new file)
        📄podcast.py
        📄subtitles.py
        📄video.py
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄2_whisper_pods.py
    📄3_subtitle_master.py
    📄4_faster_whisper.py
    📄settings.py
    📄.env

Inside openai_api.py, let’s start with our imports and some basic setup:

import typing
from pathlib import Path

from decouple import config
from openai import OpenAI


CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))
MODEL = "whisper-1"

ResponseFormat = typing.Literal["text", "srt", "vtt"]

We’ll use typing to define our allowed response formats. The rest is all imports we have used before, config as we’ll need to load our API key and OpenAI to call the APIs for Whisper and ChatGPT. We create our CLIENT just like last time and we save the MODEL in a string variable, whisper-1 is the only option for the Whisper API for now.

Finally, we define a type alias named ResponseFormat which is a Literal type, which means it can only be one of the three strings we have defined, text, srt, or vtt. We can use this as a type hint later to indicate that if a particular variable is of type ResponseFormat then it should have one of these three values and nothing else. (json and verbose_json are also possible if you prefer JSON object output, but we will be skipping them as they are useless for our purposes.)

Now we’ll define our transcription utility function:

def transcribe(
    file: Path,
    language: str | None = None,
    translate: bool = False,
    response_format: ResponseFormat = "text",
) -> str:

    print("Transcribing file...")
    options = {
        "file": file,
        "model": MODEL,
        "response_format": response_format,
    }

    if translate:
        transcript = CLIENT.audio.translations.create(**options)
    else:
        if language:
            options["language"] = language
        transcript = CLIENT.audio.transcriptions.create(**options)

    if type(transcript) != str:
        raise TypeError(
            f"Expected a string value to be returned, but got {type(transcript)} instead."
        )
    print(f"Transcription successful:\n{transcript[:100]}...")

    return transcript

We define a function called transcribe which takes a file of type Path, a language of type str or None, which defaults to None, in which case the API will try to detect the language automatically. We also have a translate boolean which defaults to False, and a response_format which has to be of type ResponseFormat, so one of the three values we defined in the type alias, and defaults to text. The function returns a string.

We print a message to indicate the transcription is starting and then create a dictionary named options in which we pass in some options that are needed for both a translation and a transcript call, so the shared options if you will. These are the file, model, and response_format. If the user requests a translation we call the CLIENT.audio.translations.create method, passing in the **options dictionary as arguments as is. If translation = False it must be a transcription. For transcriptions, we can add the language key to the options dictionary to specify the language, but if the user didn’t provide it we can leave it out and it will just take a bit longer to do the auto-detection. This time we call the CLIENT.audio.transcriptions.create method, again passing in the **options dictionary which optionally now contains the language key.

Finally, we check if the transcript is a string, and if not we raise a TypeError to indicate something went wrong, just to make sure the user is not requesting JSON from this endpoint, which is possible and would crash the rest of our code. Otherwise, we print a message to indicate the transcription was successful and return the transcript.

Video to Quiz

As we’re going to be building a video-to-quiz app, we need one more utility function inside this openai_api.py file, which will take a transcript and generate some questions for us. Continue below the transcribe function:

PROMPT_SETUP = """You are a text-to-quiz app. The user will provide you a video transcription in textual format. You will generate a list of questions for the user to answer about this video. Depending on the length of the transcription, stick to a maximum of 5 questions. All questions should be solely about the video transcription content provided by the user and should be answerable by reading the transcription. Do not provide the answers, but only the questions. The transcription the user provides is based on a video, and may include timestamps, please ignore these timestamps and just treat it as one single transcription containing all the content in the video.
List and number each item on a separate line.
"""

from tenacity import retry, stop_after_attempt, stop_after_delay

First, we define a constant to hold the prompt setup instruction for ChatGPT. Just go ahead and copy mine. It’s a fairly basic setup that asks for questions related to the video so we can make a quiz tailor-made for the input video. We also import retry, stop_after_attempt, and stop_after_delay from the tenacity package. (Go ahead and move the tenacity imports line to the top of your file with the other imports instead of here in the middle.) We can use these to make our code a bit more robust when calling APIs or taking actions that do not have a 100% success rate. It’s fairly easy to use and I just want to show you that this tool is out there, you’ll see how it works in a second.

Let’s code up the function:

def text_to_quiz(text: str) -> str:
    print("Converting text to quiz...")
    messages = [
        {"role": "system", "content": PROMPT_SETUP},
        {"role": "user", "content": text},
    ]
    result = CLIENT.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=messages,
    )
    content = result.choices[0].message.content
    if content == None:  # Just a quick sanity check
        raise ValueError("There was an error while trying to generate the quiz.")
    print(f"Text to quiz conversion completed.")
    return content

Our function takes a string which is the transcription and returns a string as output. We create a list of messages with the first being the system message, holding our PROMPT_SETUP, and the second being the user message which has the transcription as its content. We then call the CLIENT.chat.completions.create method, passing in the model and messages as arguments. We’ll use gpt-3.5-turbo-1106 which is the newest gpt-3.5 model out there and is frankly good enough. You can use gpt-4 but make sure you consider the cost, it is considerably more expensive and not really needed for this use case. If you’re worried about the lower maximum input size, or ‘context window’ of gpt-3.5, know that it has a 16k context limit that can easily handle long video transcriptions, though most are not really as long as you might think they are.

We then access the content of the first choice’s message in the result object, which should hold our quiz. We do a quick sanity check to make sure we received a valid response, and then print a message to indicate the conversion was successful and return the content.

So that’s pretty simple, right? But what if we get no content back? Do we really want to just raise an error and give up immediately? Let’s use the tenacity library so we can try again in case of a failure. The only single thing we have to change is to add the @retry decorator before our function, the only thing that changes is the first line:

@retry(stop=stop_after_attempt(3) | stop_after_delay(60))
def text_to_quiz(text: str) -> str:
    print("Converting text to quiz...")
    messages = [
        {"role": "system", "content": PROMPT_SETUP},
        {"role": "user", "content": text},
    ]
    result = CLIENT.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=messages,
    )
    content = result.choices[0].message.content
    if content == None:  # Just a quick sanity check
        raise ValueError("There was an error while trying to generate the quiz.")
    print(f"Text to quiz conversion completed.")
    return content

And just like that, our function is set up to try up to three times or (|) for a max of 60 seconds, just in case the API call fails for some reason. Notice how easy it is to use the Tenacity library. This is not required but it’s a nice way to make your code more robust just in case.

Putting it all together

That’s our openai_api.py file done! Go ahead and save and close it. Now let’s create a new file in our project root directory called 4_vid_to_quiz.py to put it all together:

📁FINX_WHISPER (project root folder)
    📁output_temp_files
    📁output_video
    📁styles
    📁test_audio_files
    📁utils
        📄command.py
        📄openai_api.py
        📄podcast.py
        📄subtitles.py
        📄video.py
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄2_whisper_pods.py
    📄3_subtitle_master.py
    📄4_faster_whisper.py
    📄4_vid_to_quiz.py   (✨new file)
    📄settings.py
    📄.env

Inside 4_vid_to_quiz.py let’s start with our imports:

import os
import uuid
from pathlib import Path

import gradio as gr

from settings import BASE_DIR, OUTPUT_TEMP_DIR, STYLES_DIR
from utils import openai_api, video


API_UPLOAD_LIMIT_BYTES = 26214400  # 25mb

We will use os to check the size of the file we will upload, as there is a size limit to the API. We have some imports you’ve seen before, and some of our directories from the settings file plus our openai_api and video utilities. We also define a constant API_UPLOAD_LIMIT_BYTES which is the maximum size of the file we can upload to the API, which is 25 MB.

Let’s start with a quick function to check if the file is not too big:

def check_upload_size(input_file: str) -> None:
    """Check the video file size is within the API upload limit."""
    input_file_size = os.path.getsize(input_file)
    if input_file_size > API_UPLOAD_LIMIT_BYTES:
        raise ValueError(
            f"File size of {input_file_size} bytes ({input_file_size / 1024 / 1024:.2f} MB) exceeds the API upload limit of {API_UPLOAD_LIMIT_BYTES} bytes ({API_UPLOAD_LIMIT_BYTES / 1024 / 1024:.2f} MB). Please use a shorter video or lower the audio quality settings."
        )

We take an input file path as a string and then use os.path.getsize to get the size of the file in bytes, and then check if it is larger than our API_UPLOAD_LIMIT_BYTES. If it is, we raise a ValueError to indicate the file is too large. We also print a message to indicate the file size and the API upload limit. That’s all there is to this function.

Let’s move on to our main function:

def main(input_video: str) -> str:
    """Takes a video file as string path and returns a quiz as string."""
    unique_id = uuid.uuid4()

    mp3_file = video.to_mp3(
        input_video,
        log_directory=BASE_DIR,
        output_path=OUTPUT_TEMP_DIR / f"{unique_id}.mp3",
        mono=True,
    )

    check_upload_size(mp3_file)
    transcription = openai_api.transcribe(
        Path(mp3_file), language="en", translate=False, response_format="text"
    )

    quiz = openai_api.text_to_quiz(transcription)
    return quiz

This is the function the gradio button will call when clicked. It takes an input_video as string input and will return the quiz in string format. We don’t really care about the name of the mp3 file we’ll extract from the video here so we just use a uuid to make it unique. Now we use our video.to_mp3 utility function from the previous part to extract the audio from the video.

We pass in the input_video as the video file, our project root directory as the log_directory, and our output_path is the OUTPUT_TEMP_DIR with the uuid and .mp3 extension pasted on. Finally, this is the time to use the mono option we built into the to_mp3 function but didn’t use last time. So far the size of our files has not been that important, but now that we have a web API it suddenly becomes relevant.

Whisper down-mixes audio to mono before processing anyway, and the API has an upload limit of roughly 25MB per transcription request. So we can save a lot of space by dropping the channels to 1, from stereo to mono audio, which allows us to make much longer requests as we can drastically lower the bitrate with only 1 audio channel.

Sending stereo audio would exceed the file limit after about 20 minutes of audio at 192kbps quality. We more than halved the quality to 80kbps which is still considered decent quality for mono mp3 files and allows us to transcribe way longer files. You can also try playing with the other audio quality settings or lower the bitrate even further to 64kbps for mono if you want to go even further.

After that, we run our check_upload_size check to make sure the file is not too large, and then we call our openai_api.transcribe function, passing in the mp3_file as the file, language="en" as the language, translate=False as we don’t want to translate, and response_format="text" as we want the transcription in text format. We then call our openai_api.text_to_quiz function, passing in the transcription as the text and returning the resulting quiz.

Gradio Interface

Finally, we’ll create our gradio interface:

if __name__ == "__main__":
    block = gr.Blocks(
        css=str(STYLES_DIR / "vid2quiz.css"),
        theme=gr.themes.Soft(primary_hue=gr.themes.colors.yellow),
    )

    with block:
        with gr.Group():
            gr.HTML(
                f"""
                <div class="header">
                <img src="https://i.imgur.com/oEtZKEh.png" referrerpolicy="no-referrer" class="header-img" />
                </div>
                """
            )
            with gr.Row():
                input_video = gr.Video(
                    label="Input Video", sources=["upload"], mirror_webcam=False
                )
                output_quiz_text = gr.Textbox(label="Quiz")
            with gr.Row():
                button_text = "📝 Make a quiz about this video! 📝"
                btn = gr.Button(value=button_text, elem_classes=["button-row"])

            btn.click(main, inputs=[input_video], outputs=[output_quiz_text])

    block.launch(debug=True)

All of this will be familiar by now, I just used a different CSS file we’ll have to create, and used a slightly different primary_hue for the team than last time. The ‘imgur’ image link has changed as well to give you a new header logo and below that, we just take an input video and have an output Textbox. Our button has a CSS class of button-row again so we can style it and clicking the button runs the function with the input video and the output going to the output textbox.

Let’s add the CSS file to our styles folder:

📁FINX_WHISPER (project root folder)
    📁output_temp_files
    📁output_video
    📁styles
        📄subtitle_master.css
        📄vid2quiz.css      (✨new file)
        📄whisper_pods.css
    📁test_audio_files
    📁utils
        📄command.py
        📄openai_api.py
        📄podcast.py
        📄subtitles.py
        📄video.py
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄2_whisper_pods.py
    📄3_subtitle_master.py
    📄4_faster_whisper.py
    📄4_vid_to_quiz.py
    📄settings.py
    📄.env

And inside vid2quiz.css let’s add the following:

.header {
  display: flex;
  justify-content: center;
  align-items: center;
  padding: 2em 8em;
}

.header-img {
  max-width: 50%;
}

.header,
.button-row {
  background-color: #0c1d36;
}

We use flex to center the header image vertically and horizontally and apply the usual padding. We give the header-img class a max-width of 50% so it doesn’t take up the entire width of the screen. Finally, we give the header and button-row classes a background color of #0c1d36 which is a dark blue color.

Ok, you know the drill, let’s run it and see what happens!

Ok, looking good, so let’s upload a video and then request a quiz about it. I used a random video from YouTube, namely Hot Dr Pepper from the 1960s, just because it showed up when I opened the YouTube website. Let’s see how it does:

Perfect, exactly what we wanted, and this was all powered by the OpenAI API! You’ll also notice it was probably reasonably fast, considering it had to convert the whole video and then transcribe it and generate a quiz.

One important limitation of the app in this particular form is that it can handle videos up to about ~48 minutes in length (with the 80kbps mono settings), because of the upload limit. If you want to handle longer videos you could split them up and put the transcripts back together, but honestly, if you’re going to be handling files of that length you’re probably better off deploying the model yourself to save cost as it is calculated per minute of audio.

A fun idea is that you can also use the translation option in our utils.get_transcription function to have foreign language videos as input and then English questions about the foreign language video as output. This could be cool for a foreign language learning app or test.

So that’s it for the whisper course. I hope you enjoyed it and now have a good idea of how to use Whisper, what you can use it for, and the various deployment options. The next step is up to you and limited only by your imagination!

As always, it was an honor and a pleasure to take this journey together, and I hope to see you next time!