Hi and welcome back! In this part, we’re going to look at some alternatives to speed stuff up or outsource the processing power to OpenAI’s servers altogether. First, we’ll look at faster-whisper
at a basic level. If you’re not sure whether you want to use this you can also just watch this part and decide whether or not you want to install it for further use later as we’re just going to cover it quickly before moving on to the web API version for the rest of this part.
So what is faster-whisper
? Faster-Whisper is a quicker version of OpenAI’s Whisper speech-to-text model. As OpenAI released the whisper
model as open-source this has naturally allowed others to try to build on and optimize it further. It uses CTranslate2, a fast engine for Transformer models, and is up to 4 times faster and uses considerably less memory than the original openai/whisper while claiming to maintain the same accuracy. You can find the GitHub repository here.
You can use this for the same apps we have built so far, just as a faster version of the Whisper model, so we won’t be building a new app specifically for this, as it would get repetitive and I don’t want to waste your time! You just need some syntax changes to make your app work with faster-whisper instead of the original whisper model. So we’ll take a look at the basics of fast-whisper, let you decide if you want to use/implement it, and then move on to the web-API version.
Installing faster-whisper
Note: If you do not plan on using faster-whisper or are not quite sure, there is no point in going through the install procedures, and you can skip ahead a couple of minutes to the web-API version, or just watch/read along and decide later if you want to use it.
Basically, to install faster-whisper you just have to run the following command in your terminal:
pip install faster-whisper
And to support GPU execution you need to have the appropriate libraries for CUDA installed, which are cuBLAS and cuDNN. This can be the slightly trickier part of the install, and again I cannot really give you platform-specific instructions or help you with the specific troubleshooting if you run into challenges. As always in software development, if you’re lucky you won’t have any problems, and if you’re not, you spend some time on Google and Stackoverflow to find the solution. If you just want to run faster-whisper on your CPU, which will of course be slower but may not be a big deal for small-scale development on your own machine, you can skip the cuBLAS
and cuDNN
installs.
Using faster-whisper
So let’s give it a spin to see how it works! First create a new file in your project root directory called 4_faster_whisper.py
:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📁test_audio_files 📁utils 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄2_whisper_pods.py 📄3_subtitle_master.py 📄4_faster_whisper.py (✨new file) 📄settings.py 📄.env
And inside let’s start with our imports:
from faster_whisper import WhisperModel from settings import TEST_AUDIO_DIR model_size = "small"
We import the WhisperModel
class from the faster_whisper
package, and the TEST_AUDIO_DIR
variable from our settings.py
file, and then set a string variable to the value small
. Like whisper, faster-whisper also comes with different sizes of models. Using the same naming convention we have tiny.en
, base.en
, small.en
, and medium.en
as our English-only models. For the multi-language models, we can choose between tiny
, base
, small
, medium
, or one of several versions of the full-size model, namely: large-v1
, large-v2
, large-v3
, or large
.
Next, we’ll create a new instance of the WhisperModel
class, picking only one of the two options below:
model = WhisperModel(model_size, device="cpu", compute_type="int8") # Choose only one of these, depending on if you're running on CPU or GPU (cuda). (I'll be using the second option) model = WhisperModel(model_size, device="cuda", compute_type="float16")
More options are available, like running on cuda
using int8_float16
or even using float32
, see here for more details.
The .transcribe
method for faster-whisper is slightly different:
segments, info = model.transcribe( str(TEST_AUDIO_DIR / "dutch_long_repeat_file.mp3"), beam_size=5, )
As you can see we get two returns when calling model.transcribe
instead of the single dictionary output we had before. The first is a list of segments
which contains the transcription. The second is a NamedTuple
(a Tuple
with named fields) which allows us to access information like the language (info.language
), language probability (info.language_probability
), etc. So let’s add some print statements to print the information and then the transcription itself to the console:
print(f"Detected language '{info.language}' with probability {info.language_probability}") for segment in segments: print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
The first print statement just has us access some of the properties of the info
object we discussed. The second print statement loops over the list of segments
, and for each segment
it will print the segment’s start time, end time, and the text of the segment itself. The :.2f
is a formatting string that tells Python to print the number with two decimal places, for example: 1.23
instead of 1.23456789
.
One interesting thing to note here though is that segments
is not actually a list. Segments is a generator, which is a different type of iterable. What this means is that the segments will be generated when you request them and not beforehand. In other words, the transcription only begins when we iterate over the segments
and not before. Calling .transcribe()
on our model did not start the transcription as vanilla whisper did. You can either loop over the segments
as we did above, or you can convert the generator to a list by converting it to a list list(segments)
.
One of the nice things about this generator is that we can very easily see the live transcription and print it to the console while it is still generating, which is exactly what this code will do. So let’s run it and see what happens:
Estimating duration from bitrate, this may be inaccurate Detected language 'nl' with probability 0.931703 [0.00s -> 3.04s] Hoi allemaal, dit is weer een testbestandje. [3.04s -> 6.88s] Deze keer om te testen of de Nederlandse taal goed herkent gaat worden. [6.88s -> 12.68s] Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat. [12.68s -> 13.88s] Ik ben benieuwd. [13.88s -> 16.84s] Hoi allemaal, dit is weer een testbestandje. [16.84s -> 20.72s] Deze keer om te testen of de Nederlandse taal goed herkent gaat worden. [20.72s -> 26.48s] Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat. [26.48s -> 27.68s] Ik ben benieuwd. [27.68s -> 30.72s] Hoi allemaal, dit is weer een testbestandje. [30.72s -> 34.60s] Deze keer om te testen of de Nederlandse taal goed herkent gaat worden. [34.60s -> 40.36s] Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat. [40.36s -> 41.52s] Ik ben benieuwd.
You can see the output streaming to the console as the model transcribes. Unless you run over CPU you will also notice a pretty good speed. Now as you’re probably not Dutch I’ll just tell you the transcription above is perfect except for the one small (herkent/herkend
) issue we had before, but as you know this can be fixed by loading a larger model size.
Play around with any audio file you want and see what model size you need. If you use English files pick a .en
model for greater efficiency. Also be aware that you can pass in options into the .transcribe
method much like the vanilla whisper model, for instance:
segments, info = model.transcribe( str(TEST_AUDIO_DIR / "dutch_long_repeat_file.mp3"), beam_size=5, word_timestamps=True, # uncomment this line to get word timestamps # without_timestamps=True, # uncomment this line to get rid of timestamps and just transcribe )
In conclusion, faster-whisper is a nice optimization to look into if you’re considering deploying this model in a production application somewhere. There are also other optimized versions of the whisper model out there that you can check out, like distil-whisper. Play around and see which gives you the best trade-offs between speed and accuracy. I’ll leave the rest up to you as we move on from faster-whisper to check out the web-API version.
Web-API version
Another option we have is to simply not deploy the model anywhere but outsource this to OpenAI’s fast servers. This is kind of like making a ChatGPT call except we request a transcription instead of a chat completion. The OpenAI servers are also very optimized for machine-learning calculations (obviously) and as you’ll see they are therefore quite fast!
So let’s take a look at the pricing first. The cost for using the Whisper API is $0.006 per minute transcribed, rounded to the nearest second. This means a 20-minute video would cost you $0.12. This is a good solution if you don’t want to deploy the model yourself, perhaps your application will only be used occasionally and it’s simply not worth it to invest that much into having a model running somewhere. For a high-use application dealing with longer files and many users, this is not the way to go though.
So let’s take a quick look at how this would work practically, by building one last quick application, but this time using the web API. Our application will take any video in any language as input and will return a short quiz with questions about the video. First, create a new file in your utils
folder named openai_api.py
:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📁test_audio_files 📁utils 📄command.py 📄openai_api.py (✨new file) 📄podcast.py 📄subtitles.py 📄video.py 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄2_whisper_pods.py 📄3_subtitle_master.py 📄4_faster_whisper.py 📄settings.py 📄.env
Inside openai_api.py
, let’s start with our imports and some basic setup:
import typing from pathlib import Path from decouple import config from openai import OpenAI CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY"))) MODEL = "whisper-1" ResponseFormat = typing.Literal["text", "srt", "vtt"]
We’ll use typing
to define our allowed response formats. The rest is all imports we have used before, config
as we’ll need to load our API key and OpenAI
to call the APIs for Whisper and ChatGPT. We create our CLIENT
just like last time and we save the MODEL
in a string variable, whisper-1
is the only option for the Whisper API for now.
Finally, we define a type alias named ResponseFormat
which is a Literal
type, which means it can only be one of the three strings we have defined, text
, srt
, or vtt
. We can use this as a type hint later to indicate that if a particular variable is of type ResponseFormat
then it should have one of these three values and nothing else. (json
and verbose_json
are also possible if you prefer JSON object output, but we will be skipping them as they are useless for our purposes.)
Now we’ll define our transcription utility function:
def transcribe( file: Path, language: str | None = None, translate: bool = False, response_format: ResponseFormat = "text", ) -> str: print("Transcribing file...") options = { "file": file, "model": MODEL, "response_format": response_format, } if translate: transcript = CLIENT.audio.translations.create(**options) else: if language: options["language"] = language transcript = CLIENT.audio.transcriptions.create(**options) if type(transcript) != str: raise TypeError( f"Expected a string value to be returned, but got {type(transcript)} instead." ) print(f"Transcription successful:\n{transcript[:100]}...") return transcript
We define a function called transcribe
which takes a file
of type Path
, a language
of type str
or None
, which defaults to None
, in which case the API will try to detect the language automatically. We also have a translate
boolean which defaults to False
, and a response_format
which has to be of type ResponseFormat
, so one of the three values we defined in the type alias, and defaults to text
. The function returns a string.
We print a message to indicate the transcription is starting and then create a dictionary named options in which we pass in some options that are needed for both a translation and a transcript call, so the shared options if you will. These are the file
, model
, and response_format
. If the user requests a translation we call the CLIENT.audio.translations.create
method, passing in the **options
dictionary as arguments as is. If translation
= False
it must be a transcription. For transcriptions, we can add the language
key to the options dictionary to specify the language, but if the user didn’t provide it we can leave it out and it will just take a bit longer to do the auto-detection. This time we call the CLIENT.audio.transcriptions.create
method, again passing in the **options
dictionary which optionally now contains the language
key.
Finally, we check if the transcript
is a string, and if not we raise a TypeError
to indicate something went wrong, just to make sure the user is not requesting JSON from this endpoint, which is possible and would crash the rest of our code. Otherwise, we print a message to indicate the transcription was successful and return the transcript
.
Video to Quiz
As we’re going to be building a video-to-quiz app, we need one more utility function inside this openai_api.py
file, which will take a transcript and generate some questions for us. Continue below the transcribe
function:
PROMPT_SETUP = """You are a text-to-quiz app. The user will provide you a video transcription in textual format. You will generate a list of questions for the user to answer about this video. Depending on the length of the transcription, stick to a maximum of 5 questions. All questions should be solely about the video transcription content provided by the user and should be answerable by reading the transcription. Do not provide the answers, but only the questions. The transcription the user provides is based on a video, and may include timestamps, please ignore these timestamps and just treat it as one single transcription containing all the content in the video. List and number each item on a separate line. """ from tenacity import retry, stop_after_attempt, stop_after_delay
First, we define a constant to hold the prompt setup instruction for ChatGPT. Just go ahead and copy mine. It’s a fairly basic setup that asks for questions related to the video so we can make a quiz tailor-made for the input video. We also import retry
, stop_after_attempt
, and stop_after_delay
from the tenacity
package. (Go ahead and move the tenacity imports line to the top of your file with the other imports instead of here in the middle.) We can use these to make our code a bit more robust when calling APIs or taking actions that do not have a 100% success rate. It’s fairly easy to use and I just want to show you that this tool is out there, you’ll see how it works in a second.
Let’s code up the function:
def text_to_quiz(text: str) -> str: print("Converting text to quiz...") messages = [ {"role": "system", "content": PROMPT_SETUP}, {"role": "user", "content": text}, ] result = CLIENT.chat.completions.create( model="gpt-3.5-turbo-1106", messages=messages, ) content = result.choices[0].message.content if content == None: # Just a quick sanity check raise ValueError("There was an error while trying to generate the quiz.") print(f"Text to quiz conversion completed.") return content
Our function takes a string which is the transcription and returns a string as output. We create a list of messages with the first being the system message, holding our PROMPT_SETUP
, and the second being the user message which has the transcription as its content. We then call the CLIENT.chat.completions.create
method, passing in the model
and messages
as arguments. We’ll use gpt-3.5-turbo-1106
which is the newest gpt-3.5 model out there and is frankly good enough. You can use gpt-4 but make sure you consider the cost, it is considerably more expensive and not really needed for this use case. If you’re worried about the lower maximum input size, or ‘context window’ of gpt-3.5, know that it has a 16k context limit that can easily handle long video transcriptions, though most are not really as long as you might think they are.
We then access the content
of the first choice’s message in the result
object, which should hold our quiz. We do a quick sanity check to make sure we received a valid response, and then print a message to indicate the conversion was successful and return the content
.
So that’s pretty simple, right? But what if we get no content back? Do we really want to just raise an error and give up immediately? Let’s use the tenacity library so we can try again in case of a failure. The only single thing we have to change is to add the @retry
decorator before our function, the only thing that changes is the first line:
@retry(stop=stop_after_attempt(3) | stop_after_delay(60)) def text_to_quiz(text: str) -> str: print("Converting text to quiz...") messages = [ {"role": "system", "content": PROMPT_SETUP}, {"role": "user", "content": text}, ] result = CLIENT.chat.completions.create( model="gpt-3.5-turbo-1106", messages=messages, ) content = result.choices[0].message.content if content == None: # Just a quick sanity check raise ValueError("There was an error while trying to generate the quiz.") print(f"Text to quiz conversion completed.") return content
And just like that, our function is set up to try up to three times or (|
) for a max of 60 seconds, just in case the API call fails for some reason. Notice how easy it is to use the Tenacity library. This is not required but it’s a nice way to make your code more robust just in case.
Putting it all together
That’s our openai_api.py
file done! Go ahead and save and close it. Now let’s create a new file in our project root directory called 4_vid_to_quiz.py
to put it all together:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📁test_audio_files 📁utils 📄command.py 📄openai_api.py 📄podcast.py 📄subtitles.py 📄video.py 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄2_whisper_pods.py 📄3_subtitle_master.py 📄4_faster_whisper.py 📄4_vid_to_quiz.py (✨new file) 📄settings.py 📄.env
Inside 4_vid_to_quiz.py
let’s start with our imports:
import os import uuid from pathlib import Path import gradio as gr from settings import BASE_DIR, OUTPUT_TEMP_DIR, STYLES_DIR from utils import openai_api, video API_UPLOAD_LIMIT_BYTES = 26214400 # 25mb
We will use os
to check the size of the file we will upload, as there is a size limit to the API. We have some imports you’ve seen before, and some of our directories from the settings
file plus our openai_api
and video
utilities. We also define a constant API_UPLOAD_LIMIT_BYTES
which is the maximum size of the file we can upload to the API, which is 25 MB.
Let’s start with a quick function to check if the file is not too big:
def check_upload_size(input_file: str) -> None: """Check the video file size is within the API upload limit.""" input_file_size = os.path.getsize(input_file) if input_file_size > API_UPLOAD_LIMIT_BYTES: raise ValueError( f"File size of {input_file_size} bytes ({input_file_size / 1024 / 1024:.2f} MB) exceeds the API upload limit of {API_UPLOAD_LIMIT_BYTES} bytes ({API_UPLOAD_LIMIT_BYTES / 1024 / 1024:.2f} MB). Please use a shorter video or lower the audio quality settings." )
We take an input file path as a string and then use os.path.getsize
to get the size of the file in bytes, and then check if it is larger than our API_UPLOAD_LIMIT_BYTES
. If it is, we raise a ValueError
to indicate the file is too large. We also print a message to indicate the file size and the API upload limit. That’s all there is to this function.
Let’s move on to our main
function:
def main(input_video: str) -> str: """Takes a video file as string path and returns a quiz as string.""" unique_id = uuid.uuid4() mp3_file = video.to_mp3( input_video, log_directory=BASE_DIR, output_path=OUTPUT_TEMP_DIR / f"{unique_id}.mp3", mono=True, ) check_upload_size(mp3_file) transcription = openai_api.transcribe( Path(mp3_file), language="en", translate=False, response_format="text" ) quiz = openai_api.text_to_quiz(transcription) return quiz
This is the function the gradio button will call when clicked. It takes an input_video as string input and will return the quiz in string format. We don’t really care about the name of the mp3 file we’ll extract from the video here so we just use a uuid
to make it unique. Now we use our video.to_mp3
utility function from the previous part to extract the audio from the video.
We pass in the input_video
as the video file, our project root directory as the log_directory
, and our output_path
is the OUTPUT_TEMP_DIR
with the uuid
and .mp3
extension pasted on. Finally, this is the time to use the mono
option we built into the to_mp3
function but didn’t use last time. So far the size of our files has not been that important, but now that we have a web API it suddenly becomes relevant.
Whisper down-mixes audio to mono before processing anyway, and the API has an upload limit of roughly 25MB per transcription request. So we can save a lot of space by dropping the channels to 1, from stereo to mono audio, which allows us to make much longer requests as we can drastically lower the bitrate with only 1 audio channel.
Sending stereo audio would exceed the file limit after about 20 minutes of audio at 192kbps quality. We more than halved the quality to 80kbps which is still considered decent quality for mono mp3 files and allows us to transcribe way longer files. You can also try playing with the other audio quality settings or lower the bitrate even further to 64kbps for mono if you want to go even further.
After that, we run our check_upload_size
check to make sure the file is not too large, and then we call our openai_api.transcribe
function, passing in the mp3_file
as the file
, language="en"
as the language, translate=False
as we don’t want to translate, and response_format="text"
as we want the transcription in text format. We then call our openai_api.text_to_quiz
function, passing in the transcription
as the text
and returning the resulting quiz
.
Gradio Interface
Finally, we’ll create our gradio interface:
if __name__ == "__main__": block = gr.Blocks( css=str(STYLES_DIR / "vid2quiz.css"), theme=gr.themes.Soft(primary_hue=gr.themes.colors.yellow), ) with block: with gr.Group(): gr.HTML( f""" <div class="header"> <img src="https://i.imgur.com/oEtZKEh.png" referrerpolicy="no-referrer" class="header-img" /> </div> """ ) with gr.Row(): input_video = gr.Video( label="Input Video", sources=["upload"], mirror_webcam=False ) output_quiz_text = gr.Textbox(label="Quiz") with gr.Row(): button_text = "📝 Make a quiz about this video! 📝" btn = gr.Button(value=button_text, elem_classes=["button-row"]) btn.click(main, inputs=[input_video], outputs=[output_quiz_text]) block.launch(debug=True)
All of this will be familiar by now, I just used a different CSS file we’ll have to create, and used a slightly different primary_hue
for the team than last time. The ‘imgur’ image link has changed as well to give you a new header logo and below that, we just take an input video and have an output Textbox
. Our button has a CSS class of button-row
again so we can style it and clicking the button runs the function with the input video and the output going to the output textbox.
Let’s add the CSS file to our styles
folder:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📄subtitle_master.css 📄vid2quiz.css (✨new file) 📄whisper_pods.css 📁test_audio_files 📁utils 📄command.py 📄openai_api.py 📄podcast.py 📄subtitles.py 📄video.py 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄2_whisper_pods.py 📄3_subtitle_master.py 📄4_faster_whisper.py 📄4_vid_to_quiz.py 📄settings.py 📄.env
And inside vid2quiz.css
let’s add the following:
.header { display: flex; justify-content: center; align-items: center; padding: 2em 8em; } .header-img { max-width: 50%; } .header, .button-row { background-color: #0c1d36; }
We use flex
to center the header image vertically and horizontally and apply the usual padding. We give the header-img
class a max-width
of 50% so it doesn’t take up the entire width of the screen. Finally, we give the header
and button-row
classes a background color of #0c1d36
which is a dark blue color.
Ok, you know the drill, let’s run it and see what happens!
Ok, looking good, so let’s upload a video and then request a quiz about it. I used a random video from YouTube, namely Hot Dr Pepper from the 1960s, just because it showed up when I opened the YouTube website. Let’s see how it does:
Perfect, exactly what we wanted, and this was all powered by the OpenAI API! You’ll also notice it was probably reasonably fast, considering it had to convert the whole video and then transcribe it and generate a quiz.
One important limitation of the app in this particular form is that it can handle videos up to about ~48 minutes in length (with the 80kbps mono settings), because of the upload limit. If you want to handle longer videos you could split them up and put the transcripts back together, but honestly, if you’re going to be handling files of that length you’re probably better off deploying the model yourself to save cost as it is calculated per minute of audio.
A fun idea is that you can also use the translation option in our utils.get_transcription
function to have foreign language videos as input and then English questions about the foreign language video as output. This could be cool for a foreign language learning app or test.
So that’s it for the whisper course. I hope you enjoyed it and now have a good idea of how to use Whisper, what you can use it for, and the various deployment options. The next step is up to you and limited only by your imagination!
As always, it was an honor and a pleasure to take this journey together, and I hope to see you next time!