Welcome to this first part of the Whisper course. My name is Dirk van Meerveld and it is my pleasure to be your host and guide for this tutorial series where we will be looking at OpenAI’s amazing speech-to-text model called Whisper.
We’ll first take a look at what it is and how its basic usage works, and then we’ll explore ways in which we can practically use it in our projects. Along the way, we’ll learn about the balance between model size and accuracy, and in the final part, we’ll look at alternative options to speed it up or outsource the processing to OpenAI’s servers.
The local installation process should not be too much of a problem but is a bit different for all operating systems and system setups. Unfortunately, I cannot cover every single possible system setup configuration, so you may have to do some googling and trial and error along the way.
This is an inevitable part of software development, don’t give up and you will always get it working eventually, we all get stuck trying to get something to work with our particular system sometimes, it’s just part of the job.
If you do not like a particular configuration like running the model locally, rest assured we will cover both the different ways to run Whisper and various implementation projects over the series, so just watch through the whole thing and then take whatever projects you like and combine it with whatever version of running Whisper you liked.
Installing Whisper
First, we need to install Whisper. We’ll be using the pip package manager for this, so make sure you have that installed, but you should if you’re a Python user. In a terminal window run the following command:
pip install -U openai-whisper
The -U
flag in the pip install -U openai-whisper
command stands for --upgrade
. It means that Whisper will either be installed or upgraded to the latest version if it is already installed.
The second thing we need to have installed is ffmpeg
. What is ffmpeg
? FFmpeg is a versatile multimedia framework that allows us to work with audio and video files. It supports a wide range of formats, and is highly portable, running on pretty much any operating system.
The simplest way to install ffmpeg
is to use a package manager. If you’re on Windows, you can use Chocolatey to install ffmpeg
by running the following command in a terminal window:
# on Windows / Chocolatey choco install ffmpeg
If you’re on MacOS using Homebrew, you can install ffmpeg
by running the following command in a terminal window:
# on MacOS / Homebrew brew install ffmpeg
If you’re on Linux, well you probably know what to do and don’t need instructions! sudo apt update && sudo apt install ffmpeg
This may be the most challenging part of the tutorial series, to be honest. You may not run into any issues if your system is already set up well, or you may need to do quite some googling and setup work to get everything up and running. It took me some messing around to get everything working properly on my system and it’s unfortunately impossible to know exactly what you will need to do to resolve any issues you may run into. Google is your friend! Remember we’ll also cover the API in part 4 if you don’t want to run the model locally, but don’t just skip ahead as you’ll miss out on a lot of useful information.
What is Whisper?
Whisper is a speech-to-text model developed by OpenAI. What is really cool is that they open-source released this model to the public. It is a neural network that takes audio as input and outputs text. It is trained on a large dataset of audio and text pairs and has learned the text that corresponds to the audio. What is exciting about the model is that it’s not just effective at transcribing high-quality ‘gold-standard’ audio that has been recorded on studio microphones, but is also very good at transcribing audio that has considerably lower quality, or even imperfect pronunciation with a foreign accent. If you compare it with auto-generated subtitles from Youtube, for example, you will see that it really is a level apart.
Instead of diving deep into the model’s architecture and technical details that make it work behind the scenes, this course will focus on the practical application of what we can do with it and how to use it to make cool stuff.
Model sizes
There are different sizes available for the Whisper model. The smaller the size of the model, the less processing power and VRAM it needs, and the faster it will run. This comes at the cost of a lower accuracy. On the contrary, the larger the model size, the more processing power and VRAM it needs, and the longer it will take to run, but the more accurate it will be and the better it will deal with foreign languages, noise, and poor audio quality.
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative Speed |
---|---|---|---|---|---|
tiny | 39M | tiny.en | tiny | ~1GB | ~32x |
base | 74M | base.en | base | ~1GB | ~16x |
small | 244M | small.en | small | ~2GB | ~6x |
medium | 769M | medium.en | medium | ~5GB | ~2x |
large | 1550M | N/A | large | ~10GB | 1x |
As we can see in this table from the Whisper GitHub, we have 5 different model sizes in total. There are 4 sizes for the English-only model, namely tiny.en
, base.en
, small.en
, and medium.en
. As this model only deals with the English language it is highly recommended to use one of these when you know you’re going to be transcribing English as these models are specialized at only dealing with English and therefore will give greater accuracy at a much smaller model size and run-time. This is why there is no large.en
model as the medium.en
model is already sufficient in size to equal the accuracy of the large
multilingual model.
For the multilingual models, we have the tiny
, base
, small
, medium
, and large
sizes. This whisper is trained on a whopping 680,000 hours of audio data covering a total of 97 different languages, though the performance does vary per language as more obscure languages may not work quite as well. The larger the model size the easier it will deal with such languages, specific accents, and poor audio quality.
Now if you don’t have 10GB of VRAM, don’t worry, you can often get away with using the smaller-size models as you will see. Later on, in the last part of the series, we’ll look at smaller ‘distilled’ versions of the model that can help us optimize speed further, or just outsourcing the processing to the lighting-fast OpenAI servers. Just keep watching! That being said, I actually recommend you always use the smallest version that you can get away with for your specific task. There is simply no point in adding more cost and complexity to your apps. If you don’t need it the extra model size will only slow down and raise the cost of your application.
Basic usage
Now that we have Whisper, fire up your favorite code editor, and let’s get started! I’ll be using VSCode, but you can use whatever IDE you like. Create a root folder for your project, I’ll call mine FINX_WHISPER
, and then inside make a new file called 1_basic_call_english_only.py
. (I’m using numbers for the file names so you can easily reference them later when you are busy coding some cool new project, but this is obviously not a good general naming convention):
📁FINX_WHISPER (project root folder) 📄1_basic_call_english_only.py
Then open up the new Python file and start with the imports:
import whisper from pathlib import Path
The whisper
import is obvious, and pathlib
will help us get the path to the audio files we want to transcribe, this way our Python file will be able to locate our audio files even if the terminal window is not currently in the same directory as the Python file. Now let’s declare some constants:
MODEL = whisper.load_model("base.en") AUDIO_DIR = Path(__file__).parent / "test_audio_files"
First, we declare MODEL
and load the base.en
model. We start with the second-smallest English-only model and will scale up if and when we need to. Then we declare AUDIO_DIR
and use pathlib
to get the path. This works by first getting the path to the current file (1_basic_call_english_only.py
), using __file__
, and then getting the parent directory of that file, using .parent
. Then we add the test_audio_files
folder to the path using the /
operator. This way we can easily access the audio files in the test_audio_files
folder from our Python file.
Now let’s create the test_audio_files
as it doesn’t actually exist, make sure you spell it correctly:
📁FINX_WHISPER (project root folder) 📁test_audio_files 📄1_basic_call_english_only.py
Then go ahead and add the audio files provided into the folder. They should be provided together with this video tutorial, but if for any reason you cannot find them, go to the Finxter GitHub repository for this course or you can find a copy at:
Download all the test files and put them in the folder (you can also add your own audio files if you want to, these are just provided for your convenience):
📁FINX_WHISPER (project root folder) 📁test_audio_files 🔊dutch_long_repeat_file.mp3 🔊dutch_the_netherlands.mp3 🔊high_quality.mp3 🔊low_quality.mp3 🔊terrible_quality.mp3 📄1_basic_call_english_only.py
Ok, back to our 1_basic_call_english_only.py
file. Below the MODEL
and AUDIO_DIR
variables, let’s create a function that will transcribe the audio files for us:
def get_transcription(audio_file: str): result = MODEL.transcribe(audio_file) print(result) return result
This function takes an audio file’s path in string format as input. We then call the .transcribe()
method Whisper provides for us, and pass in the audio file’s path in string format. Then we simply print and return the result for a basic test. Looks really simple right?
First, let’s try and transcribe a high-quality English audio file, as a sort of best-case scenario:
get_transcription(str(AUDIO_DIR / "high_quality.mp3"))
Notice that the function we wrote above takes a path as a string variable. This is because Whisper requires the path to the audio file as a string. AUDIO_DIR / "high_quality.mp3"
returns a Path
object, so we use str()
to convert it to a string, or else Whisper will crash.
Getting a transcription
So go ahead and save and run the file, and you will see a large object containing all the output. Let’s take a quick look at the information available to us here, read the comments for an explanation:
{ # First we get the full transcription "text": " Hi guys, this is just a quick test audio file for you. Let's see how well it does and if my speech is recognized and converted to text properly. I'm really excited to see how well this works and I hope that it will be a good test for you guys to see how well the whisper model works.", # Now we have the list of segments "segments": [ { "id": 0, "seek": 0, # Start and end times in seconds "start": 0.0, "end": 3.52, "text": " Hi guys, this is just a quick test audio file for you.", # list of tokenized words from the transcription, where each word is represented by a unique number "tokens": [ 50363, 15902, 3730, 11, 428, 318, 655, 257, 2068, 1332, 6597, 2393, 329, 345, 13, 50539 ], "temperature": 0.0, # In the context of machine learning, temperature is a parameter that controls the randomness of predictions. A temperature of 0.0 suggests no randomness, or the model always selecting the tokens(words) with the highest probability (This is similar to the ChatGPT API temperature setting). You can pass a temperature value to the transcribe function when calling it if you want to introduce more randomness into your generations. # For instance: model.transcribe(audio_file, temperature=0.2) "avg_logprob": -0.1399546700554925, # The average log probability of the tokens in the segment. The closer to 0 the better, which means if the numbers get more negative, like -0.2 for instance, it means it's much less confident in it's transcription (and there are probably more errors). "compression_ratio": 1.5898876404494382, "no_speech_prob": 0.0045762090012431145, # Represents the probability that the segment contains no speech. We can see that it is very low. }, { '... more segments with the same structure as above, cut for brevity ...' }, ], "language": "en", }
As we can see, we really get a lot of information back from the model! What is most interesting is of course the transcription itself. Notice that it is a perfect word-for-word transcription even though we used the second smallest base.en
model possible. Very impressive for such a small version of the real model! Now let’s try a lower-quality audio file:
replace the last call:
get_transcription(str(AUDIO_DIR / "high_quality.mp3"))
with:
get_transcription(str(AUDIO_DIR / "low_quality.mp3"))
And when we run this with the considerably lower quality audio file, still on the base.en
model, I still get a perfect transcription. If we look closely at the output object though we can clearly see the avg_logprob
(explained above) has moved further away from 0, moving from -0.1399546700554925
to -0.2179246875974867
indicating the model is now much less confident in it’s transcription (though still correct).
Now let’s try a really poor-quality audio file:
get_transcription(str(AUDIO_DIR / "terrible_quality.mp3"))
And if we run this we can see that it is still half correct even though a human would have trouble understanding it:
Hi guys. This is just a quick test audio file for you. Let's see how well it does and if my speech is recognized, thank you for the context properly. I'm really excited to see how well this works and I hope that it will be a quick test for you guys to see how well the whisper model works.
We have clearly reached the limits of the base model here as part of this is incorrect, and it’s time to step up to a bigger model size. (Remember, you generally want to use the smallest model you can get away with for your use case!)
I’m going to change the model to small.en
by editing the MODEL
variable at the top of our file:
MODEL = whisper.load_model("small.en")
Now if we run it again:
Hi guys, this is just a quick test audio file for you. Let's see how well it does, and if my speech is recognized and converted to text properly, I'm really excited to see how well this works, and I hope that it will be a good test for you guys to see how well the Whisper model works.
There is an awkward super-long sentence with a bit too many commas but apart from that it’s perfect, even though the audio quality of this file is pretty terrible. Switching to medium.en
fixes the last small imperfection with the multiple commas by the way. This is the power of Whisper!
Taking a deeper look
Now let’s take a slightly deeper look at what is happening inside Whisper while looking at using other languages and even translation. Make a new file in your root folder called 1_multiple_languages.py
:
📁FINX_WHISPER (project root folder) 📁test_audio_files 📄1_basic_call_english_only.py 📄1_multiple_languages.py
Then open up the new 1_multiple_languages.py
file and start with the imports:
import whisper from pathlib import Path AUDIO_DIR = Path(__file__).parent / "test_audio_files" model = whisper.load_model("base")
Make sure to use the base
model this time, and not the base.en
model, as we want to use all available languages.
First, we’ll take a slightly deeper down look to have a rough idea of what is going on as this will help us understand some important nuances. After that, we’ll greatly simplify the whole thing using the higher-level code again. Let’s write a function that will detect the language and transcribe a file for us and we’ll explain it line by line.
def detect_language_and_transcribe(audio_file: str): audio = whisper.load_audio(audio_file)
We define a function, which takes the path to an audio_file
as a string argument. We then call Whisper’s .load_audio()
method and pass in the audio file’s path. This returns a NumPy array containing the audio waveform, in float32 datatype, or in other words, an array containing the audio data as a giant list of numbers.
audio = whisper.pad_or_trim(audio)
Next, we get a 30-second sample, either padding with silence if the file is shorter than 30 seconds or trimming it if it is longer. This is because the Whisper model is built and trained to take 30 seconds of audio as its input data each time. This doesn’t mean you cannot transcribe longer files but does have some implications we’ll get back to later.
mel = whisper.log_mel_spectrogram(audio).to(model.device)
Make a log-Mel spectrogram and move it to the same device as the model (e.g. your GPU). A log-Mel spectrogram is a representation of a sound or audio signal that has been transformed to highlight certain perceptual characteristics.
👨🏫 Spectrogram: A spectrogram is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time. It's essentially a heat map where x is time, the y-axis is frequency, and the color represents the loudness. 👨🏫 Mel Scale: The Mel scale is a perceptual scale of pitches that emulates the human ear's response to different frequencies. We humans are much better at distinguishing small changes in pitch at low frequencies than at high frequencies. The Mel scale makes the representation match more closely with human perception as opposed to the exact mathematical frequencies. 👨🏫 Logarithmic Scale: Taking the logarithm of the spectrogram values is another step to make the representation more closely match human perception. We perceive loudness on a logarithmic scale (which is why we use decibels, a logarithmic measurement, to express the loudness of sound). 👨🏫 Combining these, a log-Mel spectrogram is a representation of sound that is designed to highlight the aspects that are most important for human perception. It's commonly used in audio processing tasks, including speech and music recognition.
Now that we have this log-Mel spectrogram, we can use it to detect the language of our audio file. We do this by passing it to the .detect_language()
method of our model:
language_token, language_probs = model.detect_language(mel)
This returns the language_token
, which is a number we will not be using, and the language_probs
which is a huge list of numbers indicating the probability for possible languages matching the sound file. As we won’t actually be using the language_token
variable we can replace it with a _
to indicate that we won’t be using it. This makes it into a sort of throwaway variable that we don’t care about.
_, language_probs = model.detect_language(mel)
Let’s take what we have so far, add a print statement to check out the language_probs
, and run it using the dutch_the_netherlands.mp3
file I prepared for you:
def detect_language_and_transcribe(audio_file: str): audio = whisper.load_audio(audio_file) audio = whisper.pad_or_trim(audio) mel = whisper.log_mel_spectrogram(audio).to(model.device) _, language_probs = model.detect_language(mel) print(language_probs) detect_language_and_transcribe(str(AUDIO_DIR / "dutch_the_netherlands.mp3"))
Now when we run this we can see the massive language_probs
list printed to our console:
{ '.. cut for brevity ..' "yi": 2.012418735830579e-05, "ka": 2.161949907986127e-07, "nl": 0.9650669693946838, "en": 0.010499916970729828, "ko": 9.358442184748128e-05, "mn": 5.96029394728248e-06, "de": 0.010318436659872532, '.. cut for brevity ..' }
We have a huge list of numbers here as you can see. The higher the number the more likely the the language, many are to the power of -4
, -5
, -6
, or even lower. We can clearly see that nl
(the Netherlands) is by far the highest probability, close to a perfect 1 score with 0.965
. The second and third highest are en
(English) and de
(German) with 0.010
and 0.010
respectively which is not even close so we can be very confident that this is Dutch. Impressive for the base
model that small that deals with so many languages, and Dutch not really being that big a language.
Of course, we don’t want this whole list, we just want to know the most probable language, so we can use the max function to get the highest probability.
def detect_language_and_transcribe(audio_file: str): ... language: str = max(language_probs, key=language_probs.get) print(f"Detected language: {language}")
max
returns the key of the largest value in the dictionary. We pass in the dictionary as the first argument. The key
argument is a function that is called on each item in the dictionary, and the item for which the function returns the largest value is the result of the max
function. We can just use the built-in .get()
method as the function to get the value of each item in the dictionary.
The language name codes are in ISO 639-1 format and can be found here. We add a print statement to print the detected language. I removed the previous print statement print(language_probs)
we added before.
def detect_language_and_transcribe(audio_file: str): ... language: str = max(language_probs, key=language_probs.get) print(f"Detected language: {language}") options = whisper.DecodingOptions(language=language, task="transcribe") result = whisper.decode(model, mel, options) print(result) return result.text
Now we’ll decode this 30-second audio file into text. First, we create a DecodingOptions
object and save it in the variable named options. The DecodingOptions
object lets you set more advanced decoding options, but we’ll stick to basics for now, passing in the language
we detected and the task of “transcribe”. We then call the whisper.decode
function which performs decoding of the 30-second audio segment(s), provided as log-Mel spectrogram(s). We pass in the model, the mel spectrogram, and the options. This returns a DecodingResult
object which we save in the variable named result
. We then print the result
and return the result.text
.
The whole function now looks like this:
def detect_language_and_transcribe(audio_file: str): audio = whisper.load_audio(audio_file) audio = whisper.pad_or_trim(audio) mel = whisper.log_mel_spectrogram(audio).to(model.device) _, language_probs = model.detect_language(mel) language: str = max(language_probs, key=language_probs.get) print(f"Detected language: {language}") options = whisper.DecodingOptions(language=language, task="transcribe") result = whisper.decode(model, mel, options) print(result) return result.text
Now let’s run it with the dutch_the_netherlands.mp3
file again:
dutch_test = detect_language_and_transcribe( str(AUDIO_DIR / "dutch_the_netherlands.mp3") )
When you run this the object printed to the console will have the following transcription:
'Hoi, allemaal. Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkend gaat worden. Hierna kunnen we ook proberen deze text te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd.'
There we go, a perfect transcription! Now you probably don’t speak Dutch, but the above is a perfect word-for-word transcription of the spoken text.
Back to .transcribe
Now I’ll be honest, that was a little bit overcomplicated if we don’t need to do much personalization and just want to call the model. Also, we don’t want to limit ourselves to just 30 seconds of audio. Let’s take it back to whisper’s higher level .transcribe
function which basically does all the above for us.
Make sure you comment out the dutch_test
code so it doesn’t keep running:
# dutch_test = detect_language_and_transcribe( # str(AUDIO_DIR / "dutch_the_netherlands.mp3") # )
Now all we need to do to use .transcribe
is load a model (model = whisper.load_model("base")
) which we already did in this file, and then call the .transcribe
method on the model and pass in the path to the audio file as a string:
result = model.transcribe(str(AUDIO_DIR / "dutch_the_netherlands.mp3"), verbose=True) print(result["text"])
It also has some options, in this case, we’ve set verbose
to True
so it will give us extra information in the console. If you go ahead and run this you will get the exact same transcription in the output as we did above:
'Hoi, allemaal. Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkend gaat worden. Hierna kunnen we ook proberen deze text te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd.'
Again, you probably don’t speak Dutch, but that’s not the point. So underneath the hood, the .transcribe
function reads the entire audio file and basically processes it in 30-second windows. You could also see it did the language detection part for us automatically before starting.
Detecting language using up to the first 30 seconds. Use `--language` to specify the language Detected language: Dutch
Working with longer files
So that’s pretty good, right? Well, let’s try a longer audio file and see what happens. I’ve provided dutch_long_repeat_file.mp3
which is just the same audio file but it repeats 3 times, totaling just over 40 seconds. Let’s see what happens when we try to transcribe this file (make sure you comment out the run above):
# result = model.transcribe(str(AUDIO_DIR / "dutch_the_netherlands.mp3"), verbose=True) # print(result["text"]) result = model.transcribe( str(AUDIO_DIR / "dutch_long_repeat_file.mp3"), verbose=True, language="nl", task="transcribe", ) print(result["text"])
Note we can pass in the language if we already know it, so we can skip the detection step and save some time there. So for applications where you always know the language ahead of time just pass it in to optimize your application. We pass in nl
as it is the ISO-639-1 code for the Netherlands.
Now let’s run this and check the output (yours will look different from mine):
Hoi j allemaal! Dit is weer een testbestandje! Deze keer om te testen of de Nederlandse taal goed herkent gaat worden. Je en bırak�� collecte geval. Je gievous raakt deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd! Hoi jlynn allemaal! Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkent gaat worden. Je en driesbredmontie kunt wiring die text er metυτ�� mesma halen te laten vertalen naar het Engels om te zien hoe goed dat gaat! Ik ben benieuwd. Hoi allemaal! Dit is weer een testbestandje. Deze keer om te testen of de Nederlandse taal goed herkend gaat worden. Hierna kunnen we ook proberen deze tekst te laten vertalen naar het Engels om te zien hoe goed dat gaat. Ik ben benieuwd.
Now I’m not going to make you read this, but as a Dutch person, I will tell you this output is terrible and there are several characters and many words here that do not even exist in the Dutch language! So what happened? It’s the same model and the audio file is exactly the same as before, it’s just a bit longer and repeats itself. We should have gotten the same output right?
Well, it is because Whisper’s machine-learning model is limited to audio segments of only 30 seconds as its input. Because of this, it is more challenging for it to transcribe longer audio files. The .transcribe
function took care of cutting the audio into 30-second segments for us and feeding them through and sort of stitching them back together, making our life a lot easier, so we didn’t really notice this extra challenge.
While whisper does use some clever tricks to improve the quality for transcribing longer audio files that need to be cut into 30-second pieces and put back together again this is inherently just a bit trickier so we saw a significant drop in transcription quality even though the audio we were transcribing was the exact same as before (just repeated 3 times in a row to make it longer).
Does this mean Whisper is only good for small files? Not at all! All we need to solve this bigger challenge of a minor language (Dutch) combined with files longer than 30 seconds is to just step up to a bigger model!
When changing the model to small
instead of base
:
model = whisper.load_model("small")
I got an almost perfect output with only a single very minor spelling mistake. When I changed to medium
afterward it was absolutely perfect. It’s just a matter of using a bigger model until it works. Pick the model size that corresponds to the size of your challenge.
Translating
Besides just transcribing, as if that wasn’t awesome enough, Whisper can also translate pretty much all major languages to English. (If you get very hacky it can even translate English to other languages, but that is not an intended or supported feature).
So now let’s give it an audio file in a non-English language and then ask it for an English translation. We’ll feed it the dutch_the_netherlands.mp3
file again, but this time ask it for a translation (to English) so you can finally find out what I said in the audio!
result = model.transcribe( str(AUDIO_DIR / "dutch_the_netherlands.mp3"), verbose=True, language="nl", task="translate", ) print(result["text"])
Make sure you comment out any calls above so you don’t run them by accident. I’ve already tested this out and you’ll need to load around the medium
model size to get a good translation, so make sure you load that BEFORE the call above (if your computer can handle it, otherwise just try a smaller one).
model = whisper.load_model("medium")
The output is:
Hey everyone, this is a test file again. This time to test whether the Dutch language will be recognized well. After this, we can also try to translate this text into English to see how well that goes. I'm curious.
It’s really quite a decent translation, straight from spoken text. That is very impressive. For sloppy pronunciation it still works quite well – I tested this using my Korean pronunciation which is not great and the results were still pretty good.
So the different languages, longer files or perhaps slightly less native pronunciation will benefit a lot from going to larger versions of the model (as long as you have the VRAM for it). I’ll be sticking with the lower end of the spectrum models for this series as much as possible, as not everyone will have the GPU to run the larger models, but feel free to use a larger model if you have the VRAM for it.
On the flip side, if you can only run the small or even the base models, do not despair! The next two tutorials will actually do very well for accuracy running on these smaller models, and again, in the last part, we’ll look at speeding up, optimizing, or outsourcing the processing altogether.
Now that we’ve got the more boring basics out of the way, it’s time to build some cool and fun stuff and look at practical applications and integration in the next couple of parts! See you there!