Welcome back to part 2, where we’ll start practically applying our Whisper skills to build useful stuff. We obviously cannot just rely on the user needing to give us MP3 files to transcribe, they may want to just link a podcast for example. Here, we’ll be building a real application that can transcribe podcasts to text or subtitle format by taking just a podcast link as input.
Before we get started on the main code, we’ll do some basic setup work and create the helper functions we need to run in our main code. Keeping things separated across multiple functions and files will help keep our code a lot more clean and readable compared to just having one big script that does everything at the same time.
Saving our constants to a separate file
First, there are a couple of settings we’ll be using again and again over the next three parts, namely the paths to the input and output folders for the mp3 files, subtitles, and whatever else we will be processing. Instead of importing pathlib
in every single file and then writing BASE_DIR = Path(__file__).parent
we’ll just write this in a separate file and import it everywhere we need it. This will also make it easier to change the paths later if we need to.
In your project folder create a new file called settings.py
, making sure to put it in the root folder of your project:
📁FINX_WHISPER (project root folder) 📁test_audio_files 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄settings.py
In settings.py
, write the following code:
from pathlib import Path BASE_DIR = Path(__file__).parent OUTPUT_TEMP_DIR = BASE_DIR / "output_temp_files" OUTPUT_VIDEO_DIR = BASE_DIR / "output_video" STYLES_DIR = BASE_DIR / "styles" TEST_AUDIO_DIR = BASE_DIR / "test_audio_files"
We first get the root directory of the project using Path(__file__).parent
, and then we create a few more paths relative to the root directory. We’ll use these paths in our main code to save the output files to the correct folders. Go ahead and also create empty folders for the output_temp_files
, output_video
, and styles
folders, making sure to spell them correctly:
📁FINX_WHISPER (project root folder) 📁output_temp_files (new empty folder) 📁output_video (new empty folder) 📁styles (new empty folder) 📁test_audio_files (already existing folder) 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄settings.py
That’s our folders and paths setup done. We can just import these variables to access the folders from any file in our project. There is one more setting
we need to define, but we cannot hardcode this one in our source code. We need to get our API key for OpenAI, as we’ll be using some ChatGPT in this part of the course. You’ll also need your API key for later parts. Go to https://platform.openai.com/api-keys and copy your API key. If you don’t have one, make sure to get one. You’ll only pay for what you use which will be cents if you just play around with it casually. Then create a new file called .env
in the root folder of your project:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📁test_audio_files 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄settings.py 📄.env (new file)
And paste your API key in there like this, making sure not to use any spaces or quotes:
OPENAI_API_KEY=your_api_key_here
Then go ahead and save and close this file.
Creating a utils folder for our helper functions
Now let’s create a new folder named utils
to hold our helper functions, and then inside this new folder create an empty file called __init__.py
:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📁test_audio_files 📁utils (new folder) 📄__init__.py (new empty file) 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄settings.py 📄.env
The __init__.py
file is required to make Python treat the utils
folder as a package, which will allow us to import the functions from within our other files. You don’t need to write anything in this file, just create it and leave it empty.
Our first utils file will deal with the podcast-related functions, so create a file called podcast.py
in the utils
folder:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📁test_audio_files 📁utils 📄__init__.py 📄podcast.py (new file) 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄settings.py 📄.env
Inside podcast.py
get started with our imports:
import re import uuid from pathlib import Path import requests from decouple import config from openai import OpenAI
The re
library deals with regular expressions and will help us find the podcast download page link amongst the page text. The uuid
library lets us generate unique id’s, pathlib
is familiar to us by now, and requests
will help us download the podcast mp3 file. decouple
will help us read our API key from the .env
file, and openai
will help us use the OpenAI API. If you have not used decouple
before, make sure you run the install command in your terminal:
pip install python-decouple
Back in podcast.py
let’s create a few constants that we’ll be using in our functions:
GPT_MODEL = "gpt-3.5-turbo-1106" CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))
First, we set the ChatGPT model we’ll be using to request a podcast summary later on. Then we create a CLIENT
object that we’ll use to make requests to the OpenAI API. We pass in our API key as a string, and we use config
to read the API key from the .env
file. Note that config("OPENAI_API_KEY")
already returns a string value, the str()
call surrounding it is just there to make it explicit and will not convert values that are already strings to a string again for the second time or something weird like that.
Scraping the podcast download link from the podcast page
So what are some of the functions we’ll need in here? For this example application I will be using Google Podcasts
as our podcast source. This means we will get an input link like this:
https://podcasts.google.com/feed/aHR0cDovL2ZlZWRzLmZlZWRidXJuZXIuY29tL1RFRF9BaGFfQnVzaW5lc3M/episode/ZW4udmlkZW8udGFsay50ZWQuY29tOjExMTk3MDo4MA?sa=X&ved=0CAgQuIEEahcKEwiIzMnavduDAxUAAAAAHQAAAAAQAQ
If you load this page in your browser, you will see an HTML page, with a play button. This is the kind of page link the user will input into our app, so first of all we will need a function to extract the .mp3
download link from this page’s HTML.
Let’s get started on a function to do exactly that:
def scrape_link_from_page(page_url: str) -> str: podcast_page = requests.get(page_url).text regex = r"(?P<url>\;https?://[^\s]+)" ...
We start by defining our function which takes the page_url
as a string and will return a string value as well. Then we use requests
to get the HTML page text by sending a GET
request to the URL, much like your internet browser would if you type a URL in the address bar. Now we define a regular expression that will match the pattern of the download link we want to extract. We’ll use this regex to find the download link in the HTML page text. Here’s how it works:
(?P<url>...)
This is a named group. The matched text can be retrieved by the name URL. So basically the URL pattern we will find will be stored in a variable called URL.\;
This matches a semicolon character. The backslash is used to escape the semicolon, as it has special meaning in regular expressions. We don’t want this special meaning but the literal semicolon character, as there is a semicolon in front of the https that we want to match for the URL we need. (This is just a characteristic of this particular podcast page, other pages might have different patterns.)https?
This matches either http or https. The s? means “match zero or one s characters”. This allows the regex to match both http and https.://
This matches the string ://, which is part of the standard format for URLs.[^\s]+
This matches one or more (+
) of any character that is not (^
) a whitespace (\s
) character. So basically this will match any character that is not a space, tab, or newline character. This will match the rest of the URL we need and stop adding characters as soon as a space appears which indicates the end of the URL.
So, in simple terms, this regular expression matches a semicolon followed by a URL that starts with either http or https, and continues until a whitespace character is encountered. The URL is captured in a group named url.
Now let’s complete our function:
def scrape_link_from_page(page_url: str) -> str: podcast_page = requests.get(page_url).text regex = r"(?P<url>\;https?://[^\s]+)" podcast_url_dirty = re.findall(regex, podcast_page)[0] podcast_url = podcast_url_dirty.split(";")[1] return podcast_url
So after we declared the regex pattern, we use re.findall
to find all matches of the pattern in the podcast page text. This will return a list of matches, and we take the first match with [0]
. This will return a string that looks something like this:
;https://download.ted.com/talks/etcetcetc;
Which is pretty good, we just need to get rid of the ;
characters before and after the URL. We do this by splitting the string on the ;
character, and then taking the second item in the list with [1]
. This will return the clean URL we need: https://download.ted.com/talks/etcetcetc
Downloading the podcast mp3 file
Ok, so now our utils file has a function to scrape the download link. It stands to reason we’ll also need a function to download the mp3 file from the URL. Let’s get started on that:
def download(podcast_url: str, unique_id: uuid.UUID, output_dir: Path) -> Path: print("Downloading podcast...") podcast_audio = requests.get(podcast_url) save_location = output_dir / f"{unique_id}.mp3" ...
We define a function called download
that takes 3 input arguments. The podcast_url
is the URL we scraped from the podcast page as a string variable. The unique_id
is a unique ID we’ll use to name the downloaded file, so we can avoid name clashes where files have the same name. This argument should be an instance of the UUID
class from the uuid
built-in Python library, which we’ll have a look at in a bit. The output_dir
is the directory where we want to save the downloaded file as a Path
object. Finally, our function will also return a Path
object, which will be the path to the downloaded file.
We print a simple message to the console to show it is busy actually doing something, and then we use requests
to download the podcast audio file by sending a GET
request to the URL just like we did in the previous function. Then we create a save_location
variable which is the path to the file we want to save. We use the output_dir
argument as the parent directory, and then we use an f-string to create a filename that is the unique_id
followed by the .mp3
extension.
Now let’s complete our function:
def download(podcast_url: str, unique_id: uuid.UUID, output_dir: Path) -> Path: print("Downloading podcast...") podcast_audio = requests.get(podcast_url) save_location = output_dir / f"{unique_id}.mp3" with open(save_location, "wb") as file: file.write(podcast_audio.content) print("Podcast successfully downloaded!") return save_location
We use the open
function to open the save_location
file in write binary (wb
) mode, and we write the podcast_audio.content
to the file. This will save the podcast audio file to the save_location
path. Then we print a message to the console to show the download was successful, and we return the save_location
path which points to the mp3 file we just downloaded, awesome!
Getting a summary
Now there is one more function we need in our utils/podcast
file. Besides just the transcription, we will also provide the user with a summary of the podcast. We’ll use ChatGPT to generate this summary, so we’ll need a simple function to do that. This one will be easy, so let’s just whip it up:
def get_summary(transcription: str) -> str: print("Summarizing podcast...") prompt = f"Summarize the following podcast into the most important points:\n\n{transcription}\n\nSummary:" response = CLIENT.chat.completions.create( model=GPT_MODEL, messages=[{"role": "user", "content": prompt}] ) print("Podcast summarized!") summary = response.choices[0].message.content return summary if summary else "There was a problem generating the summary."
I assume you’re familiar with ChatGPT (if not, check out my other courses on the Finxter Academy!). We just have a simple function that takes the full transcription
as a string and will return a summary as a string. We have a console print message again just to keep ourselves posted that it is doing some work and then we have a simple ChatGPT prompt.
Note the prompt ends with Summary:
to prompt the model to start the summary right away without including any awkward introduction text, this is just a neat little trick you can use. We then use our CLIENT
object to call the chat.completions.create
endpoint, passing in the GPT_MODEL
and a list of messages. We’ll just pass in the prompt as a user message. We then extract the summary
from the response.choices[0].message.content
. Just in case there was a problem and the summary is empty, we return a default message to inform the user.
Subtitles
Awesome! Our podcast
utils are done now. Let’s move on to the subtitles
utils. This one will be a much shorter file with a function that will allow us to output the transcription in subtitle format, with timestamps and everything. So go ahead and create a new file called subtitles.py
in the utils
folder:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📁test_audio_files 📁utils 📄__init__.py 📄podcast.py 📄subtitles.py (new file) 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄settings.py 📄.env
And inside subtitles.py
get started with our imports:
from typing import Callable from pathlib import Path
Both of these imports will be used solely to indicate the type of our function arguments (type hinting). We’ll use Callable
to indicate that a function is expected as an argument, and we’ll use Path
to indicate that a Path
object is expected as an argument. This just makes our code clearer to read and easier to understand. Now let’s write our function, whose purpose will be to take a transcription done by Whisper and then convert it to a valid subtitle file:
def write_to_file(whisper_output: dict, writer: Callable, output_path: Path) -> Path: """Takes the whisper output, a writer function, and an output path, and writes subtitles to disk in the specified format.""" with open(output_path, "w", encoding="utf-8") as sub_file: writer.write_result(result=whisper_output, file=sub_file) print(f"Subtitles generated and saved to {output_path}") return output_path
We take a whisper_output
argument which is a dictionary containing the output Whisper gives us after we transcribe the podcast’s mp3 file. We also take a writer
argument which is a function that will write the subtitles to disk, so we type-hint it with Callable
. Finally, we take an output_path
argument which is a Path
object to the file we want to save the subtitles to. We then simply open the output path in write mode, calling the file sub_file
. We then call the writer.write_result
function, passing in the whisper_output
and the location to save the subtitles to. Finally, we print a message to the console to show the subtitles were generated successfully, and we return the output_path
which is the path to the subtitle file we just created.
Two important things to note here:
- When you open the subtitle file, make sure you use the
encoding="utf-8"
argument. For normal English characters, this is not necessary, so you might think this is not needed. However, the AI likes to use ♪ symbols when music starts playing to make the subtitles more interesting, and you crash if you don’t specify utf-8 encoding which can actually map and save these special characters! - You might be wondering what this magical
writer
function is. Whisper actually comes with some utility functions that will allow us to write subtitles in correct formatting, likeSRT
orVTT
. These utilities have a.write_result
function which is what we’re calling in our code above. So we’ll be able to pass in a SRT-writer or a VTT-writer depending on what subtitle type we want to save.
Ok, so that is all our utility functions done. Now let’s move on to the main code.
Installing gradio
Before we get started you’ll need to install gradio
, so in your terminal window, run:
pip install gradio
What is gradio
? Gradio is a Python library that allows us to quickly create user-friendly interfaces for testing, demonstrating, and debugging machine learning models. We’ll use gradio to create a UI for our app with just a few lines of code, and it supports a wide range of input and output types like video, audio, and text. Using this super simple framework we can keep the focus on whisper and not on building a user interface. It’s pretty self-explanatory, so you’ll understand the idea as we just code along.
Creating the main file
Now let’s get started on our main code, where mostly we’ll just have to call our utility functions and tie it all together, plus create a quick gradio interface to make it user-friendly. Create a new file called 2_whisper_pods.py
in the root folder of your project:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📁test_audio_files 📁utils 📄__init__.py 📄podcast.py 📄subtitles.py 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄2_whisper_pods.py (new file) 📄settings.py 📄.env
And inside 2_whisper_pods.py
get started with our imports:
import uuid from pathlib import Path import gradio as gr import whisper from whisper.utils import WriteSRT, WriteVTT from settings import BASE_DIR, OUTPUT_TEMP_DIR, STYLES_DIR from utils import podcast, subtitles
uuid
is Python’s built-in library to generate unique id’s, pathlib
is familiar to us by now, and gradio
is the library we just installed. We also import whisper
and two writer utilities from whisper.utils
, which are the writer functions we talked about in the previous section. Then we import our directory Path
constants from the settings
and our podcast
and subtitles
utils. Now continue below the imports:
WHISPER_MODEL = whisper.load_model("base") VTT_WRITER = WriteVTT(output_dir=str(OUTPUT_TEMP_DIR)) SRT_WRITER = WriteSRT(output_dir=str(OUTPUT_TEMP_DIR))
We load the WHISPER_MODEL
from the base
model, and we create two writer objects by creating instances of the WriteVTT
and WriteSRT
classes we imported from Whisper’s utilities, passing in the output_dir
as a string.
Now let’s create a function to tie it all together:
def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]: unique_id = uuid.uuid4() podcast_download_url = podcast.scrape_link_from_page(page_link) mp3_file: Path = podcast.download(podcast_download_url, unique_id, OUTPUT_TEMP_DIR) ...
We define a function called transcribe_and_summarize
which takes a page_link
as a string and will return a tuple so we can have multiple outputs to this function. These four outputs will feed back into the gradio interface we will create later and will be:
- The podcast summary
- The podcast transcription
- The VTT subtitle file (path)
- The SRT subtitle file (path)
We then create a new unique_id
which we’ll use to name the downloaded mp3 file. Note we do this inside the function as we need a unique identifier for every single transcription run to avoid name clashes. Then we use our podcast.scrape_link_from_page
util to scrape the download link from the podcast page, and we use our podcast.download
function to download the podcast mp3 file, passing in the podcast_download_url
, unique_id
, and the OUTPUT_TEMP_DIR
as arguments. We then catch the mp3 file path in a variable called mp3_file
. Notice how easy everything is to read because we used logical and descriptive names for all our variables and utility functions and files.
Let’s continue with our function:
def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]: ...previous code... whisper_output = WHISPER_MODEL.transcribe(str(mp3_file)) with open(BASE_DIR / "pods_log.txt", "w", encoding="utf-8") as f: f.write(str(whisper_output)) transcription = str(whisper_output["text"]) summary = podcast.get_summary(transcription)
We call the .transcribe
function by passing in the mp3_file
path as a string. This will return a dictionary with the transcription and other information we catch in whisper_output
. We then open a file called pods_log.txt
in our root directory in write mode, and we write the whisper_output
to the file. This is just for debugging purposes, so we can see what the output looks like (it’s too long to print to the console). We then extract the transcription
from the whisper_output
dictionary. Note that whisper_output["text"]
is already a string, the reason we wrapped inside a str()
call is just to make it explicit that this is a string for typing purposes. This will not add any extra overhead or computing time as values that are already a string will just pass through the str()
function unaltered. Then we call our podcast.get_summary
function, passing in the transcription
as an argument.
Now we just need to write the subtitles to disk and return all the outputs. Continue on:
def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]: ...previous code... get_sub_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_id}{ext}" vtt_subs = subtitles.write_to_file(whisper_output, VTT_WRITER, get_sub_path(".vtt")) srt_subs = subtitles.write_to_file(whisper_output, SRT_WRITER, get_sub_path(".srt")) return (summary, transcription, str(vtt_subs), str(srt_subs))
We create a lambda (nameless) function that takes a file extension as input and then returns the path to the subtitle file with that extension. For example, inputting .vtt
will yield output_temp_files/unique_id.vtt
, but giving it .srt
will yield output_temp_files/unique_id.srt
, just so we can avoid repeating the same code twice. Then we call our subtitles.write_to_file
function twice, passing in the whisper_output
, the VTT_WRITER
and SRT_WRITER
writer functions, and the get_sub_path
lambda function to get the path to the subtitle file. We catch the output of these two functions in vtt_subs
and srt_subs
respectively. Finally, we return a tuple containing the summary
, transcription
, vtt_subs
, and srt_subs
to finish off our function.
The whole thing now looks like this:
def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]: unique_id = uuid.uuid4() podcast_download_url = podcast.scrape_link_from_page(page_link) mp3_file: Path = podcast.download(podcast_download_url, unique_id, OUTPUT_TEMP_DIR) whisper_output = WHISPER_MODEL.transcribe(str(mp3_file)) with open(BASE_DIR / "pods_log.txt", "w", encoding="utf-8") as f: f.write(str(whisper_output)) transcription = str(whisper_output["text"]) summary = podcast.get_summary(transcription) get_sub_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_id}{ext}" vtt_subs = subtitles.write_to_file(whisper_output, VTT_WRITER, get_sub_path(".vtt")) srt_subs = subtitles.write_to_file(whisper_output, SRT_WRITER, get_sub_path(".srt")) return (summary, transcription, str(vtt_subs), str(srt_subs))
Creating the gradio interface
That’s very nice and well, but a typical end user does not know how to use Python and this function is not very user-friendly. So let’s create a quick gradio interface to make it easy for the user to use our app. Continue below the function:
if __name__ == "__main__": block = gr.Blocks(css=str(STYLES_DIR / "whisper_pods.css")) with block: with gr.Group(): # Header # Input textbox for podcast link # Button to start transcription # Output elements # btn.click definition block.launch(debug=True)
This is going to be the basic structure of our gradio
application. First, we use if __name__ == "__main__":
to make sure the code inside this block only runs if we run this file directly, and not if we import it from another file. Then we create a block
object by calling gr.Blocks
and passing in the path to our whisper_pods.css
file in the styles
directory as a string. This will allow us to style our app with CSS, which we’ll do in a bit (this .css file doesn’t exist yet). Then we open a with block:
block, and inside this block we open a with gr.Group():
block. This will allow us to group elements together in our app. Then we have a bunch of comments to indicate what we’ll be doing in each block, which we’ll fill in in a moment. Finally, we call block.launch
to launch our app, passing in debug=True
so we get extra feedback in the console if anything goes wrong.
- The header will hold a logo image for our application. We’ll use HTML to load it from the internet. We can call
gr.HTML
to create an HTML element, and we can pass in the HTML code as a string. We’ll use adiv
element with aheader
class, and inside thisdiv
we’ll have animg
element with a link to our logo image, which I just quickly uploaded to “imgur”. We’ll also set thereferrerpolicy
tono-referrer
to avoid any issues with the image not loading (imgur doesn’t work with alocalhost
referrer, which is what you’ll have when you run this app locally).
gr.HTML( f""" <div class="header"> <img src="https://i.imgur.com/8Xu2rwG.png" referrerpolicy="no-referrer" /> </div> """ )
- The input textbox will be where the user can paste in the podcast link. We can just call
gr.Textbox
to create a textbox element, and we can pass in a label to indicate what the textbox is for. We’ll call it “Google Podcasts Link” and we’ll catch the input in a variable calledpodcast_link_input
.
podcast_link_input = gr.Textbox(label="Google Podcasts Link:")
- The button will be the trigger that starts the main function. I want a full row button so we’ll call
gr.Row
to create a row element, and then we’ll callgr.Button
to create a button element. We can just pass in the button text we want to display and associate the button with the variable namebtn
. We’ll use thisbtn
object later to define the button’s behavior.
with gr.Row(): btn = gr.Button("🎙️ Transcribe and summarize my podcast! 🎙️")
- The output elements will be the summary, transcription, and two subtitle files. The first two are just a
gr.Textbox
which does what you’d expect and allows us to pass in a label, placeholder, and the number of lines to display by default. Theautoscroll
behavior will scroll all the way down to the bottom if a large transcription text is passed into the input box. Since we want the user to be able to start reading from the beginning instead of the end we set this behavior toFalse
. We then have anothergr.Row
with twogr.File
elements which will end up side-by-side in a single row. Thelabel
is just a label and theelem_classes
is a list of classes gradio will give the element, so we can target it with CSS later on using the namesvtt-sub-file
andsrt-sub-file
.
summary_output = gr.Textbox( label="Podcast Summary", placeholder="Podcast Summary", lines=4, autoscroll=False, ) transcription_output = gr.Textbox( label="Podcast Transcription", placeholder="Podcast Transcription", lines=8, autoscroll=False, ) with gr.Row(): vtt_sub_output = gr.File( label="VTT Subtitle file download", elem_classes=["vtt-sub-file"] ) srt_sub_output = gr.File( label="SRT Subtitle file download", elem_classes=["srt-sub-file"] )
- The
btn.click
is where we define which function to call when the button is clicked, so we give it ourtranscribe_and_summarize
function as the first argument. The second argument is a list of inputs, in this case only ourpodcast_link_input
. The third argument is a list of outputs, in this case, oursummary_output
,transcription_output
,vtt_sub_output
, andsrt_sub_output
. We’ll use these outputs to display the results of our function to the user. We just told gradio what function to run, and how to map all of the input and output elements we defined in the interface to the input and output arguments of our function!
btn.click( transcribe_and_summarize, inputs=[podcast_link_input], outputs=[ summary_output, transcription_output, vtt_sub_output, srt_sub_output, ], )
whisper_pods.py
now looks like this:
imports CONSTANTS def transcribe_and_summarize(...)... ... if __name__ == "__main__": block = gr.Blocks(css=str(STYLES_DIR / "whisper_pods.css")) with block: with gr.Group(): gr.HTML( f""" <div class="header"> <img src="https://i.imgur.com/8Xu2rwG.png" referrerpolicy="no-referrer" /> </div> """ ) podcast_link_input = gr.Textbox(label="Google Podcasts Link:") with gr.Row(): btn = gr.Button("🎙️ Transcribe and summarize my podcast! 🎙️") summary_output = gr.Textbox( label="Podcast Summary", placeholder="Podcast Summary", lines=4, autoscroll=False, ) transcription_output = gr.Textbox( label="Podcast Transcription", placeholder="Podcast Transcription", lines=8, autoscroll=False, ) with gr.Row(): vtt_sub_output = gr.File( label="VTT Subtitle file download", elem_classes=["vtt-sub-file"] ) srt_sub_output = gr.File( label="SRT Subtitle file download", elem_classes=["srt-sub-file"] ) btn.click( transcribe_and_summarize, inputs=[podcast_link_input], outputs=[ summary_output, transcription_output, vtt_sub_output, srt_sub_output, ], ) block.launch(debug=True)
Creating the CSS file
See how easy it was to write an interface using gradio! There is just one thing left to do, the STYLES_DIR / "whisper_pods.css"
file we loaded into gradio doesn’t actually exist! Go ahead and create a new file in the styles
directory called whisper_pods.css
:
📁FINX_WHISPER (project root folder) 📁output_temp_files 📁output_video 📁styles 📄whisper_pods.css (new file) 📁test_audio_files 📁utils 📄__init__.py 📄podcast.py 📄subtitles.py 📄1_basic_call_english_only.py 📄1_multiple_languages.py 📄2_whisper_pods.py 📄settings.py 📄.env
Inside whisper_pods.css
paste the following code:
.header { padding: 2em 8em; } .vtt-sub-file, .srt-sub-file { height: 80px; }
We set some padding on the header image by targeting the header
class, to stop the image from getting too big. Then we set the height of the subtitle file download boxes to 80px, so they don’t get smaller than this, keeping them nice and visible.
Now go back to your 2_whisper_pods.py
file and run it. Give it some time to load up and you’ll see the following in your terminal:
Running on local URL: http://127.0.0.1:7860 To create a public link, set `share=True` in `launch()`.
CTRL + click the link to open it in your browser. You should see the following:
Go ahead and get a Google podcasts link to input. I’ll use a short podcast just for the initial test:
https://podcasts.google.com/feed/aHR0cDovL2ZlZWRzLmZlZWRidXJuZXIuY29tL1RFRF9BaGFfQnVzaW5lc3M/episode/ZW4udmlkZW8udGFsay50ZWQuY29tOjEwNzMyNDo4MA?sa=X&ved=0CAgQuIEEahcKEwiImYLqr8qDAxUAAAAAHQAAAAAQAQ
And then click the button and wait (I’ve blurred out the transcription to respect the speaker’s copyright as this course will be published publicly):
Check the summary, transcription, and subtitle files. Try other podcasts from https://podcasts.google.com/. play around and have fun! My transcription was very good using just the base
whisper model we loaded up and I never even used a bigger one! If you use non-English languages you may need a bigger model though. You can also use a .en
model like base.en
or small.en
to get higher accuracy if you will only input English podcasts.
Also take a look at the pods_log.txt
file you wrote in the root directory of your project, which holds the full whisper output. It may help you pinpoint where the problems are and how confident the model is while transcribing.
Conclusion
There we go, that is a pretty good initial minimum viable product! Of course, it has much room for improvement, for instance by using a proper front-end framework like React and streaming the transcription live to the page so the user is not left waiting so long before seeing results.
You could also use asyncio to make the ChatGPT summary call asynchronous slightly speeding up the code by writing the subtitle files to disk while the ChatGPT summary call is running at the same time, and of course, you’d want some kind of cleanup function to get rid of all the downloaded mp3 files hanging around in your output_temp_files
folder. If you check it you will see all the files with the names like 0e0f5d05-9379-4124-a84d-81de7eb3e314.mp3
we generated, plus all the subtitle files with the same name for each mp3 file.
I’ll leave the rest up to your imagination! That’s it for part 2, I’ll see you soon in part 3, where we’ll be using Whisper to create a fully automatic video subtitling tool that takes only a video file as input, then transcribes the audio, creates subtitles, and embeds them into the video at the correct times! It will be fun, see you there!