Whisper Lesson 2 – Whisper Pods: Building a podcast transcribing app

Welcome back to part 2, where we’ll start practically applying our Whisper skills to build useful stuff. We obviously cannot just rely on the user needing to give us MP3 files to transcribe, they may want to just link a podcast for example. Here, we’ll be building a real application that can transcribe podcasts to text or subtitle format by taking just a podcast link as input.

Before we get started on the main code, we’ll do some basic setup work and create the helper functions we need to run in our main code. Keeping things separated across multiple functions and files will help keep our code a lot more clean and readable compared to just having one big script that does everything at the same time.

Saving our constants to a separate file

First, there are a couple of settings we’ll be using again and again over the next three parts, namely the paths to the input and output folders for the mp3 files, subtitles, and whatever else we will be processing. Instead of importing pathlib in every single file and then writing BASE_DIR = Path(__file__).parent we’ll just write this in a separate file and import it everywhere we need it. This will also make it easier to change the paths later if we need to.

In your project folder create a new file called settings.py, making sure to put it in the root folder of your project:

📁FINX_WHISPER (project root folder)
    📁test_audio_files
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄settings.py

In settings.py, write the following code:

from pathlib import Path

BASE_DIR = Path(__file__).parent
OUTPUT_TEMP_DIR = BASE_DIR / "output_temp_files"
OUTPUT_VIDEO_DIR = BASE_DIR / "output_video"
STYLES_DIR = BASE_DIR / "styles"
TEST_AUDIO_DIR = BASE_DIR / "test_audio_files"

We first get the root directory of the project using Path(__file__).parent, and then we create a few more paths relative to the root directory. We’ll use these paths in our main code to save the output files to the correct folders. Go ahead and also create empty folders for the output_temp_files, output_video, and styles folders, making sure to spell them correctly:

📁FINX_WHISPER (project root folder)
    📁output_temp_files     (new empty folder)
    📁output_video          (new empty folder)
    📁styles                (new empty folder)
    📁test_audio_files      (already existing folder)
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄settings.py

That’s our folders and paths setup done. We can just import these variables to access the folders from any file in our project. There is one more setting we need to define, but we cannot hardcode this one in our source code. We need to get our API key for OpenAI, as we’ll be using some ChatGPT in this part of the course. You’ll also need your API key for later parts. Go to https://platform.openai.com/api-keys and copy your API key. If you don’t have one, make sure to get one. You’ll only pay for what you use which will be cents if you just play around with it casually. Then create a new file called .env in the root folder of your project:

📁FINX_WHISPER (project root folder)
    📁output_temp_files
    📁output_video
    📁styles
    📁test_audio_files
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄settings.py
    📄.env                  (new file)

And paste your API key in there like this, making sure not to use any spaces or quotes:

OPENAI_API_KEY=your_api_key_here

Then go ahead and save and close this file.

Creating a utils folder for our helper functions

Now let’s create a new folder named utils to hold our helper functions, and then inside this new folder create an empty file called __init__.py:

📁FINX_WHISPER (project root folder)
    📁output_temp_files
    📁output_video
    📁styles
    📁test_audio_files
    📁utils                 (new folder)
        📄__init__.py       (new empty file)
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄settings.py
    📄.env

The __init__.py file is required to make Python treat the utils folder as a package, which will allow us to import the functions from within our other files. You don’t need to write anything in this file, just create it and leave it empty.

Our first utils file will deal with the podcast-related functions, so create a file called podcast.py in the utils folder:

📁FINX_WHISPER (project root folder)
    📁output_temp_files
    📁output_video
    📁styles
    📁test_audio_files
    📁utils
        📄__init__.py
        📄podcast.py        (new file)
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄settings.py
    📄.env

Inside podcast.py get started with our imports:

import re
import uuid
from pathlib import Path

import requests
from decouple import config
from openai import OpenAI

The re library deals with regular expressions and will help us find the podcast download page link amongst the page text. The uuid library lets us generate unique id’s, pathlib is familiar to us by now, and requests will help us download the podcast mp3 file. decouple will help us read our API key from the .env file, and openai will help us use the OpenAI API. If you have not used decouple before, make sure you run the install command in your terminal:

pip install python-decouple

Back in podcast.py let’s create a few constants that we’ll be using in our functions:

GPT_MODEL = "gpt-3.5-turbo-1106"
CLIENT = OpenAI(api_key=str(config("OPENAI_API_KEY")))

First, we set the ChatGPT model we’ll be using to request a podcast summary later on. Then we create a CLIENT object that we’ll use to make requests to the OpenAI API. We pass in our API key as a string, and we use config to read the API key from the .env file. Note that config("OPENAI_API_KEY") already returns a string value, the str() call surrounding it is just there to make it explicit and will not convert values that are already strings to a string again for the second time or something weird like that.

Scraping the podcast download link from the podcast page

So what are some of the functions we’ll need in here? For this example application I will be using Google Podcasts as our podcast source. This means we will get an input link like this:
https://podcasts.google.com/feed/aHR0cDovL2ZlZWRzLmZlZWRidXJuZXIuY29tL1RFRF9BaGFfQnVzaW5lc3M/episode/ZW4udmlkZW8udGFsay50ZWQuY29tOjExMTk3MDo4MA?sa=X&ved=0CAgQuIEEahcKEwiIzMnavduDAxUAAAAAHQAAAAAQAQ

If you load this page in your browser, you will see an HTML page, with a play button. This is the kind of page link the user will input into our app, so first of all we will need a function to extract the .mp3 download link from this page’s HTML.

Let’s get started on a function to do exactly that:

def scrape_link_from_page(page_url: str) -> str:
    podcast_page = requests.get(page_url).text
    regex = r"(?P<url>\;https?://[^\s]+)"
    ...

We start by defining our function which takes the page_url as a string and will return a string value as well. Then we use requests to get the HTML page text by sending a GET request to the URL, much like your internet browser would if you type a URL in the address bar. Now we define a regular expression that will match the pattern of the download link we want to extract. We’ll use this regex to find the download link in the HTML page text. Here’s how it works:

(?P<url>...) This is a named group. The matched text can be retrieved by the name URL. So basically the URL pattern we will find will be stored in a variable called URL.
\; This matches a semicolon character. The backslash is used to escape the semicolon, as it has special meaning in regular expressions. We don’t want this special meaning but the literal semicolon character, as there is a semicolon in front of the https that we want to match for the URL we need. (This is just a characteristic of this particular podcast page, other pages might have different patterns.)
https? This matches either http or https. The s? means “match zero or one s characters”. This allows the regex to match both http and https.
:// This matches the string ://, which is part of the standard format for URLs.
[^\s]+ This matches one or more (+) of any character that is not (^) a whitespace (\s) character. So basically this will match any character that is not a space, tab, or newline character. This will match the rest of the URL we need and stop adding characters as soon as a space appears which indicates the end of the URL.

So, in simple terms, this regular expression matches a semicolon followed by a URL that starts with either http or https, and continues until a whitespace character is encountered. The URL is captured in a group named url.

Now let’s complete our function:

def scrape_link_from_page(page_url: str) -> str:
    podcast_page = requests.get(page_url).text
    regex = r"(?P<url>\;https?://[^\s]+)"
    podcast_url_dirty = re.findall(regex, podcast_page)[0]
    podcast_url = podcast_url_dirty.split(";")[1]
    return podcast_url

So after we declared the regex pattern, we use re.findall to find all matches of the pattern in the podcast page text. This will return a list of matches, and we take the first match with [0]. This will return a string that looks something like this:

;https://download.ted.com/talks/etcetcetc;

Which is pretty good, we just need to get rid of the ; characters before and after the URL. We do this by splitting the string on the ; character, and then taking the second item in the list with [1]. This will return the clean URL we need: https://download.ted.com/talks/etcetcetc

Downloading the podcast mp3 file

Ok, so now our utils file has a function to scrape the download link. It stands to reason we’ll also need a function to download the mp3 file from the URL. Let’s get started on that:

def download(podcast_url: str, unique_id: uuid.UUID, output_dir: Path) -> Path:
    print("Downloading podcast...")
    podcast_audio = requests.get(podcast_url)
    save_location = output_dir / f"{unique_id}.mp3"
    ...

We define a function called download that takes 3 input arguments. The podcast_url is the URL we scraped from the podcast page as a string variable. The unique_id is a unique ID we’ll use to name the downloaded file, so we can avoid name clashes where files have the same name. This argument should be an instance of the UUID class from the uuid built-in Python library, which we’ll have a look at in a bit. The output_dir is the directory where we want to save the downloaded file as a Path object. Finally, our function will also return a Path object, which will be the path to the downloaded file.

We print a simple message to the console to show it is busy actually doing something, and then we use requests to download the podcast audio file by sending a GET request to the URL just like we did in the previous function. Then we create a save_location variable which is the path to the file we want to save. We use the output_dir argument as the parent directory, and then we use an f-string to create a filename that is the unique_id followed by the .mp3 extension.

Now let’s complete our function:

def download(podcast_url: str, unique_id: uuid.UUID, output_dir: Path) -> Path:
    print("Downloading podcast...")
    podcast_audio = requests.get(podcast_url)
    save_location = output_dir / f"{unique_id}.mp3"

    with open(save_location, "wb") as file:
        file.write(podcast_audio.content)
    print("Podcast successfully downloaded!")

    return save_location

We use the open function to open the save_location file in write binary (wb) mode, and we write the podcast_audio.content to the file. This will save the podcast audio file to the save_location path. Then we print a message to the console to show the download was successful, and we return the save_location path which points to the mp3 file we just downloaded, awesome!

Getting a summary

Now there is one more function we need in our utils/podcast file. Besides just the transcription, we will also provide the user with a summary of the podcast. We’ll use ChatGPT to generate this summary, so we’ll need a simple function to do that. This one will be easy, so let’s just whip it up:

def get_summary(transcription: str) -> str:
    print("Summarizing podcast...")
    prompt = f"Summarize the following podcast into the most important points:\n\n{transcription}\n\nSummary:"

    response = CLIENT.chat.completions.create(
        model=GPT_MODEL, messages=[{"role": "user", "content": prompt}]
    )

    print("Podcast summarized!")
    summary = response.choices[0].message.content
    return summary if summary else "There was a problem generating the summary."

I assume you’re familiar with ChatGPT (if not, check out my other courses on the Finxter Academy!). We just have a simple function that takes the full transcription as a string and will return a summary as a string. We have a console print message again just to keep ourselves posted that it is doing some work and then we have a simple ChatGPT prompt.

Note the prompt ends with Summary: to prompt the model to start the summary right away without including any awkward introduction text, this is just a neat little trick you can use. We then use our CLIENT object to call the chat.completions.create endpoint, passing in the GPT_MODEL and a list of messages. We’ll just pass in the prompt as a user message. We then extract the summary from the response.choices[0].message.content. Just in case there was a problem and the summary is empty, we return a default message to inform the user.

Subtitles

Awesome! Our podcast utils are done now. Let’s move on to the subtitles utils. This one will be a much shorter file with a function that will allow us to output the transcription in subtitle format, with timestamps and everything. So go ahead and create a new file called subtitles.py in the utils folder:

📁FINX_WHISPER (project root folder)
    📁output_temp_files
    📁output_video
    📁styles
    📁test_audio_files
    📁utils
        📄__init__.py
        📄podcast.py
        📄subtitles.py      (new file)
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄settings.py
    📄.env

And inside subtitles.py get started with our imports:

from typing import Callable
from pathlib import Path

Both of these imports will be used solely to indicate the type of our function arguments (type hinting). We’ll use Callable to indicate that a function is expected as an argument, and we’ll use Path to indicate that a Path object is expected as an argument. This just makes our code clearer to read and easier to understand. Now let’s write our function, whose purpose will be to take a transcription done by Whisper and then convert it to a valid subtitle file:

def write_to_file(whisper_output: dict, writer: Callable, output_path: Path) -> Path:
    """Takes the whisper output, a writer function, and an output path, and writes subtitles to disk in the specified format."""
    with open(output_path, "w", encoding="utf-8") as sub_file:
        writer.write_result(result=whisper_output, file=sub_file)
        print(f"Subtitles generated and saved to {output_path}")

    return output_path

We take a whisper_output argument which is a dictionary containing the output Whisper gives us after we transcribe the podcast’s mp3 file. We also take a writer argument which is a function that will write the subtitles to disk, so we type-hint it with Callable. Finally, we take an output_path argument which is a Path object to the file we want to save the subtitles to. We then simply open the output path in write mode, calling the file sub_file. We then call the writer.write_result function, passing in the whisper_output and the location to save the subtitles to. Finally, we print a message to the console to show the subtitles were generated successfully, and we return the output_path which is the path to the subtitle file we just created.

Two important things to note here:

When you open the subtitle file, make sure you use the encoding="utf-8" argument. For normal English characters, this is not necessary, so you might think this is not needed. However, the AI likes to use ♪ symbols when music starts playing to make the subtitles more interesting, and you crash if you don’t specify utf-8 encoding which can actually map and save these special characters!
You might be wondering what this magical writer function is. Whisper actually comes with some utility functions that will allow us to write subtitles in correct formatting, like SRT or VTT. These utilities have a .write_result function which is what we’re calling in our code above. So we’ll be able to pass in a SRT-writer or a VTT-writer depending on what subtitle type we want to save.

Ok, so that is all our utility functions done. Now let’s move on to the main code.

Installing gradio

Before we get started you’ll need to install gradio, so in your terminal window, run:

pip install gradio

What is gradio? Gradio is a Python library that allows us to quickly create user-friendly interfaces for testing, demonstrating, and debugging machine learning models. We’ll use gradio to create a UI for our app with just a few lines of code, and it supports a wide range of input and output types like video, audio, and text. Using this super simple framework we can keep the focus on whisper and not on building a user interface. It’s pretty self-explanatory, so you’ll understand the idea as we just code along.

Creating the main file

Now let’s get started on our main code, where mostly we’ll just have to call our utility functions and tie it all together, plus create a quick gradio interface to make it user-friendly. Create a new file called 2_whisper_pods.py in the root folder of your project:

📁FINX_WHISPER (project root folder)
    📁output_temp_files
    📁output_video
    📁styles
    📁test_audio_files
    📁utils
        📄__init__.py
        📄podcast.py
        📄subtitles.py
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄2_whisper_pods.py   (new file)
    📄settings.py
    📄.env

And inside 2_whisper_pods.py get started with our imports:

import uuid
from pathlib import Path

import gradio as gr
import whisper
from whisper.utils import WriteSRT, WriteVTT

from settings import BASE_DIR, OUTPUT_TEMP_DIR, STYLES_DIR
from utils import podcast, subtitles

uuid is Python’s built-in library to generate unique id’s, pathlib is familiar to us by now, and gradio is the library we just installed. We also import whisper and two writer utilities from whisper.utils, which are the writer functions we talked about in the previous section. Then we import our directory Path constants from the settings and our podcast and subtitles utils. Now continue below the imports:

WHISPER_MODEL = whisper.load_model("base")
VTT_WRITER = WriteVTT(output_dir=str(OUTPUT_TEMP_DIR))
SRT_WRITER = WriteSRT(output_dir=str(OUTPUT_TEMP_DIR))

We load the WHISPER_MODEL from the base model, and we create two writer objects by creating instances of the WriteVTT and WriteSRT classes we imported from Whisper’s utilities, passing in the output_dir as a string.

Now let’s create a function to tie it all together:

def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    unique_id = uuid.uuid4()

    podcast_download_url = podcast.scrape_link_from_page(page_link)
    mp3_file: Path = podcast.download(podcast_download_url, unique_id, OUTPUT_TEMP_DIR)
    ...

We define a function called transcribe_and_summarize which takes a page_link as a string and will return a tuple so we can have multiple outputs to this function. These four outputs will feed back into the gradio interface we will create later and will be:

The podcast summary
The podcast transcription
The VTT subtitle file (path)
The SRT subtitle file (path)

We then create a new unique_id which we’ll use to name the downloaded mp3 file. Note we do this inside the function as we need a unique identifier for every single transcription run to avoid name clashes. Then we use our podcast.scrape_link_from_page util to scrape the download link from the podcast page, and we use our podcast.download function to download the podcast mp3 file, passing in the podcast_download_url, unique_id, and the OUTPUT_TEMP_DIR as arguments. We then catch the mp3 file path in a variable called mp3_file. Notice how easy everything is to read because we used logical and descriptive names for all our variables and utility functions and files.

Let’s continue with our function:

def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    ...previous code...

    whisper_output = WHISPER_MODEL.transcribe(str(mp3_file))
    with open(BASE_DIR / "pods_log.txt", "w", encoding="utf-8") as f:
        f.write(str(whisper_output))

    transcription = str(whisper_output["text"])
    summary = podcast.get_summary(transcription)

We call the .transcribe function by passing in the mp3_file path as a string. This will return a dictionary with the transcription and other information we catch in whisper_output. We then open a file called pods_log.txt in our root directory in write mode, and we write the whisper_output to the file. This is just for debugging purposes, so we can see what the output looks like (it’s too long to print to the console). We then extract the transcription from the whisper_output dictionary. Note that whisper_output["text"] is already a string, the reason we wrapped inside a str() call is just to make it explicit that this is a string for typing purposes. This will not add any extra overhead or computing time as values that are already a string will just pass through the str() function unaltered. Then we call our podcast.get_summary function, passing in the transcription as an argument.

Now we just need to write the subtitles to disk and return all the outputs. Continue on:

def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    ...previous code...

    get_sub_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_id}{ext}"
    vtt_subs = subtitles.write_to_file(whisper_output, VTT_WRITER, get_sub_path(".vtt"))
    srt_subs = subtitles.write_to_file(whisper_output, SRT_WRITER, get_sub_path(".srt"))

    return (summary, transcription, str(vtt_subs), str(srt_subs))

We create a lambda (nameless) function that takes a file extension as input and then returns the path to the subtitle file with that extension. For example, inputting .vtt will yield output_temp_files/unique_id.vtt, but giving it .srt will yield output_temp_files/unique_id.srt, just so we can avoid repeating the same code twice. Then we call our subtitles.write_to_file function twice, passing in the whisper_output, the VTT_WRITER and SRT_WRITER writer functions, and the get_sub_path lambda function to get the path to the subtitle file. We catch the output of these two functions in vtt_subs and srt_subs respectively. Finally, we return a tuple containing the summary, transcription, vtt_subs, and srt_subs to finish off our function.

The whole thing now looks like this:

def transcribe_and_summarize(page_link: str) -> tuple[str, str, str, str]:
    unique_id = uuid.uuid4()

    podcast_download_url = podcast.scrape_link_from_page(page_link)
    mp3_file: Path = podcast.download(podcast_download_url, unique_id, OUTPUT_TEMP_DIR)

    whisper_output = WHISPER_MODEL.transcribe(str(mp3_file))
    with open(BASE_DIR / "pods_log.txt", "w", encoding="utf-8") as f:
        f.write(str(whisper_output))

    transcription = str(whisper_output["text"])
    summary = podcast.get_summary(transcription)

    get_sub_path = lambda ext: OUTPUT_TEMP_DIR / f"{unique_id}{ext}"
    vtt_subs = subtitles.write_to_file(whisper_output, VTT_WRITER, get_sub_path(".vtt"))
    srt_subs = subtitles.write_to_file(whisper_output, SRT_WRITER, get_sub_path(".srt"))

    return (summary, transcription, str(vtt_subs), str(srt_subs))

Creating the gradio interface

That’s very nice and well, but a typical end user does not know how to use Python and this function is not very user-friendly. So let’s create a quick gradio interface to make it easy for the user to use our app. Continue below the function:

if __name__ == "__main__":
    block = gr.Blocks(css=str(STYLES_DIR / "whisper_pods.css"))

    with block:
        with gr.Group():
            # Header

            # Input textbox for podcast link

            # Button to start transcription

            # Output elements

            # btn.click definition

    block.launch(debug=True)

This is going to be the basic structure of our gradio application. First, we use if __name__ == "__main__": to make sure the code inside this block only runs if we run this file directly, and not if we import it from another file. Then we create a block object by calling gr.Blocks and passing in the path to our whisper_pods.css file in the styles directory as a string. This will allow us to style our app with CSS, which we’ll do in a bit (this .css file doesn’t exist yet). Then we open a with block: block, and inside this block we open a with gr.Group(): block. This will allow us to group elements together in our app. Then we have a bunch of comments to indicate what we’ll be doing in each block, which we’ll fill in in a moment. Finally, we call block.launch to launch our app, passing in debug=True so we get extra feedback in the console if anything goes wrong.

The header will hold a logo image for our application. We’ll use HTML to load it from the internet. We can call gr.HTML to create an HTML element, and we can pass in the HTML code as a string. We’ll use a div element with a header class, and inside this div we’ll have an img element with a link to our logo image, which I just quickly uploaded to “imgur”. We’ll also set the referrerpolicy to no-referrer to avoid any issues with the image not loading (imgur doesn’t work with a localhost referrer, which is what you’ll have when you run this app locally).

gr.HTML(
    f"""
    <div class="header">
    <img src="https://i.imgur.com/8Xu2rwG.png" referrerpolicy="no-referrer" />
    </div>
    """
)

The input textbox will be where the user can paste in the podcast link. We can just call gr.Textbox to create a textbox element, and we can pass in a label to indicate what the textbox is for. We’ll call it “Google Podcasts Link” and we’ll catch the input in a variable called podcast_link_input.

podcast_link_input = gr.Textbox(label="Google Podcasts Link:")

The button will be the trigger that starts the main function. I want a full row button so we’ll call gr.Row to create a row element, and then we’ll call gr.Button to create a button element. We can just pass in the button text we want to display and associate the button with the variable name btn. We’ll use this btn object later to define the button’s behavior.

with gr.Row():
    btn = gr.Button("🎙️ Transcribe and summarize my podcast! 🎙️")

The output elements will be the summary, transcription, and two subtitle files. The first two are just a gr.Textbox which does what you’d expect and allows us to pass in a label, placeholder, and the number of lines to display by default. The autoscroll behavior will scroll all the way down to the bottom if a large transcription text is passed into the input box. Since we want the user to be able to start reading from the beginning instead of the end we set this behavior to False. We then have another gr.Row with two gr.File elements which will end up side-by-side in a single row. The label is just a label and the elem_classes is a list of classes gradio will give the element, so we can target it with CSS later on using the names vtt-sub-file and srt-sub-file.

summary_output = gr.Textbox(
    label="Podcast Summary",
    placeholder="Podcast Summary",
    lines=4,
    autoscroll=False,
)

transcription_output = gr.Textbox(
    label="Podcast Transcription",
    placeholder="Podcast Transcription",
    lines=8,
    autoscroll=False,
)

with gr.Row():
    vtt_sub_output = gr.File(
        label="VTT Subtitle file download", elem_classes=["vtt-sub-file"]
    )
    srt_sub_output = gr.File(
        label="SRT Subtitle file download", elem_classes=["srt-sub-file"]
    )

The btn.click is where we define which function to call when the button is clicked, so we give it our transcribe_and_summarize function as the first argument. The second argument is a list of inputs, in this case only our podcast_link_input. The third argument is a list of outputs, in this case, our summary_output, transcription_output, vtt_sub_output, and srt_sub_output. We’ll use these outputs to display the results of our function to the user. We just told gradio what function to run, and how to map all of the input and output elements we defined in the interface to the input and output arguments of our function!

btn.click(
    transcribe_and_summarize,
    inputs=[podcast_link_input],
    outputs=[
        summary_output,
        transcription_output,
        vtt_sub_output,
        srt_sub_output,
    ],
)

whisper_pods.py now looks like this:

imports

CONSTANTS


def transcribe_and_summarize(...)...
    ...


if __name__ == "__main__":
    block = gr.Blocks(css=str(STYLES_DIR / "whisper_pods.css"))

    with block:
        with gr.Group():
            gr.HTML(
                f"""
                <div class="header">
                <img src="https://i.imgur.com/8Xu2rwG.png" referrerpolicy="no-referrer" />
                </div>
                """
            )

            podcast_link_input = gr.Textbox(label="Google Podcasts Link:")

            with gr.Row():
                btn = gr.Button("🎙️ Transcribe and summarize my podcast! 🎙️")

            summary_output = gr.Textbox(
                label="Podcast Summary",
                placeholder="Podcast Summary",
                lines=4,
                autoscroll=False,
            )

            transcription_output = gr.Textbox(
                label="Podcast Transcription",
                placeholder="Podcast Transcription",
                lines=8,
                autoscroll=False,
            )

            with gr.Row():
                vtt_sub_output = gr.File(
                    label="VTT Subtitle file download", elem_classes=["vtt-sub-file"]
                )
                srt_sub_output = gr.File(
                    label="SRT Subtitle file download", elem_classes=["srt-sub-file"]
                )

            btn.click(
                transcribe_and_summarize,
                inputs=[podcast_link_input],
                outputs=[
                    summary_output,
                    transcription_output,
                    vtt_sub_output,
                    srt_sub_output,
                ],
            )

    block.launch(debug=True)

Creating the CSS file

See how easy it was to write an interface using gradio! There is just one thing left to do, the STYLES_DIR / "whisper_pods.css" file we loaded into gradio doesn’t actually exist! Go ahead and create a new file in the styles directory called whisper_pods.css:

📁FINX_WHISPER (project root folder)
    📁output_temp_files
    📁output_video
    📁styles
        📄whisper_pods.css  (new file)
    📁test_audio_files
    📁utils
        📄__init__.py
        📄podcast.py
        📄subtitles.py
    📄1_basic_call_english_only.py
    📄1_multiple_languages.py
    📄2_whisper_pods.py
    📄settings.py
    📄.env

Inside whisper_pods.css paste the following code:

.header {
  padding: 2em 8em;
}

.vtt-sub-file,
.srt-sub-file {
  height: 80px;
}

We set some padding on the header image by targeting the header class, to stop the image from getting too big. Then we set the height of the subtitle file download boxes to 80px, so they don’t get smaller than this, keeping them nice and visible.

Now go back to your 2_whisper_pods.py file and run it. Give it some time to load up and you’ll see the following in your terminal:

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

CTRL + click the link to open it in your browser. You should see the following:

Go ahead and get a Google podcasts link to input. I’ll use a short podcast just for the initial test:
https://podcasts.google.com/feed/aHR0cDovL2ZlZWRzLmZlZWRidXJuZXIuY29tL1RFRF9BaGFfQnVzaW5lc3M/episode/ZW4udmlkZW8udGFsay50ZWQuY29tOjEwNzMyNDo4MA?sa=X&ved=0CAgQuIEEahcKEwiImYLqr8qDAxUAAAAAHQAAAAAQAQ

And then click the button and wait (I’ve blurred out the transcription to respect the speaker’s copyright as this course will be published publicly):

Check the summary, transcription, and subtitle files. Try other podcasts from https://podcasts.google.com/. play around and have fun! My transcription was very good using just the base whisper model we loaded up and I never even used a bigger one! If you use non-English languages you may need a bigger model though. You can also use a .en model like base.en or small.en to get higher accuracy if you will only input English podcasts.

Also take a look at the pods_log.txt file you wrote in the root directory of your project, which holds the full whisper output. It may help you pinpoint where the problems are and how confident the model is while transcribing.

Conclusion

There we go, that is a pretty good initial minimum viable product! Of course, it has much room for improvement, for instance by using a proper front-end framework like React and streaming the transcription live to the page so the user is not left waiting so long before seeing results.

You could also use asyncio to make the ChatGPT summary call asynchronous slightly speeding up the code by writing the subtitle files to disk while the ChatGPT summary call is running at the same time, and of course, you’d want some kind of cleanup function to get rid of all the downloaded mp3 files hanging around in your output_temp_files folder. If you check it you will see all the files with the names like 0e0f5d05-9379-4124-a84d-81de7eb3e314.mp3 we generated, plus all the subtitle files with the same name for each mp3 file.

I’ll leave the rest up to your imagination! That’s it for part 2, I’ll see you soon in part 3, where we’ll be using Whisper to create a fully automatic video subtitling tool that takes only a video file as input, then transcribes the audio, creates subtitles, and embeds them into the video at the correct times! It will be fun, see you there!