Hugging Face Course (5/6) – Text-to-Speech with MeloTTS

👉 Back to the Full Course on local models and Hugging Face (+Videos)

Hi and welcome back to part 5 of the local models tutorial series. In this part, we’ll be getting started with audio models, looking first at text-to-speech generation. In the next part, we’ll also look at text-to-music AI music generation.

💡 We will not be touching on speech-to-text transcription type models here as I already have a full OpenAI Whisper course that deals specifically with the amazing and free open-source OpenAI Whisper model for this purpose, running it on our local machine. If you’re interested in speech-to-text, podcast subscription, automatic video subtitling, and such, bookmark this course for later!

Let’s get started with the text-to-speech models now, as they have loads of real-world practical use cases and it’s very nice to have a free open-source model on your computer available whenever you need it.

If some of the preceding models were a bit intense for your computer system to handle, fear not! Text-to-speech is surprisingly lightweight and you can probably run real-time inference even on a CPU-only setup!

Let’s head over to the Hugging Face page for MeloTTS:

Installation

If we look at the model card it offers some very simple example code but states we need to follow the installation steps before we can run the code.

The installation instructions for Linux and MacOS are quite simple and straightforward, so if you’re on one of those systems just follow the provided instructions. Assuming most of you are on Windows, we have a couple of extra steps to follow as we need to run Docker. If you already have Docker and are familiar with the software, skim ahead a bit to the part where we run the container!

If this sounds like a large pain in the bum just to run one model, rest assured, you will need Docker again sooner or later in this world, so my personal advice is to just get it out of the way and get used to Docker as soon as possible!

I won’t go too deep into the details of what Docker is, but basically, it will allow us to run a sort of small virtual machines named containers (oversimplifying here) based on Linux on our Windows machine. This small container will have all the dependencies inside that our project needs and runs independently on its own Linux and Python environment.

Preparing for Docker installation

Most tutorials will just tell you to install Docker and give no details which might leave you with a hard and long process of figuring out why it won’t work on your system. I’ll do a quick grab-back to the Autogen course where I explained the process in detail, so we can get through the installation quickly:

💡 I’ll cover all the basics here and provide some useful references but you may need to do some extra research if you run into any issues, as I cannot prepare for every possible problem you might run into.

First, we need to install WSL2 (Windows Subsystem for Linux 2) on our system. This is a Windows feature that allows us to run Linux on our Windows machine. This is needed because Docker Desktop for Windows requires WSL2 to run. (You can also use Hyper-V instead, but WSL2 is the recommended way).

💡 If you already use WSL but need to check if you have version 2, you can check your version by entering the command: wsl -l -v in PowerShell or Windows Command Prompt.

You can find the details for installing WSL2 here, but basically you do this:

Open a PowerShell or Windows Command Prompt in administrator mode by right-clicking and selecting “Run as administrator”
Run the following command in the terminal: wsl --install
Restart your computer when prompted

This will enable WSL on your Windows machine and install a Linux distribution (Ubuntu) on your system. You may need to create a Username and Password for your Linux distribution and save them somewhere. More details can be found here or Google “Setting up your Linux username and password.”

We have a couple more things to check before we can install Docker Desktop. First, we need to make sure the Virtual Machine Platform option is enabled. This is fairly easy! Just press the Windows key or click in the Windows search bar and search for “Turn Windows features on or off”.

Open this and make sure the “Virtual Machine Platform” option is checked. If it’s not, check it and click OK. You may need to restart your computer after this.

(You don’t have to match the other checkboxes with the settings in the image!)

Next, we need to make sure that Virtualization is enabled in our BIOS. The easiest way to check if this is enabled is to open the Windows Task Manager by pressing Ctrl+Alt+Delete and selecting Task Manager. Then click on the Performance tab and click on CPU on the left-hand side. If Virtualization is enabled, you will see “Virtualization: Enabled” in the bottom right information block like this:

If Virtualization is not enabled, you will have to go into your BIOS settings and enable Virtualization. This is where you will have to do some googling and research on your own, as every system has slightly different keys to get into the BIOS setup menu, and the settings may be located in different parts of the BIOS menu for different manufacturers. I’ll leave you with two links to get you started, the first one describes the general process of getting this setting enabled in your BIOS:

When you’ve got that ready and set up to go, let’s continue on.

Installing Docker Desktop

Finally, it’s time! Head over to the Docker Desktop download page here and download the appropriate Docker version for your OS. I’m running Docker 4.26.1, but just download the latest version and you should be fine. When the download finishes start the installer. The installer will give you the following options:

Just accept both of these options and click OK, unless you chose not to install WSL 2 and use Hyper-V instead. Whether or not you want a shortcut on your desktop is entirely up to you of course 😉.

Now just let the installer do its magic, and then go ahead and run the Docker Desktop application, where you’ll have to accept the service agreement:

Then just choose “Use recommended settings” and click Finish:

Now Docker Desktop will start and you will be prompted to either sign up or sign in. Docker is free for personal and even small-business use, so press the button to sign up and create an account. You can even use your Google or GitHub account to create one really fast.

If everything was successful, you should be greeted by the following screen:

Congratulations! You’ve now installed Docker Desktop and are ready to go! If you still have problems, first try the below, and if that doesn’t work, google will have a solution. Never despair!

(Only for those who still have problems 🙈🙂)

Hypervisor enabled at Windows startup –
If you have completed the steps described above and are still experiencing Docker Desktop startup issues, this could be because the Hypervisor is installed, but not launched during Windows startup. Some tools (such as older versions of Virtual Box) and video game installers turn off hypervisor on boot. To turn it back on:
Open an administrative console prompt.
- Run bcdedit /set hypervisorlaunchtype auto.
- Restart Windows.

Downloading and running the MeloTTS model

Back to the MeloTTS model! Open up your VSCode running in the project directory again and then run the following command in a terminal window:

git clone https://github.com/myshell-ai/MeloTTS.git

This will clone the MeloTTS repository from the GitHub page straight into a new folder named MeloTTS in your project directory:

📁Local_Models
  📁MeloTTS  ✨New folder
  📁test_files
  📄chat_app.py
  📄image_gen.ipynb
  📄image_to_image.ipynb
  📄local_chat.py
  📄local_chat_memory.py
  📄memory.py
  📄model_preloader.py
  📄Pipfile
  📄Pipfile.lock

Now let’s cd (change-directory) our terminal into the MeloTTS folder:

cd MeloTTS

Now we need to build the Docker container for the MeloTTS model. Run the following command in your terminal:

docker build -t melotts .

If you get an error message make sure that Docker Desktop is running and that you are in the correct directory. This command will build the Docker container for the MeloTTS model and tag it (-t) with the name melotts. This process can take a good while as it downloads the necessary dependencies and exact versions that MeloTTS needs and sets up the container. It will only take this long for the first build.

If you’re lucky the build may finish and the terminal will finish all [7/7] steps (skip ahead to Running the MeloTTS model). If you’re less lucky like me your process will get stuck trying to install the following dependency over and over in step [5/7]:

 => => # Downloading botocore-1.21.5-py3-none-any.whl (7.7 MB)
 => => #   ━━━━━━━━━━━━━━━━━━━━━ 7.7/7.7 MB 21.8 MB/s eta 0:00:00
 => => # Collecting boto3<2.0,>=1.0
 => => # Downloading boto3-1.18.4-py3-none-any.whl (131 kB)
 => => #   ━━━━━━━━━━━━━━━━━━━━━ 131.6/131.6 kB 17.8 MB/s eta 0:00:00

What is going on? Well, this is just the way software development is sometimes 💩🚽. It seems there is no particular version of this botocore dependency that satisfies all the other dependencies and it tries over and over again. “But we followed all the instructions on the website verbatim and this Dockerfile is created by the MeloTTS team themselves, how can it not work?” Yeah, that’s just the way it goes, quite a lot of the time.

Never despair! Google is your friend. Just keep trying and experimenting. I found a working fix by experimenting. Go to the MeloTTS folder and then open the Dockerfile by clicking on it:

Fixing the Dockerfile

This is the Dockerfile that the MeloTTS team provided for us. Without going into too much detail (Docker is a fascinating and extensive topic all on its own) here is the basic rundown of what is going on here:

## specify the base image (Linux distribution) and the Python version to be installed. Then create an `app` directory for the install inside this new system.
FROM python:3.9-slim
WORKDIR /app
COPY . /app

## install Linux dependencies
RUN apt-get update && apt-get install -y \
  build-essential libsndfile1 \
  && rm -rf /var/lib/apt/lists/*

## install Python dependencies using the pip package manager
RUN pip install -e .
RUN python -m unidic download
RUN python melo/init_downloads.py

## command to run on container start
CMD ["python", "./melo/app.py", "--host", "0.0.0.0", "--port", "8888"]

This Dockerfile is basically the recipe to build a container that can run MeloTTS. I’ve made two small changes to the recipe like so:

## Python version changed to 3.10 to try and resolve dependency issues
FROM python:3.10-slim
WORKDIR /app
COPY . /app

RUN apt-get update && apt-get install -y \
  build-essential libsndfile1 \
  && rm -rf /var/lib/apt/lists/*

## Changed the pip install flag to -v (verbose) to see what is going on in an attempt to gather more information
RUN pip install -v .
RUN python -m unidic download
RUN python melo/init_downloads.py

CMD ["python", "./melo/app.py", "--host", "0.0.0.0", "--port", "8888"]

Change your Dockerfile in the MeloTTS folder to match the above code. Save the file and then go back to your terminal and run the build command again (making sure your terminal window is cd‘d into the MeloTTS folder):

docker build -t melotts .

This time it should run. Never give up! Life as a software developer is frustrating at times but always be convinced that you will find a solution eventually. Instead of choosing a less desirable model that is easier to install and then just pretending to you that it’s always easy, I deliberately included this hardship in the course to show you it’s not always smooth sailing, and that is ok too. You will learn a lot from these experiences, even if you wanna punch a hole wall at first!

Running the MeloTTS model

When the build is finished, we can run the container using either of the two following commands, depending on if you have a CUDA-enabled GPU or not:

# Normal command
docker run -it -p 8888:8888 melotts

# If you have a GPU
docker run --gpus all -it -p 8888:8888 melotts

If you have any trouble with the second command just use the first. This model is not nearly as compute-heavy as the others we’ve used so far so you should be fine with the first command. The -it flag is for interactive mode keeping the terminal open. You can also run docker images in the background instead without occupying the terminal by using the -d (detached) flag instead of -it.

The -p 8888:8888 flag is for port forwarding. The MeloTTS model runs on port 8888 inside the container, which is its own ecosystem if you will. So this command links the internal port 8888 in the docker container to the external port 8888 on your host machine, otherwise, it would be impossible to access. The name melotts is the name of the container we built in the previous step (remember the -t flag).

Now your container will be running! You can check this by opening the Docker Desktop application. Your container will have a weird random name but you can see in the second column that the image is melotts (ignore the other extra stuff I have below):

Your terminal will tell you that the MeloTTS model is running and you can access it by going to http://0.0.0.0:8888 in your browser. You should see the following screen. Remember that this message is coming from inside the container, so this link would work if you open it inside the container but we are outside. Change the link to http://localhost:8888 (as we mapped port 8888 on the container to port 8888 on localhost) and you should see the test app:

Generating speech

Go ahead and play around with it and generate some speech! The first time you use a particular language/model it will download about 400MB of data for the model so it will take a longer time to initialize, but subsequent runs will be much easier.

Pretty good for a free local TTS, we have several voices and even multiple languages. The Korean voice is quite good too:

Using the MeloTTS model in Python

Now of course we don’t want to be stuck using this Gradio app manually so let’s create a new file named text_to_speech.py in the project root directory:

📁Local_Models
  📁MeloTTS
  📁test_files
  📄chat_app.py
  📄image_gen.ipynb
  📄image_to_image.ipynb
  📄local_chat.py
  📄local_chat_memory.py
  📄memory.py
  📄model_preloader.py
  📄Pipfile
  📄Pipfile.lock
  📄text_to_speech.py  ✨New file

Now the main issue we have here is that the MeloTTS model card provides us with some example code:

This code will work fine in the environment where you are running the project. The issue is that our project is running inside of a Docker container and not inside our VSCode Windows environment. We could take this example code and put it inside a Python file in VSCode but this is not the environment where the MeloTTS model is running so it will not work.

There are three potential solutions to this problem:

We can write code, and copy this code inside of the Docker container, execute it there, and copy the resulting files back to our local machine outside the container. This is more inconvenient than just using the Gradio app manually, so let’s not do this.
We can write a quick server/API and place this inside the Docker container, and then we can send requests to this server from our local machine. This is a good solution but it’s a bit more work than we need for this tutorial. I also don’t want to assume you have experience with Flask or FastAPI and going into APIs properly is a bit much for this tutorial.
We can use the pre-built API that is hidden inside the Gradio server already. We can simply send requests to this Gradio server inside the Docker container from our local machine, and it will then return the response to the origin of the request, which is our local machine. This is the solution we will use as it is the least complex and most convenient.

The issue we have is that the MeloTTS creators have disabled the built-in auto-generated Gradio API functionality. My guess is that they haven’t tested it and don’t want to provide support for it, which is fine and fair enough. We can easily re-enable this functionality by changing a single line of code in the app.py file inside the Docker container. There is no obstacle that will stop us from tinkering up a solution!

Re-enabling the Gradio API

Open the app.py file in the MeloTTS/melo folder. Inside this file scroll all the way to near the bottom of the code and find the following block:

Notice that in the launch method call the show_api argument is set to False. This is what is disabling the API. Change this to True like I did in the image above and make sure to save the file. Now we need to rebuild the Docker container to apply this change.

Go ahead and shut down the Docker container by pressing Ctrl+C in the terminal window where the container is running, or by opening Docker Desktop and pressing the Stop button on the container.

Then go back to the VSCode terminal and make sure you use the cd command to change your terminal directory into the MeloTTS folder instead of the root project folder. Go back to your terminal and run the following command just like we did the first time.

docker build -t melotts .

This will trigger a full rebuild as we changed some of the source code, meaning this will unfortunately take quite a while again to build the first time. When the build is finished, run the container again using the same command as before:

# Normal command
docker run -it -p 8888:8888 melotts

# If you have a GPU
docker run --gpus all -it -p 8888:8888 melotts

Using the Gradio API

Now when the container is up and running, go to your browser and open the MeloTTS app at http://localhost:8888. You should see the same app as before, but if you scroll down all the way to the bottom there will be a small link that says Use via API:

If you click on it you will see the API documentation for the auto-generated API:

Keep in mind this is not perfect as it is auto-generated, which is probably why the MeloTTS team disabled it, but if you scroll down to the bottom:

There is a perfectly valid API endpoint for us to use here! Bingo! We have a way to communicate with the Gradio server running inside the Linux Docker container from our Windows Python development environment. Take note of the /synthesize endpoint name and the example code snippet provided here. This forms the basis for the code we’ll write in a little bit.

Now we could use the basic Python requests library or whatever we want to send requests to this API manually, but in the API documentation we just saw there is a hint to install the gradio_client Python package which will make this all easier for us. So let’s do that! Use cd .. to bring your terminal directory back to the root project directory and then run the following command:

pipenv install gradio_client

Remember a long time ago, when I told you to make a new file named text_to_speech.py? It’s finally time to open it and get cracking on some code! Let’s start by importing the required libraries:

from gradio_client import Client
import uuid
import shutil
import os

We just discussed gradio_client which is provided for us by the Gradio team to make our life easier. We use uuid for unique names again, and a combination of shutil which deals with basic file operations like copying, and os which has basic Operating System functionality, to move the generated audio files to a new folder. These last three are all standard libraries that come with Python.

Now let’s define some of the constants and values we want to use in our script:

CLIENT = Client("http://localhost:8888/")
SPEED = 1.0
MAPPING = {
    "EN-Default": {"base_language": "EN", "speaker_id": "EN-Default"},
    "EN-US": {"base_language": "EN", "speaker_id": "EN-US"},
    "EN-BR": {"base_language": "EN", "speaker_id": "EN-BR"},
    "EN-IN": {"base_language": "EN", "speaker_id": "EN_INDIA"},
    "EN-AU": {"base_language": "EN", "speaker_id": "EN-AU"},
    "ES": {"base_language": "ES", "speaker_id": "ES"},
    "FR": {"base_language": "FR", "speaker_id": "FR"},
    "ZH": {"base_language": "ZH", "speaker_id": "ZH"},
    "JP": {"base_language": "JP", "speaker_id": "JP"},
    "KR": {"base_language": "KR", "speaker_id": "KR"},
}

The CLIENT constant is just the Gradio Client we imported, initialized with the url of the Gradio app, just like the example code had. I’ll also set a default SPEED value of 1.0 which is the normal speed of the speech.

In order to call any model we need to provide both the base_language and the speaker_id to the API. This is the same in many cases and as it’s a bit of a nuisance to remember multiple input values, I’ve created a MAPPING dictionary that maps the language names to the base_language and speaker_id values.

This way we can just use EN-IN without having to remember obnoxious details like the fact that this particular model uses EN as the base_language and EN_INDIA as the speaker_id with an _ instead of a - like most others.

Let’s get started on our text_to_speech function:

def text_to_speech(text, language="EN-Default"):
    if language in MAPPING:
        base_language = MAPPING[language]["base_language"]
        speaker_id = MAPPING[language]["speaker_id"]
    else:
        raise ValueError(
            f"Language {language} not supported. Please choose one of {MAPPING.keys()}."
        )

  # To be continued...

We start by defining a new function named text_to_speech that takes in a text argument and a language argument with a default value of EN-Default. We then check if the language is in our MAPPING dictionary and if it is we extract the base_language and speaker_id values from the dictionary.

If the language is not in the dictionary we raise a ValueError with a message that tells the user which languages are supported.

Let’s continue:

def text_to_speech(text, language="EN-Default"):
    if language in MAPPING:
        base_language = MAPPING[language]["base_language"]
        speaker_id = MAPPING[language]["speaker_id"]
    else:
        raise ValueError(
            f"Language {language} not supported. Please choose one of {MAPPING.keys()}."
        )

    result: str = CLIENT.predict(
        language=base_language,
        speaker=speaker_id,
        text=text,
        speed=SPEED,
        api_name="/synthesize",
    )

  # To be continued...

Here we basically just use the code that the Gradio API documentation told us to use, calling the predict method on the CLIENT object with the base_language, speaker_id, text, speed, and setting the API name to /synthesize as discussed earlier.

The important thing here is what exactly is in the result variable, as this determines how we handle it from here on. You can see I’ve type-hinted the result variable as a str value. This result does not actually contain the audio data, but a path to the audio file that has been generated. How do I know this? Simple, I just ran the code and printed out the result value to see what it was! Here is the print output:

C:\Users\admin\AppData\Local\Temp\gradio\8b2eb002f2cfa1978e0bd8c12affe46817010fc0\audio

Notice how the file is simply named audio and has no extension like .wav or .mp3. We happen to know here that the file extension is supposed to be .wav so we will need to take this particular file at this path, and copy it into a new file, renaming it and adding the .wav extension all at the same time.

Let’s create a new folder in our root project directory that will hold all the generated audio files. Create a new folder named tts_output:

📁Local_Models
  📁MeloTTS
  📁test_files
  📁tts_output  ✨New folder
  📄chat_app.py
  📄image_gen.ipynb
  📄image_to_image.ipynb
  📄local_chat.py
  📄local_chat_memory.py
  📄memory.py
  📄model_preloader.py
  📄Pipfile
  📄Pipfile.lock
  📄text_to_speech.py

Now finish up the function like this:

def text_to_speech(text, language="EN-Default"):
    if language in MAPPING:
        base_language = MAPPING[language]["base_language"]
        speaker_id = MAPPING[language]["speaker_id"]
    else:
        raise ValueError(
            f"Language {language} not supported. Please choose one of {MAPPING.keys()}."
        )

    result: str = CLIENT.predict(
        language=base_language,
        speaker=speaker_id,
        text=text,
        speed=SPEED,
        api_name="/synthesize",
    )

    output_path = f"tts_output/{base_language}_{text[:10]}_{uuid.uuid4()}.wav"
    print(result)

    shutil.move(result, os.path.join(os.getcwd(), output_path))

    return output_path

We create a new output_path variable that points to the tts_output folder we just created and then has a name made of the base_language, the first 10 characters of the text input, and a unique ID generated by uuid.uuid4(). Note we also added the .wav extension to the file name at the end to fix all issues at once.

The print(result) line is optional and shows where the raw file was saved. We then use the shutil.move function to move the file from the result path to the output path we want. To create this output path we use os.path.join to join the current working directory, (a.k.a. cwd -> os.getcwd) with the output_path we created.

Finally, we return the output_path to our newly copied and renamed file.

Let’s test it out!

Now I’m going to give this a good stress test by generating a bunch of different languages and speakers. I’ll add a if __name__ == "__main__": block at the bottom of the file to run some tests:

if __name__ == "__main__":
    # Remember the first time you load a model it will take longer so running all of these for the first time may take a bit.
    text = "The field of text-to-speech has seen rapid development recently."
    text_to_speech(text, "EN-Default")
    text_to_speech(text, "EN-US")
    text_to_speech(text, "EN-BR")
    text_to_speech(text, "EN-IN")
    text_to_speech(text, "EN-AU")
    spanish_text = "El campo de la conversión de texto a voz ha experimentado un rápido desarrollo recientemente."
    text_to_speech(spanish_text, "ES")
    french_text = (
        "Le domaine de la synthèse vocale a connu un développement rapide récemment"
    )
    text_to_speech(french_text, "FR")
    chinese_text = "text-to-speech 领域近年来发展迅速"
    text_to_speech(chinese_text, "ZH")
    japanese_text = "テキスト読み上げの分野は最近急速な発展を遂げています"
    text_to_speech(japanese_text, "JP")
    korean_text = "최근 텍스트 음성 변환 분야가 급속도로 발전하고 있습니다."
    text_to_speech(korean_text, "KR")

If you want to run these same tests then just copy/paste them from the written tutorial. Obviously, feel free to generate any other speech files that you want instead of these. My example code will take a little bit the first time as it will download and load every single different model there is here.

Let’s go ahead and run our file, making sure your terminal window is in the root project folder and the tts_output folder exists. You should get your requested files generated in the Docker container and copied straight into your project folder:

Awesome! I won’t paste all of them in here as you can obviously generate them yourself, but I’ll give you a small taste of the different languages:

Great! You can now use this text_to_speech function to generate anything you might want to for free as long as the Docker container is running!

Say you want to do a language learning app, you could have the base Japanese, Korean, Spanish or whatever sentence spoken by the TTS, and then also send the text sentence to a model running on Ollama, asking for an English translation which you then send into MeloTTS again.

Now you could use spaced-repetition flashcard software like Anki for example, which has clients for both desktop and mobile and they sync perfectly with each other. You could then use the genanki library to programmatically and fully automatically create flashcards for your Anki study without any manual work.

The frontside of the card would have the foreign language sentence both written and spoken in audio, and then when you click the button you get the spoken and written English version to test yourself and study. You could generate a whole study deck like this with very little effort and every study card would be spoken!

Alternatively, you may want to create a translation helper, using the Whisper model (see my other tutorial series on OpenAI Whisper) to have a model hear what a foreign speaker is saying and translate it to English, using TTS to speak the sentence to you.

You could then come back again in English and a translation chain plus the TTS would speak back in the foreign language again so the other person can understand.

There are so many personal and professional use cases for this, like wanting to read a book while you’re exercising, having accessibility features for people with visual impairments, or even just having a voice assistant that can read you the news in the morning. Having a good TTS is a very powerful tool.

One of the main points of this part of the tutorial series was also to show you that sometimes you run into a ton of obstacles. All the preceding models so far were fairly easy to install and run, but often in reality it is very difficult and takes a lot of tinkering to even get things to work properly at all. This is just the name of the game and that is ok!

That’s it for this part, I’ll see you soon in the next one where we’ll look at some slightly more niche but very cool models to see that there really is no area of computing that is not about to be revolutionized by AI. We’ll be AI-generating music and 3D models, how cool is that! We’ll also explore one more alternative way of getting a model/library/project running on our local machine. I’ll see you in the next one! 👋😊