Hugging Face Course (6/6) – Music and 3D Model Generation

👉 Back to the Full Course on local models and Hugging Face (+Videos)

Hello and welcome back to the last part of the local models course. In this part, we’ll be looking at some cool specialized models and one last installation method for local models that we haven’t used so far. We’ll use different models to generate music from text descriptions and then 3D models from our own pictures, which is very cool.

Musicgen

Without further ado, let’s look at the music generation models first. Head over to the Hugging Face page for Facebook’s MusicGen (Meta AI):

This model comes in four versions:

Small 300M parameters, not really strong enough for great results.
Medium 1.5B parameters, in between reasonable model.
Large 3.3B parameters, well over 10GB download size.
Melody 1.5B parameters, this version takes an input audio sample and a text prompt and tries to use the input audio melody when generating music in the prompted style.

So let’s get started! As always the code we’ll write here is based on the Hugging Face model card and the official GitHub repository for the project. To get started we need to install the required libraries. Run the following command in your terminal:

pipenv install transformers scipy

The transformers library should already be installed from the previous parts if you’re following along. Scipy will be used so we can write the generated wavefile data to an output .wav audio file. Now let’s create a new file in our project root directory named text_to_music.py:

📁Local_Models
  📁MeloTTS
  📁test_files
  📁tts_output
  📄chat_app.py
  📄image_gen.ipynb
  📄image_to_image.ipynb
  📄local_chat.py
  📄local_chat_memory.py
  📄memory.py
  📄model_preloader.py
  📄Pipfile
  📄Pipfile.lock
  📄text_to_speech.py
  📄text_to_music.py  ✨New file

Inside this new file let’s start by importing the required libraries:

from transformers import pipeline
import scipy
import uuid

We’ll be using the amazing Hugging Face transformers library again (which is almost like one of the main features of the Hugging Face ecosystem) so we can load up one of the predefined pipelines with the model we want with minimum effort.

As we stated, scipy is used to write the generated audio data to a .wav file. The uuid library is for unique IDs as we want to make sure we don’t overwrite our previous generations. You can also use a timestamp if you prefer.

Now let’s load up the model and the pipeline:

synthesizer = pipeline("text-to-audio", "facebook/musicgen-medium")

We use a predefined text-to-audio pipeline and define that we want to load up the facebook/musicgen-medium model. I’ll be skipping over the small version of the model here but I highly recommend that you try medium before trying to generate using large as the large model is quite big and can take a very long time to generate if your system is less powerful.

The official project page recommends you have at least a 16GB GPU for the medium model, but even if you don’t have one you can follow along and generate using the CPU. In my personal testing for this course it worked fine even if I disabled the GPU altogether, it may just take 10+ minutes before you get your music result.

Generating music

Now we simply define our music:

music = synthesizer(
    "Gangster rap beat with a piano riff",
    forward_params={"do_sample": True},
)

We create a new instance of this pipeline and name it synthesizer, passing in the query of Gangster rap beat with a piano riff or any other query you like for the first test. The forward_params argument is used to pass an argument through the pipeline to the model. The do_sample argument is described in the documentation as:

“…sampling leads to significantly better results than greedy, thus we encourage sampling mode to be used where possible. Sampling is enabled by default and can be explicitly specified by setting do_sample=True in the call…

Since we like quality results, we’ll follow along with the instructions here! Now after it generates this music we need to save it to a file:

scipy.io.wavfile.write(
    f"musicgen_out_{uuid.uuid4()}.wav", rate=music["sampling_rate"], data=music["audio"]
)

This code is pretty self-explanatory; scipy.io.wavfile.write writes the audio data to a .wav file. We pass in the sampling_rate and the audio data which are provided for us as properties of the music object generated by the model. We also use uuid.uuid4() to generate a unique ID for the file name so we don’t overwrite our previous generations, saving the file to musicgen_out_IDHERE.wav.

Make sure your terminal is in the same directory as this script when you run it so the file is saved in the correct location.

💡 Side note: your IDE may show red squiggly lines under the music["sampling_rate"] and music["audio"] parts of the code if you have a type checker enabled. This is because the type-checker cannot infer the type of the music variable and check whether or not it has these properties. Not to worry, they will be provided by the model in the output, so you can just ignore the red lines.

That’s it, go ahead and run this file in your terminal and give it some time to run. Here is what I got for the medium-quality gangster rap beat with a piano riff:

Keeping in mind this was just AI generated by our own local machine that really is not bad! I’m curious about the quality difference with the large model. If your computer can handle it, change the line where we load the model:

synthesizer = pipeline("text-to-audio", "facebook/musicgen-large")

I ran it again using the exact same prompt and here is what I got for the large model version:

You can tell that the piano riff part of the query is more present in the result compared to the previous one where it was a lot more in the background, so it does seem to be a bit better. Neither music sample is a perfect professional finished product, but for something we just generated on our local computer with an open-source model that’s pretty cool!

Testing various genres

I ran some more tests so let’s look at the prompts and the results for the large model.

lofi slow bpm electro chill with organic samples

The model is excellent at generating this type of chill lo-fi music as it is lo-fi by nature and thus very easily sounds qualitative enough!

blues guitar slow tempo rhythmic drums guitar bass blues band guitar licks soulful bluesy feel emotional blues progression expressive

As you can hear it sounds sort of half plausible but is not going to replace blues music quite yet. This type of ‘real’ live-style music seems a lot harder for the model to digest. We can see that it picks up on our hints very well though, like blues progression and adding the bass guitar too plus the slow tempo.

dance beat with a catchy melody energetic and lively dance festival edm dance music dance party dance pop danceable dancefloor dancehall

The dance beat is a bit mediocre overall though it has its moments.

epic orchestral music with a powerful and dramatic feel epic movie soundtrack epic trailer music epic video game music

It has the right feel and rhythm and stuff, but qualitatively it sounds like an amateur midi file, so not quite as epic as I’d hoped. It’s clear though that in the future even this type of epic movie-style music will be easily generated by AI.

In order to use the melody model where we give an input melody and the model will use it to a very short music clip, we have to install the underlying Audiocraft library. This is also needed if you want to get it to generate music longer than 30 seconds, which is totally possible. More information here and here

As AI music generation is not quite up to the highest level yet, and this is a very niche type of model at the moment, I’ll leave this up to you to explore further if you’re interested. I think it would be really cool to create an infinite lo-fi chill music radio channel and have it live-stream on Twitch or something.

3D Model generation using TripoSR

It’s time for us to move on to 3D models! The model we’ll be trying out this time is TripoSR, which is made in collaboration between Stability AI, which makes Stable Diffusion and the SDXL-Turbo image model we used in part 4, and Tripo AI. The Hugging Face page can be found here:

If we scroll down we can see that it refers us to the GitHub repository for installation instructions, so let’s head over to the repo at https://github.com/VAST-AI-Research/TripoSR. If we scroll down a bit we can see the instructions:

We can pretty much skip past the first three steps! We already have the CUDA and PyTorch parts handled. Next, it tells us to update setuptools, so let’s do that first before we continue. With your terminal in the root project folder, run the following command:

pipenv install update setuptools
# This is the same as:
# pip install --upgrade setuptools
# Except pipenv uses the term 'update'

Cloning and installing the repository

Finally, it instructs us to install the other dependencies provided in the requirements.txt file.

To do this we’ll first need to have a local clone of the repository on our machine. We cannot install requirements for a project that we don’t have on our machine! So let’s clone the repository:

git clone https://github.com/VAST-AI-Research/TripoSR

This shouldn’t take long:

$ git clone https://github.com/VAST-AI-Research/TripoSR
Cloning into 'TripoSR'...
remote: Enumerating objects: 161, done.
remote: Counting objects: 100% (64/64), done.
remote: Compressing objects: 100% (47/47), done.
remote: Total 161 (delta 36), reused 21 (delta 17), pack-reused 97
Receiving objects: 100% (161/161), 36.71 MiB | 16.29 MiB/s, done.
Resolving deltas: 100% (62/62), done.

All the files from the GitHub repository are now present on your local machine, under the folder TripoSR:

📁Local_Models
  📁MeloTTS
  📁test_files
  📁TripoSR  ✨New folder
  📁tts_output
  📄chat_app.py
  📄image_gen.ipynb
  📄image_to_image.ipynb
  📄local_chat.py
  📄local_chat_memory.py
  📄memory.py
  📄model_preloader.py
  📄Pipfile
  📄Pipfile.lock
  📄text_to_music.py
  📄text_to_speech.py

If you look inside this folder you will also see a requirements.txt file. This is the file that contains a list of all the dependencies that we need to install.

The great thing about this file is that we can just feed to to pip or pipenv and it will install all the dependencies for us in one go. With your terminal in the root project folder, run the following command:

pipenv install -r TripoSR/requirements.txt

Note that our terminal is in the project root folder, so we refer to the requirements file in the TripoSR folder like this: TripoSR/requirements.txt. The -r flag is simply used to feed the requirements.txt file into pipenv.

Egg fragments?

So go ahead and run the command and you will probably get the following error message!

ValueError: pipenv requires an #egg fragment for version controlled dependencies. Please install remote dependency in the form git+https://github.com/tatsy/torchmcubes.git#egg=<package-name>.

We need an egg fragment? What is that? Why does nothing ever work the way we want it to?! It’s ok, we’ll get there 😉. Going back in history, eggs were a packaged version of a Python package. Think of it as putting the project folder inside a .zip file and then renaming it to project.egg. This format is no longer used, but an egg refers basically to the Python package itself.

If we open up the requirements.txt file we can see that the torchmcubes package is the one causing the issue:

We’re telling it to install a package from this address, but we’re not giving it a name to install it under. When you want to install a Python package directly from a Git repository URL using pipenv, you need to provide an egg fragment in the URL. The egg fragment is used to specify the name of the package that is being installed.

Think of it as saying “from this URL www.github.com/package install package_name” The egg fragment name is generally the name of the package, so we can probably just add torchmcubes to the end of the URL and it should work.

We can check the #egg name. Head over to the GitHub for the torchcubes package we are trying to install at https://github.com/tatsy/torchmcubes. Now scroll down to find the setup.py file:

Click it and then look for the name field used in the setup() call:

The name of the package is torchmcubes just like we suspected. So open the requirements.txt file inside the TripoSR folder and add #egg=torchmcubes to the end of the URL:

Now save the file and with your terminal in the root project folder, try running the following command again:

pipenv install -r TripoSR/requirements.txt

💡 If you still have issues, you can try running pipenv lock --clear to clear the pipenv cache and then try running the command again. Sometimes the cache can cause problems with dependency resolution. (Note it will take a while to fetch everything again if you clear the cache).

Running the Gradio app

Awesome, that’s the installing stuff done! Let’s see if it actually works. As we can see from the GitHub page, the model has a Gradio app that we can use to test it out. so let’s first cd into the TripoSR folder:

cd TripoSR

Then run the following command:

python gradio_app.py

It may take some time to load up depending on your system specs, and will probably fill up your system memory quite well! But once it’s loaded you should see the following message:

Running on local URL: http://127.0.0.1:7860

So let’s check it out in our browser:

Sweet! Now we need to upload an image. (Apologies that some of the text on my screen may be displayed in Korean, as this is my system language, this will obviously not be the case for you.) I’ll upload a picture I just took of my Numpad:

I recommend you take your own picture of something though, as it’s much more fun that way 😄, but feel free to use my image if you want. Take something with a fairly simple shape to get started.

Leave the ✅Remove Background option ticked, unless you are uploading a pre-processed image that has the background removed already, and then press the Generate button and give it a moment to run:

Model specific limitations and strengths

Awesome! I’ll try something else next, how about a notebook:

And here is the 3D model generated from that image:

Oh no! What happened?! Well, we need to remember that AI is still in its infancy, and not everything will work perfectly. It works best with very predictable shapes and these are far from perfect and optimized 3D models, but it does offer us a small glimpse into the future and the fact that there are very few computing-related fields that will remain unaffected by AI in the future.

The thing with this type of model is we need to know where its particular strength lies. I suspect this model has mostly been trained on 3D models instead of real-life photographs. It will therefore tend to work quite well on stuff that looks very 3D, like the following AI-generated 3D character:

And here is the 3D model generated from that image:

That is pretty good and already looks like an acceptable video game character from a game that might have been released like 10-15 years ago. Here is another example of a 3D poop 💩 emoji, notice how it has no trouble whatsoever with these 3D-looking images.

It’s just a question of time until someone makes a more powerful model with more training data that can handle varied input images and create higher quality output. This is pretty cool though as the whole character itself was generated by AI and then turned into a 3D model by AI.

Conclusion

That’s it for the models we’ll be looking at in this course. Hugging Face has a lot more to offer than just the models, though they are the main feature of course. Besides tons of data sets you can use to either train or test your own models the spaces are also really fun to explore, but you can also create and run your very own AI apps on Hugging Face spaces, so make sure to check that out as well.

You can combine the models we’ve looked at in the last 3 parts of the course with the Ollama functionality we looked at in the first three parts to create cool local AI apps. You have all the building blocks, now, it’s up to you what to do with them!

As this course was mainly focused on running local models, we’ll leave it here. As always, I hope you enjoyed and thanks for taking this journey together with us! If there is anything else you would like to know or learn about, please let us know and we’ll see what we can do. Until next time, happy coding! 🚀

👉 Back to the Full Course on local models and Hugging Face (+Videos)