Hello and welcome back to part 2 of the Introduction to AI Engineering series. In this part, we’ll take a look at some other exciting APIs in the AI space, as there is much more out there than just ChatGPT.
Image generation using DALL-E
Let’s start with a look at image generation. We’ll be looking at DALL-E, which is an AI model that can generate images from textual descriptions. There are many AI image generators out there nowadays, but the process for using them will be roughly the same.
So let’s have a look at the sizing and price options we have available:

As you can see, we have both Standard
and HD
quality options available, with sizes either 1024x1024
, 1024x1792
, or 1792x1024
. Depending on the choices we make the costs are anywhere between $0.04 to $0.12 per image. Now this may seem somewhat costly, but generally, you won’t really need the HD
option (there is not a huge difference) and you can often get away with a 1024x1024
image. $0.04 for instant on-demand artwork is not too bad!
So let’s see how to use the API and create some cool images. Start by creating a new file called generate_image.py
in your project root folder:
π Intro_to_AI_engineering π .env π chat_gpt_request.py π generate_image.py β¨ New file π text_to_summarize.txt
Now open the generate_image.py
file and add the following code:
import os from openai import OpenAI from dotenv import load_dotenv load_dotenv() CLIENT = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
You will probably notice something interesting. The imports and setup we are doing here are exactly the same as in the first tutorial. We are using the same OpenAI
class, and loading our API key using load_dotenv
again, passing it into the CLIENT
object we create. This is because OpenAI provides a unified interface for all their APIs, which is really convenient.
As the steps we’ll be taking will be reasonably similar to last time for ChatGPT, let’s step it up a notch and make this version a bit nicer using a function. Add the following code to the generate_image.py
file:
def generate_image(image_description): response = CLIENT.images.generate( model="dall-e-3", prompt=image_description, n=1, size="1024x1024" ) image_url = response.data[0].url print(f"Generated image: {image_url}") return image_url
We define a function generate_image
that takes an image_description
as input. We then call the CLIENT.images.generate
method, which is conveniently also provided for us in the CLIENT
object, and does exactly what the name suggests.
For the model, let’s use the latest version of DALL-E, which is dall-e-3
at the time of writing. We pass in the image_description
as the prompt
for this model, and request n=1
(number = 1) image to be generated. We also specify the size
of the image we want, which in this case is 1024x1024
.
The response
object will contain a bunch of data, just like last time with the ChatGPT response. We’ll have to dig in a little bit to find the URL that we want. Without going into too much detail here (see other courses for more on this), we can access the URL of the generated image by accessing the url
attribute of the first item in the data
list, so response.data[0].url
will give us the URL to our image.
Finally, we print out the URL to the terminal and return it as the output of the function. This url is hosted on OpenAI’s servers and you can click the link to see and download your image.
Now let’s make this into a runnable script so we can actually use the function we just wrote. Below the generate_image
function, add the following code:
if __name__ == "__main__": image_description = input("Enter a description of the image you want to generate: ") generate_image(image_description)
The if __name__ == "__main__":
block might look a bit confusing. There is no need to fully understand why it works here, as long as you know what it does. It basically means that the code inside this block will only run if you run the script directly, and not if you import it into another script. This is a common pattern in Python scripts.
What this will let us do is basically run this Python file as a script directly, in which case the code inside this block will be executed. The input
function will prompt the user to enter a description of the image they want to generate in the terminal. We save whatever the user types in a variable called image_description
.
We then call our generate_image
function with this description of the image to generate as input, after which point our function will run and generate the image, printing the URL to the terminal.
So give it a test run just like our ChatGPT script last time, by either using the terminal window or the play button in VSCode. You should see your prompt question pop up in the terminal as we invoked the input
function:
Enter a description of the image you want to generate:
So type anything and then press enter. You’ll get a response with the URL to your generated image. Click the link to see your image!
Enter a description of the image you want to generate: Japanese anime typical old fashioned street in the sunset art Generated image: https://url.here.net/verylongurlhere
Here is what I got for my old-fashioned Japanese sunset street in anime style:

That looks pretty darn cool! It does look like a scene straight out of an anime. As this is an introduction-level tutorial we don’t have too much time to go into specific details, but let’s take it a step further before we move on and at least have the images automatically downloaded to our computer.
Downloading the image
First create a new folder named output
in your project root folder:
π Intro_to_AI_engineering π output β¨ New folder π .env π chat_gpt_request.py π generate_image.py π text_to_summarize.txt
This is where we will be saving our images. Now inside the generate_image.py
script, we first need to add two new imports up top:
import os import uuid # New import import requests # New import from openai import OpenAI from dotenv import load_dotenv ... rest of the code ...
The requests library will allow us to have Python make an internet request to the URL we receive on our behalf, so we can download the image automatically. The uuid
library helps us generate a unique name for the image file so we don’t overwrite existing files.
Now scroll down a bit and directly after the generate_image
function but before the if __name__ == "__main__":
block, add the following new function:
def download_and_save_image(url): response = requests.get(url) image_name = uuid.uuid4() image_path = os.path.join("output", f"{image_name}.png") with open(image_path, "wb") as file: file.write(response.content) print(f"Image saved to {image_path}") return image_path
We name this function download_and_save_image
and it takes a url
as input. We then use the requests.get
function to make a request to the URL and get the image data, which we save in the response
object.
We create a unique name to save the file (to avoid conflicts and overwriting existing files) using the uuid
library. This will simply generate a random unique string for us.
After this we use the os
(operating-system) library to create a file path using path
by joining (join
) the string "output"
, referring to the output folder we created before, with the unique string, and adding the .png
extension. This is now a valid path we can use as a location to save the image.
Now we use the with open
trick from part 1 again, and as we already have our path created we can just pass in image_path
as the first argument. We open the file in write-binary mode (wb
) which is required for writing binary data like images.
The response
object that the requests library got for us contains more than just the image itself, but we are only interested in the image right now, so we index into response.content
to get the image data. We pass this image data into file.write
to write the image to the file.
Finally, we print out a message to the terminal that the image was saved successfully.
Now we just need to slightly edit the if __name__ == "__main__":
block to call this new function after we generate the image. Edit the block to look like this:
if __name__ == "__main__": image_description = input("Enter a description of the image you want to generate: ") image_url = generate_image(image_description) download_and_save_image(image_url)
As our generate_image
function already returns the URL of the generated image in our code, we can save this URL in a variable called image_url
. We then pass this URL into our new download_and_save_image
function to download and save the image to our computer.
So go ahead and try out our new and improved function:
Enter a description of the image you want to generate: Godzilla in a bikini sunbathing in front of the Pyramids sipping on a cocktail Generated image: https://url.here.net/verylongurlhere Image saved to output\3e8dc429-069d-4ea7-9eb7-3f0baf607039.png
And there you go, the image will be automatically saved in your output folder, and it is pretty epic:

If you enjoy image generation and would like to learn more, check out the OpenAI API Mastery: Innovating with GPT-4 Turbo, Text-to-Speech (TTS), and DALLΒ·E 3 course on the Finxter Academy!
In this course, you’ll dive into the cutting-edge features of AI, focusing on practical applications using GPT-4 Turbo, Text-to-Speech (TTS), and DALLΒ·E 3. This hands-on course guides you through parallel function calls, OpenAI JSON mode, and advanced image editing techniques, equipping you with the skills to create efficient and intuitive AI applications. As a side note, since new models keep coming out so quickly, some of the courses use older model names. You can just change the name used for gpt-4o
or gpt-4o-mini
if you want to use the latest models, the courses are still the same.
Text to speech
One of the other very useful options in our AI API arsenal is a text-to-speech converter. OpenAI has a pretty good one and they sound very natural nowadays. We’ll take a quick look at the bare basics here only, to quickly expand the tools in our toolbox.
Create a new file called generate_speech.py
in your project root folder:
π Intro_to_AI_engineering π output π .env π chat_gpt_request.py π generate_image.py π generate_speech.py β¨ New file π text_to_summarize.txt
To save time here and keep it basic, I’ll deliberately use nearly the same code so we can go over this very efficiently. Open the generate_speech.py
file and add the following code:
import os from openai import OpenAI from dotenv import load_dotenv load_dotenv() CLIENT = OpenAI(api_key=os.getenv('OPENAI_API_KEY')) def text_to_speech(text): response = CLIENT.audio.speech.create( model="tts-1", voice="alloy", input=text ) first_chars = text[:20] speech_file_path = os.path.join("output", f"{first_chars}.mp3") response.stream_to_file(speech_file_path) print(f"Speech saved to {speech_file_path}") return response if __name__ == "__main__": text = input("Enter the text you want to convert to speech: ") text_to_speech(text)
No surprises on the imports and setup, as we are using the same OpenAI
class to set up our CLIENT
object again. We define a new function called text_to_speech
that takes a text
input. We then call the CLIENT.audio.speech.create
method which will invoke OpenAI’s text-to-speech model.
Let’s go over the parameters we have here. For the model, we have either tts-1
which is the normal version and costs $0.015 per 1000 characters spoken, or tts-1-hd
which is the high-definition version and costs double that at $0.03 per 1000 characters spoken. We are using the normal version here and for most use cases you probably will not hear much difference.
There are several voices available, alloy
just being one of the options. See here for a full list of voices available. We pass in the text
we want to convert to speech as the input
parameter.
Now I’ll get us a quick filename by slicing the first 20 characters of the user input text [:20]
and use that as the filename. We then create a file path using os.path.join
again like before, but this time we use the .mp3
extension and we call the stream_to_file
method on the response
object instead.
The final code below where we ask for the user input is also familiar from the previous example with the image generator. So go ahead and try this out by running the file and giving it any input text you want. Keep in mind that longer text may take a bit longer to process:
Enter the text you want to convert to speech: Once upon a time in a land of waffles and kangaroos, a quirky platypus named Sir Quackalot decided to open a detective agency. c:\Coding_Vault\Broad_intro_to_AI_draft\generate_speech.py:17: DeprecationWarning: Due to a bug, this method doesn't actually stream the response content, `.with_streaming_response.method()` should be used instead response.stream_to_file(speech_file_path) Speech saved to output\Once upon a time in .mp3
Note that there is a deprecated warning in the middle,Β but you can just ignore it for now.Β It’s a bit outside the scope of this tutorial and your code will work just fine.Β A new file will appear in your output folder with the speech you requested:
Very cool and sounds very natural, right? You can use this for a lot of things, like creating audiobooks, adding voiceovers to videos, or even just for fun. We’ve come so far since the days of the robot voices. There’s not too much complexity to this particular API, you can just play around with the different voices and see which you like best and enjoy!
Speech-to-text (transcription)
The last API I want to show you in this broad overview, and another one of those you really cannot miss in your tool-belt of AI APIs, is a speech-to-text model. Why is this so useful? Well, it’s not just for those automated YouTube subtitles you get nowadays (yes, you can automatically subtitle your videos, more on that later!), but also used for stuff like automated action points and summaries of the last online meeting you had with your team.
There are many cases where you have audio and you simply want it in a text format because that is easier to share, access, and glance over. Or perhaps you want to do some further analysis on the text, like sentiment analysis or keyword extraction, maybe even have an LLM work with it, or have it translated. All of these require getting an accurate transcription of the audio first.
OpenAI has a very good model named Whisper
which can help us out here. The best thing about it is that it costs only $0.006 per minute of audio transcribed!
So let’s try it out! The first thing you will need is audio that we can convert to text. Let’s keep it simple for the first example. If you have a microphone you can record a short sample of you speaking, or you can download my sample provided here for your convenience:
I have a reasonably decent microphone so my sample is quite good quality, but that does not mean that you can only use high-quality samples. Whisper is surprisingly good with low-quality audio and microphones, such as the ones that are often used for Internet meetings, dramatically increasing its usefulness.
Whether you use my sample or record your own, save it in your project root folder as test_audio.mp3
:
π Intro_to_AI_engineering π output π .env π chat_gpt_request.py π generate_image.py π generate_speech.py π test_audio.mp3 β¨ New file π text_to_summarize.txt
Now create a new file called transcribe_audio.py
in your project root folder:
π Intro_to_AI_engineering π output π .env π chat_gpt_request.py π generate_image.py π generate_speech.py π test_audio.mp3 π text_to_summarize.txt π transcribe_audio.py β¨ New file
Open the transcribe_audio.py
file and let’s go over the code all at once. You’ll notice the setup is the same, we’ll look at how to reduce this type of duplication later. Add the following code:
import os from openai import OpenAI from dotenv import load_dotenv load_dotenv() CLIENT = OpenAI(api_key=os.getenv('OPENAI_API_KEY')) def transcribe_audio(audio_file): response = CLIENT.audio.transcriptions.create( model="whisper-1", file=audio_file ) transcript = response.text print(f"Transcription:\n{transcript}") return transcript if __name__ == "__main__": audio_file = open("test_audio.mp3", "rb") transcribe_audio(audio_file)
We define a new function called transcribe_audio
that takes an audio_file
as input. We then call the CLIENT.audio.transcriptions.create
method which will invoke OpenAI’s speech-to-text model.
For the model we use whisper-1
as Whisper is the name of OpenAI’s very impressive speech-to-text model. We pass in the audio_file
as the file
parameter.
The transcript of the audio will be in the response.text
attribute, which we save in a variable called transcript
. We then print out the transcript to the terminal and return it as the output of the function.
The final code below that is also familiar from the previous examples. Inside the if __name__ == "__main__":
block we open the test_audio.mp3
file in read-binary mode (rb
) and pass it into our transcribe_audio
function.
And I get the following output in my terminal:
Transcription: Hi guys, this is just a quick test audio file for you. Let's see how well it does and if my speech is recognized and converted to text properly. I'm really excited to see how well this works and I hope that it will be a good test for you guys to see how well the Whisper model works.
Perfect, word-for-word, with even the punctuation being flawless, and very fast too. Now this on its own is already pretty useful, but when we combine this with other concepts to build on it, we can really create cool stuff. If you would like to see what kinds of cool stuff exactly, check out Voice-First Development: Building Cutting-Edge Python Apps Powered By OpenAI Whisper course on the Finxter Academy!

Β You will learn to build this and more!
This course delves into OpenAI’s Whisper model, teaching you how to use Python for speech-to-text applications. You will build practical tools like a podcast transcription app, an automatic subtitle generator that embeds subtitles straight into your video file, and more. The course also covers strategies for optimizing Whisper’s speed and effectively outsourcing tasks to balance quick results with accurate transcriptions, giving you a comprehensive grasp of its real-world applications.
This is the end of part 2 of this ‘broad overview of AI’ series. I’ll see you in part 3 where we’ll look at how to chain all of these APIs together in an organized manner using LangChain. See you there!
Wow capabilities of speech to text is awesome… Even in Polish, which is one of the hardest languages to get. Wow. This Imeage generator generate high quality images, but if I need something really specific is hard to get.
Agree, thanks for your point!