The voices are pretty convincing. It's funny to hear drastically the tone of the reading can change when repeatedly stopping and restarting the samples without changing any of the settings.
Quite disappointing their speech to text models are not open source. Whisper was really good and it was great it was open to play around with. I guess this continues OpenAI's approach of not really being open!
Vibe: Heavy german accent, doing an Arnold Schwarzenegger impression, way over the top for comedic effect. Deep booming voice, uses pauses for dramatic effect.
Interesting, I inserted a bunch of "fuck"s in the text and the "NYC Cabbie" voice read it all just fine. When I switched to other voices ("Connoisseur", "Cheerleader", "Santa"), it responded "I'm sorry I can't assist with that request".
I switched back to "NYC Cabbie" and it again read it just fine. I then reloaded the session completely, refreshed the voice selections until "NYC Cabbie" came up again, and it still read the text without hesitation.
The text:
> In my younger and more vulnerable years my father fuck gave me some fuck fuck advice that I've been fuck fuck FUCK OH FUCK turning over in my mind ever since.
> "Whenever you feel like criticizing any one," he told me, oh fuck! FUCK! "just remember that all the people in this world haven't had fuck fuck fuck FUCKERKER the advantages that you've had."
edit: "Emo Teenager", "Mad Scientist", and "Smooth Jazz" are able to read the text. However, "Medieval Knight" and "Robot" cannot.
Cool format for a demo. Some of the voices have a slight "metallic" ring to them, something I've seen a fair amount with Eleven Labs' models.
Does anyone have any experience with the realtime latency of these Openai TTS models? ElevenLabs has been so slow (much slower than the latency they advertise), which makes it almost impossible to use in realtime scenarios unless you can cache and replay the outputs. Cartesia looks to have cracked the time to first token, but i've found their voices to be a bit less consistent than Eleven Labs'.
Hey, I'm Jeff and I was PM for these models at OpenAI. Today we launched three new state-of-the-art audio models. Two speech-to-text models—outperforming Whisper. A new TTS model—you can instruct it how to speak (try it on openai.fm!). And our Agents SDK now supports audio, making it easy to turn text agents into voice agents. We think you'll really like these models. Let me know if you have any questions here!
One very important quote from the official announcement:
> For the first time, developers can “instruct” the model not just on what to say but how to say it—enabling more customized experiences for use cases ranging from customer service to creative storytelling.
The instructions are the "vibes" in this UI. But the announcement is wrong with the "for the first time" part: it was possible to steer the base GPT-4o model to create voices in a certain style using system prompt engineering (blogged about here: https://minimaxir.com/2024/10/speech-prompt-engineering/ ) out of concern that it could be used as a replacement for voice acting, however it was too expensive and adherence isn't great.
The schema of the vibes here implies that this new model is more receptive to nuance, which changes the calculus. The test cases from my post behave as expected, and the cost of gpt-4o-mini-tts audio output is $0.015 / minute (https://platform.openai.com/docs/pricing ), which is about 1/20th of the cost of my initial experments and is now feasible to use to potentially replace common voice applications. This has implications, and I'll be testing more around more nuanced prompt engineering.
If these are the "gpt-4o-mini-tts" models, and if the pricing estimate of "$0.015 per minute" of audio is correct, then these prices 85% cheaper than those of ElevenLabs.
With ElevenLabs, if I choose their most cost-effectuve "Business" plan for $1100 per month (with annual billing of $13,200, a savings of 17% over monthly billing), then I get 11,000 minutes TTS, and each minute is billed at 10 cents.
With OpenAI, I could get 11,000 minutes of TTS for $165.
Is this right? The current best TTS from OpenAI uses gpt-4o-audio-preview which is $2.50 input text, $80 output audio, the new gpt-4o-mini-tts is $0.60 input text, $12 output audio. An average 5x price reduction.
Going the other way, transcribe with gpt-4o-audio-preview price was $40 input audio, $10 output text, the new gpt-4o-transcribe is $6 input audio and $10 output text. Like a 7x reduction on the input price.
TTS/Transcribe with gpt-4o-audio-preview was a hack where you had to prompt with 'listen/speak this sentence:' and it often got it wrong. These new dedicated models are exactly what we needed.
I'm currently using the Google TTS API which is really good, fast and cheap. They charges $16 per million characters which is exactly the same as OpenAI's $0.015 per minute estimate.
Unfortunately it's not really worth switching over if the costs are exactly the same. Transcription on the other hand is 1.6¢/minute with Google and 0.6¢/minute with OpenAI now, that might be worth switching over for.
Interestingly "replaces every second word with potato" and "speaks in Spanish instead of English" both (kind of) work as a style, so it's clear there's significant flexibility and probably some form of LLM-like thing under the hood.
Is there way to get "speech marks" alongside the generated audio?
FYI, Speech marks provide millisecond timestamp for each word in a generated audio file/stream (and a start/end index into your original source string), as a stream of JSONL objects, like this:
AWS uses these speech marks (with variants for "sentence", "word", "viseme", or "ssml") in their Polly TTS service…
The sentence or word marks are useful for highlighting text as the TTS reads aloud, while the "viseme" marks are useful for doing lip-sync on a facial model.
Hmm I was hoping these would be bridging the gap between what's already been availalbe on their audio API or in the RealtimeAPI vs. Advanced Voice Mode, but the audio quality is really the same as its been up to this point.
Does anyone have any clue about exactly why they're not making the quality of Advanced Voice Mode available to build with? It would be game changing for us if they did.
I gave it (part of) the classic Navy Seal copypasta.
Interestingly, the safety controls ("I cannot assist with that request") is sort of dependent on the vibe instruction. NYC cabbie has no problem with it (and it's really, really funny, great job openAI), but anything peaceful, positive, etc. will deny the request.
This is astonishing. I can type anything I want into the "vibe" box and it does it for the given text. Accents, attitudes, personality types… I'm amazed.
The level of intelligent "prosody" here — the rhythm and intonation, the pauses and personality — I wasn't expecting anything like this so soon. This is truly remarkable. It understands both the text and the prompt for how the speaker should sound.
Like, we're getting much closer to the point where nobody except celebrities are going to record audiobooks. Everyone's just going to pick whatever voice they're in the mood for.
Some fun ones I just came up with:
> Imposing villain with an upper class British accent, speaking threateningly and with menace.
> Helpful customer support assistant with a Southern drawl who's very enthusiastic.
> Woman with a Boston accent who talks incredibly slowly and sounds like she's about to fall asleep at any minute.
It doesn't seem clear, but can the model do correct emphesis? On things like single words:
I did not steal that horse
Is the trivial example of something where intonation of the single word is what matters. More importantly if you are reading something, as a human, you change the intonation, audiolevel, and speed.
I'm surprised at how poor this is at following a detailed prompt.
It seems capable of generating a consistent style, and so in that sense quite useful. But if you want (say) a regional UK accent it's not even close.
I also find it confusing you have to choose a voice. Surely that's what the prompt should be for, especially when the voices have such abstract names.
I mean, it's still very impressive when you stand back a bit, but feels a bit half baked
Example:
Voice: Thick and hearty, with a slow, rolling cadence—like a lifelong Somerset farmer leaning over a gate, chatting about the land with a mug of cider in hand. It’s warm, weathered, and rich, carrying the easy confidence of someone who’s seen a thousand harvests and knows every hedgerow and rolling hill in the county.
Tone: Friendly, laid-back, and full of rustic charm. It’s got that unhurried quality of a man who’s got time for a proper chinwag, with a twinkle in his eye and a belly laugh never far away. Every sentence should feel like it’s been seasoned with fresh air, long days in the fields, and a lifetime of countryside wisdom.
Dialect: Classic West Country, with broad vowels, softened consonants, and that unmistakable rural lilt. Words flow together in an easy drawl, with plenty of dropped "h"s and "g"s. "I be" replaces "I am," and "us" gets used instead of "we" or "me." Expect plenty of "ooh-arrs," "proper job," and "gurt big" sprinkled in naturally.
Personally I just want to text or talk to Siri or an LLM and have it do whatever I need. Have it interface with AI Agents of companies, businesses, friends or families AI Agents to get whatever I need done like the example on OpenAI.fm site here (rebook my flight). Once it's done it shows me the confirmation on my lock screen and I receive an email confirmation.
Large text-to-speech and speech-to-text models have been greatly improving recently.
But I wish there were an offline, multilingual text-to-speech solution with good voices for a standard PC — one that doesn't require a GPU, tons of RAM, or max out the CPU.
In my research, I didn't find anything that fits the bill. People often mention Tortoise TTS, but I think it garbles words too often. The only plug-in solution for desktop apps I know of is the commercial and rather pricey Acapela SDK.
I hope someone can shrink those new neural network–based models to run efficiently on a typical computer. Ideally, it should run at under 50% CPU load on an average laptop that’s several years old, and start speaking almost immediately (less than 400ms delay).
The same goes for speech-to-text. Whisper.cpp is fine, but last time I looked, it wasn't able to transcribe audio at real-time speed on a standard laptop.
I'd pay for something like this as long as it's less expensive than Acapela.
Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.
Our site uses cookies. Learn more about our use of cookies: cookie policyACCEPTREJECT
Privacy & Cookies Policy
Privacy Overview
This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
35 Comments
minimaxir
This is an official OpenAI tool linked from the new model announcement (https://openai.com/index/introducing-our-next-generation-aud… ), despite the branding difference.
danso
The voices are pretty convincing. It's funny to hear drastically the tone of the reading can change when repeatedly stopping and restarting the samples without changing any of the settings.
nickthegreek
Try the refresh button to get a new list of vibe styles.
varunneal
One of the most novel demos I've seen openai ship in a few years. I love how it looks almost like a synth. Fun to play around with!
stephenheron
Quite disappointing their speech to text models are not open source. Whisper was really good and it was great it was open to play around with. I guess this continues OpenAI's approach of not really being open!
Etheryte
Recommended input for anyone trying this out:
Voice: Onyx
Vibe: Heavy german accent, doing an Arnold Schwarzenegger impression, way over the top for comedic effect. Deep booming voice, uses pauses for dramatic effect.
jcmp
How do you call this desing/ui astethic? I like it
danso
Interesting, I inserted a bunch of "fuck"s in the text and the "NYC Cabbie" voice read it all just fine. When I switched to other voices ("Connoisseur", "Cheerleader", "Santa"), it responded "I'm sorry I can't assist with that request".
I switched back to "NYC Cabbie" and it again read it just fine. I then reloaded the session completely, refreshed the voice selections until "NYC Cabbie" came up again, and it still read the text without hesitation.
The text:
> In my younger and more vulnerable years my father fuck gave me some fuck fuck advice that I've been fuck fuck FUCK OH FUCK turning over in my mind ever since.
> "Whenever you feel like criticizing any one," he told me, oh fuck! FUCK! "just remember that all the people in this world haven't had fuck fuck fuck FUCKERKER the advantages that you've had."
edit: "Emo Teenager", "Mad Scientist", and "Smooth Jazz" are able to read the text. However, "Medieval Knight" and "Robot" cannot.
ComputerGuru
It would be much more convenient to use if changing the voice model worked on the fly, without having to stop and start the audio.
islewis
Cool format for a demo. Some of the voices have a slight "metallic" ring to them, something I've seen a fair amount with Eleven Labs' models.
Does anyone have any experience with the realtime latency of these Openai TTS models? ElevenLabs has been so slow (much slower than the latency they advertise), which makes it almost impossible to use in realtime scenarios unless you can cache and replay the outputs. Cartesia looks to have cracked the time to first token, but i've found their voices to be a bit less consistent than Eleven Labs'.
carbocation
Nova+Serene sounds very metallic at the beginning about 50% of the time for me.
jeffharris
Hey, I'm Jeff and I was PM for these models at OpenAI. Today we launched three new state-of-the-art audio models. Two speech-to-text models—outperforming Whisper. A new TTS model—you can instruct it how to speak (try it on openai.fm!). And our Agents SDK now supports audio, making it easy to turn text agents into voice agents. We think you'll really like these models. Let me know if you have any questions here!
basitmakine
I don't think they're anywhere near TaskAGI or ElevenLabs level.
tantalor
It does a good job with Pirate voice. It can even inject "Arrr matey"
theoryofx
Still seems like Elevenlabs is crushing them on realtime audio, or does this change things?
jtbayly
I don’t get it. These voices all have a not-so-subtle vibration in them that makes them feel worse than Siri to me. I was expecting a lot better.
minimaxir
One very important quote from the official announcement:
> For the first time, developers can “instruct” the model not just on what to say but how to say it—enabling more customized experiences for use cases ranging from customer service to creative storytelling.
The instructions are the "vibes" in this UI. But the announcement is wrong with the "for the first time" part: it was possible to steer the base GPT-4o model to create voices in a certain style using system prompt engineering (blogged about here: https://minimaxir.com/2024/10/speech-prompt-engineering/ ) out of concern that it could be used as a replacement for voice acting, however it was too expensive and adherence isn't great.
The schema of the vibes here implies that this new model is more receptive to nuance, which changes the calculus. The test cases from my post behave as expected, and the cost of gpt-4o-mini-tts audio output is $0.015 / minute (https://platform.openai.com/docs/pricing ), which is about 1/20th of the cost of my initial experments and is now feasible to use to potentially replace common voice applications. This has implications, and I'll be testing more around more nuanced prompt engineering.
benjismith
If I'm reading the pricing correctly, these models are SIGNIFICANTLY cheaper than ElevenLabs.
https://platform.openai.com/docs/pricing
If these are the "gpt-4o-mini-tts" models, and if the pricing estimate of "$0.015 per minute" of audio is correct, then these prices 85% cheaper than those of ElevenLabs.
https://elevenlabs.io/pricing
With ElevenLabs, if I choose their most cost-effectuve "Business" plan for $1100 per month (with annual billing of $13,200, a savings of 17% over monthly billing), then I get 11,000 minutes TTS, and each minute is billed at 10 cents.
With OpenAI, I could get 11,000 minutes of TTS for $165.
Somebody check my math… Is this right?
fixprix
Is this right? The current best TTS from OpenAI uses gpt-4o-audio-preview which is $2.50 input text, $80 output audio, the new gpt-4o-mini-tts is $0.60 input text, $12 output audio. An average 5x price reduction.
Going the other way, transcribe with gpt-4o-audio-preview price was $40 input audio, $10 output text, the new gpt-4o-transcribe is $6 input audio and $10 output text. Like a 7x reduction on the input price.
TTS/Transcribe with gpt-4o-audio-preview was a hack where you had to prompt with 'listen/speak this sentence:' and it often got it wrong. These new dedicated models are exactly what we needed.
I'm currently using the Google TTS API which is really good, fast and cheap. They charges $16 per million characters which is exactly the same as OpenAI's $0.015 per minute estimate.
Unfortunately it's not really worth switching over if the costs are exactly the same. Transcription on the other hand is 1.6¢/minute with Google and 0.6¢/minute with OpenAI now, that might be worth switching over for.
tosh
Are these models only available via the API right now or also available as open weights?
evalstate
Really looking forward to integrating with these models.
The next version of Model Context Protocol will have native audio support (https://github.com/modelcontextprotocol/specification/pull/9…), which will open up plenty of opportunities for interop.
pklimk
Interestingly "replaces every second word with potato" and "speaks in Spanish instead of English" both (kind of) work as a style, so it's clear there's significant flexibility and probably some form of LLM-like thing under the hood.
benjismith
Is there way to get "speech marks" alongside the generated audio?
FYI, Speech marks provide millisecond timestamp for each word in a generated audio file/stream (and a start/end index into your original source string), as a stream of JSONL objects, like this:
{"time":6,"type":"word","start":0,"end":5,"value":"Hello"}
{"time":732,"type":"word","start":7,"end":11,"value":"it's"}
{"time":932,"type":"word","start":12,"end":16,"value":"nice"}
{"time":1193,"type":"word","start":17,"end":19,"value":"to"}
{"time":1280,"type":"word","start":20,"end":23,"value":"see"}
{"time":1473,"type":"word","start":24,"end":27,"value":"you"}
{"time":1577,"type":"word","start":28,"end":33,"value":"today"}
AWS uses these speech marks (with variants for "sentence", "word", "viseme", or "ssml") in their Polly TTS service…
The sentence or word marks are useful for highlighting text as the TTS reads aloud, while the "viseme" marks are useful for doing lip-sync on a facial model.
https://docs.aws.amazon.com/polly/latest/dg/output.html
looknee
Hmm I was hoping these would be bridging the gap between what's already been availalbe on their audio API or in the RealtimeAPI vs. Advanced Voice Mode, but the audio quality is really the same as its been up to this point.
Does anyone have any clue about exactly why they're not making the quality of Advanced Voice Mode available to build with? It would be game changing for us if they did.
mlsu
I gave it (part of) the classic Navy Seal copypasta.
Interestingly, the safety controls ("I cannot assist with that request") is sort of dependent on the vibe instruction. NYC cabbie has no problem with it (and it's really, really funny, great job openAI), but anything peaceful, positive, etc. will deny the request.
https://www.openai.fm/#56f804ab-9183-4802-9624-adc706c7b9f8
forgotpasagain
It sounds very expressive but weirdly "fake" as if it's targeting to be similar to some NPC character, dataset issue?
kartikarti
What does this little star next to the name mean?
crazygringo
This is astonishing. I can type anything I want into the "vibe" box and it does it for the given text. Accents, attitudes, personality types… I'm amazed.
The level of intelligent "prosody" here — the rhythm and intonation, the pauses and personality — I wasn't expecting anything like this so soon. This is truly remarkable. It understands both the text and the prompt for how the speaker should sound.
Like, we're getting much closer to the point where nobody except celebrities are going to record audiobooks. Everyone's just going to pick whatever voice they're in the mood for.
Some fun ones I just came up with:
> Imposing villain with an upper class British accent, speaking threateningly and with menace.
> Helpful customer support assistant with a Southern drawl who's very enthusiastic.
> Woman with a Boston accent who talks incredibly slowly and sounds like she's about to fall asleep at any minute.
tomjen3
It doesn't seem clear, but can the model do correct emphesis? On things like single words:
I did not steal that horse
Is the trivial example of something where intonation of the single word is what matters. More importantly if you are reading something, as a human, you change the intonation, audiolevel, and speed.
ForTheKidz
Pricing looks like it's aimed at us peasants, not our lords. Smart if openai wants to survive!
RobinL
I'm surprised at how poor this is at following a detailed prompt.
It seems capable of generating a consistent style, and so in that sense quite useful. But if you want (say) a regional UK accent it's not even close.
I also find it confusing you have to choose a voice. Surely that's what the prompt should be for, especially when the voices have such abstract names.
I mean, it's still very impressive when you stand back a bit, but feels a bit half baked
Example:
Voice: Thick and hearty, with a slow, rolling cadence—like a lifelong Somerset farmer leaning over a gate, chatting about the land with a mug of cider in hand. It’s warm, weathered, and rich, carrying the easy confidence of someone who’s seen a thousand harvests and knows every hedgerow and rolling hill in the county.
Tone: Friendly, laid-back, and full of rustic charm. It’s got that unhurried quality of a man who’s got time for a proper chinwag, with a twinkle in his eye and a belly laugh never far away. Every sentence should feel like it’s been seasoned with fresh air, long days in the fields, and a lifetime of countryside wisdom.
Dialect: Classic West Country, with broad vowels, softened consonants, and that unmistakable rural lilt. Words flow together in an easy drawl, with plenty of dropped "h"s and "g"s. "I be" replaces "I am," and "us" gets used instead of "we" or "me." Expect plenty of "ooh-arrs," "proper job," and "gurt big" sprinkled in naturally.
paul7986
Personally I just want to text or talk to Siri or an LLM and have it do whatever I need. Have it interface with AI Agents of companies, businesses, friends or families AI Agents to get whatever I need done like the example on OpenAI.fm site here (rebook my flight). Once it's done it shows me the confirmation on my lock screen and I receive an email confirmation.
tiahura
When are we going to get the equivalent for Whisper. When is it going to pick up on enthusiasm, sarcasm, etc?
kibbi
Large text-to-speech and speech-to-text models have been greatly improving recently.
But I wish there were an offline, multilingual text-to-speech solution with good voices for a standard PC — one that doesn't require a GPU, tons of RAM, or max out the CPU.
In my research, I didn't find anything that fits the bill. People often mention Tortoise TTS, but I think it garbles words too often. The only plug-in solution for desktop apps I know of is the commercial and rather pricey Acapela SDK.
I hope someone can shrink those new neural network–based models to run efficiently on a typical computer. Ideally, it should run at under 50% CPU load on an average laptop that’s several years old, and start speaking almost immediately (less than 400ms delay).
The same goes for speech-to-text. Whisper.cpp is fine, but last time I looked, it wasn't able to transcribe audio at real-time speed on a standard laptop.
I'd pay for something like this as long as it's less expensive than Acapela.
(My use case is an AAC app.)
justanotheratom
Note that the previous Whisper STT models were Open Source, and these new STT models are not, AFAICT.