Skip to content Skip to footer

35 Comments

  • Post Author
    minimaxir
    Posted March 20, 2025 at 5:25 pm

    This is an official OpenAI tool linked from the new model announcement (https://openai.com/index/introducing-our-next-generation-aud… ), despite the branding difference.

  • Post Author
    danso
    Posted March 20, 2025 at 5:37 pm

    The voices are pretty convincing. It's funny to hear drastically the tone of the reading can change when repeatedly stopping and restarting the samples without changing any of the settings.

  • Post Author
    nickthegreek
    Posted March 20, 2025 at 5:38 pm

    Try the refresh button to get a new list of vibe styles.

  • Post Author
    varunneal
    Posted March 20, 2025 at 5:38 pm

    One of the most novel demos I've seen openai ship in a few years. I love how it looks almost like a synth. Fun to play around with!

  • Post Author
    stephenheron
    Posted March 20, 2025 at 5:38 pm

    Quite disappointing their speech to text models are not open source. Whisper was really good and it was great it was open to play around with. I guess this continues OpenAI's approach of not really being open!

  • Post Author
    Etheryte
    Posted March 20, 2025 at 5:40 pm

    Recommended input for anyone trying this out:

    Voice: Onyx

    Vibe: Heavy german accent, doing an Arnold Schwarzenegger impression, way over the top for comedic effect. Deep booming voice, uses pauses for dramatic effect.

  • Post Author
    jcmp
    Posted March 20, 2025 at 5:45 pm

    How do you call this desing/ui astethic? I like it

  • Post Author
    danso
    Posted March 20, 2025 at 5:45 pm

    Interesting, I inserted a bunch of "fuck"s in the text and the "NYC Cabbie" voice read it all just fine. When I switched to other voices ("Connoisseur", "Cheerleader", "Santa"), it responded "I'm sorry I can't assist with that request".

    I switched back to "NYC Cabbie" and it again read it just fine. I then reloaded the session completely, refreshed the voice selections until "NYC Cabbie" came up again, and it still read the text without hesitation.

    The text:

    > In my younger and more vulnerable years my father fuck gave me some fuck fuck advice that I've been fuck fuck FUCK OH FUCK turning over in my mind ever since.

    > "Whenever you feel like criticizing any one," he told me, oh fuck! FUCK! "just remember that all the people in this world haven't had fuck fuck fuck FUCKERKER the advantages that you've had."

    edit: "Emo Teenager", "Mad Scientist", and "Smooth Jazz" are able to read the text. However, "Medieval Knight" and "Robot" cannot.

  • Post Author
    ComputerGuru
    Posted March 20, 2025 at 5:47 pm

    It would be much more convenient to use if changing the voice model worked on the fly, without having to stop and start the audio.

  • Post Author
    islewis
    Posted March 20, 2025 at 5:49 pm

    Cool format for a demo. Some of the voices have a slight "metallic" ring to them, something I've seen a fair amount with Eleven Labs' models.

    Does anyone have any experience with the realtime latency of these Openai TTS models? ElevenLabs has been so slow (much slower than the latency they advertise), which makes it almost impossible to use in realtime scenarios unless you can cache and replay the outputs. Cartesia looks to have cracked the time to first token, but i've found their voices to be a bit less consistent than Eleven Labs'.

  • Post Author
    carbocation
    Posted March 20, 2025 at 5:51 pm

    Nova+Serene sounds very metallic at the beginning about 50% of the time for me.

  • Post Author
    jeffharris
    Posted March 20, 2025 at 5:55 pm

    Hey, I'm Jeff and I was PM for these models at OpenAI. Today we launched three new state-of-the-art audio models. Two speech-to-text models—outperforming Whisper. A new TTS model—you can instruct it how to speak (try it on openai.fm!). And our Agents SDK now supports audio, making it easy to turn text agents into voice agents. We think you'll really like these models. Let me know if you have any questions here!

  • Post Author
    basitmakine
    Posted March 20, 2025 at 6:01 pm

    I don't think they're anywhere near TaskAGI or ElevenLabs level.

  • Post Author
    tantalor
    Posted March 20, 2025 at 6:02 pm

    It does a good job with Pirate voice. It can even inject "Arrr matey"

  • Post Author
    theoryofx
    Posted March 20, 2025 at 6:06 pm

    Still seems like Elevenlabs is crushing them on realtime audio, or does this change things?

  • Post Author
    jtbayly
    Posted March 20, 2025 at 6:14 pm

    I don’t get it. These voices all have a not-so-subtle vibration in them that makes them feel worse than Siri to me. I was expecting a lot better.

  • Post Author
    minimaxir
    Posted March 20, 2025 at 6:20 pm

    One very important quote from the official announcement:

    > For the first time, developers can “instruct” the model not just on what to say but how to say it—enabling more customized experiences for use cases ranging from customer service to creative storytelling.

    The instructions are the "vibes" in this UI. But the announcement is wrong with the "for the first time" part: it was possible to steer the base GPT-4o model to create voices in a certain style using system prompt engineering (blogged about here: https://minimaxir.com/2024/10/speech-prompt-engineering/ ) out of concern that it could be used as a replacement for voice acting, however it was too expensive and adherence isn't great.

    The schema of the vibes here implies that this new model is more receptive to nuance, which changes the calculus. The test cases from my post behave as expected, and the cost of gpt-4o-mini-tts audio output is $0.015 / minute (https://platform.openai.com/docs/pricing ), which is about 1/20th of the cost of my initial experments and is now feasible to use to potentially replace common voice applications. This has implications, and I'll be testing more around more nuanced prompt engineering.

  • Post Author
    benjismith
    Posted March 20, 2025 at 6:21 pm

    If I'm reading the pricing correctly, these models are SIGNIFICANTLY cheaper than ElevenLabs.

    https://platform.openai.com/docs/pricing

    If these are the "gpt-4o-mini-tts" models, and if the pricing estimate of "$0.015 per minute" of audio is correct, then these prices 85% cheaper than those of ElevenLabs.

    https://elevenlabs.io/pricing

    With ElevenLabs, if I choose their most cost-effectuve "Business" plan for $1100 per month (with annual billing of $13,200, a savings of 17% over monthly billing), then I get 11,000 minutes TTS, and each minute is billed at 10 cents.

    With OpenAI, I could get 11,000 minutes of TTS for $165.

    Somebody check my math… Is this right?

  • Post Author
    fixprix
    Posted March 20, 2025 at 6:21 pm

    Is this right? The current best TTS from OpenAI uses gpt-4o-audio-preview which is $2.50 input text, $80 output audio, the new gpt-4o-mini-tts is $0.60 input text, $12 output audio. An average 5x price reduction.

    Going the other way, transcribe with gpt-4o-audio-preview price was $40 input audio, $10 output text, the new gpt-4o-transcribe is $6 input audio and $10 output text. Like a 7x reduction on the input price.

    TTS/Transcribe with gpt-4o-audio-preview was a hack where you had to prompt with 'listen/speak this sentence:' and it often got it wrong. These new dedicated models are exactly what we needed.

    I'm currently using the Google TTS API which is really good, fast and cheap. They charges $16 per million characters which is exactly the same as OpenAI's $0.015 per minute estimate.

    Unfortunately it's not really worth switching over if the costs are exactly the same. Transcription on the other hand is 1.6¢/minute with Google and 0.6¢/minute with OpenAI now, that might be worth switching over for.

  • Post Author
    tosh
    Posted March 20, 2025 at 6:23 pm

    Are these models only available via the API right now or also available as open weights?

  • Post Author
    evalstate
    Posted March 20, 2025 at 6:28 pm

    Really looking forward to integrating with these models.

    The next version of Model Context Protocol will have native audio support (https://github.com/modelcontextprotocol/specification/pull/9…), which will open up plenty of opportunities for interop.

  • Post Author
    pklimk
    Posted March 20, 2025 at 6:32 pm

    Interestingly "replaces every second word with potato" and "speaks in Spanish instead of English" both (kind of) work as a style, so it's clear there's significant flexibility and probably some form of LLM-like thing under the hood.

  • Post Author
    benjismith
    Posted March 20, 2025 at 6:33 pm

    Is there way to get "speech marks" alongside the generated audio?

    FYI, Speech marks provide millisecond timestamp for each word in a generated audio file/stream (and a start/end index into your original source string), as a stream of JSONL objects, like this:

    {"time":6,"type":"word","start":0,"end":5,"value":"Hello"}

    {"time":732,"type":"word","start":7,"end":11,"value":"it's"}

    {"time":932,"type":"word","start":12,"end":16,"value":"nice"}

    {"time":1193,"type":"word","start":17,"end":19,"value":"to"}

    {"time":1280,"type":"word","start":20,"end":23,"value":"see"}

    {"time":1473,"type":"word","start":24,"end":27,"value":"you"}

    {"time":1577,"type":"word","start":28,"end":33,"value":"today"}

    AWS uses these speech marks (with variants for "sentence", "word", "viseme", or "ssml") in their Polly TTS service…

    The sentence or word marks are useful for highlighting text as the TTS reads aloud, while the "viseme" marks are useful for doing lip-sync on a facial model.

    https://docs.aws.amazon.com/polly/latest/dg/output.html

  • Post Author
    looknee
    Posted March 20, 2025 at 6:34 pm

    Hmm I was hoping these would be bridging the gap between what's already been availalbe on their audio API or in the RealtimeAPI vs. Advanced Voice Mode, but the audio quality is really the same as its been up to this point.

    Does anyone have any clue about exactly why they're not making the quality of Advanced Voice Mode available to build with? It would be game changing for us if they did.

  • Post Author
    mlsu
    Posted March 20, 2025 at 6:35 pm

    I gave it (part of) the classic Navy Seal copypasta.

    Interestingly, the safety controls ("I cannot assist with that request") is sort of dependent on the vibe instruction. NYC cabbie has no problem with it (and it's really, really funny, great job openAI), but anything peaceful, positive, etc. will deny the request.

    https://www.openai.fm/#56f804ab-9183-4802-9624-adc706c7b9f8

  • Post Author
    forgotpasagain
    Posted March 20, 2025 at 6:37 pm

    It sounds very expressive but weirdly "fake" as if it's targeting to be similar to some NPC character, dataset issue?

  • Post Author
    kartikarti
    Posted March 20, 2025 at 6:45 pm

    What does this little star next to the name mean?

  • Post Author
    crazygringo
    Posted March 20, 2025 at 6:48 pm

    This is astonishing. I can type anything I want into the "vibe" box and it does it for the given text. Accents, attitudes, personality types… I'm amazed.

    The level of intelligent "prosody" here — the rhythm and intonation, the pauses and personality — I wasn't expecting anything like this so soon. This is truly remarkable. It understands both the text and the prompt for how the speaker should sound.

    Like, we're getting much closer to the point where nobody except celebrities are going to record audiobooks. Everyone's just going to pick whatever voice they're in the mood for.

    Some fun ones I just came up with:

    > Imposing villain with an upper class British accent, speaking threateningly and with menace.

    > Helpful customer support assistant with a Southern drawl who's very enthusiastic.

    > Woman with a Boston accent who talks incredibly slowly and sounds like she's about to fall asleep at any minute.

  • Post Author
    tomjen3
    Posted March 20, 2025 at 6:55 pm

    It doesn't seem clear, but can the model do correct emphesis? On things like single words:

    I did not steal that horse

    Is the trivial example of something where intonation of the single word is what matters. More importantly if you are reading something, as a human, you change the intonation, audiolevel, and speed.

  • Post Author
    ForTheKidz
    Posted March 20, 2025 at 7:21 pm

    Pricing looks like it's aimed at us peasants, not our lords. Smart if openai wants to survive!

  • Post Author
    RobinL
    Posted March 20, 2025 at 7:22 pm

    I'm surprised at how poor this is at following a detailed prompt.

    It seems capable of generating a consistent style, and so in that sense quite useful. But if you want (say) a regional UK accent it's not even close.

    I also find it confusing you have to choose a voice. Surely that's what the prompt should be for, especially when the voices have such abstract names.

    I mean, it's still very impressive when you stand back a bit, but feels a bit half baked

    Example:
    Voice: Thick and hearty, with a slow, rolling cadence—like a lifelong Somerset farmer leaning over a gate, chatting about the land with a mug of cider in hand. It’s warm, weathered, and rich, carrying the easy confidence of someone who’s seen a thousand harvests and knows every hedgerow and rolling hill in the county.

    Tone: Friendly, laid-back, and full of rustic charm. It’s got that unhurried quality of a man who’s got time for a proper chinwag, with a twinkle in his eye and a belly laugh never far away. Every sentence should feel like it’s been seasoned with fresh air, long days in the fields, and a lifetime of countryside wisdom.

    Dialect: Classic West Country, with broad vowels, softened consonants, and that unmistakable rural lilt. Words flow together in an easy drawl, with plenty of dropped "h"s and "g"s. "I be" replaces "I am," and "us" gets used instead of "we" or "me." Expect plenty of "ooh-arrs," "proper job," and "gurt big" sprinkled in naturally.

  • Post Author
    paul7986
    Posted March 20, 2025 at 7:37 pm

    Personally I just want to text or talk to Siri or an LLM and have it do whatever I need. Have it interface with AI Agents of companies, businesses, friends or families AI Agents to get whatever I need done like the example on OpenAI.fm site here (rebook my flight). Once it's done it shows me the confirmation on my lock screen and I receive an email confirmation.

  • Post Author
    tiahura
    Posted March 20, 2025 at 8:02 pm

    When are we going to get the equivalent for Whisper. When is it going to pick up on enthusiasm, sarcasm, etc?

  • Post Author
    kibbi
    Posted March 20, 2025 at 8:14 pm

    Large text-to-speech and speech-to-text models have been greatly improving recently.

    But I wish there were an offline, multilingual text-to-speech solution with good voices for a standard PC — one that doesn't require a GPU, tons of RAM, or max out the CPU.

    In my research, I didn't find anything that fits the bill. People often mention Tortoise TTS, but I think it garbles words too often. The only plug-in solution for desktop apps I know of is the commercial and rather pricey Acapela SDK.

    I hope someone can shrink those new neural network–based models to run efficiently on a typical computer. Ideally, it should run at under 50% CPU load on an average laptop that’s several years old, and start speaking almost immediately (less than 400ms delay).

    The same goes for speech-to-text. Whisper.cpp is fine, but last time I looked, it wasn't able to transcribe audio at real-time speed on a standard laptop.

    I'd pay for something like this as long as it's less expensive than Acapela.

    (My use case is an AAC app.)

  • Post Author
    justanotheratom
    Posted March 20, 2025 at 8:32 pm

    Note that the previous Whisper STT models were Open Source, and these new STT models are not, AFAICT.

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.