Crossing the uncanny valley of conversational voice by monroewalker

ByHackTech March 2, 2025

33Comments

Share This Article

Sed ut perspiciatis unde.

Send to HN

February 27, 2025

Brendan Iribe, Ankit Kumar, and the Sesame team

How do we know when someone truly understands us? It is rarely just our words—it is in the subtleties of voice: the rising excitement, the thoughtful pause, the warm reassurance.

Voice is our most intimate medium as humans, carrying layers of meaning through countless variations in tone, pitch, rhythm, and emotion.

Today’s digital voice assistants lack essential qualities to make them truly useful. Without unlocking the full power of voice, they cannot hope to effectively collaborate with us. A personal assistant who speaks only in a neutral tone has difficulty finding a permanent place in our daily lives after the initial novelty wears off.

Over time this emotional flatness becomes more than just disappointing—it becomes exhausting.

Achieving voice presence

At Sesame, our goal is to achieve “voice presence”—the magical quality that makes spoken interactions feel real, understood, and valued. We are creating conversational partners that do not just process requests; they engage in genuine dialogue that builds confidence and trust over time. In doing so, we hope to realize the untapped potential of voice as the ultimate interface for instruction and understanding.

Key components

Emotional intelligence: reading and responding to emotional contexts.
Conversational dynamics: natural timing, pauses, interruptions and emphasis.
Contextual awareness: adjusting tone and style to match the situation.
Consistent personality: maintaining a coherent, reliable and appropriate presence.

We’re not there yet

Building a digital companion with voice presence is not easy, but we are making steady progress on multiple fronts, including personality, memory, expressivity and appropriateness. This demo is a showcase of some of our work in conversational speech generation. The companions shown here have been optimized for friendliness and expressivity to illustrate the potential of our approach.

Conversational voice demo

1. Microphone permission is required. 2. Calls are recorded for quality review but not used for ML training and are deleted within 30 days. 3. By using this demo, you are agreeing to our Terms of Use and Privacy Policy. 4. We recommend using Chrome (Audio quality may be degraded in iOS/Safari 17.5).

Technical post

Authors

Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang

To create AI companions that feel genuinely interactive, speech generation must go beyond producing high-quality audio—it must understand and adapt to context in real time. Traditional text-to-speech (TTS) models generate spoken output directly from text but lack the contextual awareness needed for natural conversations. Even though recent models produce highly human-like speech, they struggle with the one-to-many problem: there are countless valid ways to speak a sentence, but only some fit a given setting. Without additional context—including tone, rhythm, and history of the conversation—models lack the information to choose the best option. Capturing these nuances requires reasoning across multiple aspects of language and prosody.

To address this, we introduce the Conversational Speech Model (CSM), which frames the problem as an end-to-end multimodal learning task using transformers. It leverages the history of the conversation to produce more natural and coherent speech. There are two key takeaways from our work. The first is that CSM operates as a

single-stage model, thereby improving efficiency and expressivity. The second is our

evaluation suite, which is necessary for evaluating progress on contextual capabilities and addresses the fact that common public evaluations are saturated.

Background

One approach to modeling audio with transformers is to convert continuous waveforms into discrete audio token sequences using tokenizers. Most contemporary approaches ([1], [2]) rely on two types of audio tokens:

Semantic tokens: Compact speaker-invariant representations of semantic and phonetic features. Their compressed nature enables them to capture key speech characteristics at the cost of high-fidelity representation.
Acoustic tokens: Encodings of fine-grained acoustic details that enable high-fidelity audio reconstruction. These tokens are often generated using Residual Vector Quantization (RVQ) [2]. In contrast to semantic tokens, acoustic tokens retain natural speech characteristics like speaker-specific identity and timbre.

A common strategy first models semantic tokens and then generates audio using RVQ or diffusion-based methods. Decoupling these steps allows for a more structured approach to speech synthesis—the semantic tokens provide a compact, speaker-invariant representation that captures high-level linguistic and prosodic information, while the second-stage reconstructs the fine-grained acoustic details needed for high-fidelity speech. However, this approach has a criti

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (33)

33 Comments

Post Author

monroewalker

Posted March 2, 2025 at 6:16 am

This was already posted here: https://news.ycombinator.com/item?id=43221377 but I’m really surprised at the lack of attention this model is getting. The responsiveness and apparent personality are pretty mind blowing. It’s similar to what OpenAI had initially demoed for advanced voice mode, at least for the voice conversation portion.

The demo interactions are recorded, which is mentioned in their disclaimer under the demo UI. What isn't mentioned though is that they include past conversations in the context for the model on future interactions. It was pretty surprising to be greeted with something like "welcome back" and the model being able to reference what was said in previous interactions. The full disclaimer on the page for the demo is:

"
1. Microphone permission is required. 2. Calls are recorded for quality review but not used for ML training and are deleted within 30 days. 3. By using this demo, you are agreeing to our
"

edit: Actually this has been posted quite a few times already and had good visibility a couple days ago:
– https://news.ycombinator.com/item?id=43200400
Others: https://hn.algolia.com/?q=sesame.com

0Likes Log in to Reply
Post Author

drvladb

Posted March 2, 2025 at 6:30 am

Definitely an improvement over your normal Text-To-Speach model, and to some degree really different, but the subtle imperfections do appear and ruin the overall perception. A move in the right direction, though, I suppose.

0Likes Log in to Reply
Post Author

kats

Posted March 2, 2025 at 6:33 am

AI voice is an overwhelmingly harmful technology. It's biggest use will be to hurt people.

0Likes Log in to Reply
Post Author

bobosha

Posted March 2, 2025 at 6:36 am

Very impressive. well done team sesame!

0Likes Log in to Reply
Post Author

tobr

Posted March 2, 2025 at 6:38 am

I asked it if it could whisper, and it replied in full voice, ”I’m whispering to you right now”.

0Likes Log in to Reply
Post Author

thekevan

Posted March 2, 2025 at 6:41 am

It's good, but it still sounds fake to me, but in a different way. The voice itself sounds like a human, undoubtedly.

But the cadence and the rhythm of speaking are off. It sounds like someone who isn't a podcaster trying to speak in the personality of a podcaster. It just sounds like someone trying too hard and speaking in an unnatural way.

0Likes Log in to Reply
Post Author

bloomingkales

Posted March 2, 2025 at 6:43 am

This is so good that it's disarming. People are going to blabber everything to it, so we need a local private model. It's a lot to ask, I know. Incredible tech.

0Likes Log in to Reply
Post Author

singularity2001

Posted March 2, 2025 at 6:51 am

pretty impressive demo but not my style I mean the constant jabbing and kind of unintelligent behavior. so yeah it feels pretty uncanny but unfortunately in a negative annoying way. I don't think this is a limitation of the model they could just adopt to more scientific users in a more cooperative way, similar to how ChatGPT has this very sophisticated aura. I don't like how systems which have no emotions constantly pretend to have emotions but maybe that's just me.

0Likes Log in to Reply
Post Author

pulkitsh1234

Posted March 2, 2025 at 6:52 am

This is mind blowing

0Likes Log in to Reply
Post Author

rendall

Posted March 2, 2025 at 6:53 am

Well done. My first impression:

Cons: they are just a bit too casual with their language. The casualness came off somewhat studied and inauthentic. They were just a bit too eager to fill silence: less than a split second of silence, and they were chattering. If they were humans I would think they were a bit insecure and trying too hard to establish rapport. But those flaws are relatively minor, and could just be an uncanny valley thing.

Pros: They had such personalities that I felt at moments that I was talking to a person. Maya was trying to make me laugh and succeeded. They took initiative in conversation; even if that needs some tweaking, it feels huge.

0Likes Log in to Reply
Post Author

razemio

Posted March 2, 2025 at 6:58 am

I asked if speaking in German would be possible and the result was if someone is trying to speak German without knowing any word. However, I asked if a german sentence could be repeated after me and it was insanely good. Impressive tech!

0Likes Log in to Reply
Post Author

mohsen1

Posted March 2, 2025 at 7:02 am

The intelligence of the model is very low though. I asked it about catcalling and it started to talk about cats!

0Likes Log in to Reply
Post Author

TZubiri

Posted March 2, 2025 at 7:16 am

Or don't, revert course and give me robo-voice!

0Likes Log in to Reply
Post Author

kaizenb

Posted March 2, 2025 at 7:20 am

Glad to have my HER moment!

0Likes Log in to Reply
Post Author

names_are_hard

Posted March 2, 2025 at 7:23 am

I must be doing something wrong, but the demo seems to be the voice having a conversation with itself? It doesn't let me interject, and it answers its own questions. There's some kind of feedback loop here, it seems.

0Likes Log in to Reply
Post Author

karimf

Posted March 2, 2025 at 7:30 am

This might be a game changer for learning English.

I'm from a developing country and it's sad that most English teachers on public schools here can't speak English well. There are good English teachers, but they are expensive and they are not affordable for the average people.

OpenAI realtime models are good, but we can't deploy it to masses since it's very expensive.

This model might be able to solve the issue since it's better or on par with the OpenAI model, yet it's significantly cheaper since it's a fairly small model.

0Likes Log in to Reply
Post Author

wewewedxfgdf

Posted March 2, 2025 at 7:31 am

Yeah that's remarkable.

Trying asking it to be dungeon master and play dungeons and dragons style role playing game.

0Likes Log in to Reply
Post Author

habosa

Posted March 2, 2025 at 7:36 am

The first thing it said to me was that I should read the “looong looong” post about how it works and it pronounced that as “loon-g” not “lawn-g” which was a weird own goal.

Extremely impressive overall though.

0Likes Log in to Reply
Post Author

bradley13

Posted March 2, 2025 at 7:41 am

Maybe I'm weird, but I have zero desire to talk with an AI model. I use them a lot, in a browser or a console. But talking? No. Just…no. Why would I?

0Likes Log in to Reply
Post Author

jonplackett

Posted March 2, 2025 at 7:49 am

My end-of-the-world AI prediction is everyone gets a phone call all at the same time and the voice on the end of the phone is so perfect they never put the phone down again. Maybe they do whatever it asks them to, maybe it’s just lovely.

0Likes Log in to Reply
Post Author

swang

Posted March 2, 2025 at 7:51 am

i turned it on while i was heating some hot chocolate

told it, "hold on" as i was putting on my headset, they said "no problem". but then i tried to fill the empty airtime by saying, "i'm uhh heating some hot chocolate?"

the ai's response was something like, "ah.. (something) (something). data processing or is it the real kind with marshmallows"

not 100% on the exact dialog but 100% would not have been fooled by this. closed it there. no uncanny valley situation for me.

0Likes Log in to Reply
Post Author

rjpruitt16

Posted March 2, 2025 at 7:52 am

"I hate to say this, but I was deeply offended by this model. It sounds more human-like, but it has a strong bias toward political views. I don’t want to talk about the topic that was discussed. However, I would never allow my children to listen to this. I’m surprised that AI is capable of making me this mad. At first, I was excited about a tremendous leap into the future, but now I’m worried about the level of mind control this technology could have over children."

0Likes Log in to Reply
Post Author

radley

Posted March 2, 2025 at 8:00 am

The inflection was quite good. The only thing off seemed to be when she was thinking on something new. Instead of pausing to think, her next thought actually started too quickly, cutting off the very end of what she was saying before.

I am curious how easy it would be to adjust the inflection and timing. She was over-complimentary, which is fine for a demo. But I'd love something more direct, like a brainstorming session, and almost talking over each other. And then a whiteboard…

0Likes Log in to Reply
Post Author

brendanfinan

Posted March 2, 2025 at 8:16 am

all chat models seem enraptured by what I have to say. The first one to feign disinterest will pass the Turing test

0Likes Log in to Reply
Post Author

ChrisArchitect

Posted March 2, 2025 at 8:18 am

Previously: https://news.ycombinator.com/item?id=43200400

0Likes Log in to Reply
Post Author

gorgoiler

Posted March 2, 2025 at 9:09 am

I would say most command and control voice interactions are going to be like buying a coffee — the parameters of the transaction are well known, so it’s just about fine tuning the match between what the user wants and what the robot has to do.

A small minority of these interactions are going to be like a restaurant server — chit chat, pleasantries, some information gathering, followed by issuing direct orders.

The truly conversational interactions, while impressive, seem to be focused on… having a conversation. When am I going to want to have a conversation with an artificial person?

It’s precisely this kind of boundary violation of DMV clerks being chatty and friendly and asking about my kids that feels so uncanny, imho, when I’m clearly there for, literally, a one hundred percent transactional purpose. Do people really want to be asked how their day is going when sizing up an M5 bolt order?

In fact the humanising of robots like this makes it feel very uncomfortable when I have to interrupt their patter, ask them to be quiet, and insist they stay on topic.

0Likes Log in to Reply
Post Author

35mm

Posted March 2, 2025 at 9:17 am

Seems like they’re going to make a hardware product based on their open positions. A universal translator earbud would be nice.

0Likes Log in to Reply
Post Author

taylorius

Posted March 2, 2025 at 9:30 am

It's very good, really impressive demo. My feedback would be, Maya needs to keep quiet a little longer after asking a question. She would ask something, then as I thought about my reply, already be on to the next thing. It left me with the impression she was a babbler (which is not an unrealistic model of how humans are, but it would be cool to be able to dial such traits up or down to taste).

I suppose the lack of visual cues probably hinders things in that regard.

0Likes Log in to Reply
Post Author

oezi

Posted March 2, 2025 at 9:49 am

Text-To-Speech models still aren't trained on rich enough data to have all the nuances we need to be fully expressive. For example, most models don't have a way to change accents separately from language (e.g. English with a slight French accent) or have an ability to set emotions such as excitement or sleepiness.

We aren't even talking about adding laughing, singing/rap or beatboxing.

0Likes Log in to Reply
Post Author

daniel-ash

Posted March 2, 2025 at 10:07 am

Miles is the first AI I’ve met that is way cooler than me

Incredible!

0Likes Log in to Reply
Post Author

diimdeep

Posted March 2, 2025 at 10:14 am

Some comedy skilled guys made radio play like impro with this AI and it is beyond hilarious.

Miles gets Arrested: Sesame.ai https://youtu.be/cGMO2hRNnv0

0Likes Log in to Reply
Post Author

richrichardsson

Posted March 2, 2025 at 10:28 am

Still suffers the same problem that all Voice Recognition seems to suffer; cannot reliably detect that the speaker has finished speaking.

This was almost worse though because it did feel like a rude person just interrupting instead of a dumb computer not being able to pick up normal social cues around when the person they're listening to has finished.

0Likes Log in to Reply
Post Author

forgotmysn

Posted March 2, 2025 at 10:34 am

a lot of comments are dismissive of these generated convos because of out how obvious it is that these convos are generated. i feel like that's a high bar. you can tell that GTA5 is generated, but it's close enough to be fun. i imagine that's as close as we'll get with conversational AI

0Likes Log in to Reply

Crossing the uncanny valley of conversational voice by monroewalker

Crossing the uncanny valley of conversational voice by monroewalker

Share This Article

Newsletter

Achieving voice presence

Key components

We’re not there yet

Conversational voice demo

Background

33 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter