A new AI voice model has set the internet abuzz, with reactions oscillating between awe and unease. Sesame AI’s Conversational Speech Model (CSM) doesn’t just sound human—it feels human. Users describe extended, almost emotional interactions with the AI-generated voices, which exhibit breath sounds, hesitations, corrections, and even chuckles. For some, it’s a technological marvel. For others, it’s a glimpse into a future that feels uncomfortably close.
Sesame AI: A voice that feels alive
The core innovation behind Sesame’s CSM lies in its ability to simulate natural, dynamic conversation. Unlike traditional text-to-speech systems that simply read aloud, CSM actively engages. It stumbles over words, corrects itself, and modulates tone in a way that mimics real human unpredictability.
When one tester spoke to the model for 28 minutes, they noted its ability to debate moral topics, reacting naturally to prompts like, “How do you decide what’s right or wrong?” Others found themselves unintentionally forming attachments, with one Reddit user admitting, “I’m almost a bit worried I will start feeling emotionally attached to a voice assistant with this level of human-like sound.”
Sesame’s AI assistants, dubbed “Miles” and “Maya,” are designed not just for information retrieval but for deep, engaging conversations. The company describes its goal as achieving “voice presence”—the magical quality that makes spoken interactions feel real, understood, and valued.
That realism sometimes leads to oddly human quirks. In one viral demo, the AI casually mentioned craving a peanut butter and pickle sandwich—a bizarrely specific comment that only added to the illusion of personality.
Did you create your TikTok AI voice?
The tech behind the voice
So how does Sesame’s CSM achieve such eerily lifelike conversations?
- A multimodal approach: Unlike conventional AI speech models that process text and audio separately, Sesame’s system interleaves them. This single-stage processing allows for more fluid, context-aware speech.
- High-parameter training: The largest version of the model runs on 8.3 billion parameters and was trained on over one million hours of spoken dialogue.
- Meta’s influence: The model’s architecture builds upon Meta’s Llama framework, integrating a backbone model with a decoder for nuanced speech generation.
Blind tests have revealed that, in isolated speech samples, human evaluators couldn’t reliably distinguish Sesame’s AI voices from real ones. However, when placed in full conversational context, human speech still won out—suggesting AI has not yet mastered the full complexity of interactive dialogue.
A mixed reception
Not everyone is thrilled by how human this AI sounds.
Technology journalist Mark Hachman described his experience with the voice model as “deeply unsettling.” He compared it to talking with an old friend he hadn’t seen in years, noting that the AI’s voice bore an eerie resemblance to someone he had once dated.
Others have likened Sesame’s model to OpenAI’s Advanced Voice Mode for ChatGPT, with some preferring Sesame’s realism and willingness to roleplay in more dramatic or even angry scenarios—something OpenAI’s models tend to avoid.
One particularly striking demo showcased the AI arguing with a “boss” over an embezzlement scandal. The conversation was so dynamic that listeners struggled to determine which speaker was the human and which was the AI.
The risks of a perfect voice
As with all AI breakthroughs, hyper-realistic voice synthesis brings both promise and peril.
- Fraud & scams: With AI voices now indistinguishable from human speech, voice phishing scams could become far more convincing. Criminals could impersonate family members, corporate executives, or government officials with near-perfect accuracy.
- Social engineering: Unlike basic robocalls, AI-powered deception could adapt in real time, responding naturally to questions and suspicion.
- Unintended emotional impact: Some users have reported their children forming attachments to the AI voices. One parent noted that their 4-year-old cried after being denied further conversation with the model.
While Sesame’s CSM does not clone real voices, the possibility of similar open-source projects emerging remains a concern. OpenAI has already delayed the wider release of its voice technology over fears of misuse.
What’s next?
Sesame AI plans to open-source key components of its research under the Apache 2.0 license, allowing developers to build upon its work. The company’s roadmap includes:
- Scaling up model size to increase realism further.
- Expanding to 20+ languages, broadening its conversational reach.
- Developing “fully duplex” models, enabling true back-and-forth, interruption-capable conversations.
For now, the demo remains available on Sesame’s website—though demand has already overwhelmed their servers at times. Whether you find it astonishing or unsettling, one thing is clear: the days of robotic, monotone AI voices are over.
From here on, you may never be quite sure who—or what—you’re talking to.
Featured image credit: Kerem Gülen/Imagen 3