Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

What if your voice could become a lion’s roar or a goblin’s snarl with AI?

A South Korean research team built a machine learning model that converts speech into animal and fantasy sounds with high realism.

byKerem Gülen
June 2, 2025
in Research
Home Research
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

A team of researchers from NC AI Co., Ltd and Sogang University in South Korea—Minsu Kang, Seolhee Lee, Choonghyeon Lee, and Namhyun Cho—have developed a machine learning model that can convert human speech into a wide range of non-human sounds. Their work, titled addresses a growing need in gaming, film, and interactive media: how to automatically generate expressive animal sounds and fantasy voices from human input, with high audio fidelity and realistic style.

Previous research in this area has focused mostly on converting speech to dog barks, and often operated at lower sampling rates like 16 or 22kHz. This limited their usefulness in applications requiring richer timbres and broader frequency ranges. In contrast, the new system by Kang and colleagues is designed to work at 44.1kHz, capturing subtle acoustic cues and transient signals typical of bird calls, lion growls, or synthetic character voices like goblins or orcs.

Rethinking voice conversion for non-human sounds

Traditional voice conversion models are trained on structured human speech and are good at replicating phonemes, tone, and cadence from one speaker to another. However, they struggle with non-human audio that may lack clear linguistic content or follow very different time-frequency dynamics. For example, a bird chirp contains more high-frequency components than typical human speech, while a growl involves fast, irregular changes in pitch and energy.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

To handle these challenges, the researchers built a new preprocessing pipeline that goes beyond standard speech processing. Audio is sampled at 44.1kHz to retain fidelity, and a Short-Time Fourier Transform with a short 5ms hop is used to extract fine temporal features. This setup allows the model to better capture the transient, erratic nature of non-human sounds while still retaining the intelligibility of human input.

A CVAE model tailored for animal and synthetic voices

The core of the system is a Conditional Variational Autoencoder (CVAE), a type of deep learning architecture suited for style conversion. This model separates the content of the original audio—what is being said—from the style of the target sound, such as the tone or energy pattern of a lion’s roar or a fantasy monster voice.

One notable architectural improvement is the selective use of the style vector. Rather than feeding style into every stage of the model, it is only applied to the prior network and the flow module. This prevents the redundancy and potential interference that occurs when style is overused. Additionally, energy-based prosody features are used instead of pitch, since many non-human sounds don’t have a clearly defined pitch or harmonic structure.

The model also includes a reconstruction loss specifically designed for transient-rich audio, called Frequency Domain Reconstruction Loss (FDRL). This helps ensure that rapid acoustic changes, like those found in growls or screeches, are accurately reproduced in the final output.

Training and evaluation

The researchers trained their model on over 82,000 audio clips. These included expressive human sounds like laughter and screams, synthetic voices for fictional characters, and natural animal sounds sourced from professional sound libraries. Training was conducted on high-performance GPUs using a mix of adversarial, variational, and feature-matching losses to balance audio realism and linguistic clarity.

Performance was evaluated using both subjective and objective metrics. Human listeners rated the converted audio for quality, naturalness, and similarity on a five-point scale. Objectively, the model’s output was compared to reference recordings using energy correlation, root mean squared error, and recognition accuracy (via Whisper).

The results showed clear advantages over three strong baselines: DDDM-VC, Diff-HierVC, and Free-VC. The new system scored higher on similarity (3.78), quality (3.16), and naturalness (3.16). It also achieved better preservation of energy contours and lower word and character error rates, indicating improved retention of speech content.


How good are large language models at playing games?


Why preprocessing matters

An ablation study showed that replacing the new preprocessing pipeline with a conventional, speech-only one significantly reduced performance. Similarity and naturalness scores dropped, and transcription error rates increased. This confirms that non-human voice conversion needs its own dedicated processing approach and cannot rely on assumptions valid only for human speech.

Another experiment tested the effect of injecting style information into the decoder. While this change slightly improved perceived naturalness, it caused a drop in similarity, suggesting that it confused the model about which acoustic features were essential to the target voice.

Applications and significance

This work has strong implications for industries that require large volumes of stylized audio. In game development, for instance, character voices can now be generated programmatically rather than recorded or manually designed. Film and animation studios can apply this model to create realistic animal reactions or fantastical voice effects without relying on expensive foley work.

Beyond entertainment, this technology could support more expressive voice agents or accessibility tools, such as converting speech into animal-themed responses for children, or enabling more emotionally resonant avatars in virtual spaces.


Featured image credit

Tags: AIvoice

Related Posts

Your future quantum computer might be built on standard silicon after all

Your future quantum computer might be built on standard silicon after all

November 25, 2025
Microsoft’s Fara-7B: New agentic LLM from screenshots

Microsoft’s Fara-7B: New agentic LLM from screenshots

November 25, 2025
Precision Neuroscience proves you do not need to drill holes to read brains

Precision Neuroscience proves you do not need to drill holes to read brains

November 24, 2025
New Apple paper reveals how AI can track your daily chores

New Apple paper reveals how AI can track your daily chores

November 23, 2025
Why your lonely teenager should never trust ChatGPT with their mental health

Why your lonely teenager should never trust ChatGPT with their mental health

November 21, 2025
Google wants AI to build web pages instead of just writing text

Google wants AI to build web pages instead of just writing text

November 20, 2025

LATEST NEWS

The original Apple founding contract is heading to auction for $4 million

EU Council drops CSAM scanning mandate for tech firms

Stranger Things mania crashed Netflix despite 30% more bandwidth

ShadowV2 botnet exploited AWS outage timeline to test global IoT attacks

Google rolls out Power Saving Mode in Maps for Pixel 10

Google streamlines Desktop Mode with per-monitor memory

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.