Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

What if your voice could become a lion’s roar or a goblin’s snarl with AI?

A South Korean research team built a machine learning model that converts speech into animal and fantasy sounds with high realism.

byKerem Gülen
June 2, 2025
in Research

A team of researchers from NC AI Co., Ltd and Sogang University in South Korea—Minsu Kang, Seolhee Lee, Choonghyeon Lee, and Namhyun Cho—have developed a machine learning model that can convert human speech into a wide range of non-human sounds. Their work, titled addresses a growing need in gaming, film, and interactive media: how to automatically generate expressive animal sounds and fantasy voices from human input, with high audio fidelity and realistic style.

Previous research in this area has focused mostly on converting speech to dog barks, and often operated at lower sampling rates like 16 or 22kHz. This limited their usefulness in applications requiring richer timbres and broader frequency ranges. In contrast, the new system by Kang and colleagues is designed to work at 44.1kHz, capturing subtle acoustic cues and transient signals typical of bird calls, lion growls, or synthetic character voices like goblins or orcs.

Rethinking voice conversion for non-human sounds

Traditional voice conversion models are trained on structured human speech and are good at replicating phonemes, tone, and cadence from one speaker to another. However, they struggle with non-human audio that may lack clear linguistic content or follow very different time-frequency dynamics. For example, a bird chirp contains more high-frequency components than typical human speech, while a growl involves fast, irregular changes in pitch and energy.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

To handle these challenges, the researchers built a new preprocessing pipeline that goes beyond standard speech processing. Audio is sampled at 44.1kHz to retain fidelity, and a Short-Time Fourier Transform with a short 5ms hop is used to extract fine temporal features. This setup allows the model to better capture the transient, erratic nature of non-human sounds while still retaining the intelligibility of human input.

A CVAE model tailored for animal and synthetic voices

The core of the system is a Conditional Variational Autoencoder (CVAE), a type of deep learning architecture suited for style conversion. This model separates the content of the original audio—what is being said—from the style of the target sound, such as the tone or energy pattern of a lion’s roar or a fantasy monster voice.

One notable architectural improvement is the selective use of the style vector. Rather than feeding style into every stage of the model, it is only applied to the prior network and the flow module. This prevents the redundancy and potential interference that occurs when style is overused. Additionally, energy-based prosody features are used instead of pitch, since many non-human sounds don’t have a clearly defined pitch or harmonic structure.

The model also includes a reconstruction loss specifically designed for transient-rich audio, called Frequency Domain Reconstruction Loss (FDRL). This helps ensure that rapid acoustic changes, like those found in growls or screeches, are accurately reproduced in the final output.

Training and evaluation

The researchers trained their model on over 82,000 audio clips. These included expressive human sounds like laughter and screams, synthetic voices for fictional characters, and natural animal sounds sourced from professional sound libraries. Training was conducted on high-performance GPUs using a mix of adversarial, variational, and feature-matching losses to balance audio realism and linguistic clarity.

Performance was evaluated using both subjective and objective metrics. Human listeners rated the converted audio for quality, naturalness, and similarity on a five-point scale. Objectively, the model’s output was compared to reference recordings using energy correlation, root mean squared error, and recognition accuracy (via Whisper).

The results showed clear advantages over three strong baselines: DDDM-VC, Diff-HierVC, and Free-VC. The new system scored higher on similarity (3.78), quality (3.16), and naturalness (3.16). It also achieved better preservation of energy contours and lower word and character error rates, indicating improved retention of speech content.


How good are large language models at playing games?


Why preprocessing matters

An ablation study showed that replacing the new preprocessing pipeline with a conventional, speech-only one significantly reduced performance. Similarity and naturalness scores dropped, and transcription error rates increased. This confirms that non-human voice conversion needs its own dedicated processing approach and cannot rely on assumptions valid only for human speech.

Another experiment tested the effect of injecting style information into the decoder. While this change slightly improved perceived naturalness, it caused a drop in similarity, suggesting that it confused the model about which acoustic features were essential to the target voice.

Applications and significance

This work has strong implications for industries that require large volumes of stylized audio. In game development, for instance, character voices can now be generated programmatically rather than recorded or manually designed. Film and animation studios can apply this model to create realistic animal reactions or fantastical voice effects without relying on expensive foley work.

Beyond entertainment, this technology could support more expressive voice agents or accessibility tools, such as converting speech into animal-themed responses for children, or enabling more emotionally resonant avatars in virtual spaces.


Featured image credit

Tags: AIvoice

Related Posts

Researchers find electric cars erase their “carbon debt” in under two years

Researchers find electric cars erase their “carbon debt” in under two years

November 5, 2025
Anthropic study reveals AIs can’t reliably explain their own thoughts

Anthropic study reveals AIs can’t reliably explain their own thoughts

November 4, 2025
Apple’s Pico-Banana-400K dataset could redefine how AI learns to edit images

Apple’s Pico-Banana-400K dataset could redefine how AI learns to edit images

November 4, 2025
USC researchers build artificial neurons that physically think like the brain

USC researchers build artificial neurons that physically think like the brain

November 3, 2025
Forget seeing dark matter, it’s time to listen for it

Forget seeing dark matter, it’s time to listen for it

October 28, 2025
Google’s search business could lose  billion a year to ChatGPT

Google’s search business could lose $30 billion a year to ChatGPT

October 27, 2025

LATEST NEWS

Tech News Today: Sora’s video tricks and the invisible bug that defines Android’s power

OpenAI’s Sora hits 470,000 Android installs on day one

Mastodon adds quote posts in major 4.5 update with built-in safeguards

Elon Musk says Tesla may need a “gigantic” chip factory for its AI ambitions

BMW integrates Alexa+ for true in-car conversations

This Samsung Galaxy phone needs and immediate update

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.