Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

What if your voice could become a lion’s roar or a goblin’s snarl with AI?

A South Korean research team built a machine learning model that converts speech into animal and fantasy sounds with high realism.

byKerem Gülen
June 2, 2025
in Research

A team of researchers from NC AI Co., Ltd and Sogang University in South Korea—Minsu Kang, Seolhee Lee, Choonghyeon Lee, and Namhyun Cho—have developed a machine learning model that can convert human speech into a wide range of non-human sounds. Their work, titled addresses a growing need in gaming, film, and interactive media: how to automatically generate expressive animal sounds and fantasy voices from human input, with high audio fidelity and realistic style.

Previous research in this area has focused mostly on converting speech to dog barks, and often operated at lower sampling rates like 16 or 22kHz. This limited their usefulness in applications requiring richer timbres and broader frequency ranges. In contrast, the new system by Kang and colleagues is designed to work at 44.1kHz, capturing subtle acoustic cues and transient signals typical of bird calls, lion growls, or synthetic character voices like goblins or orcs.

Rethinking voice conversion for non-human sounds

Traditional voice conversion models are trained on structured human speech and are good at replicating phonemes, tone, and cadence from one speaker to another. However, they struggle with non-human audio that may lack clear linguistic content or follow very different time-frequency dynamics. For example, a bird chirp contains more high-frequency components than typical human speech, while a growl involves fast, irregular changes in pitch and energy.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

To handle these challenges, the researchers built a new preprocessing pipeline that goes beyond standard speech processing. Audio is sampled at 44.1kHz to retain fidelity, and a Short-Time Fourier Transform with a short 5ms hop is used to extract fine temporal features. This setup allows the model to better capture the transient, erratic nature of non-human sounds while still retaining the intelligibility of human input.

A CVAE model tailored for animal and synthetic voices

The core of the system is a Conditional Variational Autoencoder (CVAE), a type of deep learning architecture suited for style conversion. This model separates the content of the original audio—what is being said—from the style of the target sound, such as the tone or energy pattern of a lion’s roar or a fantasy monster voice.

One notable architectural improvement is the selective use of the style vector. Rather than feeding style into every stage of the model, it is only applied to the prior network and the flow module. This prevents the redundancy and potential interference that occurs when style is overused. Additionally, energy-based prosody features are used instead of pitch, since many non-human sounds don’t have a clearly defined pitch or harmonic structure.

The model also includes a reconstruction loss specifically designed for transient-rich audio, called Frequency Domain Reconstruction Loss (FDRL). This helps ensure that rapid acoustic changes, like those found in growls or screeches, are accurately reproduced in the final output.

Training and evaluation

The researchers trained their model on over 82,000 audio clips. These included expressive human sounds like laughter and screams, synthetic voices for fictional characters, and natural animal sounds sourced from professional sound libraries. Training was conducted on high-performance GPUs using a mix of adversarial, variational, and feature-matching losses to balance audio realism and linguistic clarity.

Performance was evaluated using both subjective and objective metrics. Human listeners rated the converted audio for quality, naturalness, and similarity on a five-point scale. Objectively, the model’s output was compared to reference recordings using energy correlation, root mean squared error, and recognition accuracy (via Whisper).

The results showed clear advantages over three strong baselines: DDDM-VC, Diff-HierVC, and Free-VC. The new system scored higher on similarity (3.78), quality (3.16), and naturalness (3.16). It also achieved better preservation of energy contours and lower word and character error rates, indicating improved retention of speech content.


How good are large language models at playing games?


Why preprocessing matters

An ablation study showed that replacing the new preprocessing pipeline with a conventional, speech-only one significantly reduced performance. Similarity and naturalness scores dropped, and transcription error rates increased. This confirms that non-human voice conversion needs its own dedicated processing approach and cannot rely on assumptions valid only for human speech.

Another experiment tested the effect of injecting style information into the decoder. While this change slightly improved perceived naturalness, it caused a drop in similarity, suggesting that it confused the model about which acoustic features were essential to the target voice.

Applications and significance

This work has strong implications for industries that require large volumes of stylized audio. In game development, for instance, character voices can now be generated programmatically rather than recorded or manually designed. Film and animation studios can apply this model to create realistic animal reactions or fantastical voice effects without relying on expensive foley work.

Beyond entertainment, this technology could support more expressive voice agents or accessibility tools, such as converting speech into animal-themed responses for children, or enabling more emotionally resonant avatars in virtual spaces.


Featured image credit

Tags: AIvoice

Related Posts

Just 250 bad documents can poison a massive AI model

Just 250 bad documents can poison a massive AI model

October 15, 2025
71% of workers are using rogue AI tools at work, Microsoft warns

71% of workers are using rogue AI tools at work, Microsoft warns

October 14, 2025
Google taught your voice assistant to understand what you mean

Google taught your voice assistant to understand what you mean

October 14, 2025
Apple researchers just made AI text generation 128x faster

Apple researchers just made AI text generation 128x faster

October 13, 2025
Have astronomers finally found the universe’s first dark stars?

Have astronomers finally found the universe’s first dark stars?

October 10, 2025
KPMG: CEOs prioritize AI investment in 2025

KPMG: CEOs prioritize AI investment in 2025

October 9, 2025

LATEST NEWS

Pixel 10 users finally get relief from week-long app crash nightmare

Google Keep reminders now sync directly with Tasks and Calendar

Samsung Galaxy Buds 4 icon leak shows ear tip design

Sora becomes a hit and a headache for OpenAI with 1M downloads

YouTube rolls out new UI with Custom Likes feature

NVTS stock skyrockets 27%: What is the correlation between Navitas and Nvidia

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.