Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

What if your voice could become a lion’s roar or a goblin’s snarl with AI?

A South Korean research team built a machine learning model that converts speech into animal and fantasy sounds with high realism.

byKerem Gülen
June 2, 2025
in Research
Home Research
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail
Google Preferred Source

A team of researchers from NC AI Co., Ltd and Sogang University in South Korea—Minsu Kang, Seolhee Lee, Choonghyeon Lee, and Namhyun Cho—have developed a machine learning model that can convert human speech into a wide range of non-human sounds. Their work, titled addresses a growing need in gaming, film, and interactive media: how to automatically generate expressive animal sounds and fantasy voices from human input, with high audio fidelity and realistic style.

Previous research in this area has focused mostly on converting speech to dog barks, and often operated at lower sampling rates like 16 or 22kHz. This limited their usefulness in applications requiring richer timbres and broader frequency ranges. In contrast, the new system by Kang and colleagues is designed to work at 44.1kHz, capturing subtle acoustic cues and transient signals typical of bird calls, lion growls, or synthetic character voices like goblins or orcs.

Rethinking voice conversion for non-human sounds

Traditional voice conversion models are trained on structured human speech and are good at replicating phonemes, tone, and cadence from one speaker to another. However, they struggle with non-human audio that may lack clear linguistic content or follow very different time-frequency dynamics. For example, a bird chirp contains more high-frequency components than typical human speech, while a growl involves fast, irregular changes in pitch and energy.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

To handle these challenges, the researchers built a new preprocessing pipeline that goes beyond standard speech processing. Audio is sampled at 44.1kHz to retain fidelity, and a Short-Time Fourier Transform with a short 5ms hop is used to extract fine temporal features. This setup allows the model to better capture the transient, erratic nature of non-human sounds while still retaining the intelligibility of human input.

A CVAE model tailored for animal and synthetic voices

The core of the system is a Conditional Variational Autoencoder (CVAE), a type of deep learning architecture suited for style conversion. This model separates the content of the original audio—what is being said—from the style of the target sound, such as the tone or energy pattern of a lion’s roar or a fantasy monster voice.

One notable architectural improvement is the selective use of the style vector. Rather than feeding style into every stage of the model, it is only applied to the prior network and the flow module. This prevents the redundancy and potential interference that occurs when style is overused. Additionally, energy-based prosody features are used instead of pitch, since many non-human sounds don’t have a clearly defined pitch or harmonic structure.

The model also includes a reconstruction loss specifically designed for transient-rich audio, called Frequency Domain Reconstruction Loss (FDRL). This helps ensure that rapid acoustic changes, like those found in growls or screeches, are accurately reproduced in the final output.

Training and evaluation

The researchers trained their model on over 82,000 audio clips. These included expressive human sounds like laughter and screams, synthetic voices for fictional characters, and natural animal sounds sourced from professional sound libraries. Training was conducted on high-performance GPUs using a mix of adversarial, variational, and feature-matching losses to balance audio realism and linguistic clarity.

Performance was evaluated using both subjective and objective metrics. Human listeners rated the converted audio for quality, naturalness, and similarity on a five-point scale. Objectively, the model’s output was compared to reference recordings using energy correlation, root mean squared error, and recognition accuracy (via Whisper).

The results showed clear advantages over three strong baselines: DDDM-VC, Diff-HierVC, and Free-VC. The new system scored higher on similarity (3.78), quality (3.16), and naturalness (3.16). It also achieved better preservation of energy contours and lower word and character error rates, indicating improved retention of speech content.


How good are large language models at playing games?


Why preprocessing matters

An ablation study showed that replacing the new preprocessing pipeline with a conventional, speech-only one significantly reduced performance. Similarity and naturalness scores dropped, and transcription error rates increased. This confirms that non-human voice conversion needs its own dedicated processing approach and cannot rely on assumptions valid only for human speech.

Another experiment tested the effect of injecting style information into the decoder. While this change slightly improved perceived naturalness, it caused a drop in similarity, suggesting that it confused the model about which acoustic features were essential to the target voice.

Applications and significance

This work has strong implications for industries that require large volumes of stylized audio. In game development, for instance, character voices can now be generated programmatically rather than recorded or manually designed. Film and animation studios can apply this model to create realistic animal reactions or fantastical voice effects without relying on expensive foley work.

Beyond entertainment, this technology could support more expressive voice agents or accessibility tools, such as converting speech into animal-themed responses for children, or enabling more emotionally resonant avatars in virtual spaces.


Featured image credit

Tags: AIvoice

Related Posts

Codex use is spreading into knowledge work, OpenAI says

Codex use is spreading into knowledge work, OpenAI says

July 1, 2026
Meta says Brain2Qwerty v2 turns brain activity into text

Meta says Brain2Qwerty v2 turns brain activity into text

July 1, 2026
Penn Medicine unveils AI-human system to speed CAR T cancer target discovery

Penn Medicine unveils AI-human system to speed CAR T cancer target discovery

June 30, 2026
CrowdStrike warns prompt injection attacks hit over 90 firms in 2025

CrowdStrike warns prompt injection attacks hit over 90 firms in 2025

June 29, 2026
Wireless charging uses about 40% more electricity

Wireless charging uses about 40% more electricity

June 25, 2026
European consumers may leave businesses using US tech providers

European consumers may leave businesses using US tech providers

June 24, 2026

LATEST NEWS

Android Halo will place AI agent updates in status bar

WhatsApp usernames spark impersonation and fraud concerns

Apple reportedly plans entry-level MacBook Pro redesign for 2027

X launches Live Studio with new creator payouts

Sony will end physical PlayStation game discs in 2028

Microsoft explores disc-to-digital support for Xbox games

BEST AI MODELS LEADERBOARD

See the best AI models, ranked by intelligence, benchmark results, speed and token price. Find the most suitable LLMs, Text-to-Image, Image Editing, Text-to-Speech, Text-to-Video and Image-to-Video  artificial intelligence model for your tasks and business.

LATEST TOOLS

Copyleaks – Plagiarism detector

Clipping Magic

KoalaChat

SpeechText

Booknotes

Unscrambler

LingoLooper

Politepost

Evolup

Wondercraft

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.