What If Your Voice Could Become A Lion’s Roar Or A Goblin’s Snarl With AI?

A team of researchers from NC AI Co., Ltd and Sogang University in South Korea—Minsu Kang, Seolhee Lee, Choonghyeon Lee, and Namhyun Cho—have developed a machine learning model that can convert human speech into a wide range of non-human sounds. Their work, titled addresses a growing need in gaming, film, and interactive media: how to automatically generate expressive animal sounds and fantasy voices from human input, with high audio fidelity and realistic style.

Previous research in this area has focused mostly on converting speech to dog barks, and often operated at lower sampling rates like 16 or 22kHz. This limited their usefulness in applications requiring richer timbres and broader frequency ranges. In contrast, the new system by Kang and colleagues is designed to work at 44.1kHz, capturing subtle acoustic cues and transient signals typical of bird calls, lion growls, or synthetic character voices like goblins or orcs.

Rethinking voice conversion for non-human sounds

Traditional voice conversion models are trained on structured human speech and are good at replicating phonemes, tone, and cadence from one speaker to another. However, they struggle with non-human audio that may lack clear linguistic content or follow very different time-frequency dynamics. For example, a bird chirp contains more high-frequency components than typical human speech, while a growl involves fast, irregular changes in pitch and energy.

To handle these challenges, the researchers built a new preprocessing pipeline that goes beyond standard speech processing. Audio is sampled at 44.1kHz to retain fidelity, and a Short-Time Fourier Transform with a short 5ms hop is used to extract fine temporal features. This setup allows the model to better capture the transient, erratic nature of non-human sounds while still retaining the intelligibility of human input.

A CVAE model tailored for animal and synthetic voices

The core of the system is a Conditional Variational Autoencoder (CVAE), a type of deep learning architecture suited for style conversion. This model separates the content of the original audio—what is being said—from the style of the target sound, such as the tone or energy pattern of a lion’s roar or a fantasy monster voice.

One notable architectural improvement is the selective use of the style vector. Rather than feeding style into every stage of the model, it is only applied to the prior network and the flow module. This prevents the redundancy and potential interference that occurs when style is overused. Additionally, energy-based prosody features are used instead of pitch, since many non-human sounds don’t have a clearly defined pitch or harmonic structure.

The model also includes a reconstruction loss specifically designed for transient-rich audio, called Frequency Domain Reconstruction Loss (FDRL). This helps ensure that rapid acoustic changes, like those found in growls or screeches, are accurately reproduced in the final output.

Training and evaluation

The researchers trained their model on over 82,000 audio clips. These included expressive human sounds like laughter and screams, synthetic voices for fictional characters, and natural animal sounds sourced from professional sound libraries. Training was conducted on high-performance GPUs using a mix of adversarial, variational, and feature-matching losses to balance audio realism and linguistic clarity.

Performance was evaluated using both subjective and objective metrics. Human listeners rated the converted audio for quality, naturalness, and similarity on a five-point scale. Objectively, the model’s output was compared to reference recordings using energy correlation, root mean squared error, and recognition accuracy (via Whisper).

The results showed clear advantages over three strong baselines: DDDM-VC, Diff-HierVC, and Free-VC. The new system scored higher on similarity (3.78), quality (3.16), and naturalness (3.16). It also achieved better preservation of energy contours and lower word and character error rates, indicating improved retention of speech content.

How good are large language models at playing games?

Why preprocessing matters

An ablation study showed that replacing the new preprocessing pipeline with a conventional, speech-only one significantly reduced performance. Similarity and naturalness scores dropped, and transcription error rates increased. This confirms that non-human voice conversion needs its own dedicated processing approach and cannot rely on assumptions valid only for human speech.

Another experiment tested the effect of injecting style information into the decoder. While this change slightly improved perceived naturalness, it caused a drop in similarity, suggesting that it confused the model about which acoustic features were essential to the target voice.

Applications and significance

This work has strong implications for industries that require large volumes of stylized audio. In game development, for instance, character voices can now be generated programmatically rather than recorded or manually designed. Film and animation studios can apply this model to create realistic animal reactions or fantastical voice effects without relying on expensive foley work.

Beyond entertainment, this technology could support more expressive voice agents or accessibility tools, such as converting speech into animal-themed responses for children, or enabling more emotionally resonant avatars in virtual spaces.

Featured image credit