Researchers at Carnegie Mellon University are studying human interactions with artificial intelligence agents that mimic physical presence through audio, which may influence the development of audio-only AI systems for smart glasses and accessibility tools. The team, from the School of Computer Science and the Department of Psychology, created an interface for chatbots that relies solely on audio cues to enhance user engagement.
David Lindlbauer, an assistant professor in the Human-Computer Interaction Institute, noted that the research explores how making AI sound more human can change user interactions. “The question becomes, ‘If I had an AI assistant, what would happen if I made the audio component more like an actual human?'” he said. The findings surprised the researchers.
The researchers used spatialization and Foley effects to create the interface. Spatialization allows users to perceive the AI’s sound as coming from specific locations in the room, while Foley effects are realistic sound effects that enhance immersion, such as typing and pouring water. Laurie Heller, a psychology professor, emphasized the necessity of these effects for creating a believable experience, stating, “If they aren’t part of the movie soundtrack, it doesn’t seem realistic.”
In the experimental setup, participants interacted with AI agents using varying combinations of spatial and Foley effects while familiarizing themselves with a room. After speaking with the AI, participants completed questionnaires to report their experiences. Lindlbauer reported statistically significant results showing increased user engagement when spatial and Foley effects were used.
The study revealed an unexpected aspect: participants began to expect the AI to adhere to human social norms. “As soon as the participants felt like their agent was engaged in something else, they considered this rude,” Lindlbauer said. This resulted from automated Foley effects not being directly tied to the conversation.
Cheng suggested making audio cues more context-aware to mitigate feelings of distraction during interactions. Lindlbauer believed that the final audio system could feature effects that are independent of specific environments without losing their engagement-boosting qualities. Participants may visually react to sounds, like looking at a screen when hearing typing, but this did not diminish the overall experience.
The researchers presented their findings at the Association for Computing Machinery Conference on Human Factors in Computing Systems (CHI 2026) in Barcelona. The study’s results are documented in the conference proceedings.





