This article originally appeared on VentureBeat and is reproduced with permission.
Even state-of-the-art automatic speech recognition (ASR) algorithms struggle to recognize the accents of people from certain regions of the world. That’s the top-line finding of a new study published by researchers at the University of Amsterdam, the Netherlands Cancer Institute, and the Delft University of Technology, which found that an ASR system for the Dutch language recognized speakers of specific age groups, genders, and countries of origin better than others.
Speech recognition has come a long way since IBM’s Shoebox machine and Worlds of Wonder’s Julie doll. But despite progress made possible by AI, voice recognition systems today are at best imperfect — and at worst discriminatory. In a study commissioned by the Washington Post, popular smart speakers made by Google and Amazon were 30% less likely to understand non-American accents than those of native-born users. More recently, the Algorithmic Justice League’s Voice Erasure project found that that speech recognition systems from Apple, Amazon, Google, IBM, and Microsoft collectively achieve word error rates of 35% for African American voices versus 19% for white voices.
The coauthors of this latest research set out to investigate how well an ASR system for Dutch recognizes speech from different groups of speakers. In a series of experiments, they observed whether the ASR system could contend with diversity in speech along the dimensions of gender, age, and accent.
The researchers began by having an ASR system ingest sample data from CGN, an annotated corpus used to train AI language models to recognize the Dutch language. CGN contains recordings spoken by people ranging in age from 18 to 65 years old from Netherlands and the Flanders region of Belgium, covering speaking styles including broadcast news and telephone conversations.
CGN has a whopping 483 hours of speech spoken by 1,185 women and 1,678 men. But to make the system even more robust, the coauthors applied data augmentation techniques to increase the total hours of training data “ninefold.”
When the researchers ran the trained ASR system through a test set derived from the CGN, they found that it recognized female speech more reliably than male speech regardless of speaking style. Moreover, the system struggled to recognize speech from older people compared with younger, potentially because the former group wasn’t well-articulated. And it had an easier time detecting speech from native speakers versus non-native speakers. Indeed, the worst-recognized native speech — that of Dutch children — had a word error rate around 20% better than that of the best non-native age group.
In general, the results suggest that teenagers’ speech was most accurately interpreted by the system, followed by seniors’ (over the age of 65) and children’s. This held even for non-native speakers who were highly proficient in Dutch vocabulary and grammar.
As the researchers point out, while it’s to an extent impossible to remove the bias that creeps into datasets, one solution is mitigating this bias at the algorithmic level.
“[We recommend] framing the problem, developing the team composition and the implementation process from a point of anticipating, proactively spotting, and developing mitigation strategies for affective prejudice [to address bias in ASR systems],” the researchers wrote in a paper detailing their work. “A direct bias mitigation strategy concerns diversifying and aiming for a balanced representation in the dataset. An indirect bias mitigation strategy deals with diverse team composition: the variety in age, regions, gender, and more provides additional lenses of spotting potential bias in design. Together, they can help ensure a more inclusive developmental environment for ASR.”