I’ll never forget my “aha” moment with bias in AI. I was working at IBM as the product owner for Watson Visual Recognition. We knew that the API wasn’t the best in class at returning “accurate” tags for images, and we needed to improve it.
I was nervous about the possibility of bias creeping into our models. Bias in Machine Learning (ML) models is the exact sort of problem the ML community has seen time and again, from poor facial recognition of diverse individuals to an AI beauty pageant gone awry and countless other instances. We looked long and hard at the data labels we used for our project and, at first blush, everything seemed fine.
Just prior to launch, a researcher on our team brought something to my attention. One of the image classifications that had trained our model was called “loser.” And a lot of those images depicted people with disabilities.
I was horrified. We started wondering, “what else have we overlooked?” Who knows what seemingly innocuous label might train our model to exhibit inherent or latent bias? We gathered everyone we could — from engineers to data scientists to marketers — to comb through the tens of thousands of labels and millions of associated images and pull out everything we found objectionable according to IBM’s code of conduct. We pulled out more than a handful of other classes that didn’t reflect our values.
My “aha” moment helped avert a crisis. But I also realize that we had some advantages in doing so. We had a diverse team (different ages, races, ethnicities, geographies, experience, etc.) and a shared understanding of what was and wasn’t objectionable. We also had the time, support, and the resources to look for objectionable labels and fix them.
Not everyone who is building an ML-enabled product has the resources of the IBM team. For teams without the advantages we had, and even for organizations that do, the prospect of unwanted bias looms. Here are a few best practices for teams of any size as they embark upon their ML journey.
- Define and narrow the business problem you’re solving
Trying to solve for too many scenarios often means you’ll need a ton of labels across an unmanageable amount of classes. Narrowly defining a problem, to start, will help you make sure your model is performing well for the exact reason you’ve built it.
For example, if you’re creating a computer vision model that’s answering a fairly straight-forward question, like “Is this a human?” you need to define what you mean by “human.” Do cartoons count? What if the person is partially occluded? Should a torso count as “human” for your model? This all matters. You need clarity on what “human” means for this model. If you’re unsure, ask people the same question about your data. You might be surprised by the ambiguities present and the assumptions you made going in.
One way to help define your scope is by considering the information you use for your model. Even academic datasets like ImageNet can have classes and labels that introduce unintended bias into your algorithms. The more of your data you understand and own and can map back to the business problem you’re solving, the less likely you are to be surprised by objectionable labels.
2. Gather a diverse team that asks diverse questions
We all bring different experiences and ideas to the workplace. People from diverse backgrounds–not just race and gender, but age, experience, etc.–will inherently ask different questions and interact with your model in different ways. That can help you catch problems before your model is in production.
Building a diverse team also requires gathering data in a way that allows for different opinions, as well. There are often multiple valid opinions or labels for a single datapoint. Gathering those opinions and accounting for legitimate, often subjective, disagreements will make your model more flexible.
3. Think about all of your end users
Likewise, understand that your end users won’t simply be like you or your team. Be empathetic. Anticipate how people who aren’t like you will interact with your technology and what problems might arise in their doing so.
With this in mind, it’s important to remember that models rarely remain static. One of the worst mistakes you can make is deploying your model without a way for end users to give you feedback on how the model is applying in the real world.
You’ll want to keep humans as part of your process to react to changes, edge cases, instances of bias you might’ve missed, and more. You want to get feedback from your model and give it feedback of your own to improve its performance, iterating constantly towards higher accuracy.
4. Annotate with diversity
When you use humans to annotate your data, it’s best to draw from a diverse pool. Don’t use students from a single college or even labelers from one country. The larger the pool, the more diverse your viewpoints. That can really help reduce bias.
After all, this is where bias is often hidden. A few years back, researchers at the University of Washington & the University of Maryland found that doing an image search for certain jobs revealed serious underrepresentation and bias in results. Search “nurse,” for example, and you’d see only women. Search “CEO” and it was all men.
Having people of diverse backgrounds annotate data will help ensure your team asks different questions, thinks about different end users, and, hopefully, creates a technology with some empathy in mind.
Accounting for Bias Is Paramount for Good AI
Knowing what I know now, I’d argue it’s both negligent and reckless to launch an AI system into a production without accounting for bias with these basic best practices. Remember: it’s not impossible to reduce unwanted bias in your models. It takes some grit and hard work, sure, but it reduces down to being empathetic, iterating throughout the model building and tuning processes, and taking great care with your data.