Everyone in the data chain has to worry about the question of ethics and security. Individuals want to protect their data. Companies and researchers want to learn from it. Big data, like many other emerging areas of technology, suffer from very real ethical problems. Regulations, governing bodies, and even general understanding of ethics are struggling to get up-to-date. Worse still, it’s not just mega-corporations using that data, it’s researchers and the thousands of small companies and studies worldwide.

One study from Facebook in 2014 has long been the poster child for misuse of social data in science. Most ethical guidelines for big data use consider maintaining the reputation of scientific research a key requirement, meaning Facebook researchers had some explaining to do. Yet, the data collected was not particularly sensitive. The experiment included tweaking users’ news feeds to contain either more positive or more negative stories to see whether that influenced users’ emotions and subsequent posts. There was an amount of outrage, and shock, but were Facebook’s actions actually unethical?

First and foremost, the looming question is “did users give their consent?” If an ordinary user doesn’t provide consent to be a human guinea pig, why would they ever suspect they’re being tracked and analyzed? However, despite the missing consent, there might not be any perceived problem by researchers, given the minimal risk to users and anonymization of data. Whether Facebook’s data collection methods were appropriate had ethicists split down the middle, which is an increasingly common occurrence in data. In the end, the hammer did not come down, as the project was conducted by Facebook for internal purposes, with outsiders from Cornell University providing only analysis, and nothing more. The fact Facebook is a private company also completely changed the rules, making it tough to say they did anything legitimately wrong.

While ethicists can find plenty to disagree on, many agree that academics and scientific researchers should be held to higher standards than the bare minimum, or what they can scrape by with. In the wake of the emotion-study debacle, Facebook enacted a new research review board and updated terms of service. Many other companies, including Microsoft, joined suit to avoid upsetting users in the future. While users are prepared for their data to be taken and used, that doesn’t mean they’re comfortable with it. Many would-be-test-subjects view data and data collection in a very negative light. The idea of academics collecting data in what many define as a “creepy” manner does harm individuals’ respect for research as a whole.

When Ethicists Argue and Guidelines Get Ignored

Interestingly, there are already several rules (official and unofficial) for researchers to follow. Associations, government bodies, and even research companies have laid out extensive guidelines for researchers to abide by. In many instances, however, these guidelines are anything but legally binding.

There are several points that make up a general understanding of ethical data collection and usage, the most obvious being consent, which, as seen with Facebook, doesn’t always apply in the way users expect it to. Other points include that used data was reasonably linked to the topic and study, a point that may interest researchers, but doesn’t make users feel any safer. A general requirement to not in any way harm users does exist, but, when it comes to not terrifying them with “creepy” out-of-the-blue data collection, that can be also be difficult.

This is one reason review boards exist. Institutions often have their own Institutional Review Board, made up of researchers who are left to judge what is and isn’t ethical when a study gets to a sticky spot. Many of these judges, however, aren’t professional ethicists. The boards are often made up of scientists, a fact that likely made much more sense before the time of big data. Such boards may be fit to preside over medical studies, where a subject the risk is clear as life and death, but the complexities of technology and data rights is proving to be a tough grey area.

Casting aside legitimate ethicists in favor of what seems to be common sense may have played a role in the recent OKCupid data disaster, wherein researchers scraped data on 70,000 OKCupid users and then made their results public. When asked whether any attempt to anonymize the dataset had been made, head researcher Emil Kirkegaard tweeted only, “No. Data is already public.” Just about every news source has already jumped into the conversation, most arguing the situation is much more complex than Kirkegaard describes. Simply existing on the internet does not mean data is truly public, or that researchers have a right to it. Perhaps the most glaringly obvious problem behind Kirkegaards’ dismissal is that he dismisses the existence, and possible harm caused to, his unwitting test subjects, which should be a primary consideration during any academic research. While arguing the ethics of the situation (which seem to be more along the lines of “just how unethical is this” as opposed to “is this unethical or not”), perhaps the most jarring point is that Kierkegaard is just a graduate student. Not every graduate student sparks an internet-wide freak out and ethics debate, and it’s the wide-spread power of big data that made it happen.

It doesn’t take much to scrape data. It takes relatively little to gather 70,000 profiles worth of data, and that’s why users and ethicists care so much. The OKCupid researchers’ first attempt at data collection included creating a profile and letting a bot scrape suggested profiles, meaning just about any female in and near Denmark could have become unsuspecting test subjects. While this particular study was for science as opposed to explicitly nefarious reasons, the almost deceptive use of an OKCupid profile to scrape data from users is still unsettling. That data, which includes full usernames, could also be used to form full profiles of users, from their location to preferences and answers to site-related questions.

This leads to the problem of big data’s future uses. Now, there’s not just data; there’s an effort to combine all existing data to create thorough user profiles. Putting apparently public data together leads to much more than a dating profile, but a very exact look at an individual or specific group of individuals. Unforeseeable problems with collected data only make the case for a more unified, and strict, stance on the ethics of data more important. Even if research groups and universities establish guidelines for ethical data collection and usage, it means nothing without a relatively uniform understanding of what is and isn’t allowed—not to mention, respect for the test subjects and their privacy.

Like this article? Subscribe to our weekly newsletter to never miss out!

Previous post

Data Scientists...futureproof yourselves!

Next post

The Data Lake: A Reservoir or a Swamp? It Depends on Your Approach