The Four Questions You Need To Ask To Get The Most Out Of Your Data

Facebook is apparently flagging articles for satirical content. Of course, it can’t work: sarcasm is about cultural context and shared assumptions and a computer capable of understanding it would be very close to the almost-human AIs of science fiction.

But fielding requests for magic is an experience familiar to anybody working with data for less technical managers, and while Facebook’s current model doesn’t seem to be doing much more than identifying Onion articles (take http://nightofthelivingdad.net/2014/08/13/why-we-didnt-vaccinate-our-child/ for instance), we can certainly come up with something a little better than what they’re currently up to. By way of illustrating how a data project can go from idea to implementation, let’s walk through the questions we need to ask to figure out what we should be doing and why.

First, we can ask about the real need. Right now, what happens to the data? How could that be better? This can let us get to the point where we have a solvable problem. For Facebook, the goal is presumably to reduce the number of times a user gets confused by an article, thinking it’s real when it’s not. So we’d like to add to the data-processing pipeline something that can decide whether the content of a share is likely to be misleadingly satirical.

The next question is what level of success solves the need? Usually a solution doesn’t need to be perfect to be worth implementing — in some workflows in some companies, getting five minutes ahead of a developing event or routing requests 10% better can be worth millions; in others, you might need a model that was 99.9% accurate to be an improvement over actually having humans look at things. For Facebook, it’s hard to imagine a human-based solution that would be tenable, but we might want to think about the two different kinds of errors: people presumably don’t want their posts to be labeled as satire when they’re not, but a post that’s unlabeled is unlikely to drive people away in the short term, so it’s probably better to not label anything unless you have a fairly high degree of confidence. This kind of situation is pretty common: the exact kinds of errors that are acceptable constrain the available solutions.

What is the available data? One of the most common failure modes of proposed data projects is, remarkably, a lack of data. The optimal case for modern algorithms is a bunch of examples with reliable labels: spam or not spam, customer called back or did not, user gave a 5 star rating or not. As it happens, Facebook has plenty of articles that have been shared, but probably a much less good idea of which ones were satirical and which weren’t. For Facebook, the obvious pieces of information about an article are its URL, which tells you where it came from, its text, which may be stored somewhere, and the text of likes and shares, which is potentially useful but isn’t as good as a real label. Probably the optimal thing to do here is to regard some list of sources (the Onion, the Daily Show, etc.) as generating known-satire and hoping to generalize from that base.

With the answers in hand, we can now outline a solution. The most basic approach is to simply use that list of sources of known-satire as the only things worth labeling — this seems to be the current solution — but we can probably do better. The next step might be to take every article from that gold standard list, treat every word in them as a potential signal, then compute the features that distinguish satire from non-satire and ‘score’ future articles on whether or not they include these features. These features will likely be simple things like stilted vocabulary combined with cursing. Not very complex, but the technology here is completely off-the-shelf. Could it really do the job of detecting satire? Kind of! It’s likely that such a classifier would be pretty good at detecting fake news, but might have a harder time with articles like the one above; without a ton of human context and common sense, we couldn’t generalize much, but we could probably get enough confidence that at least some satirical articles from as-yet-unidentified sources should get flagged. Whether this is worth enough to implement will depend on the exact values of our tolerances, which of course we can’t know without actually trying the project, but you would likely get something you could actually put into production to improve your user experience.

It’s weirdly easy for executives not to explore the ways that data science can improve the success of their organizations, but it shouldn’t be hard to get the confidence to ask. If you know the number you care about, how accurate you need the result to be, and you have lots of well-labeled examples, you can almost certainly get real business value out of a big data project, even if what you want is impossible.

Follow @DataconomyMedia

Dennis Clark studied algebraic geometry and theoretical computer science at Harvard University. After a few years’ service with distinction in the financial sector at QVT Financial LP, he’s brought his business savvy and linear algebra skills to Luminoso. Dennis primarily handles customer relations, product management, and strategic planning, but also provides insight into the mathematical computations and strategies of Luminoso product development.