‘Big data’ has become an industry buzzword amongst today’s business leaders. Corporate spending on infrastructure to capture and store diverse volumes of rapidly-changing data has risen significantly in recent years as organisations have scrambled to collect all of the consumer information they believe will help them stay ahead of the competition. However, it’s becoming clear that collecting data alone isn’t enough. Indeed, this year’s Gartner’s Hype Cycle Special Report cites big data as moving beyond the peak of inflated expectations – with businesses now beginning to see that data must not only be harvested and stored, but also analysed and mined efficiently for new insights if it is to be of any strategic value. However, with so much data now being collected, how do organisations know what is useful and what isn’t, and how does one make the insights actionable?
Typically, most organisations focus their data analysis efforts on transactional data – the information customers supply when they purchase a product or service – because they perceive it to be the most valuable. This typically includes names, addresses, credit card information etc. However, in the course of collecting transactional data, large amounts of additional customer information are also accumulated as a byproduct. This non-transactional data is commonly referred to as “dark data”, which Gartner defines as ‘information assets that organisations collect, process and store during regular business activities, but generally fail to use for other purposes.’
What is Dark Data?
Dark data can consist of various insights such as which marketing pieces a specific individual responded to, on which platform they answered a questionnaire, or what they’ve said about an organisation or brand on social media. Dark data can also include customer purchase history, frequency of website visits or geographical spread of customers etc.
While it can appear obscure and unhelpful, if approached in the correct way, dark data can reveal all kinds of patterns and insights that would otherwise have been missed. In short, it is information that can really make a difference if interpreted correctly.
One key to unlocking dark data’s secrets lies in the ability to understand the relationships between seemingly unrelated pieces of information. The way that data is stored plays a critical role in this. Traditional relational databases, and indeed even many big data technologies, simply aren’t designed to show relationships and patterns between data records. You may be able to unearth some connections at a very high level, but the results will be extremely slow and lack real definition. It’s the difference between understanding that two people living in one house are married, siblings or flatmates, and then going a step further to predict how those differences might influence their decisions.
Thanks to advances in technology over the last few years, deriving business value from dark data is now a real possibility. Broadly speaking, the recipe for doing this involves three steps:
- Discovering the hidden patterns – This requires technology infrastructure to store and crunch data, and data scientists to ask the questions. Key technologies here are your usual bulk analytic suspects: Hadoop (increasingly with Spark over MapReduce as a way of processing data), Splunk, SAS, etc.
- Developing hypotheses – Also in the domain of off-line analytics and data science, developing hypotheses combines various forms of forward (A/B) and back testing.
- Putting your newfound insights to use – As the new algorithms developed above make the rules more complex, older technologies lose the ability to execute in real time. As a result, new technologies such as graph databases like Neo4j are needed to run the algorithms at the appropriate junctures, and with the right level of timeliness, to have their effect on the business.
There is quite a lot of discourse aimed at the first two activities, which belong to the realm of data analysis. The third, it turns out, is the critical ingredient that makes insights actionable. And it’s not a negligible problem. The more intricate and subtle the algorithms coming out of the analytics processes become, the more pressure is exerted on the operational systems. Take the example an e-commerce recommendation. The golden recommendation might very well require combining up-to-the-second information about the products in someone’s shopping cart, the products they’ve browsed and bought, and then examining what other people in their situation have bought in the past in similar product categories. These kinds of real-time multi-hop recommendation algorithms are usually where relational databases either crumble, or get unjustifiably expensive.
The result is an increasingly common phenomenon called “polyglot persistence”, where new database technologies such as graph databases (which are ideal for solving, among other things, the recommendation problem above) are used alongside existing systems to solve specific high-value problems.
Because many of the insights yet to be made rely on a deeper understanding of complex causalities (else they’d have already been discovered), it’s no surprise that new technologies are needed. more and more businesses are discovering graph databases as a powerful enabler of the real-time execution engine, for bringing to life insights in data: light and dark.
Dark data isn’t just useful for customer insights either. It can be equally useful when applied to employees. For example, Gate Gourmet, an airline industry catering provider, was struggling to lower an unusually high 50 per cent attrition rate amongst its one thousand employees at the O’Hare Airport in Chicago. Using dark data already easily accessible in internal systems, such as demographics, salaries and transportation options, the company confirmed its suspicion that the attrition rate was directly related to the distance and transportation options from employees’ homes to the airport. This realisation enabled them to change the hiring process and reduce attrition by 27 per cent. Gate Gourmet did not need to invest a huge amount of money into collecting data to solve the company’s attrition problem. Rather, they needed to look closer at the data they already had available in a way that enabled them to see patterns and connections between employees that were staying with the company and those that were leaving.
Although many businesses are not yet leveraging their dark data, the example of Gate Gourmet demonstrates what can happen when they do. While companies will, and must, continue to actively collect data, it is essential not to neglect the information already available, free of cost! It is clear that there is a need to be more creative by asking new questions from the same old data to throw up exciting and surprising results.
The key in monetizing dark data lies not only in gathering it, but in analysing it to discover hidden patterns, developing hypotheses, and then putting the insights to use. Doing this successfully requires a variety of different technologies, each suited to a particular job. By combining data science and number crunching on large-scale analytic technologies, with the real-time execution of complex algorithms by using a graph database, businesses can bring transformative insights to their operational decisions, and combine the latest technologies with their existing data and systems.
Emil Eifrem is CEO of Neo Technology and co-founder of the Neo4j project. Committed to sustainable open source, he guides Neo along a balanced path between free availability and commercial reliability. Emil is a frequent conference speaker and author on NOSQL databases.
(Image credit: Jason Eppink)