‘Streams in the Beginning, Graphs in the End’ is a three-part series by Dataconomy contributor and Senior Director of Product Management at Cray, Inc., Venkat Krishnamurthy – focusing on how big changes are afoot in data management, driven by a very different set of use cases around sensor data processing. In this first part, we’ll talk about how the bigger revolution in data management infrastructure is driven more by the increasing ease of data collection than by processing tools.
For those of you that like natural disaster movies, you may recall Twister, a movie about tornado chasers where the star attractions were the twisters themselves. As a quick plot summary, the movie is about tornado chasers risking life and limb to get a bunch of shiny, winged sensors into the heart of an EF5 twister to enable them to understand these monsters better from the inside. In a way, ‘Dorothy’, the machine that digitized the tornado foretold the arrival of the Big Data age
It’s no exaggeration that we’re in the golden age of data management and analytics.
To us at Cray, Data has never really been any other scale than Big, and the reason for this has been the scientific method itself. Science begins with observation, and ‘data analytics’ has been fundamental to this endeavor from the beginning. In the past, this led to the invention of specialized instruments to observe the very small (microscopes) or very large (telescopes). Arguably this is really the first application of ‘data analytics ‘- in a sense, a (optical) microscope or telescope simply turns a tissue sample or patch of sky into a stream of photons analyzed by sophisticated pattern recognition engines (human brains) attached to extremely high-fidelity sensors (human eyeballs).
However, as science relentlessly advanced into ever smaller and ever larger scales simultaneously, it became humanly impossible to build equally capable instruments.
Scientists instead turned to creating scalable, high fidelity mathematical models of physical phenomena and needed tools to study them, hence giving rise to supercomputing by necessity. They use these models to study the insides of stars, the structure of the universe, or molecular dynamics. So, supercomputers have evolved primarily driven by the need to approximate reality at extreme scales – and are really versatile, multipurpose scientific instruments in disguise.
Meanwhile, major advances in data processing have been driven primarily by the commercial sector, starting with the birth of the database. Big ideas in data management like the relational model, transaction processing and SQL were birthed in this age of relatively scarce data and compute capabilities when it was too expensive to capture anything other than a carefully curated recording of key business events.
When the inevitable need arose to understand a business beyond just recording it, the central ideas of Data Warehousing and Business Intelligence were born, driven by basic business needs like financial reporting and sales analysis. Hence, the major ideas of data processing were driven primarily by a need to understand reality, albeit in a narrow business-oriented sense.
For a long time, the paths of traditional ‘supercomputing’ and data analytics didn’t quite intersect except in specialized domains like finance. This persisted till Google upended the status quo famously with the Map Reduce processing model in 2004. The motivating problem at Google was to index the entire Web – but by focusing on building a set of simple building blocks and principles for data processing at extreme scale, they set the stage for the Big Data revolution.
The subsequently rapid, exponential evolution of many open frameworks to process data at scale has meant that Big Data has now become a pervasive cliche applied to every domain and several use cases beyond this original need. Tools like Spark and Hadoop allow the average commercial company to dream really Big about their Data, but have brought to the fore all the problems of building and using distributed computing platforms and applications to the commercial datacenter. In addition, businesses are evolving from simple counting and aggregation of business events, to trying to identify sophisticated patterns in their data, inevitably bringing them closer to the computational techniques used in science.
On the flip side, supercomputers have gotten even better at approximating reality, and generate ever-increasing amounts of data in the process. As the supercomputer has become a telescope, or microscope into the un-observably large or small, the ‘stream of photons’ is now a deluge of bits. Increasingly, scientists need to combine the results of these simulations with data from the real world, and identify patterns in petabyte-sized datasets. Their big data need isn’t gated so much by scale, as by productivity, loosely defined as the quickest time to first result in analyzing the data they have either simulated and/or collected. What is needed is the equivalent of human eyeballs and brains at this scale – this is, in essence why convergence between Supercomputing and Big Data is inevitable.
Fig 1 – The evolution of the microscope – on top, the first ever microscope invented by Anton Leeuwenhoek and some samples. Below, a pictorial representation of mass-spectrometry bio-imaging, which ionizes biological samples into mass spectra
Great, you say – but why is ‘Dorothy’ and a barrel of shiny artificial butterflies relevant to this? Also, the idea of ‘convergence of supercomputing and Big Data’ sounds good, it’s still somewhat abstract. How does this all tie together?
The way we see it, the big changes for data management so far have been the ‘revolution at the center’: Storage facilities (‘Data Warehouses’), distribution facilities (‘Data Hubs’) or aquatic bodies of data (‘Data Lakes’).
In contrast, we believe that the realization of the Big Data revolution will be at the edges of data management. Here is where we see this idea of convergence fundamentally becoming reality, and driving changes in everything from the building blocks for large-scale data processing to the system architecture for platforms at Cray that can deliver on the promise.
Why is this true? We believe it has to do with 2 fundamental problems on either end of the analytical data management lifecycle
- At one end, how to handle data management when data collection is pushing towards the ‘edges’, where a large number of sensors produce data
- At the other end, how to create a scalable model of knowledge to unify the results from any and all types of data processing of all that sensor data
To address the above, we believe that an important organizing principle for data management of Big Data will be about ‘Streams in the beginning, Graphs in the End’.
In subsequent parts, we’ll dive into greater detail on each of the above. Stay tuned!
Image Credit: Eric Fischer / Geography of Twitter / CC BY 2.0