It seems like every quarter a new McKinsey report predicts that this will be the year trillions of dollars of IoT potential is unlocked. But while the amount of data IoT produces has skyrocketed, we’re still waiting for that return on investment. The good news is, the reports aren’t wrong. Actionable data can, in fact, enable data scientists to accelerate business growth. The bad news is, businesses haven’t had access to the right tools to make their data actionable. In fact, examples indicate just 1 percent of operational data is being used in enterprises.
The primary exhaust of IoT devices is time series data, i.e. sequential events indexed by time. Working with time series data is tricky. Whether it’s information coming from a machine on a factory floor or the trunk of a self-driving car, events occur in uneven intervals, different sized windows and formats that vary across datasets. Time series data is unique in that it’s write once, non-deletable, non-transactional and non-relational. It also has different access patterns, such as looking for behaviors and patterns across time rather than joining on a specific field.
Unfortunately, time series data often gets grouped with other types of data such as CRM records, log data and general analytics. This results in tools that don’t work, leaving data scientists and their organizations without an effective solution for leveraging their data or making it actionable.
Unique Hurdles and Advantages
With traditional datasets, data scientists often look for relationships that can be expressed easily and efficiently with SQL. For time series data, however, data scientists need to look for behaviors and patterns in events streaming across time. They need to look for specific sequences, how often they happen and the characteristics of the data during these windows of time in order to gain insights and build models. Relying on SQL to do time series data lookup can quickly become very costly and inefficient.
Luckily, time series data can be sampled. Data scientists only need a small portion of the extracted data to understand its overall shape. This initial sample can fit into memory and be analyzed with pandas or a Jupyter notebook. It may even be small enough to efficiently do full table scans inside of a No-SQL or SQL database. The small sample size of time series data makes it possible for data scientists to quickly explore the data for patterns and write small programs to transform the data or add new features.
Performance and Workflow Challenges
Eventually, though, analysis needs to scale. Managing the performance and workflow of time series analytics from a small sample to production-level volumes can be extremely challenging for data scientists. For instance, even simple pre-processing and data transformation steps need to be moved to distributed batch processing workflows. Moving the extract, transform and load (ETL) program from local scripts into a production-ready data pipeline requires rewriting entire programs for environments where table scans just aren’t feasible.
The level of experimental interactivity and flexibility during the data exploration and model development process is directly related to how valuable the time series data insights will be. When data scientists are forced to wait hours or days for long batch processing pipelines, they lose interactivity, iterate less and find suboptimal solutions that often have unintended consequences. For instance, because it’s so inefficient to adapt or tune an ETL, early assumptions aren’t tested, leaving dangerous biases and failure modes in a system. These problems compound when data scientists need to join streams of data together, each with different states, features, ETL requirements and schemas. The resulting pipelines are extremely fragile, and they break frequently. Before you know it, the majority of a data scientist’s time is spent troubleshooting.
Crucial Best Practices
Creating useful metrics from time series data requires looking at high-level features not visible in the raw data itself. For instance, instead of merely looking at a temperature value, it’s useful to extract degrees/hour change. Useful patterns are discovered by combining derived features from multiple data sources into higher level query expressions.
For data scientists looking to effectively leverage the insights behind their organization’s time series data, acknowledging and prioritizing scaling challenges is critical. Maintain a high level of interactivity so you can explore and iterate quickly. Recognize the unique behaviors complex events will reveal, and be ready to test as many combinations as possible. In doing so, data scientists can productively work with time series data to help a business grow.
Like this article? Subscribe to our weekly newsletter to never miss out!
ETL tools are used to Extract data from homogeneous or heterogeneous data sources Transform the data for storing it in proper format.