The collision between people and big data has caused an explosion of machine learning innovations, with one natural home being in modern data preparation – the steps of understanding, cleaning, shaping, and correlating data prior to it being ready for analytics.
For thirty years, there have really only been two data preparation processes: first, the human-led, coding and scripting, trial-and-error approach, which can’t scale when datasets are constantly changing and being generated regularly from new disparate sources. The other: the rigid path of ETL (Extract, Transform, Load), where a schema and set of mappings was built and could not be changed without an act of Congress. Neither of these options allows for people to process, analyze or derive insight from the volumes of data they are collecting, as rapidly as it is being generated.
Today, companies like Paxata are leveraging machine learning to accelerate the modern data preparation process, giving everyone who works with data a “partner” can do things people can no longer do with just curiosity and eyes. It automates the exploration of data quality issues to discover unidentified relationships, anomalies and other data properties without being explicitly programmed on what to look for. And, unlike traditional methods which break under the stress of constantly evolving data volume and variety, machine-based learning only gets better as the data gets bigger and different.
How does machine learning work in data preparation?
The Paxata approach, for instance, uses multiple techniques to “learn” the meaning behind the data (semantic typing) and how it relates the other data elements:
Adaptive semantic indexing – An indexing and retrieval method that establishes associations between words that occur in similar contexts. The adaptive indexing aspect is that semantic index creation and refinement happens on the fly as a by-product of pipeline execution, often known as “database cracking.”
Probabilistic join recommendations – Uses the indexes to generate a virtual search space of all possible matches between words across various data sets, generating statistical distributions of the matches, aggressively pruning the possible legitimate combinations, and then building reasoning around the potential matches that remain to make the possible matching decisions.
Reinforcement learning – As recommendations from the join detection process are confirmed via user interaction, the model of the relationships among the datasets receives feedback, which then gets manifested in the weights used in subsequent join detection steps.
These combined capabilities make it possible for a person to understand the semantic and syntactic qualities of a billion rows of data without coding a single line.
Paxata incorporates machine learning in all five pillars of Adaptive Data Preparation:
- Data integration: These are capabilities for extracting data from operational systems, transforming and joining that data, and delivering it to integrated structures for analytics purposes. The transformations include converting data types, simple calculations, lookups, pivoting, aggregations, filtering, and even extracting people, places, and events out of free form text. Machine learning can recommend new data sets to join, possible transformations to make on the data, and even propose normalization or de-normalization strategies that can be enabled using pivoting and de-pivoting.
- Data quality: These are capabilities for assessing the quality of data, detecting integrity violations and outliers, decomposing it into its component parts and formatting values consistently based on standards. Syntactic cleansing can fix structural issues with data such as standardizing punctuation, yet it is semantic cleansing that can ensure that data is standardized based on its correct meaning. Machine learning can be used to automatically detect the types within the data (customer names, addresses, locations, dates) and recommend monitoring and transformation rules to remediate issues.
- Data enrichment: These are capabilities that enhance the value of internally-held data by appending related attributes from external sources (for example, consumer demographic attributes and geographic descriptors) and enable the consolidation and rationalization of the data representing critical business entities, such as customers, products and employees by identifying, linking or merging related entries within or across sets of data. Machine learning can be used to recommend other data sets that people have used to combine with the dataset currently being worked with based on the automatic detection of semantic types.
- Dynamic governance: These are the capabilities that enable an organization to set policies and processes that ensures that important data assets are formally managed throughout the enterprise. This is manifested by functionality that captures decision rights and accountabilities for information-related processes, formalizing agreed-upon policies which describe who can take what actions with what information, when, under what circumstances, using what methods. Machine learning can be used to automatically enforce security policies in the system based on other policies that have been modeled explicitly, preventing “holes” in the security infrastructure.
- Ad-hoc collaboration: These are the capabilities that enable people to edit data simultaneously, share it across organization boundaries, make requests for data and seek approvals for leveraging it in business processes, and annotating it to add additional context to be preserved for posterity. Machine learning can recommend the right collaborators with domain expertise in specific areas based on an understanding of who works with what type of data.
Rise of machine learning due to enabling technology
Machine learning feeds off of volumes and varieties of available data, and requires powerful computational processing, which makes it a natural by-product of the Hadoop ecosystem. With technologies like Apache Spark and the extensible RDD model, along with columnar persistent caching, database cracking, and adaptive windowing, it is possible to learn the relationships across massive sets of data and still provide the results with interactive response times.
As Gartner notes in its report: Machine Learning Drives Digital Business, “Machine learning models can surpass human capability in coping with significant volumes of data, finding high-order interactions and patterns within the data and dealing with highly complex business problems.” It is about time that man partner with machine to make sense of big data and reap the rewards of the digital economy.
Nenshad Bardoliwalla, Co-Founder and VP of Products at Paxata– An executive and thought leader with a proven track record of success leading product strategy, product management, and development in business analytics. Bardoliwalla co-founded Tidemark Systems, Inc. where he drove the market, product, and technology efforts for their next-generation analytic applications built for the cloud. He formerly served as VP for product management, product development, and technology at SAP where he helped to craft the business analytics vision, strategy, and roadmap leading to the acquisitions of Pilot Software, OutlookSoft, and Business Objects. Prior to SAP, he helped launch Hyperion System 9 while at Hyperion Solutions. Nenshad began his career at Siebel Systems working on Siebel Analytics. Nenshad is also the lead author of Driven to Perform: Risk-Aware Performance Management From Strategy Through Execution.
Photo credit: gwai / Foter / CC BY-NC-ND