Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Man and the Machine Partner to Solve the Big Data Dilemma

by Nenshad Bardoliwalla
May 4, 2015
in Machine Learning
Home Topics Data Science Machine Learning
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

The collision between people and big data has caused an explosion of machine learning innovations, with one natural home being in modern data preparation – the steps of understanding, cleaning, shaping, and correlating data prior to it being ready for analytics.

For thirty years, there have really only been two data preparation processes: first, the human-led, coding and scripting, trial-and-error approach, which can’t scale when datasets are constantly changing and being generated regularly from new disparate sources. The other: the rigid path of ETL (Extract, Transform, Load), where a schema and set of mappings was built and could not be changed without an act of Congress. Neither of these options allows for people to process, analyze or derive insight from the volumes of data they are collecting, as rapidly as it is being generated.

Today, companies like Paxata are leveraging machine learning to accelerate the modern data preparation process, giving everyone who works with data a “partner” can do things people can no longer do with just curiosity and eyes. It automates the exploration of data quality issues to discover unidentified relationships, anomalies and other data properties without being explicitly programmed on what to look for. And, unlike traditional methods which break under the stress of constantly evolving data volume and variety, machine-based learning only gets better as the data gets bigger and different.

How does machine learning work in data preparation?

The Paxata approach, for instance, uses multiple techniques to “learn” the meaning behind the data (semantic typing) and how it relates the other data elements:

Adaptive semantic indexing – An indexing and retrieval method that establishes associations between words that occur in similar contexts.  The adaptive indexing aspect is that semantic index creation and refinement happens on the fly as a by-product of pipeline execution, often known as “database cracking.”


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


Probabilistic join recommendations – Uses the indexes to generate a virtual search space of all possible matches between words across various data sets, generating statistical distributions of the matches, aggressively pruning the possible legitimate combinations, and then building reasoning around the potential matches that remain to make the possible matching decisions.

Reinforcement learning – As recommendations from the join detection process are confirmed via user interaction, the model of the relationships among the datasets receives feedback, which then gets manifested in the weights used in subsequent join detection steps.

These combined capabilities make it possible for a person to understand the semantic and syntactic qualities of a billion rows of data without coding a single line.

Paxata incorporates machine learning in all five pillars of Adaptive Data Preparation:

  1. Data integration:  These are capabilities for extracting data from operational systems, transforming and joining that data, and delivering it to integrated structures for analytics purposes.  The transformations include converting data types, simple calculations, lookups, pivoting, aggregations, filtering, and even extracting people, places, and events out of free form text. Machine learning can recommend new data sets to join, possible transformations to make on the data, and even propose normalization or de-normalization strategies that can be enabled using pivoting and de-pivoting.
  2.  

  3. Data quality:  These are capabilities for assessing the quality of data, detecting integrity violations and outliers, decomposing it into its component parts and formatting values consistently based on standards.  Syntactic cleansing can fix structural issues with data such as standardizing punctuation, yet it is semantic cleansing that can ensure that data is standardized based on its correct meaning. Machine learning can be used to automatically detect the types within the data (customer names, addresses, locations, dates) and recommend monitoring and transformation rules to remediate issues.
  4.  

  5. Data enrichment:  These are capabilities that enhance the value of internally-held data by appending related attributes from external sources (for example, consumer demographic attributes and geographic descriptors) and enable the consolidation and rationalization of the data representing critical business entities, such as customers, products and employees by identifying, linking or merging related entries within or across sets of data. Machine learning can be used to recommend other data sets that people have used to combine with the dataset currently being worked with based on the automatic detection of semantic types.
  6.  

  7. Dynamic governance:  These are the capabilities that enable an organization to set policies and processes that ensures that important data assets are formally managed throughout the enterprise.  This is manifested by functionality that captures decision rights and accountabilities for information-related processes, formalizing agreed-upon policies which describe who can take what actions with what information, when, under what circumstances, using what methods. Machine learning can be used to automatically enforce security policies in the system based on other policies that have been modeled explicitly, preventing “holes” in the security infrastructure.
  8.  

  9. Ad-hoc collaboration:  These are the capabilities that enable people to edit data simultaneously, share it across organization boundaries, make requests for data and seek approvals for leveraging it in business processes, and annotating it to add additional context to be preserved for posterity. Machine learning can recommend the right collaborators with domain expertise in specific areas based on an understanding of who works with what type of data.
  10.  

Rise of machine learning due to enabling technology

Machine learning feeds off of volumes and varieties of available data, and requires powerful computational processing, which makes it a natural by-product of the Hadoop ecosystem. With technologies like Apache Spark and the extensible RDD model, along with columnar persistent caching, database cracking, and adaptive windowing, it is possible to learn the relationships across massive sets of data and still provide the results with interactive response times.

As Gartner notes in its report: Machine Learning Drives Digital Business, “Machine learning models can surpass human capability in coping with significant volumes of data, finding high-order interactions and patterns within the data and dealing with highly complex business problems.” It is about time that man partner with machine to make sense of big data and reap the rewards of the digital economy.


4-NenshadNenshad Bardoliwalla, Co-Founder and VP of Products at Paxata– An executive and thought leader with a proven track record of success leading product strategy, product management, and development in business analytics. Bardoliwalla co-founded Tidemark Systems, Inc. where he drove the market, product, and technology efforts for their next-generation analytic applications built for the cloud. He formerly served as VP for product management, product development, and technology at SAP where he helped to craft the business analytics vision, strategy, and roadmap leading to the acquisitions of Pilot Software, OutlookSoft, and Business Objects. Prior to SAP, he helped launch Hyperion System 9 while at Hyperion Solutions. Nenshad began his career at Siebel Systems working on Siebel Analytics. Nenshad is also the lead author of Driven to Perform: Risk-Aware Performance Management From Strategy Through Execution.


Photo credit: gwai / Foter / CC BY-NC-ND

Tags: Data IntegrationData PreparationpaxataReinforcement Learning

Related Posts

robotic process automation vs machine learning

A comprehensive comparison of RPA and ML

March 27, 2023
What is multimodal AI: Understanding GPT-4

Tracing the evolution of a revolutionary idea: GPT-4 and multimodal AI

March 15, 2023
What are natural language processing and conversational AI

A journey from hieroglyphs to chatbots: Understanding NLP over Google’s USM updates

March 14, 2023
Machine learning in asset pricing explained

Rethinking finance through the potential of machine learning in asset pricing

March 3, 2023
Exploring the intricacies of deep learning models

Exploring the intricacies of deep learning models

February 28, 2023
machine learning prediction

Insights from the game of Go: Discussing ML prediction

February 24, 2023

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

Playing with fire: The leaked plugin DAN unchains ChatGPT from its moral and ethical restrictions

The art of abstraction in computer science

AI whisperers, fear, Bing AI ads and guns: Was Elon right?

The strategic value of IoT development and data analytics

AI experts call for pause in development of advanced systems

Microsoft Security Copilot is the AI-ssential tool for cybersecurity experts

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.