Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

4 steps to help kickstart your data cleaning project

by Aleksandra Besińska
December 4, 2015
in BI & Analytics, Case Studies
Home Topics Data Science BI & Analytics
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

Are you considering carrying out or outsourcing a data cleaning project? Find out how to start.

Table of Contents

  • Data quality problems
  • Benefits of running a data cleaning project
  • Stages of the data cleaning process
  • Data cleaning project – steps
    • 1. Profiling
    • 2. Data cleaning
    • 3. The next step is preparing the final cleaned data sets and project documentation/report
    • 4.  Automatization

Data quality problems

At present companies face the challenge of maintaining ever growing collections of data about their customers. This data is often incorrect (e.g. contains duplicate entries), incomplete or incoherent. On the other hand, the quality of this data has great influence on the effectiveness of marketing campaigns or collecting payments for provided services (e.g. a customer does not pay the invoice, because it was sent to the wrong address).

According to Lemonly.com and Software AG, the business cost of having poor quality data can be equal to 10%-25% of the company’s income. On the other hand, statistics provided by Halo Business Intelligence indicate that 92% of surveyed companies admit that their customer address information is inexact, while 66% of organizations believe that incorrect data has negative impact on their operations.

Benefits of running a data cleaning project

  • Less time needed to data unification before each use
  • Correct interpretation of business data
  • Increased data reliability
  • Less time needed for the preparation of data for future analyses
  • Decreased marketing campaign costs as the result of reduced number of duplicate shipments

Stages of the data cleaning process

For 13 years we have been predicting customer behavior from collected data. We know how important data quality is: it is difficult to obtain accurate business interpretation of data if this data is full of errors.


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


We have carried out numerous projects covering assessing and improving data quality in sectors like telecommunications, debt collection, insurance, or FMCG, with data cleaning efficiency exceeding 90%. We have in total analyzed ca. 26 million records with customer information.

Drawing from our experience in this area, we would like to show you how a data cleaning projects look like.

Data cleaning project – steps

The diagram below depicts the main stages of a data cleaning project. Not all projects must look the same, as the requirements of our clients influence the final shape of a project.

image3

 

 

 

1. Profiling

Its goal is to detect issues affecting poor quality of the data. We verify the data quality in terms of business (eg outliers, accordance with dictionaries) and technical (e.g. basic statistics, data format tests) accuracy.

With the use of interactive tools available in our software we try to find problems. The result is a data profiling report containing information about data exploration carried out, a list of the problems encountered and recommendations methods of cleaning, necessary to conduct further work related to the project.

2. Data cleaning

After defining problems with data and setting further goals with our client, we begin to clean the data. This stage includes 3 tasks: Parsing, Standardization and Deduplication.

Parsing – breaking down a complex field into a number of fields based on the meaning of data and context (for example, first and last name, code and city, etc.).

image2

At this stage we can carry out additional tasks including:

  • Based on the contents of the field “name”, identification whether the record contains a person, group of persons, an institution, a company or business activity
  • Based on popular names, determination of gender
  • Isolating legal forms for companies – a legal form is standardized as official CSO abbreviations

Standardization – replacing a number of different instances of the same variable with one value. For example, “New York” and “NY” will be identified as the same value and replaced with one user-defined value.

Deduplication – detecting duplicate records and their consolidation. We search for multiple entries of the same customer in the database even if the data are of different formats. It is also possible for us to combine databases from multiple sources and unify them by creating a customer record that includes all information from various sources.

Duplicates example:

image1

3. The next step is preparing the final cleaned data sets and project documentation/report

4.  Automatization

In the end we automate Data Quality processes which allows our clients to maintain a certain level of data quality over a long period.

From this moment, for instance, every new input in our client’s CRM system will be cleaned (as shown in point 2).

As a part of data cleaning process it is also possible to carry out additional analysis like Data enrichment (e.g. filling missing values, detection of households) and geocoding.

Stay tuned for another article, that will go into even more detail.

Tags: Algolyticsautomizationdata cleaningData Qualitydetection

Related Posts

business intelligence career path explained

From zero to BI hero: Launching your business intelligence career

March 24, 2023
Data integration vs business intelligence

A comprehensive look at data integration and business intelligence

February 21, 2023
What is Analytics as a Service (AaaS): Examples

Transform your data into a competitive advantage with AaaS

January 26, 2023
Top 4 business intelligence reporting tools       

Transforming data into insightful information with BI reporting

January 25, 2023
What is a virtual influencer? Explore the best virtual influencers such as Lil Miquela & Casas Bahia and find out to connection with the metaverse.

The rise of virtual influencers in the early stages of the metaverse 

October 11, 2022
What is the impact of artificial intelligence in insurance with examples? Explore AI in insurance use cases and find out insurance companies using artificial intelligence.

The insurance of insurers

September 22, 2022

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

Explained: Is ChatGPT plagiarism free?

How can data science optimize performance in IoT ecosystems?

Consensus AI makes accessing scientific information easier than ever

A comprehensive comparison of RPA and ML

ChatGPT now supports plugins and can access live web data

From zero to BI hero: Launching your business intelligence career

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.