Are you considering carrying out or outsourcing a data cleaning project? Find out how to start.

Data quality problems

At present companies face the challenge of maintaining ever growing collections of data about their customers. This data is often incorrect (e.g. contains duplicate entries), incomplete or incoherent. On the other hand, the quality of this data has great influence on the effectiveness of marketing campaigns or collecting payments for provided services (e.g. a customer does not pay the invoice, because it was sent to the wrong address).

According to and Software AG, the business cost of having poor quality data can be equal to 10%-25% of the company’s income. On the other hand, statistics provided by Halo Business Intelligence indicate that 92% of surveyed companies admit that their customer address information is inexact, while 66% of organizations believe that incorrect data has negative impact on their operations.

Benefits of running a data cleaning project

  • Less time needed to data unification before each use
  • Correct interpretation of business data
  • Increased data reliability
  • Less time needed for the preparation of data for future analyses
  • Decreased marketing campaign costs as the result of reduced number of duplicate shipments

Stages of the data cleaning process

For 13 years we have been predicting customer behavior from collected data. We know how important data quality is: it is difficult to obtain accurate business interpretation of data if this data is full of errors.

We have carried out numerous projects covering assessing and improving data quality in sectors like telecommunications, debt collection, insurance, or FMCG, with data cleaning efficiency exceeding 90%. We have in total analyzed ca. 26 million records with customer information.

Drawing from our experience in this area, we would like to show you how a data cleaning projects look like.

Data cleaning project – steps

The diagram below depicts the main stages of a data cleaning project. Not all projects must look the same, as the requirements of our clients influence the final shape of a project.





1. Profiling

Its goal is to detect issues affecting poor quality of the data. We verify the data quality in terms of business (eg outliers, accordance with dictionaries) and technical (e.g. basic statistics, data format tests) accuracy.

With the use of interactive tools available in our software we try to find problems. The result is a data profiling report containing information about data exploration carried out, a list of the problems encountered and recommendations methods of cleaning, necessary to conduct further work related to the project.

2. Data cleaning

After defining problems with data and setting further goals with our client, we begin to clean the data. This stage includes 3 tasks: Parsing, Standardization and Deduplication.

Parsing – breaking down a complex field into a number of fields based on the meaning of data and context (for example, first and last name, code and city, etc.).


At this stage we can carry out additional tasks including:

  • Based on the contents of the field “name”, identification whether the record contains a person, group of persons, an institution, a company or business activity
  • Based on popular names, determination of gender
  • Isolating legal forms for companies – a legal form is standardized as official CSO abbreviations

Standardization – replacing a number of different instances of the same variable with one value. For example, “New York” and “NY” will be identified as the same value and replaced with one user-defined value.

Deduplication – detecting duplicate records and their consolidation. We search for multiple entries of the same customer in the database even if the data are of different formats. It is also possible for us to combine databases from multiple sources and unify them by creating a customer record that includes all information from various sources.

Duplicates example:


3. The next step is preparing the final cleaned data sets and project documentation/report

4.  Automatization

In the end we automate Data Quality processes which allows our clients to maintain a certain level of data quality over a long period.

From this moment, for instance, every new input in our client’s CRM system will be cleaned (as shown in point 2).

As a part of data cleaning process it is also possible to carry out additional analysis like Data enrichment (e.g. filling missing values, detection of households) and geocoding.

Stay tuned for another article, that will go into even more detail.

Previous post

"I suspect in five years or so, the generalist ‘data scientist’ may not exist" - Interview with Data Scientist Trey Causey

Next post

IoT In Education: The Internet of School Things