What is Data Cleansing?

by Garrett Alley  
4 min read  • 10 Aug 2018

What does data cleansing mean?

As a business grows and matures, the size, number, formats, and types of its data assets change along with it. Evolutions in payroll systems, new network hardware and software, emerging supply-chain technologies, and the like can all create the need to migrate, merge, and combine data from multiple sources. "Dirty" data — data that contains redundancies, includes duplicate records, is missing information, or has been otherwise corrupted in the process of being imported or merged — is one inevitable result. Data transformation, which involves “massaging” data to make its fields and formats conform to those of its destination, can also be the source of hair pulling and sleepless nights.

The art and science of handling these odious tasks is called "data cleansing."

Clean up your dirty data

The goal of data cleansing is to improve data quality and utility by catching and correcting errors before it is transferred to a target database or data warehouse. Manual data cleansing may or may not be realistic, depending on the amount of data and number of data sources your company has (this blog post has some good information on how to approach manual data cleansing). There are data cleansing tools designed to take some of the difficulty out of the process.

Regardless of the methodology, data cleansing presents a handful of challenges, such as correcting mismatches, ensuring that columns are in the same order, and checking that data (such as date or currency) is in the same format. Depending on the situation, other difficulties may include enriching data with supplementary information on the fly, revising or updating schema, and detecting errors. These data discrepancies may have originated from human error, aging (data such as contact information degrades over time), omissions due to optional fields in forms, or merge errors.

Both manual and automatic data cleansing execute the same basic steps, in varying order:

  1. Import data via API or in .csv (or another delimited text format).
  2. Format data to match the destination database.
  3. Re-create missing data, wherever possible.
  4. Correct errors, such as spelling.
  5. Reorder columns and rows to match the target database.
  6. Compare and delete duplicate records.
  7. Enrich data by merging in additional information (such as adding data from purchased marketing and sales databases), if desired.

Remove the headaches from the process

Some "data wranglers" are born do-it-yourselfers and prefer to detect and correct their data inconsistencies manually. However, this methodology is error prone and does not scale as a company grows. For those seeking an automated solution, selecting the ideal data-cleansing software should be top priority.

Here are some features and functionality to look for when comparing vendors of automated data-cleansing software:

The ability to catch and correct errors in real time. Errors are inevitably introduced in the process of integrating data from multiple sources. Premium data-cleansing solutions, such as Alooma, include the ability to catch these errors and restream them for exactly-once processing.

Parse formats to enable proper merging. In merging data from multiple sources, there are often inconsistencies in data format. Your data-cleansing software must be able to match one format (such as M-D-Y) to another (such as D-M-Y).

Automatically revise or update schema. A master database can be an important tool for deriving insights from your data. However, integrating all of your data in this way requires every data source to have the same schema. The best data-cleansing software will detect this and revise schema, wherever necessary.

Enrich data before merging it into a data warehouse. It is not unusual for companies to add supplementary data from a commercial source to incoming data. An example is to enrich incoming demographic data with corrected address data (such as Zip+4 information) before transporting it to the data warehouse.

The ability to view data in real time. Ideally, data scientists and engineers can monitor data streams in real time, enabling them to detect and correct problems before the data is merged into the data warehouse.

Avoid the costs of "dirty" and inconsistent data

As someone tasked with data analysis or with ensuring data integrity, you are well aware of the relationship between data quality and revenue. That’s why you should consider protecting your organization’s most vital data assets with a premium data cleansing solution, such as Alooma.

Alooma is the enterprise data platform built for the cloud. Whether you need real-time access to data streams, have lots of different types of data sources to pull from, require complicated data transformation, or need to manage the schema as your data flows into your cloud data warehouse, Alooma has you covered.

If you’re ready to learn more about how Alooma can simplify and streamline the process of data cleansing in your environment, contact us today!

This might interest you as well