Data preparation is a process of gathering, cleansing, and organizing data so that it can be analyzed. Preparing data ensures that your data is complete (no blank or null values) and unique (contains distinct values that are not duplicated), and that the range of values is consistent with what you expect. This step is necessary because data may have missing values, or it may have inaccuracies or other errors. This process often takes place as a preliminary step when moving data through an ETL (Extraction, Transformation, Loading) tool so that data can be loaded into a data warehouse or data lake for analysis.
Why is data preparation so important?
In Steve Lohr’s New York Times article, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights, he writes "Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets." Why do we spend so much time doing this ‘janitor work’? Your analysis is only as good as your data, plain and simple. Suppose, for example, that you want to compare sales data for the televisions sold by your company. In some of the files, you might use the term “TV,” while in others you might use the term “Television”, and still others might only reference the TV by its model number. While it may be clear to you that these are all referencing the same item, a database cannot make these kinds of inferences. A data scientist needs to map the connections to align all these sources.
What does this mean in real dollars? According to a Gartner research report, poor quality of data or bad data costs an average organization $15 million every year. But when done correctly, you can beat out the competition — creating an accurate analysis of sales, forecasting future trends, and accurately providing insights to decision-makers so that business can be steered in the right direction.
Why is data preparation so difficult?
Why are companies losing an average of $15 million every year due to bad data? Data preparation is difficult for a number of reasons. First, there can be an unwieldy amount of data to be prepared. If large data sources exist in different places, this can be a daunting task. Similarly, data may be siloed. Siloed data is data that is isolated because it is contained within a particular business group, or because it is stored in proprietary software that is not compatible with other systems. It is hard to work with siloed data because it may be in a unique format that requires normalizing, or because the task of accessing these sources may be difficult. Lastly, there has been an explosion of the types of data in recent years. Data may be generated from smartphones, laptops, websites, customer service interactions, sales and marketing, monitoring devices, and databases. Aligning these sources can be a very difficult task.
How is data preparation achieved?
There are a few steps to data preparation. First, you’ll need to gather your data. Which data is relevant to your analysis? What types of data do you have — is it structured or unstructured? Next, you’ll need to perform data profiling. Data profiling is a process of examining data from an existing source and summarizing information about that data — essentially, getting to know your data. This allows you to map out a strategy for cleaning your data. Next, you’ll cleanse and validate your data. This involves removing extraneous data, filling in missing data, normalizing data (making it conform to a pattern that is compatible with other data), and masking sensitive data. You may need to transform and enrich the data. Data transformation is the process of converting data from one format or structure into another format or structure. On the other hand, enriching data involves adding information to the data that may later confer value to the analysis, such as inserting timestamps or geographical information. Lastly, you will likely want to move your data to one place to analyze it. Typically, data is moved into a data warehouse or data lake so that you can then perform business analytics on the consolidated data.
How Alooma can help
Alooma is a cloud-based ETL solution that can help ease some of the challenges of data preparation. Alooma brings all your data sources together into data warehouses like Azure, BigQuery, Redshift, and Snowflake, or cloud storage like Amazon S3.
One of the great challenges of data preparation is the ability to handle disparate data sources, Alooma excels at handling the widest range of data sources — flat files, RDBMS, S3 buckets, and CSVs, among others.
Securing your data is critical, and Alooma can remove sensitive information before you transform it. In addition, Alooma is 100% SOC2 Type II, ISO27001, HIPAA and GDPR compliant.
Alooma can do all these things in real time. Alooma can help you cleanse your data and align your data types on the fly while you move data to the target store.
Are you ready to see how Alooma can help you with data preparation? Contact us today to learn more.