ETL testing then and now
ETL testing is traditionally a long and drawn-out process, one that can have hundreds of stages and take weeks or months. Does it have to be that way? This article presents a new technological model — the cloud-based managed data pipeline — that is changing the face of ETL testing, even at the largest organizations with the most stringent data processing requirements. It is making many parts of ETL testing unnecessary, and other parts easier and more focused.
There are several commercial products based on this new model, all mature, feature-rich, and used by major enterprises. Alooma is one of them. We’ll use our platform as an example, to showcase what the future of ETL testing looks like.
ETL testing analogy: bookkeeping in the 19th century
Source: Understanding and Using Early Nineteenth Century Account Books, Christopher Densmore
Bookkeeping, carried out at small shops and businesses around 150 years ago, was an early type of ETL. Let us explain. Purchases were logged in chronological order, and this data then needed to be parsed and converted into a usable form for analysis.
Here’s how it was done in the mid-19th century:
- The shop clerk, upon making a sale, would open the shop’s "Waste Book", a rough diary of financial transactions, and write down the date, items purchased, and price.
- When the shop owner purchased goods or services he’d also note those transactions in the Waste Book.
- The transactions are carefully typed into a "Day Book". This is still just a listing of day-to-day transactions.
- Later on, the shop owner or accounts manager would review the Day Book and copy transactions into the General Ledger, which has accounts for different payment categories, each with two columns: "debit" and "credit".
- Numerous similar transactions would be summarized into one line in the Ledger, such as "sundries" or "supplies".
- Amounts on each side of Ledger accounts were summed and compared to see if the book is balanced. Remember, with no calculators!
What can go wrong in this process?
- Purchases incorrectly recorded
- Details incorrectly or inconsistently copied from Waste Book to Day Book
- Mistakes in assignment of transactions to accounts in the Ledger
- Mistakes in summarization of amounts
If someone were to do this at enterprise scale, they would need something similar to our modern ETL testing process, to ensure that the final data was really correct.
Today all businesses go through the same process, but there is no need for meticulous testing to validate the data:
- Shop clerks use automatic cash registers with barcode scanning to record purchases, which prevents almost all mistakes in recording the data
- Products are recorded in a database and already allocated to payment categories
- Amounts are summarized automatically
Quite a change from the painstaking copying and manual consolidation of ages gone by. And we’re about to show you how a modern-day data pipeline can go through a similar transformation.
What does an automated, managed data pipeline look like?
There are subtle differences between today’s data pipelines as-a-service, but most use a similar architecture. Here is how Alooma works:
- Data sources are pre-integrated with the pipeline - for example, if you want to pull data from Salesforce, you only need to provide credentials and the data is pulled from Salesforce to your data store.
- A Code Engine allows you to perform transformations on the live data stream, including the ability to detect issues or anomalies on the fly and trigger notifications.
- A Mapper component automatically maps fields from each data source to the target data store. Schemas are inferred automatically and simple incongruencies, like timestamps with different formats or commas in the middle of numbers, can be fixed automatically. You can also manually control field mapping for any data source in one place.
- A Restream Queue provides a safety net in case of errors - errored events, schema changes you chose not to handle automatically, and data type mismatches do not halt the ETL process. Instead, the problematic events are put aside in a queue for inspection. Once the error is corrected, they can be "re-streamed" into the data store, and you never lose an event.
- Data store is typically a cloud-based data warehouse such as Amazon Redshift or Google BigQuery. Managed data pipelines can stream data to any source. But when paired with a modern data warehouse, they make it possible to store nearly infinite data volumes and query them at blazing speed.
- Unlimited capacity and scalability - the entire pipeline is based on a cloud-based resilient infrastructure that is managed by Alooma. If your ETL process suddenly scales up from 1,000 events a second to 100,000, we’ll handle it with no slow-down. The ETL process is reliable with data integrity at a 99.999% level.
ETL testing, reloaded
Even with an automated data pipeline, you still need to perform testing and validation of data flowing through the pipeline. Managed data pipelines like Alooma make this process easier and more agile than in traditional ETL systems.
Below, we’ll cover two major aspects of ETL testing, and show how they are achieved in the next generation of ETL technology:
- ETL Testing Preparation
- ETL Problem Resolution
The future of ETL data preparation
Consider what you need to do today to prepare the data and infrastructure for your ETL process. Here’s what it looks like in the next generation of data pipelines.
|Problem with the traditional ETL process||How Alooma solves it|
|Data profiling - collecting statistics or informative summaries about that data||The Alooma Mapper component is also a schema explorer. While data is streaming in, you can view its characteristics and even sample data. No need to perform extensive data profiling in advance.|
|Defining data quality and performance acceptance criteria||At Alooma we distinguish between data integrity and data quality:
• Data integrity means that if you replicate data from a MySQL server to a target data store, Alooma guarantees the data on your target data store will be identical to the data on your MySQL. Identical means that the values will be exactly equal, the amount of records will also be exactly equal, with nothing missing and nothing duplicated. Alooma guarantees greater than 99.999% data integrity.
• Data quality means that while the data was copied across correctly, something could be wrong in the data. For example, the total price of the items in an invoice isn’t equal to the total amount, or the price of an item does not equal the list price in the product table. In Alooma, you can perform these types of validations or transformations using the Code Engine (see below).
|Defining data transformation rules||Alooma’s Code Engine allows you to write Python code that performs any transformation on the data. The code is source controlled, and fully managed — you edit it in a convenient web-based IDE, or in your local IDE. It’s possible to test code changes on live streaming data, and if it checks out, deploy to production in one click.|
|Reviewing metadata||Let’s split this into two cases:
• If your data has a schema - Alooma guarantees an exact translation of the schema, automatically. No need to review anything.
• If your data does not have a schema - Alooma collects several thousand samples, and tries to guess the schema as best as possible. For example, it guesses char lengths, detects timestamp formats, etc. If Alooma guessed wrong, and ten thousand records later the schema is broken, the events are saved aside for you to decide what to do with them, then re-stream them back into the data source.
|Additional Stages||Here are a few more common ETL preparation stages and how they are handled in a managed data pipeline in Alooma:
• Validating source to target mapping - Alooma can automatically keep source and target schemas in sync, with optional custom mapping.
• Validating data model and dimensional modeling - Alooma’s Code Engine transforms and de-normalizes data on the fly.
• Indexation and partitioning - Alooma saves data to cloud-based data warehouses, in which indexing and partitioning is configurable (Redshift) or completely seamless (BigQuery).
The future of ETL problem resolution
ETL testing is about discovering problems in the data stream and correcting them. Traditionally, this was done in a waterfall approach: identifying problems in the data or the ETL process, building a system to resolve those problems, testing that everything works, and rolling into production.
Here’s what the future looks like: an agile process in which data issues are fixed on the fly, and largely automatically, with no interruption to data ingestion.
|Problem in traditional ETL process||Solving it in a managed data pipeline|
|A good record doesn’t get loaded||Alooma runs discrepancy comparisons to make sure all data is loaded only once. We guarantee this type of issue cannot occur.|
|A record may get truncated||By default, Alooma follows strict transformation and mapping rules. A record that cannot be loaded due to string size limitations is treated as an error, and stored aside in the Restream Queue for later handling (unless the user explicitly truncated it as part of a transformation).|
|A record may write multiple times||Alooma maintains strict only-once loading semantics, so this issue cannot occur. See our extensive blog post on exactly-once processing.|
|A record may transform wrong:
• Mathematical errors
• Final output is wrong
|Alooma’s Code Engine allows you to test transformations against live data, sampled continuously from the data pipeline, so such errors can be detected before production deployment. Transformation code can be fixed and once it’s working, can be deployed to production in seconds.|
|A record may have "dirty data" that violates business rules:
• Minimum/maximum values
• Invalid type
• Invalid value
|Using the Code Engine, you can program business rules that define what should be done with “dirty data”: discard it, fix it, store it aside for later debugging, trigger an email notification, etc.|
|System stability/scale issues||Managed data pipeline services like Alooma are built in the cloud and can scale dynamically to handle any throughput, even if it is out of the user’s payment tier.|
In this article we showed what a next-generation data pipeline looks like, and how this new architecture affects the numerous stages of the ETL testing cycle. The change can be summarized as follows: new architectures like the one implemented by Alooma are a shift from monolithic, waterfall-style ETL development paradigm to an agile paradigm.
ETL testing is still needed in the next-generation data pipeline as-a-service. However, much of it is automated, just like most tests and processes were automated when organizations shifted to agile software development. For example, critical functions like production validation and metadata testing are performed by the platform with no user intervention. Other testing functions, such as performance and recovery testing, are "outsourced" to the operator of the managed platform or to the cloud infrastructure on which it operates.
Of course, just like in agile development, there will always be tests which need to be identified and performed by humans. But they can occur on-the-fly, tested on live data and not on an elaborate staging environment. They do not require extensive re-work and re-deployment of the entire ETL infrastructure for every change, enabling frequent iteration and fast evolution.
We started by showing the difference between 19th century bookkeeping and modern stores with fully automated checkout processes. We hope we’re not exaggerating when we say this: Just like store clerks from the 1800s would be astonished to see a modern store, so would operators of enterprise scale ETL factories be amazed to experience the simplicity of the new, agile and automated, data pipeline as-a-service.