ETL process basics
Extract, Transform, and Load (ETL) is a process that involves extracting data from disparate sources and transforming it, performing such actions as changing the data types or applying calculations. Then, after the data is transformed, it is loaded into the target data store, typically a data warehouse.
What is the traditional ETL process?
Traditional ETL systems were developed in the 1970s when enterprise companies found the need to bring together data from different sources, such as sales, inventory, and customer records. Typically, this data was moved from one structured database to another, and was moved in batches to align with time periods of low system loads. However, like all tools, ETL was developed to work with existing technology. But that technology is changing, and so are the expectations of companies.
Traditional ETL tools are well-suited to working with relational databases, but often less geared for unstructured data. These tools are often designed to move data in batches, meaning that large volumes of data are moved at the same scheduled time, usually when network traffic is low. This means that you likely wouldn’t be able to perform ETL outside of the scheduled batches or perform any kind of real-time analysis. In addition, when using traditional ETL tools, any changes to your plan may require the mapping to be restructured and all the data to be reloaded.
What is the modern ETL process?
Like all good inventions, modern ETL was designed to address the existing and emerging problems of real-world users. There has been an explosion in the volume and variety of data sources that people want to track and query, and the traditional model, built for structured data, was inadequate. In addition, today's business decisions must be made in real time, so the traditional batch-processing is too slow. Lastly, an increasing number of businesses want to move their data and operations to the cloud, so a cloud-based ETL tool makes sense.
Flexible tools to handle different data sources. Modern ETL tools excel at addressing the exploding growth of different types of data sources and are designed to work with both structured and unstructured data.
On-premise or cloud data warehouses. Modern ETL tools are built to integrate with on-premise environments and cloud data warehouses — Amazon Redshift, Snowflake, Google BigQuery, Azure, or any number of other options.
Real time pipelines. In addition, modern ETL tools are designed to move data in real time and to allow for changes to the schema on the fly.
Flexible pipelines. Lastly, and perhaps most importantly, modern ETL might actually be a pipeline that supports a combination of ETL and ELT (Extract, Load, and Transform — the data is loaded to the target data warehouse and transformed afterwards). ELT can be a powerful way to leverage the increasing power of cloud-based data warehouses to perform such tasks as joins or complex calculations. This allows you the flexibility to perform transformations where it is most logical and efficient. For example, you might perform some tasks in the data warehouse, and perform the following transformations in the pipeline to protect data privacy or add important information that enriches the data:
- Cleanse values that are represented differently in a database; for example, changing values of 0 or 1 to false or true, or rounding floats to integers.
- Scrub personally identifiable information (PII).
- Convert currency types.
- Enrich data with the results of lookups from other sources, such as Geo-IP resolution.
- Discard undesired events.
- Split up a single event into multiple events across different database tables to make the data more modular and easy to query.
- Gather metrics on incoming data.
- Generate notifications.
How they compare
The following table compares traditional to modern ETL tools based on some key factors.
|Traditional ETL Process||Modern ETL Process|
|Sources & data types||Well-suited for traditional data sources, such as relational databases.||Designed to handle a wide variety of sources and targets, and designed to handle structured and unstructured data.|
|Hardware requirements||Often requires specific hardware and has its own engines to perform transformations on the data.||Often cloud-based, where hardware requirements and the cost of infrastructure is absorbed by the ETL vendor. This can be a significant cost savings for the client.|
|Flexibility||Typically less flexible regarding changes to schema, variety of sources and targets, and ability to combine transformations in the pipeline and in the data warehouse.||Works with a variety of sources and targets seamlessly, can perform a combination of ETL and ELT, and may work with both cloud and on-premise sources.|
|Real time vs. batched||Processes data in batches.||Processes data in batches or real time.|
|Security||Security is straightforward, assuming you have the right resources in place, and all of the parts are on premise.||Security in the cloud is provided by the vendor.|
How Alooma can help
If you are considering a modern ETL solution, Alooma invites you to contact us to see if our data pipeline may fit your needs. Alooma is uniquely positioned to offer a flexible, secure solution.
- Alooma is format-agnostic and can integrate the widest range of data sources, structured and unstructured.
- Alooma’s solution is flexible enough to perform some transformations in the pipeline and some in the target data warehouse, taking advantage of the data warehouse's processing power. We let you perform transformations where it makes most sense.
- Alooma is designed to handle massive amounts of data, moving them in real time so that analysis is near-instantaneous.
- Because Alooma is cloud-based, it is flexible, allowing you to scale up or down based on your unique business needs.
- And lastly, you can relax knowing that security is a cornerstone of Alooma’s platform. Alooma is SOC 2 Type II, HIPAA, GDPR, and EU-US Privacy Shield Framework compliant and supports OAuth 2.0. Data is encrypted in motion and at rest.
Contact us today to see if our solution is right for you!