What are the fundamental principles behind Extract, Transform, Load?
Fundamentally, ETL is the process of extracting data from a data store used for a certain task(s), and loading it to another data store to be used for some other task. The T(ransform) part is there usually to adapt the data for the latter - be it by redacting unnecessary data, adding missing data (often called enrichment), or manipulating the existing data in some way (i.e. converting timestamp formats).
ETL processes are often used as part of a wider workflow - i.e. extract data A from a source, load it to a destination, then kick off other in-data-store operations - i.e. creating views, updating derivative tables, etc. Commonly, this process is used to enable extracting information from the raw data, often via visualization of the data in dashboards or reports.
As data volumes and needs keep changing and growing, ETL as a term is slowly abandoned in favor of the “data pipeline”. While ETL is usually batch-oriented db-to-db, modern data pipelines are any-to-any, and many of them are stream-oriented, treating every bit of data as a singular event. For example, a pipeline might look at CSVs in a data lake and ping a REST endpoint for every row that matches some criteria. It might have several outputs, it might merge data from several sources, etc.
Learn more about the ETL process.
Published at Quora. See Original Question here