How do ETL tools handle the situation when Target is offline?
It varies between ETL tools.
The simple / naive approach
When you have only one target and it is offline, one would think that it makes sense to simply stop the ETL procedure and wait for it to come online, meanwhile doing no work, and periodically polling to check if the target is back online.
What do you do if you have two targets, and only one of them is offline? In that case you don’t want to stop loading into the other target, but on the other hand you don’t want data loss in the offline target, so you clearly need to buffer the data in someway in the ETL tool itself.
The simplest data-lossy solution
No target? no problem. Simply throw away all the data that can’t be loaded, but don’t stop the pipeline (any other targets will still get the data). This is sometimes applicable for ETL solutions handling periodical dumps of data (I.E. peridical full table copies from DB to DB), where table snapshots will be extracted in whole again when the target is up, so any snapshots taken while while the output is down can be discarded.
The bigger problem
This issue is not limited to multi-target ETL tools. Consider the following (very common) ETL scenario:
- You have a MySQL database with many table which you are continuously replicating to a cloud data warehouse, such as Amazon Redshift
- To read the MySQL database with minimal impact on its performance, you’re reading the MySQL Binary log (binlog), an operation log that records every change made on the database.
Since you want to avoid having the binlog take up all you DB HDD space, MySQL eventually deletes log files to make space for new ones. That means your ETL tool can’t just stop reading it - if MySQL deletes the file it was reading, your ETL tool will miss any data in the deleted file (for more info on why binlog replication is better than just copying the tables, check out this blog post about replicating MySQL to Redshift. Skip to “Different approaches for MySQL replication).
So again, it looks like in some cases you MUST buffer the data in a storage for later loading as the ETL tool must keep working even while the target is offline.
The buffering approach
When the target is offline, all the data the ETL tool extracts is buffered. Buffered data can be retained in a queue, in files, or any other efficient storage method. Once the target is back up, the data is simply sent to the target in the order it was extracted. This can be triggered manually by a data engineer, or by automatic periodical checks for target availability.
The Alooma approach
I work at Alooma, a data pipeline SaaS. Our tactic for solving the issue is using the buffering approach, for which we use the Restream Queue. The data pipeline never stops extracting data from its sources, and when any issue is encountered (target offline, data type mismatch, or any other issue), we simply send the data to the queue, where it is viewable. Once the target is back up, the restream queue is flushed and contents are sent to the target for loading - thus the data is never lost.
Published at Quora. See Original Question here