What do you need to consider if I will be creating an event-driven ETL?
Event-driven ETL, also known as a data pipeline, differs from classic ETL in that they treat source data as a stream rather than pulling batches of data every (insert-interval-here).
The first thing to consider is - do you need a data pipeline (event-driven ETL)? The biggest advantage of event-driven ETL tools is the latency - they tend to be near-real-time, and thus allow scenarios that batch/schedule-driven ETL don’t, e.g. real-time alerting on events (e.g. send an alert when an event is received from X). If you are OK with having having up-to-date data once every few hours, or once a day - then you might not need it.
If you do choose a data pipeline:
Stream-oriented ETL allows faster access to the data to drive more real-time decisions, but it does come with some additional complexity. I think this is the key difference when building an event-driven tool:
- Data co-dependency - Especially when you have more than one data source, you don’t want to create dependency between the different sources. One classic dependency is that an error occurs loading an event, and the entire pipeline is then stalled until that error is fixed because data is waiting in line behind the failed event. In some cases, not only events from the same source are stuck, but the entire data might be stalled. To avoid that, you should implement some logic to handle erred events in a way that doesn’t block (maybe store them in a file and check them once a day? maybe just discard them? there are various approaches).
Having said that, both event-driven ETL and batch ETL share some basic difficulties. If you’re looking for more information I can recommend this blog post about Building ETL pipeline.
Published at Quora. See Original Question here