What's the most tedious part of building ETLs and/or data pipelines?

byYuval Barth
Updated Feb 28, 2019

Data pipelines are like your home plumbing. It always starts easy:

You build your home, making sure the plumbing supports your one toilet, two sinks, and a shower. For 5 days, everything works great. Then you shave one day, and you find out that the curvature of the pipe + your beard hairs clog the drain and water starts coming out. You quickly run out and buy a better pipe, and the problem is fixed.

A week later, you install a dishwasher, only to find out the throughput of your main pipe doesn’t support a concurrent shower and dishwashing, so you can only turn on the dishwasher at night, when no one showers. Bummer. You install a bigger main pipe, but then your sink faucet breaks due to the higher water pressure. You now need to fix that. Two days later, your kitchen sink rusts and breaks, and water starts leaking. You fix that sink and finally get some rest, and realize your kid just flushed his teddybear down the toilet. Sewage explosion. Thanks kid.

But seriously - the average data pipeline never ends up doing just the one thing it was built for. Instead, two things always happen:

New data requirements surface - and with them, changes must be made to the pipeline. These changes are very hard to forecast, and as the organization grows, they become more and more frequent. New sources, new data schemas, and data volume growth mean you never move to pure “maintenance status”. Existing data requirements change - “I forgot to add a field, can I add it?”, “This is no longer a timestamp, it is now a boolean”, “This database is now configured to terminate connections after 5 seconds”, “Our web server now sends the data encoded in Mandarin” - it never ends.

Fact is, the most tedious job of building a data pipeline starts exactly a second after you finished building it - when “all that is left” is to maintain it. That is why many companies decide to use managed solutions - like Alooma (where I work), Fivetran, Stitch data, Xplenty, and many others.

If you want a more concrete examples of some data pipeline woes, check out our blogpost about building production-grade data pipelines. Good luck to your peeps at the office!

Like what you read? Share on

Published at Quora. See Original Question here