Why should I use an existing ETL vs writing my own in Python?
A major factor here is that companies that provide ETL solutions do so as their core business focus, which means they will constantly work on improving their performance and stability while providing new features (sometimes ones you can’t foresee needing until you hit a certain roadblock on your own).
If your environment is currently simple, it could seem very easy to develop your own ETL solution… but what happens when the business grows?
Here are few points to consider: Schema changes: once your business grows and the ETL process starts gaining several inputs, which might come from tools developed by different people in your organization, your schema likely won’t fit the new requirements. At this point you’d want to be able to easily adjust your ETL process to the schema changes. Data visibility: detecting schema changes (or other changes in the data) might not be that easy in the first place. You’d want to get notified once something like that happens, and you’d also want it to be very easy to understand what has changed. Scalability: once your business grows, your data volume grows with it. Your ETL solution should be able to grow as well.
And these are just the baseline considerations for a company that focuses on ETL.
The main advantage of creating your own solution (in Python, for example) is flexibility. As in the famous open-closed principle, when choosing an ETL framework you’d also want it to be open for extension.
If you are open to a solution that combines the stability and features of a professional system with the flexibility of running your own Python scripts to transform data in-stream, I would recommend checking out Alooma (Alooma modern data plumbing) - full disclosure: I am an engineer at Alooma.
My colleague, Rami, has written a more in-depth technical post about these considerations if you’re looking for more information: Building a Professional Grade Data Pipeline.