How do you create an ETL framework?

byRami Amar
Updated Dec 28, 2017

Building an ETL framework is a lot more work than you think, and even if you do decide to go down that path, don’t start if from scratch.

If I may try to discourage you from doing this yourself, I think the biggest challenges in building an ETL framework are:

  • Keeping up with constant schema changes
  • Keeping up with constant integrations’ API changes
  • Handling sources/targets failures without loosing or duplicating data
  • Scaling your infrastructure as your business generates and demands more data

If you have more time, I elaborated on the above in a blog post I wrote quite a while ago, and in another quora answer about pitfalls in data integrity.

Now, if all of the answers you received haven’t intimidated you enough, then here are some practical recommendations.

  1. Start with listing your requirements:

    1. What are the top 5 business questions your organization needs to answer?
    2. Where does the data required to answer these questions reside?
    3. How much data exists, and how much new data is generated daily?
  2. Continue with making some design choices:

    1. The amounts of data should help you choose a data warehouse. Up to a couple of terabytes, a strong MySQL / Postgres will probably hold. More than that, you should consider a cloud data warehouse like BigQuery, Redshift, or Snowflake
    2. The questions you need to answer should help you design your data warehouse’s schema. Optimize to finish your ETL process in tables which provide fast queries for your questions
    3. Your business questions also imply timing, and will help you decide whether you can live with batch loading (easier) which incur a delay in data freshness, or whether you need to build a streaming solution enabling you to get answers updated to the minute.
  3. Choose from the wide variety of open source tools available. Here’s a far from exhaustive list:

    1. Kafka & Kafka Connect - if you optimize for scale and real time, and have a lot of resources to invest
    2. Airflow, Luigi & Celery - if you are python savvy, and need to build batch oriented workflows
    3. Spark Streaming - if you’ve got an HDFS cluster under your hands
    4. Pandas - is a great python library for working with data
    5. SQLAlchemy - python library for working with databases
    6. Talend Open Studio - if you are looking for a legacy ETL development tool
    7. Snowplow - also offer an open source event analytics framework
    8. Singer - collection of integrations open sourced by Stitch

Still reading ? +1 for determination ;)

Good luck!

Like what you read? Share on

Published at Quora. See Original Question here