Should we build a data warehouse in-house or use an external tool?
It really depends on your needs. There are great data warehouses in the market today that allow you to do a lot with your data. A few options I can think about are Snowflake, Google BigQuery and Amazon Redshift. When building in-house, there’s a lot you need to consider - the cost of ownership (TCO), maintenance, administration and so on.
The tricky part is getting the data in the data warehouse and that’s something that people don’t always think about. If you’re building your own data warehouse, you’ll also need to build an in-house data pipeline. A data pipeline in-house is great for cases where the data is mostly consistent, the schema is not expected to change very often and you don’t have many sources. The process is pretty straightforward:
- Create a monitoring process that detects data changes
- Using python or another scripting language, convert the data to the format of your selected data warehouse (external or in-house)
- Copy the data to the selected warehouse
The problems begin once the schema starts changing often, you want to add more data sources, or if an error occurs in the source/target. I recommend reading this blog post for more information about building an in-house pipeline.
Full disclosure - I’m a Software Engineer at alooma, a data pipeline as a service.
Published at Quora. See Original Question here