Why did you choose Snowflake over Amazon Redshift for your cloud data warehouse?
Redshift is a great cloud Data Warehouse, and in a way, it was the first to set the trend of the migration to MPP cloud Data Warehouse.
However, Snowflake have a novel approach to cloud Data Warehouse, and has the following advantages over Redshift:
- Redshift does not separate Storage and Compute. If you need more Storage, you would need to add additional nodes, which means you are essentially paying for more compute power. With Snowflake, compute and storage are completely separate, and the storage cost is the same as storing the data on S3. AWS attempted to address this issue by introducing Redshift Spectrum, which allows querying data that exists directly on S3, but it is not as seamless as with Snowflake.
- Since Redshift does not easily scale up and down (more on that in the Maintenance section), you need to maintain a large cluster that is appropriate to your peak load, where in Snowflake you can use a Data Warehouse size that is appropriate to your needs at the moment, and can scale up and down according to the current need. Snowflake also supports automatic pause to avoid charges if no one is using the data warehouse.
- Snowpipe - Snowflake supports Snowpipe. With Snowpipe, there is no need to have a Data Warehouse up and running just for the ETL part, and ETL is not competing with resources required for queries. That can significantly reduce the total cost as the warehouse size can be smaller. More on Snowpipe: Using Snowflake Snowpipe to Optimize Data Throughput.
To sum up - Since we switched from Redshift to Snowflake at Alooma, we tripled the amount of data we’re storing, for about half the cost of Redshift.
- Snowflake supports Semi-Structured data types - Variant, Object, and Array, so you can load data without worrying about the schema. For some use cases, this is very useful. Redshift does not have such support.
- Unlimited (for any practical purpose) Strings - Redshift Varchar data type is limited to 65535 characters. In addition - you must choose the column length ahead, and it is bad practice to use the max size. With Snowflake, Strings are limited at 16MB, and there’s no performance overhead for using the max size. In fact, the default value is the max String size, which eliminates the need to know it in advance.
- Snowflake supports sharing data between different accounts (e.g. your customers). Data providers can share data with their data consumers, without making any actual copy of the data itself. Think about Dropbox for Database Tables. In my opinion, this is a game changer to how people work with 3rd party data, but this post is too long already :). Needless to say, this is not supported by Redshift.
Maintenance and DevOps
- With Redshift, it is required to Vacuum / Analyze tables regularly. Snowflake manages all of this out of the box. With very big tables, this can be a huge headache with Redshift.
- Scale up / down - Redshift does not easily scale up and down, the Resize operation of Redshift is extremely expensive and triggers hours of downtime. With Snowflake, switching Data Warehouse compute capacity is a matter of seconds, and since compute and storage are separate, there is no need to copy any data to scale up and down.
- With Redshift, all of the users need to look at the same cluster and compete over the same resources. WLM queues are the “Redshift way” to deal with this, but it is a very cumbersome set of rules to manage and understand. With Snowflake, you can simply start different warehouses of different sizes that look at the same data, without copying any data, and allocate them to different tasks / users.
For further reading, my Colleague at Alooma, Samantha, wrote a blog post comparing Redshift, Snowflake and other cloud Data Warehouse solutions - How to Choose a Cloud Data Warehouse Solution.
Published at Quora. See Original Question here