Using Snowflake Snowpipe to Optimize Data Throughput

by Eitan Rosenfeld & Garrett Alley  
5 min read  • 29 Mar 2018

As companies move more and more toward real-time, data-driven decisions, they require increasingly complex and flexible analysis — the freshest data must be available 24x7. Improving performance and optimizing your data warehouse are key requirements in meeting these demands.

At Alooma, we’re no different. In fact, we use several data warehouses for our own internal business analytics, and we’re particularly big fans of Snowflake. Snowflake’s cloud data warehouse is scalable, fast, and easy to maintain. It’s also feature rich, with capabilities like running analytical queries on non-structured JSON columns.

In this post, we’ll talk about how Alooma now supports loading data into Snowflake using a new delivery method, Snowpipe. Snowpipe can reduce warehouse costs by 80% while also reducing latency, so that data is available in near real time.

We’ll also share a bit of our own internal story about how we came to use Snowflake and Snowpipe.

A word about Snowflake billing

Snowflake’s billing model is primarily based on "Warehouse usage". Snowflake lets you choose a warehouse size and adjusts costs proportionally to that size. Queries like COPY, INSERT, MERGE, and SELECT require an active Snowflake warehouse to run. When a warehouse is activated, the account is billed for a minimum of one minute. After that, warehouse usage is billed at per-second granularity.

Because of the per-minute minimum, a price-sensitive user is not incentivised to build a real-time data pipeline. If data is loaded to Snowflake as soon as it is ready, then warehouse usage may be in a state of perpetual activity as data trickles in over the course of a day. Optimizing this is hard. A user can de-activate their warehouse as soon as loading is completed in order to minimize costs, but the next load will wake up the data warehouse and the user will incur the cost overhead of activation.

This conflicts with one of the central tenets behind investing in a data pipeline and warehouse: optimizing your business on the best — and ideally latest — information. At Alooma, we’re acutely aware of this tradeoff. That’s why we offer the ability to load data based on a schedule. Scheduled loading ensures warehouse activity is limited to the minimal number of time buckets.

In fact, this is exactly what we ended up doing internally. We traded some data freshness for reduced cost by adopting scheduled loading. But we wished we could have both lower costs and near real-time data loading.

How Snowpipe changes the game

With Snowflake’s release of Snowpipe, we realized that we can do much better for our customers who want more than scheduled loading.

Instead of running a synchronous COPY query to load data (thereby activating the warehouse), we use Snowpipe. Snowpipe is Snowflake’s new delivery mechanism which uses authenticated REST calls in order to load data asynchronously. But here’s the kicker: Snowpipe uses Snowflake-supplied compute resources in order to load the files. The user is billed for the Snowpipe warehouse usage at per-second granularity without a per-minute activation minimum. The result is that there’s no more cost overhead to loading data in real time. In fact, now the user is incentivized to load as much data as quickly as possible. At Alooma, that’s music to our ears.

Snowpipe in Alooma

With the cost and latency reductions, one more critical piece remains: data integrity. The Snowpipe API provides functionality to query which files have been loaded via Snowpipe, so we continuously query those endpoints in order to confirm that files have been loaded. Once data is loaded, we move on to the next batch of data. And in case of a failure or timeout, we make sure to retry or enqueue that data for loading at a later point in what we call the Restream Queue.

For Snowflake, we offer the option to enable Snowpipe right there in our UI.

Summary (with our actual cost savings)

At Alooma, our Snowflake data pipeline started out as as a real-time implementation, and we enjoyed the benefits of real-time data when making business decisions. When looking at our usage patterns and high warehouse costs, we switched to scheduled loading. Our costs went down by 75%, with the unfortunate consequence of increased latency. With Snowpipe, we are now back to loading data in near real time, and yet our costs have gone down even further, by another 5%.

We're proud to be the first data pipeline to support Snowpipe, and to use it internally. If you're an Alooma customer using Snowflake, you can enable Snowpipe right now.

This might interest you as well