Optimize Data Throughput with the New Snowflake Snowpipe

Garrett Alley  •  5 min read  • 28 Nov 2017

As we move more and more toward real-time, data-driven decisions, we require increasingly complex and flexible analysis — the freshest data must be available 24x7. Improving performance and optimizing your data warehouse are key requirements in meeting these demands.

Traditionally, when you want to move data into a data warehouse you use some sort of batch loading: dedicated maintenance windows where data is imported in large chunks, or batches. But that is changing quickly as we see many new streaming data sources appearing and a push toward continuous data ingestion driven by, among other things, a proliferation of devices and sensors, the Internet of Things, and the mobile space.

Snowflake Computing, a data warehouse leader, just released an answer to this at AWS re:Invent. They introduced Snowpipe, the ability to load sets of files into Snowflake through lightweight REST calls.

Designed for the enterprise streaming data scenarios

Waiting for streaming data to accumulate during the longer traditional batch windows means the data could become stale. For companies relying on streaming data, stale data is often worthless data. To counter this, many Snowflake customers prefer to have data accumulate over small time windows (think minutes, not hours or days) and then regularly load it into Snowflake in micro-batches via the COPY command.

That said, COPY is a heavyweight command, designed more for the typical larger batch data loading approach. Using COPY to aggressively load several large files per minute, as with micro-batching in data heavy environments, can lead to concurrency issues causing COPY commands to go unprocessed. This in turn causes throughput and reliability issues. Increasing the latency to reduce the number of COPY commands only results in stale data.

That’s where Snowpipe comes in.

Snowpipe: Improved throughput and reliability via an optimized serverless loading experience

Snowpipe still uses the COPY command to load the data; however, now COPY is accessible via a lightweight REST API and can be called asynchronously. This leads to improved throughput and reliability because Snowpipe loads data from an internal multi-tenant instance of Snowflake, where the resource allocation is dynamically managed by Snowflake.

So you have your micro-batches in place via Snowpipe and things are good. Now it’s time to take the solution to the next level: preparing your data and loading it into S3 via Alooma, so that it’s ready to be loaded by Snowpipe.

Add Alooma and get error handling and schema management

Snowpipe’s lightweight REST API and S3 together provide improved throughput and reliability. But we can take that another step forward and use Alooma to further enhance the solution.

When you use Alooma to load your data into S3 for Snowpipe, you take advantage of Alooma’s Mapper and Restream Queue, extending the benefits of the solution to include automatic schema management and error handling. Alooma:

  • Prepares the data in S3, using the proper CSV format for Snowpipe
  • Pre-validates each row to prevent entire COPY operation failures
  • Picks up any rows from failed COPY commands, and provides you with a means to correct them

Getting your data into the warehouse quickly and reliably is key to your business. Even better, is doing that error free and exactly once. The perfect solution is Snowflake Snowpipe and Alooma together. The great news is that if you’re using Snowflake and Alooma, you’re getting the benefits of Snowpipe automatically!

Share  

Get your data flowing today!
Contact us for a demo or free trial.