Having a centralized, properly-configured data warehouse is essential to any data-driven organization. In order to execute queries and run complex analysis, you need a central place for all of your data to live so that teams across the organization can access it.
Cloud-based warehouses outshine their on-premise counterparts in terms of speed, reliability, security, and ease of use. They allow users to modernize their processes as quickly as new technology is developed, and make it simple for the entire organization to access data in real-time.
Choosing the right cloud data warehouse requires a lot of careful consideration. After all, moving from one warehouse to another can be a time-consuming and disruptive process. The better informed your choice, the easier you’ll make things for yourself down the road.
Here are the factors to consider when choosing your solution, and how they apply to different use cases.
The Factors to Consider
|Description||AWS Redshift is a fast, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using SQL and your existing business intelligence tools.||Google BigQuery is a fully-managed, powerful Big Data analytics platform that enables super-fast SQL queries against append-only tables using the processing power of Google's infrastructure.||Snowflake is a fully-managed data warehouse built for the cloud, for structured and semi-structured data.||MemSQL is a distributed In-Memory Database that lets you process transactions and run analytics in real-time, using SQL.|
|Loading Methods||Runs several batch copies in parallel.||Streaming API or batch loading.||Microbatch copy with parallel processing of files.||Batch Copy|
|Speed||Fastest for real-time queries using columnar storage.||Uses as many resources as needed to run a query in seconds.||Powerful scaling capability enables fast real-time query and ingestion speeds.||Ultra-fast data ingestion, low latency queries with high user concurrency.|
|Scalability||Horizontally scalable by number of nodes.||Auto-cluster resizing without fixed storage size limits.||Option for auto-scale of data, workload, concurrent users, and applications or a point-and-click interface for scaling warehouse resources and performance with unlimited database size.||Massively parallel execution.|
|Integrations||• JDBC and ODBC connections to SQL client tools||•SQL Workbench
•JDBC and ODBC connections to SQL client tools
|Security||Encrypt data using SSL connections and protect access using virtual networking environment.||Identity and Access Management (IAM) to manage access to resources.||All data is encrypted over the internet and on-disk. Two-factor and federation authentication with single sign-on is supported.||Database logging writes all activities to a secure external location to support information security tasks.|
|Pricing Model||Pay per-hour, per-node||Pay per-query||Pay per-storage, per-warehouse usage||Pay per storage in memory|
|Best for this use case||•Frequent queries
•Complex data aggregations
•Inside AWS ecosystem
•Inside Google Cloud ecosystem
•Fully managed solution
•Handling unstructured data
•Control to manage resources
•Data sharing with other Snowflake accounts
|•Fast analytic workloads
•Flexibility to run on-premise, cloud, or hybrid cloud
There are two main ways to think about speed: Which warehouse will help you get the fastest query times, and how will you maintain your warehouse to keep that speed at optimal performance? Different solutions offer different ways to get the speed you need. Amazon Redshift stores data in columnar format and leverages parallel processing. Google BigQuery uses as many resources as needed to return a result in seconds. Snowflake doesn’t share compute clusters with other virtual warehouses, so a query on one virtual warehouse has no impact on the other virtual warehouses. And MemSQL has set a benchmark of processing 8 million upserts per second.
In order to keep your warehouse running quickly, we recommend an ETL solution that will make sure your data is cleaned, de-duped, and streamed in properly. This cuts down on how much time you’ll have to spend managing the infrastructure, and ultimately frees up your time to work with the data, instead of just maintaining it.
If your business is growing quickly, your warehouse solution needs to scale with you. Scalability can be measured in three ways: cost, resources, and simplicity.
BigQuery is incredibly elastic — it offers the fast and seamless resizing of a cluster up to a petabyte in scale, and it doesn’t need to be constantly tracked and analyzed to ensure that clusters are meeting dataset requirements. It works seamlessly in the background to smooth out variabilities associated with analytics workloads.
Redshift requires more maintenance than BigQuery, but it can also be more cost effective. Redshift is horizontally scalable, meaning that it’s easy to add more nodes and, if properly optimized, can be relatively inexpensive. You can scale from single 160GB nodes to 16TB nodes, giving you a petabyte-scale data warehouse and high performance for a competitive price.
Snowflake is also competitively priced and only charges customers for actual storage data used after compression, but it can be difficult to foresee your data needs and therefore hard to understand what it will cost over time. However, Snowflake offers an auto-scale option, which starts and stops clusters as needed to dynamically manage the workload on a warehouse.
With growth comes more data, data sources, and data users. As more people in your organization need to be able to access and use your data, it becomes increasingly important to ensure you can integrate different tools. If you can anticipate how much you’ll need to change data types or run multiple queries simultaneously, you’ll save yourself trouble down the road. An ETL tool can make it much easier to integrate any or all of your data sources, then easily analyze the combined data with a BI tool.
A cloud-based data warehouse is, on the whole, more reliable than an on-premise solution. A solution from Google, Amazon, or Microsoft is maintained by a network of the best data warehouse experts in the world, whereas an on-premise solution only has you and your team. But no solution is completely perfect. Consider how these cloud data warehouse service providers dealt with any recent incidents in terms of clarity of communication and time to resolution. You should also look into what kind of support they offer in case something goes wrong. If it’s hard for you to benchmark what a good response or support level is, try and collect feedback from others about their experience. Remember, by using a data pipeline service like Alooma, you ensure that you’ll always have extra support, expertise, and insight. Alooma can even help make it easier to switch data warehouses if you need to.
Naturally a key consideration is price. Over time there have been some new innovations in pricing models that make it cost effective and scalable for usage of any size and style.
For Amazon Redshift, you only pay for what you use. You are charged per-hour per node including both compute and storage. You are able to have unlimited users running unlimited analytics. Google BigQuery charges by the amount of data scanned by queries they run. It amounts to $5 per TB of data scanned and $0.02 per GB stored. All other operations from loading data, export, copy or metadata are free. Snowflake has two types of pricing models. Snowflake On Demand is pay per use of storage and compute with no long-term commitments. Snowflake capacity offers price discounts on pre-purchased storage per month, plus compute capacity that depends on your needs. MemSQL charges by GB held in memory, so you can store as much data as you want on disk for free.
The Use Cases
Ultimately, choosing the right data warehouse has a lot to do with how you’ll use it. Consider your organization’s needs and what use cases you’ll find yourself falling into when it comes to your data.
|Use Case||What is it?||The benefits of a cloud data warehouse|
|Ad-Hoc Analysis||• Flexible use of data where users go from viewing data in the aggregate to diving into row-by-row detail.
• Either done with a native SQL or an integrated BI tool like like Tableau, Zoomdata, or Looker.
|• Users can join, aggregate, and scan data in whatever kind of table they want.
• Built-in statistical functions makes it easy to build queries.
• Disk usage is optimized by using a columnstore table format.
|Machine Learning and Data Science||• Uses sophisticated algorithms to identify trends, discover hidden data relationships, and predict future events.
• Usually requires large volumes of data.
• Requires a great deal of processing power, so experiments are often done offline.
|• Can use a large variety of data formats.
• Rapid sandbox configuration allows for quick experimentation, even as load requirements change.
• Interoperational data preparation and statistical tooling.
|Real-Time and Operational Analytics||• Continuously querying data to monitor Key Performance Indicators (KPIs) to make sure teams are mapping towards their goals.||• Can query data in real-time, even as events happen.
• Higher availability and fewer outages.
• Easy to enrich and de-dupe data.
• Processes repeat queries quickly.
|Mixed Workload Analytics||• A combination of the use cases above needs to be supported across the organization.||• A single source of data for any query or use case your organization may need.
• Supports a broad range of queries without requiring additional hardware or complicated data configurations.
• Secures sensitive data.
• Supports broad data ingestion, allowing streaming and batch load data inputs.
Experts on Your Side
No matter what cloud data warehouse you choose, Alooma will help you load and maintain your data and keep it flowing. We’re a modern ETL solution designed from the ground up to work perfectly with modern cloud data warehouses. Alooma helps companies of every size make their cloud warehouses work for any use case, and our experts are knowledgeable about every data warehouse nuance.
Got questions about which data warehouse to choose? Contact us to get personalized advice on how to get the right data warehouse up and running.