Having a centralized, properly-configured data warehouse is essential to any data-driven organization. In order to execute queries and run complex analysis, you need a central place for all of your data to live so that teams across the organization can access it.
Cloud-based warehouses and repositories outshine their on-premise counterparts in terms of speed, reliability, security, and ease of use. They allow users to modernize their processes as quickly as new technology is developed, and make it simple for the entire organization to access data in real time.
Choosing the right cloud data warehouse requires a lot of careful consideration. After all, moving from one warehouse to another can be a time-consuming and disruptive process. The better informed your choice, the easier you’ll make things for yourself down the road.
And even if you don't need the full functionality of a data warehouse, the benefits of the cloud extend to data storage as well.
Here are the factors to consider when choosing your solution, and how they apply to different use cases.
|Description||AWS Redshift is a fast, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using SQL and your existing business intelligence tools.||Google BigQuery is a fully-managed, powerful Big Data analytics platform that enables super-fast SQL queries against append-only tables using the processing power of Google's infrastructure.||Snowflake is a fully-managed data warehouse built for the cloud, for structured and semi-structured data.||Microsoft Azure is a cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through a global network of Microsoft-managed data centers.||Amazon S3 is a cloud computing web service offered by Amazon's Amazon Web Services. Amazon S3 provides object storage through web services interfaces.|
|Loading Methods||Runs several batch copies in parallel.||Streaming API or batch loading.||Microbatch copy with parallel processing of files.||PolyBase (recommended) and BCP or SQLBulkCopy API||PUT objects up to 5GB, multipart upload API up to 5TB|
|Speed||Fastest for real-time queries using columnar storage.||Uses as many resources as needed to run a query in seconds.||Powerful scaling capability enables fast real-time query and ingestion speeds.||Scales compute to increase performance.||Continue to add prefixes to improve performance.|
|Scalability||Horizontally scalable by number of nodes.||Auto-cluster resizing without fixed storage size limits.||Option for auto-scale of data, workload, concurrent users, and applications or a point-and-click interface for scaling warehouse resources and performance with unlimited database size.||Scale compute and storage separately.||Auto-scale to meet request rate.|
|Security||Encrypt data using SSL connections and protect access using virtual networking environment.||Identity and Access Management (IAM) to manage access to resources.||All data is encrypted over the internet and on-disk. Two-factor and federation authentication with single sign-on is supported.||Protects data in transit and at rest, including encryption for data, files, applications, services, communications, and drives. Supports and uses numerous encryption mechanisms, including SSL/TLS, IPsec, and AES.||The full Amazon security suite in general, including these S3 specific measures: Default Encryption, Permission Checks, Cross-Region Replication ACL Overwrite, Cross-Region Replication with KMS, Detailed Inventory Report|
|Pricing Model||Pay per-hour, per-node||Pay per-query||Pay per-storage, per-warehouse usage||Monthly credits for storage, fee for compute||Pay for storage and requests|
|Best for this use case||•Frequent queries
•Complex data aggregations
•Inside AWS ecosystem
•Inside Google Cloud ecosystem
•Fully managed solution
•Handling unstructured data
•Control to manage resources
•Data sharing with other Snowflake accounts
•App & Media Hosting and Delivery
There are two main ways to think about speed: Access speed, and processing speed. Which warehouse will help you get the fastest query times? How quickly can you get your data into and out of the solution? And related to that, how will you maintain your warehouse to keep that speed at optimal performance? Different solutions offer different ways to get the speed you need. Amazon Redshift stores data in columnar format and leverages parallel processing. Google BigQuery uses as many resources as needed to return a result in seconds. Snowflake doesn’t share compute clusters with other virtual warehouses, so a query on one virtual warehouse has no impact on the other virtual warehouses. Azure gen 2 offers performance on par with that of the other solutions.
For a cloud storage solution, S3 scales via parallel requests so you can continue to add prefixes to attain the performance you need.
In order to keep your warehouse running quickly, we recommend an ETL solution that will make sure your data is cleaned, de-duped, and streamed in properly. This cuts down on how much time you’ll have to spend managing the infrastructure, and ultimately frees up your time to work with the data, instead of just maintaining it.
If your business is growing quickly, your warehouse solution needs to scale with you. Scalability can be measured in three ways: cost, resources, and simplicity.
BigQuery is incredibly elastic — it offers the fast and seamless resizing of a cluster up to a petabyte in scale, and it doesn’t need to be constantly tracked and analyzed to ensure that clusters are meeting dataset requirements. It works seamlessly in the background to smooth out variabilities associated with analytics workloads.
Redshift requires more maintenance than BigQuery, but it can also be more cost effective. Redshift is horizontally scalable, meaning that it’s easy to add more nodes and, if properly optimized, can be relatively inexpensive. You can scale from single 160GB nodes to 16TB nodes, giving you a petabyte-scale data warehouse and high performance for a competitive price.
Snowflake is also competitively priced and only charges customers for actual storage data used after compression, but it can be difficult to foresee your data needs and therefore hard to understand what it will cost over time. However, Snowflake offers an auto-scale option, which starts and stops clusters as needed to dynamically manage the workload on a warehouse.
Azure's approach is to decouple compute and storage, allowing for improved scalability/elasticity and cost effectiveness. This decoupling means that they can be scaled independently and you can pause and resume the cluster. The blob storage persists even when compute is scaled back.
Amazon's S3 is built to scale to meet extremely high request rates. Current benchmarks are at over 3,500 PUT/POST/DELETE and 5,500 GET requests per second. And that's per prefix, per bucket. The number of prefixes per bucket is unlimited, allowing for exponential scaling for read/write operations.
With growth comes more data, data sources, and data users. As more people in your organization need to be able to access and use your data, it becomes increasingly important to ensure you can integrate different tools. If you can anticipate how much you’ll need to change data types or run multiple queries simultaneously, you’ll save yourself trouble down the road. An ETL tool can make it much easier to integrate any or all of your data sources, then easily analyze the combined data with a BI tool.
A cloud-based data warehouse is, on the whole, more reliable than an on-premise solution. A solution from Google, Amazon, or Microsoft is maintained by a network of the best data warehouse experts in the world, whereas an on-premise solution only has you and your team. But no solution is completely perfect. Consider how these cloud data warehouse service providers dealt with any recent incidents in terms of clarity of communication and time to resolution. You should also look into what kind of support they offer in case something goes wrong. If it’s hard for you to benchmark what a good response or support level is, try and collect feedback from others about their experience. Remember, by using a data pipeline service like Alooma, you ensure that you’ll always have extra support, expertise, and insight. Alooma can even help make it easier to switch data warehouses if you need to.
For Amazon Redshift, you only pay for what you use. You are charged per-hour per node including both compute and storage. You are able to have unlimited users running unlimited analytics. Google BigQuery charges by the amount of data scanned by queries they run. It amounts to $5 per TB of data scanned and $0.02 per GB stored. All other operations from loading data, export, copy or metadata are free. Snowflake has two types of pricing models. Snowflake On Demand is pay per use of storage and compute with no long-term commitments. Snowflake capacity offers price discounts on pre-purchased storage per month, plus compute capacity that depends on your needs. Azure charges monthly credits for storage plus a fee for any hour in which compute resources were used, regardless of the amount of time. And for S3, you're charged for storage and number of requests.
The use cases
Ultimately, choosing the right data warehouse has a lot to do with how you’ll use it. Consider your organization’s needs and what use cases you’ll find yourself falling into when it comes to your data.
|Use Case||What is it?||The benefits of a cloud data warehouse|
|Ad-Hoc Analysis||•Flexible use of data where users go from viewing data in the aggregate to diving into row-by-row detail.
•Either done with a native SQL or an integrated BI tool like like Tableau, Zoomdata, or Looker.
|•Users can join, aggregate, and scan data in whatever kind of table they want.
•Built-in statistical functions makes it easy to build queries.
•Disk usage is optimized by using a columnstore table format.
|Machine Learning and Data Science||•Uses sophisticated algorithms to identify trends, discover hidden data relationships, and predict future events.
•Usually requires large volumes of data.
•Requires a great deal of processing power, so experiments are often done offline.
|•Can use a large variety of data formats.
•Rapid sandbox configuration allows for quick experimentation, even as load requirements change.
•Interoperational data preparation and statistical tooling.
|Real-Time and Operational Analytics||•Continuously querying data to monitor Key Performance Indicators (KPIs) to make sure teams are mapping towards their goals.||•Can query data in real-time, even as events happen.
•Higher availability and fewer outages.
•Easy to enrich and de-dupe data.
•Processes repeat queries quickly.
|Mixed Workload Analytics||•A combination of the use cases above needs to be supported across the organization.||•A single source of data for any query or use case your organization may need.
•Supports a broad range of queries without requiring additional hardware or complicated data configurations.
•Secures sensitive data.
•Supports broad data ingestion, allowing streaming and batch load data inputs.
|Application backend datastore||Store and retrieve data (including assets, files, etc.).||Storage and compute can be scaled up or down as usage fluxuates absorbing spikes in usage while also keeping costs down.
Data integrity, security, availability provided by Amazon.
Experts on your side
No matter what cloud data warehouse solution you choose, Alooma will help you load and maintain your data and keep it flowing. We’re a modern ETL solution designed from the ground up to work perfectly with modern cloud data warehouses. Alooma helps companies of every size make their cloud warehouses work for any use case, and our experts are knowledgeable about every data warehouse nuance.
Got questions about which data warehouse to choose? Contact us to get personalized advice on how to get the right data warehouse up and running.