If you're pulling data — particularly large volumes of disparate data — together to make analytics and reports, you'll need to store it somewhere. As a core component of data-driven business intelligence, a Data Warehouse (DW) — combined with an ETL (extract, transform, load) platform — integrates data from different, typically transactional, sources (like Marketo, CRM, Salesforce, inventory systems and even web analytics, to name a few). A DW often stores both current and historical data, and is always aimed at getting overviews and insights only possible from a single, canonical data store.
When choosing a DW solution for the first time, the very first consideration is typically one between an on-prem DW or a cloud-based one. And while a lot of folks brand new to the domain go straight to the cloud these days, there are still many reasons why you might want to choose an on-prem solution. Are you comfortable with your data in the cloud, or would you prefer to keep it on premise? Are there compliance requirements that you may find easier to ensure with an on-prem solution? Are you running legacy systems that do not integrate well with cloud offerings? Are your data volumes high enough (read: lots of connected devices) to justify the scale of the cloud, or small enough (read: mostly local, transactional and/or operational) such that you can comfortably keep your data in house? Do you need the support that a cloud offering provides? Note that most on-prem solutions these days do offer some capacity to partition and scale, but also note that you (or your team) will be the ones doing it.
On-prem data warehouse solutions
It's worth mentioning that healthcare organizations, as well as banks and insurance companies, occasionally still prefer on-prem DWs because of the control they have over them. This means, of course, keeping (and funding) their own IT staff to maintain their instances of these solutions and develop new capabilities for them. In fact, many of these organizations have IT teams that iteratively implement new features and bug fixes using agile methods, just like a software organization. This approach works quite well (read: the investment makes the most sense) in cases where there are still legacy systems in production, and when mostly low-level customizations (code, connectors, and configuration changes) are necessary for integration.
Features, functionality, and use cases
On-prem solutions might as well also be called "do it yourself" due to the nature of their deployment. To get going with one, you'll install test instances on your own commodity hardware, run benchmarks to try them out, sell the winner(s) to management, buy the appropriate licenses, and eventually deploy them. And you'll need the staff and expertise on hand to do the work.
Your lifecycle for these solutions could be quite long, and that's something you should think about when considering them. The long lifecycle has to do with this: after you've added your own unique features and invested time and money on sharding, replication, scaling, and maintenance, moving to something else later on could be a hard sell. Still, you might have a good thing if customizations are important or you're not starting from scratch (e.g. there are legacy systems in your data pipeline that aren't going away any time soon).
On-prem DW solutions typically offer the following features:
- Support (including customizability) for some legacy systems;
- Ease of access to the nuts and bolts of the system (great for low-level troubleshooting and custom feature development);
- Ease of control over physical data centers.
On-prem DW solutions can, and in some cases still do, serve the following use cases:
- Hospitals, which still occasionally build and host their own datacenters;
- Banks, where there are often legacy systems with which to integrate;
- Manufacturing and logistics - which often have data systems as old as the companies themselves;
- Applications where data is proprietary and mostly consumed or accessed inside the organization;
- Organizations with major concerns over accessibility (millisecond response times).
These advantages of on-prem solutions over cloud-based solutions are fast disappearing though, as more cloud vendors address these issues.
List of solutions
Note that most on-prem providers also offer either pure cloud, hybrid (combination of on-prem and cloud), or even multi-cloud storage solutions as well.
Cloud-based data warehouse solutions
The big advantage to a cloud-based solution is that, as a managed solution, tasks like sharding, replication, and scaling are done for you — with many even happening automatically, in the background! You also have fixed costs (although they tend to be different for each solution, so you'll need to do your research). There is no additional outlay for hardware, nor variable costs when something fails or needs to be upgraded. If you're building your data infrastructure from scratch, then cloud-based may well be the way to go, due to the very low barrier to entry.
Features, functionality, and use cases
Similar to on-prem solutions, cloud-based solutions will still, more than likely, require you to implement connectors, database schemas and streaming or ingestion mechanisms (like lambda functions or Kafka pipelines). However, it's hands off with a lot of the routine maintenance and scaling activities, and this alone can save you significant time and cost.
Cloud-based DW solutions typically offer the following features:
- No upfront requirement for hardware outlay;
- Ability to massively autoscale;
- Connectors for most major ETLs, data stores, and databases;
- Technical support and maintenance, bundled in.
Cloud-based DW solutions generally best serve the following use cases:
- Any product or company building a data infrastructure from scratch, where there are no legacy systems to accommodate;
- Any product or company building a data infrastructure around fairly standard components.
Read about how to choose a cloud data warehouse solution.
List of solutions
- Google BigQuery
- Amazon Redshift
- Amazon S3 (not a data warehouse per se, but used frequently for volume data storage in flat files)
- Microsoft Azure
Comparing data warehouse solutions
Here's a high-level comparison of some of the major solutions:
|Solution||Deployment||Best known for||Best used for|
|Oracle||On-prem, cloud and hybrid.||One of the oldest known names in relational databases; one of the best-known remaining "original RDBMS" solutions used in so-called "Big Data". Maximizes performance via features like HCC (Hybrid Columnar Compression).||Analytical and transactional workloads.|
|Informatica||Cloud||Entire product portfolio based on data integration, including ETL. Announced "cloud business" in 2006.||Used in industries such as utilities, banking, insurance, healthcare, and the public sector.|
|IBM||On-prem, cloud, and as an integrated appliance.||Previously known as DB2 and DashDB.||IBM has been widely deployed across the industry.|
|Teradata Database||On-prem, cloud and hybrid.||A flexible database and analytics engine in a scalable, on-prem package (intelliflex). Intellicloud is a similar cloud offering.||Long used in a wide range of industries including financial, retail, manufacturing, healthcare and logistics.|
|Amazon Redshift||Cloud||Fast, petabyte-scale DW service on AWS.||Frequent queries on AWS ecosystem, with complex data aggregations.|
|Google BigQuery||Cloud||Fully-managed big-data analytics platform that leverages Google's CPU processing power to enable super-fast SQL queries against append-only tables.||Fluctuating workloads inside of Google Cloud ecosystem.|
|Snowflake||Cloud||Fully-managed DW built for the cloud.||Semi- and unstructured data; sharing data easily with other Snowflake accounts.|
|Microsoft Azure||Cloud||Cloud computing service created by Microsoft. Uses a global network of Microsoft-managed data centers.||Frequent large-volume data queries on Microsoft's ecosystem.|
|Amazon S3||Cloud||Amazon's cloud computing storage service.||Not a DW per se, but frequently used for volume data storage and data hosting in flat files.|
What's best for you?
On-prem DWs are now increasingly giving way to cloud-based ones, and this trend is expected to continue. Deciding between the two is typically a measure of things like scalability, customizability, ease of access for users, speed (both throughput and latency), ability to manage, security, and future-proofness.
Cloud-based solutions are taking over everywhere due to their ability to scale and the lack of need for an expensive hardware outlay, as well as the support that managed solutions naturally provide. Cloud-based solutions have closed the gap on issues such as millisecond latency and compliance requirements like HIPAA.
Still, there may be some specific cases — support for legacy systems, or a need to keep data in house, among others — where an on-prem solution remains the better option.
How Alooma can help
Alooma is a modern enterprise ETL platform that can load data into all of the popular data warehouses (including Microsoft Azure, Google BigQuery, Amazon Redshift, and Snowflake). Alooma sets up in minutes, integrates with hundreds of data sources, and supports real-time ingestion.