So you’ve got a data warehouse (the new saying goes), and you want to better understand how it works, especially how your data gets into the warehouse — or the process of data replication.
What is database replication?
In simple terms, data replication takes data from your source databases — Oracle, MySQL, Microsoft SQL Server, PostgreSQL, MongoDB, etc. — and copies it into your cloud data warehouse. This can be a one-time operation or an ongoing process as your data is updated. Because your data warehouse is the important mechanism through which you’re able to access and analyze your data, proper data replication is necessary to avoid losing, duplicating, or otherwise mucking up valuable information.
For example, if you don’t do it right, you could import old data or the same data over and over, or you could miss updates to data that include changes or deletions. Furthermore, you’re likely to incur charges for each of those replication cycles by your data warehouse provider no matter the results. Running data analysis on outdated or incorrect data essentially defeats the purpose of a data warehouse, and can end up a waste of both time and money.
Fortunately, there are data replication methods built to integrate with today’s cloud-based data warehouses and suit a variety of use cases. Let’s discuss each of the three common methods of data replication and outline the option that may be best for you.
Understanding the three replication methods
Whether you’re interested in simplicity, speed, thoroughness, or all of the above, selecting the right data replication method has a lot to do with your particular source database(s) and how you store and collect data.
Full dump and load: Starting with the easiest method first, full dump and load replication begins with you defining a replication interval (could be two, four, six hours — whatever suits your needs). Then, at each interval, the tables you’re replicating are queried and a snapshot is taken. The new snapshot (dump) replaces (loads) the previous snapshot in your data warehouse. This method is best for small tables (typically less than 100 million rows), static data, or one-time imports. Because it takes time to perform the dump, it’s a typically slower method than the others.
Incremental: With the incremental method, you define an update indicator for each of your tables — typically a column that tracks the last updated time (typically something like "updated_at"). Every time a row in your database gets inserted or updated, the update indicator is updated. Your data tables are queried regularly to capture what has changed. The changes get copied to your data warehouse and are merged. Despite some upfront work setting up the indicator column, this method gives you lower latency and less load on your database. The incremental method works well for databases where new information gets added or existing data is updated.
Log replication, or change data capture (CDC): The fastest method — more or less the gold standard in data replication — is log replication, or CDC. It involves querying your database’s internal change log every few seconds, copying the changes into the data warehouse, and incorporating them frequently. All changes to the tables and objects you specify are loaded in by default, including deletes, so nothing goes missing. CDC is not only a faster, more reliable method, it also has a much lower impact on database performance during querying and helps you avoid loading duplicate events. However, it does require more initial setup work and possibly some cycles from a database admin. CDC is the best method for databases that are being updated continually and fully supports deletes.
Determining what’s right for you
If you have small tables, and limited access to database admin cycles, dump/load is probably a good choice. However, if you have more access to database administration and if you have huge amounts of data, or if it's updated frequently, you'll want to use incremental or log replication.
Each of these methods has its advantages and knowing which to use is key. Keep in mind that the simplest replication method may not necessarily be the best option for you, especially if you have large, complex, or constantly changing databases.
Alooma, the enterprise data platform built for the cloud, supports all three data replication methods, suiting different levels of administrative capacity and expertise, along with database configuration and complexity. While the CDC/log replication method is preferred — with greater flexibility, adaptability, power, and cost-effectiveness than other methods — your situation may dictate a different approach. If dump/load or incremental is right for your environment, Alooma has you covered.