Documentation

Database Replication Methods

Abstract

SummaryAn overview of the three different types of database replication available in Alooma, and the tables Alooma creates in your data destination.

You’ve got data in your transactional database - Elasticsearch, IBM Db2, Microsoft SQL Server, MongoDB, MySQL, Oracle, or PostgreSQL. Awesome, we can handle that!

When you create a new transactional database input, Alooma gives you up to 3 ways to replicate, each optimized for different data use cases:

  • Full dump and load

  • Incremental

  • Change data capture (CDC aka log replication)

What's the difference? Let’s dive in!

Full Dump and Load Replication

With full dump and load, you define a replication interval (such as every 4 hours). On that interval, we query each of the tables you’re replicating, take a full snapshot (dump) of the table, and replace the previous snapshot (load) in your data destination.

Advantage: Setup is simple, and this method works best for small tables.

Disadvantage: Tables only update every few hours due to the time it takes to perform the dump. This can also be resource heavy, which may slow down your database, especially for larger tables.

Incremental Replication

With incremental, you define an “update indicator column” (such as updated_at) for each of the tables you're replicating. Every time a row in your database is updated, it gets the next largest value in the updated indicator column.

Every few seconds, we’ll query your tables for what’s changed since the last time we checked, leveraging the update indicator column. We’ll copy those incremental changes to your data destination, and merge them in every 15 minutes. We'll also provide you with a view to see the changes in near real time.

Advantage: Setup is simple, doesn’t create a lot of load on your database, and is low latency.

Disadvantage: Requires an update indicator column for each table. Requires a primary key. Requires additional work to capture deletes.

Note

We recommend indexing the replication index column.

If you need to add a primary key (PK) to a table in a database input (MySQL, Oracle, PostgreSQL, SQL Server, etc.) to enable replication:

  1. Add the PK to the table via the Alooma API.

  2. Add the table to the list of tables being replicated by editing the input in the UI or via the API.

Change Data Capture (CDC) aka Log Replication/Change Tracking/etc.

The state of the art in database replication, CDC (or its equivalent) is the fastest and most reliable way to replicate. For the databases that support this type of functionality (Microsoft SQL Server, MongoDB, MySQL, Oracle, and PostgreSQL), we query the database's internal change log/table every few seconds to get the latest changes. We’ll copy that log of changes to your data destination, and merge them in, every 15 minutes. We'll also provide you with a view to see the changes in near real time.

When you use CDC replication, Alooma will replicate all tables/objects from all databases/collections by default. If you want to load a subset of the items in your database you can change that behavior when creating the input.

Advantage: Enables near real-time replication with very low query impact.

Disadvantage: Requires more initial setup work.

Note

For PostgreSQL WAL, Alooma can replicate partition tables but they will come in as the name of the child not the parent table.

Q: What happens to data already replicated when a user removes an item from the list of tables/objects/collections to replicate?

A: Nothing. The data remains in the data destination.

Q: What happens when user adds an item to the list of tables/objects/collections to replicate?

A: Alooma will start replicating it, and if "Perform initial snapshot" is checked, we will also dump its historical data.

Q: What happens if users toggles "Perform initial snapshot" after replication started?

A: Tables/objects/collections already being replicated are not impacted. Only those added to the list going forward will be impacted by this change.

Replication Method Comparison

Function

CDC Log Rep

Incremental

Dump

Initial Load

Optional

By default

Every interval

Row Sync

Changes (new, updated, deleted)

Only new and updated rows

All rows from the source

Deleted Records

Supported

Not Supported

Supported

Time

Less impacted by size of table

Generally faster

Can be slower, depending on size of table

Primary Key?

Optional

Must have

Optional

Identify Changes

Must define and update an indicator column

Supported Replication by Database

Database

CDC

Log Rep

Incremental

Dump

Elasticsearch

-

Via API Only

Supported

IBM Db2

-

Supported

-

Microsoft SQL Server

Supported

Supported

Supported

Mongo

Supported

-

-

MySQL

Supported

Supported

Supported

Oracle

Supported

Supported

Supported

PostGreSQL

Supported

Supported

Supported

Replication and Event Metadata

The type of replication you use will dictate some of the fields included in the event metadata. See this article for more information.

Tables in your Data Destination

To support replication, Alooma creates 2 tables and one view in your data destination:

Final (aka “Consolidated”) Table

This is the table which represents the source table from your input. Generally given the same name as your source table, the final table is updated every 15 minutes for incremental and CDC replication, or on the replication interval you defined for full dump and load replication.

When should I query the final table? When query speed is super important to you, and a 15 minute data delay is acceptable.

Log Table

This is where we write new data (be it the latest dump data, or the freshest incremental or log data) before it’s merged into the final table. For each final table, there’s a matching _log table. Every time we merge, we clean out some old data, to keep the log table small and the merges fast.

When should I query the log table? Never! Leave the querying to us :)

Near Realtime View (Incremental and CDC only)

Sometimes the 15 minute wait for the final table to be updated is just not fast enough. For that we created the Near Realtime (NRT) view, which for each table appears as a matching _nrt view. It gives you the same data as the final table, but on demand! The view can take up to a couple minutes to generate, which lowers the data latency from 15 minutes to just a couple of minutes.

When should I query the NRT view? When the freshness of the data is the most important, and it’s ok that the query can take a couple more minutes to run.

What is this “merge” you keep talking about?

As explained above, Alooma utilizes a log table as a temporary holding place for new data from your source tables. Then, at a regular interval of 15 (configurable) minutes for incremental and CDC, or the replication interval of your choice for full dump and load, Alooma merges that data into your final table using what we call “consolidation”.

This process is due to the limitations of data warehouses, which were built for appends and not for updates like transactional databases. Check out our Consolidation documentation for more info on this process, and why it's required.

Want to learn more about our replication options and what’s best for your data? Talk to us!

Search results

    No results found