How do I transfer Amazon Mysql RDS data into Amazon Redshift?

byYair Weinberger
Updated Sep 27, 2018

There are a few ways to address this problem, and it mostly depends on what the requirements are.

Brute Force: Dump and Load the entire database

The simplistic approach (which is mentioned in some of the other answers) would be to periodically dump the MySQL tables and load them into Redshift using the “copy” command.This approach, of course, has a heavy impact on the DB and is batch-oriented. It also won't scale very well - the bigger the database is, the longer the dump will take.

Dump and Load the Changes Only

An improvement to the “dump-load” approach would be to only load the changes that happened on the tables since the last load, which would involve:

  1. Having an “updated_at” column on each row
  2. A trigger that would update this column to the current timestamp on each update
  3. A query that will query the table “... where updated_at > last loaded timestamp

This approach can provide more “near real time” access to the data--with far less strain on the DB--and can scale much better than the brute force approach first mentioned.

However, a major downside is there will be several versions of each row, so there would need to be a process that derives a table with the most recent version of each row. More on that here: Updating and Inserting New Data.

Also, there is no way to remove deleted rows, as they will never appear in future queries, meaning they will be carried over to the derived table.

Reading the stream of changes: MySQL Binary Logs

A different, and arguably better, approach would be to read the Binary Logs (binlog replication). The binlog holds an ordered log of all the updates that were performed on a MySQL database. It can be streamed continuously to a binlog table in Redshift and use that table to recreate the original MySQL table in Redshift.

This approach allows near real-time access to the data and has almost no impact on the DB performance.

However, like the “delta query” approach, post-processing to derive a table with the most recent row updates is still required since several versions of the same row may exist.

Using a SaaS provider

If you want to try implement one of these solutions yourself, you should have enough info to get started. However, I can tell you from experience that this is a hugely time-consuming process which I imagine may not be directly related to your company’s core business. I would suggest checking out Alooma's MySQL to Redshift integration, both for RDS hosted and self hosted MySQL instances, using binary log replication.

Like what you read? Share on

Published at Quora. See Original Question here

Data IntegrationMySQLAmazon Redshift

Further reading

How can CSV data be loaded into Amazon Redshift?
Alooma Team • Updated Sep 27, 2018
What is Striim?
Alooma Team • Updated Jul 26, 2018
What is Microsoft Integration Services?
Alooma Team • Updated May 3, 2018
What is IBM Infosphere Information Server?
Alooma Team • Updated May 10, 2018
What is Oracle Data Integrator (ODI)?
Alooma Team • Updated Apr 3, 2018