How do I transfer Amazon Mysql RDS data into Amazon Redshift?
There are a few ways to address this problem, and it mostly depends on what the requirements are.
Brute Force: Dump and Load the entire database
The simplistic approach (which is mentioned in some of the other answers) would be to periodically dump the MySQL tables and load them into Redshift using the “copy” command.This approach, of course, has a heavy impact on the DB and is batch-oriented. It also won't scale very well - the bigger the database is, the longer the dump will take.
Dump and Load the Changes Only
An improvement to the “dump-load” approach would be to only load the changes that happened on the tables since the last load, which would involve:
- Having an “updated_at” column on each row
- A trigger that would update this column to the current timestamp on each update
- A query that will query the table “... where updated_at > last loaded timestamp”
This approach can provide more “near real time” access to the data--with far less strain on the DB--and can scale much better than the brute force approach first mentioned.
However, a major downside is there will be several versions of each row, so there would need to be a process that derives a table with the most recent version of each row. More on that here: Updating and Inserting New Data.
Also, there is no way to remove deleted rows, as they will never appear in future queries, meaning they will be carried over to the derived table.
Reading the stream of changes: MySQL Binary Logs
A different, and arguably better, approach would be to read the Binary Logs (binlog replication). The binlog holds an ordered log of all the updates that were performed on a MySQL database. It can be streamed continuously to a binlog table in Redshift and use that table to recreate the original MySQL table in Redshift.
This approach allows near real-time access to the data and has almost no impact on the DB performance.
However, like the “delta query” approach, post-processing to derive a table with the most recent row updates is still required since several versions of the same row may exist.
Using a SaaS provider
If you want to try implement one of these solutions yourself, you should have enough info to get started. However, I can tell you from experience that this is a hugely time-consuming process which I imagine may not be directly related to your company’s core business. I would suggest checking out Alooma's MySQL to Redshift integration, both for RDS hosted and self hosted MySQL instances, using binary log replication.
Published at Quora. See Original Question here