What is the best way to move data from AWS PostGres to Redshift?

byYair Weinberger
Updated Jan 12, 2016

There are a few ways to address this problem, and it mostly depends on what the requirements are and where the server is hosted.

Brute Force: Dump and Load the entire database

The simplistic approach (which is mentioned in some of the other answers) would be to periodically dump the Postgresql tables and load them into Redshift using the “copy” command.

This approach, of course, has a heavy impact on the DB and is batch-oriented. It also won't scale very well - the bigger the database is, the longer the dump will take.

Dump and Load the Changes Only

An improvement to the “dump-load” approach would be to only load the changes that happened on the tables since the last load, which would involve: Having an “updated_at” column on each row A trigger that would update this column to the current timestamp on each upgrade A query that will query the table “... where updated_at > last loaded timestamp”

This approach can provide more “near real time” access to the data--with far less strain on the DB--and can scale much better than the brute force approach first mentioned.

However, a major downside is there will be several versions of each row, so there would need to be a process that derives a table with the most recent version of each row.

Also, there is no way to remove deleted rows, as they will never appear in future queries, meaning they will be carried over to the derived table.

Reading the stream of changes: Write-Ahead Logs

A different, and arguably better, approach would be to read the WAL (Write-Ahead Logs). Since version 9.4, Postgresql supports logical decoding which allows writing the WAL in any desired format into a replication slot. Martin Kleppmann wrote Bottled Water, an open-source decoder + reader that enables replication of the write-ahead logs into Apache Kafka.

A similar approach can be used to load the write-ahead logs into Amazon Redshift.

This approach allows near real-time access to the data and has almost no impact on the DB performance. However, like the “delta query” approach, post-processing to derive a table with the most recent row updates is still required since several versions of the same row may exist.

Unfortunately, if the source DB is hosted on Amazon RDS or Heroku, there is no way to access the write-ahead logs, and one of the other approaches above should be used.

Using a SaaS provider

If you want to try implement one of these solutions yourself, you should have enough info to get started. However, I can tell you from experience that this is a hugely time-consuming process which I imagine may not be directly related to your company’s core business.

I would suggest checking out Alooma's PostgreSQL to Redshift solution, which allows seamless replication of Postgresql (and many other data sources) into Redshift, both for Postgresql hosted on Heroku and RDS (using the changes dump-load approach) and for self-hosted servers, using a logical decoder.

Like what you read? Share on

Published at Quora. See Original Question here