What is the easiest way to load ElasticSearch to Amazon Redshift?

byYuval Barth
Updated Jul 11, 2016

If you’re looking for a way to one-time dump-and-load the data in an ES index to Redshift, the easiest process I know is such:

  1. Export all the data (or some part of it, if you want it filtered) using a query on the relevant index in ES.
  2. Save the results of the query in a file.
  3. Upload said file to S3
  4. Create a table with a column for each fieldname in the ES documents in your index
  5. Use a COPY command to load the file into a table in Redshift

HOW DO I DO THAT?

  1. Export the data:

    a. Elasticsearch has an excellent REST API that is well-documented. The best method to export a big data set is usually the scan feature - check out this link to learn how to query your ES server using scan.

  2. Convert the output to a CSV. This step is VERY annoying as you need to create a superset of all the field names from documents in your ES server.

    a. I would suggest to skip steps 1 and 2 by using this excellent es2csv tool

  3. However you decided to get a CSV file to load to Redshift, you are now in possession of a CSV file which can be converted to a table in Redshift. Next step - upload your file to S3:

    a. This is easy to achieve using the s3cmd tool or using the S3 GUI in the AWS Console.

  4. Creating a table - You now need to get a list of all the fieldnames in your ES documents (and their types) and create a table in Redshift with all said columns, so that the COPY command in the next stage will work.

  5. Creating a COPY command - log into your Redshift DB and execute a COPY command (see AWS documentation for more details):

COPY <table_name> (columns....)
from 'http://s3://<file-path>'
CREDENTIALS 'aws_access_key_id=<access_key>;aws_secret_access_key=<secret_key>'
region as <region>
NULL as 'NULL'
CSV quote as '''';

That’s it! your data is now in Redshift! yay…… but:

If you followed this post, you probably noticed that Redshift and Elasticsearch have a very different structure, as they are two very different solutions for data storage. While Elasticsearch is a full-text search engine based around schema-free JSON documents, Redshift is a columnar, schema’d data warehouse. This makes processes of data transfer between the two... well, suck.

Therefore, if you’re looking for a robust and reliable way to continuously replicate the data in your Elasticsearch indices to Redshift, I would advise you to consider a managed data pipeline solution such as Alooma (full disclosure - I work at Alooma as an engineer). Using a managed solution will save you most of the heavy-lifting of transforming and loading the data, allowing you to focus on actually using the data in Redshift instead of tackling the intricacies of the procedure - which, speaking from experience, are numerous and annoying :)

Like what you read? Share on

Published at Quora. See Original Question here

Amazon RedshiftElasticsearch

Further reading

Take control of your data for free!
Sign up and get $500 worth of free credits to try Alooma.
Request a Demo