What is the easiest way to load ElasticSearch to Amazon Redshift?
If you’re looking for a way to one-time dump-and-load the data in an ES index to Redshift, the easiest process I know is such:
- Export all the data (or some part of it, if you want it filtered) using a query on the relevant index in ES.
- Save the results of the query in a file.
- Upload said file to S3
- Create a table with a column for each fieldname in the ES documents in your index
- Use a COPY command to load the file into a table in Redshift
HOW DO I DO THAT?
Export the data:
a. Elasticsearch has an excellent REST API that is well-documented. The best method to export a big data set is usually the
scanfeature - check out this link to learn how to query your ES server using
Convert the output to a CSV. This step is VERY annoying as you need to create a superset of all the field names from documents in your ES server.
a. I would suggest to skip steps 1 and 2 by using this excellent es2csv tool
However you decided to get a CSV file to load to Redshift, you are now in possession of a CSV file which can be converted to a table in Redshift. Next step - upload your file to S3:
a. This is easy to achieve using the
s3cmdtool or using the S3 GUI in the AWS Console.
Creating a table - You now need to get a list of all the fieldnames in your ES documents (and their types) and create a table in Redshift with all said columns, so that the COPY command in the next stage will work.
Creating a COPY command - log into your Redshift DB and execute a COPY command (see AWS documentation for more details):
COPY <table_name> (columns....) from 'http://s3://<file-path>' CREDENTIALS 'aws_access_key_id=<access_key>;aws_secret_access_key=<secret_key>' region as <region> NULL as 'NULL' CSV quote as '''';
That’s it! your data is now in Redshift! yay…… but:
If you followed this post, you probably noticed that Redshift and Elasticsearch have a very different structure, as they are two very different solutions for data storage. While Elasticsearch is a full-text search engine based around schema-free JSON documents, Redshift is a columnar, schema’d data warehouse. This makes processes of data transfer between the two... well, suck.
Therefore, if you’re looking for a robust and reliable way to continuously replicate the data in your Elasticsearch indices to Redshift, I would advise you to consider a managed data pipeline solution such as Alooma (full disclosure - I work at Alooma as an engineer). Using a managed solution will save you most of the heavy-lifting of transforming and loading the data, allowing you to focus on actually using the data in Redshift instead of tackling the intricacies of the procedure - which, speaking from experience, are numerous and annoying :)