Six pitfalls when connecting Elasticsearch to Redshift

by Yuval Barth  
5 min read  • 14 Mar 2017

Lessons learned moving billions of records to cloud data warehouses

Elasticsearch (based on Apache Lucene) is the most popular open source enterprise search engine. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.

Amazon Redshift is one of the most popular, simple, petabyte-scale and cost-effective data warehouses. Good for analyzing all of your data with your favorite business intelligence tools.

To move your data from Elasticsearch to Redshift you have a few options. Let’s say you have completed your “build versus buy” analysis for your CTO, CIO or technical lead and decided to build the data pipeline yourself. There is certainly no shortage of open source tools on the market to get started with. However, merely selecting an open source technology or a managed solution is only the tip of the iceberg.

The challenge:

Redshift and Elasticsearch have a very different structure, as they are two very different solutions for data storage. While Elasticsearch is a full-text search engine based around schema-free JSON documents, Redshift is an SQL-based, columnar, schema’d data warehouse based on PostgreSQL.

Let’s look at a few pitfalls of building your own data pipeline based on lessons learned when we developed Alooma’s Elasticsearch to Redshift data integration:

  1. By default Elasticsearch does not store the timestamp when documents are inserted, and IDs are not necessarily sequential

    Therefore, you have to dump and load the entire index periodically to have all the data in Redshift. This makes having real-time data harder and less efficient than say, checking only for new documents every time. We recommend setting or updating the timestamp for each document when it is inserted or updated, so that you are able to query only the last updated records in an efficient way.

  2. Converting the Elasticsearch schema-less JSON-based documents into structured Redshift tables

    Elasticsearch does not have data types at all -- it’s a schema-free JSON, which means there is no information about what column type should a specific field go into. This warrants an intelligent schema detection mechanism, which will fit every field into the appropriate Redshift type. Furthermore, there are no restrictions on the value type of a field. The same field can sometimes contain an integer, and sometimes a string, or any other type. Thus, the mechanism that fits each field into a Redshift column also need to be able to handle schema changes over time, e.g. a field that contained integers start having string contents.

  3. You have to handle nested fields, for example by flattening the document

    Because of the way nested objects are indexed in Elasticsearch, joining the nested documents to the root document at query time is fast—almost as fast as if they were a single document. One way to store the nested fields in Redshift would be to flatten the document from Elasticsearch before loading into Redshift. However, you will need to name the nested fields correctly, to ensure there are no conflicts in the field names in Redshift.

  4. You will need to convert Elasticsearch index into Redshift table(s)

    You can either split the index into multiple tables or do a one-to-one mapping. If you create a table where columns are a superset of all the field names in all the documents in the Elasticsearch index you are trying to move to Redshift, the resulting table will be huge (i.e. lots of columns), which might cause performance problems.

    Alternatively, you might split the document into multiple tables, joining them by the document ID field. This will require you to write code that splits each document and load it to multiple tables.

  5. If a “value in key” problem exists you need to transform the data before loading

    You may be wondering what is a "value in key" problem! It arises when the document in Elasticsearch is JSON and there is a field in that JSON where the key contains a changing value.

    For example, let’s say you have JSON documents with usernames and IDs, but instead of {“ID” : 1, “Username” : ”xyz”}, {“ID” : 2, “Username” : ”abc”} you have {“1” : ”xyz”}, {“2” : ”abc”}. If you try to map each key to a column name without any transformation, you would be creating multiple columns with names such as ‘1’, ‘2’, etc. Where a desired result might be a column named “ID” and another one named “Username”.

  6. As with every other data source, you will need to somehow store and handle all the records which failed to be copied into Redshift

    COPY commands could fail sometimes! Wrong data format, cluster unavailability or other problems can cause a temporary inability to load part or all of the data into Redshift. To avoid inconsistencies, you need to make sure you catch, fix and recopy those documents. This issue becomes even more complex if the documents are being split into multiple tables, as some tables may have gotten the data while others have not.

And here is a bonus pitfall to watch out for: an Elasticsearch document might contain a long string which you must transform or simply load into a VARCHAR column in Redshift. But - the maximum length of Redshift VARCHAR type is only 65,535 chars, which means that if the Elasticsearch document you are trying to copy contains a longer string, you will have to either truncate your data or split it into several columns.

Hopefully you will find the above lessons helpful, whether you plan to build your own pipeline or start evaluating hosted cloud-based solutions, like Alooma. Share your experiences with us as we love to get new insights.

This might interest you as well