Solving the challenges of mapping your data

Yaara Gazit  •  8 min read  • 21 Jun 2017

Mapping data to the right data type has lasting benefits

One of our main goals here at Alooma is to help you get the most out of your data. One of the ways to achieve that is finding the most suitable mapping for your data.

When your data is correctly mapped, your data warehouse is optimized which saves storage and computation costs, along with better results for your data analytics.

When mapping is done correctly, the pipeline process works smoothly with fewer loading errors. That’s why we at Alooma allow you to automate the mapping process, so you won’t have to deal with the dirty work of deciding which column should be mapped to which data type.

Alooma allows a user to either auto-detect the data types and create the target tables or manually customize the mapping in the target warehouse.

Data mapping Challenges

When automating the mapping process, there are 2 main challenges we are dealing with:

  1. Handling different input data types

  2. Knowing which mapping to use for the output

The first challenge is easier, so we can address its solution rather simply. The second is more complex, so we’ll spend the bulk of this article explaining how to solve it.

Handling different input data types

When handling different data types, we can categorize data into two types:

  1. Structured

  2. Semi-structured and Unstructured data

The way we map events from each type differs in complexity, but the general process is the same: getting or creating an information schema, then applying this schema on the selected data warehouse and its requirements.

For structured data, which usually refers to rows from relational databases, information is predictable and the schema is known in advance; we can import the original table schema from the source DB, which is the first step in the mapping process.

For semi-structured and unstructured data, such as logs, API’s, Cloud Services such as Mixpanel and Webhooks, the process is more complex because Alooma doesn’t know the schema in advance. A schema generation process is performed.

Knowing which mapping to use for the output

What would be the best mapping to use? How do you keep data integrity? How do you fit the requirements of the data destination? That’s where our schema-generation logic comes in.

Schema-generation

Schema-generation is our proprietary generator that enables you to apply schemas to events. Each input generates event types which are a dictionary of fields collected from a stream of events. "Solving data mapping"

Sample Event Type:

{
  "_metadata": {
    "timestamp": 1497522159,
    "schema": "test",
    "event_type": "prbuilder11497521767",
    "@uuid": "7fbc1faf-fa57-4f72-94fd-afea634002f6",
    "input_label": "ms",
    "@timestamp": "2017-06-15T10:27:07.143Z",
    "input_id": "edf8dd53-6934-4bf7-b4a4-0bdfcb8f3a6d"
  },
  "id": "133432535353",
  "time1": "2017-06-15T10:22:39.000000",
  "int_field": 5,
  "float_field": 0.7492514848709106,
  "some_text": null,
}


From there, whenever mapping an event type where the data types are known - the event originates from a structured type, like a database, the generator extracts the schema from the source.

If no original schema is available - whenever mapping an event type where the data types are unknown, the generator generates a ‘Stats Schema’ based on the statistics of field values in the samples we collect of our event types.

"Solving data mapping"

The schema generation process is done as follows:

Step 1 - We define ‘Alooma Data Type’ for our stats schema. The Alooma data types are similar to ones from common databases such as MySQL, but those are pseudo-data types, it is an in-house concrete description of the different valid values (data types).

The Alooma data types we have selected to represent the stats schema are:

  1. Timestamp data type

  2. Integer data type

  3. Bigint data type

  4. Float data type

  5. Boolean data type

  6. Varchar data type (default)

The above list of data types depicts the order in which we check the match of the field value statistics to the Alooma data type. Since the same value can fit different types (e.g. a unix timestamp can also be considered a bigint), making the order of the list very important.

Step 2 - We use a collection of samples with field values, known as our statistics - and start checking possible mappings for each field and each collection of values for a field. A possible mapping will have to fit each and every sample in order to be considered a good match.

For example, the field ID has the following sample values: ["4555", "4343", "4555","5678","234","3678"] collected from recent samples. Can you think of a suitable mapping for this field?

From the example above, ID will be detected and mapped as Integer.

However, if the samples we encountered had values like ["1489423842", "1284423842"] - these values can be interpreted as both a Timestamp and a numeric value (Bigint in most DB’s, to be more precise). So how do we know what is the meaning (and therefore - best mapping) of this value, and how would we know for sure if this is meant to be saved as the timestamp Mon, 13 Mar 2017 16:50:42 or as the simple bigint 1489423842?

Well, we can’t know for sure. We can only assume and select the most suitable mapping according to previous experience.

In the world of big data analysis, most of the data that runs through Alooma’s pipeline is real-time data, and usually consists of real time date times. This means that when checking the number 1489423842 and converting it to a possible timestamp - we can almost safely assume that if the value can be translated to a recent timestamp - it should also be mapped to timestamp in our schema.

When checking the possible mapping matches for a field, we take the first match and determine that the Alooma data type in the stats schema for that field will be the mapping we just guessed.

In the world of big data analysis, most of the data that runs through Alooma’s pipeline is real-time data, and usually consists of real time date times

We will not map a field to a specific data type if not all sample values can be converted to that data type. If we couldn’t find a data type that will match all sample values, we will map the field to a default varchar mapping.

Step 3

After we have constructed a schema from all of the event type’s fields - the next phase is mapping from the schema constructed by Alooma data types to an appropriate schema with the output warehouse data types.

Let the mapping begin

This phase applies to both structured and semi-structured data. After we have constructed a schema for each field from the event type, we want to decide on the best output mapping.

There are many data types in multiple inputs and outputs, and deciding on how to convert between those is a challenge.

After we have created a schema of Alooma data types (whether it’s from stats or from the original schema of an input), it’s time to determine the output schema and create the mapping accordingly.

An input has a subset of Alooma data types that match its fields and the output has a subset of data types accordingly:

e.g, If the original input is Google Analytics, an event from the input has a field where it’s original mapping according to Google Analytic’s schema is String (there is no Varchar type in Google Analytics, only String) - It which will be represented as Alooma’s String Data Type.

"Solving data mapping"

The conversion logic will convert Alooma String Data Type to the appropriate data type according to the output’s subset data types.


Final thoughts

Determining the best mapping for each input and output is not an easy task, since you have to sync between the different data sources and outputs, handle frequent schema changes and make the best decision in order to save storage and computation costs.

By using the schema generator, we can auto-map from multiple inputs to multiple outputs easily, while your selected data warehouse is utilized in the optimal way.

While you could try wrangling with mapping data yourself, it can often be messy and frustrating job. That’s why we’ve worked really hard to help do it for you easily.

Have ideas or suggestions on how we can better assist you?
Leave us a comment or feel free to contact us.

Share  

Get your data flowing today!
Contact us to start using Alooma for free