Documentation

S3 Retention

Abstract

SummaryDetailed information on S3 Retention, including what it is, how to set it up in your Alooma system, and instructions for how to use it to reload events.

Alooma provides the option to automatically store all events’ raw data in an Amazon S3 bucket of your choice. Events are stored as they are received, before being processed by Alooma’s Code Engine or Mapper.

S3 Retention provides a great backup just in case your source data is no longer available (for example, for event data from a mobile device) or when you want to reload data into your data destination. This raw data backup can also be helpful during the initial stages of integration to verify that events were loaded into Alooma as you expected. For example, to count the events that streamed into Alooma.

Data is stored in S3 in "append only" mode, meaning that each event (for example, an update or a delete) is simply stored as another row in S3.

Setting up S3 Retention

Reloading events differently using S3 Retention

Reloading events and mimicking the original data

Setting up S3 Retention

To start saving events to an Alooma S3 bucket select Settings ➔ S3 Retention and then configure the access credentials to the S3 bucket of your choice.

The S3 credentials provided should have permission to read and write in the specified bucket, and to list the bucket's content. A minimal security policy would look like:

{ 
  "Version": "2012-10-17", 
  "Statement": [ 
    { 
      "Sid": "Stmt1478676565000", 
      "Effect": "Allow", 
      "Principal": "*",
      "Action": [ 
        "s3:ListBucket" 
      ], 
      "Resource": [ 
        "arn:aws:s3:::<YOUR_BUCKET_NAME_HERE>" 
      ] 
    }, 
    {
      "Sid": "Stmt1478676565001",
      "Effect": "Allow", 
      "Action": [ 
        "s3:GetObject", 
        "s3:PutObject" 
      ], 
      "Resource": [ 
        "arn:aws:s3:::<YOUR_BUCKET_NAME_HERE>/*" 
      ] 
    } 
  ]
}

Data from each input is stored into its own directory. File names are formatted to include the time the last event in the file was received by Alooma.

Optionally, you can specify a Prefix for the file names. For example, for an input named "mobile_sdk" dump files will be created in the following format: s3://<BUCKET_NAME>/<PREFIX>/mobile_sdk/YYYY-mm-DD-HH-MM-SS_*

If you choose the Include Metadata option, the events stored will be in the following format:

{ 
  "message": {
    <YOUR ORIGINAL EVENT HERE>
  },
  "@uuid": "d2a62361-44dd-4a35-ac08-95a2c5d85ed6", 
  "input_label": "test_5", 
  "@timestamp": "2000-01-01T00:00:00.000Z", 
  <MORE METADATA FIELDS HERE, DEPENDING ON INPUT TYPE>
}

If you choose not to include metadata, the files in S3 will simply contain the original events, separated by newlines.

You can choose whether files will be compressed by gzip (recommended) or saved uncompressed.

Reloading events differently using S3 Retention

After events have already been loaded into your data destination, Alooma’s S3 Retention option enables you to redo how they were loaded into the data destination. You may have a variety of reasons for wanting to do this. For example:

  • To map these events to a different table, column or data type.

  • To include event properties that you previously discarded.

  • To change the code in the Code Engine that populated event properties with incorrect values.

To reload events
  1. Change the mapping in the Mapper and/or the code in the Code Engine so that newly received events will be handled the way you want them.

  2. Drop the data from the target output table to be replaced.

  3. Filter the files in the S3 bucket so that the only ones that remain are the ones that contain the events you wish to stream through Alooma. Alternatively, you can stream all events into Alooma again and filter them using the Code Engine.

  4. In the Plumbing screen, select Add a new input ➔ S3.

  5. Specify the prefix and timestamp of the S3 files to be streamed into Alooma. Make sure you select JSON lines as the file format.

  6. Click Finish and the events will stream into Alooma.

Reloading events and mimicking the original data

If you have events stored in S3 retention and want to re-import them into your data destination (and mimic the original data), you'll need to understand what happens when Alooma puts an event into S3 retention. Basically, we do two things:

  • Serialize the event into the "message" field of the new retention event.

  • Store the original metadata fields on the root of the new event.

    Note

    The S3 retention event will receive its own metadata as it is created, but we are only interested in the metadata from the original event.

The idea is to take events from S3 Retention and make them resemble the original events and then run them through the Code engine so they can follow the same path (have the same transformations applied) as the original events.

So, in order to do that, we need to "de-serialize" the event and restore the metadata. If we do this as the first step in the Code Engine, the events can then continue through the rest of the Code Engine path, just like the original events, removing the need for any additional transformations.

The process looks like this:

  1. Make sure that the events resemble the original source data via a transform in the Code Engine. This is basically two parts: de-serialize the message and restore the proper metadata.

  2. Create a new input for the S3 Retention (Plumbing screen -> Add a new Input -> S3, be sure to specify JSON lines as the File Format).

Here's a (stripped down) example of an event, from S3 retention. It has metadata fields added as part of the original import and as part of the S3 retention. The original event is "serialized" in the "message" field.

{ 
  "message": {
    <YOUR ORIGINAL EVENT HERE>
  },
  "@uuid": "d2a62361-44dd-4a35-ac08-95a2c5d85ed6", 
  "input_label": "test_5", 
  "@timestamp": "2000-01-01T00:00:00.000Z",
  "file_name": "mytable.txt",
  "@uuid": "7a18b429-10e0-4a36-817f-4001d26a0719",
  "line_number": 1,
  "total_rows": 1,
  "file_modified_ts": "2017-12-01T00:00:00+00:00",
  "bucket_name": "my-S3_bucket",  
  <MORE METADATA FIELDS HERE, DEPENDING ON INPUT TYPE>
}

Note

In our example, the resulting event has an additional five metadata fields (file_name, bucket_name, line_number, file_modified_ts, and total_rows).

Now, in order to take the resulting event from S3 Retention and make it look like the original event, we need to run a transform in the Code Engine to de-serialize the message and correct the metadata.

We also need to make sure that the resulting events follow the same codepath that the original events would follow before placing them in the data destination. For example, if your original events went through a transform that added a new field, you would need to make sure that the same field is added to the resulting event.

Note

Just for illustrative purposes, in our example transforms below, we simulate this theoretical codepath by assuming events have a metadata field called "input_label" with a value of either MySQL or MySQL-Backfill. The transform function sends original events (where input_label is MySQL) to the transform_mysql function which in this case simply simulates any extra transforms you might require. The key is that the S3 Retention events (input_label is MySQL-Backfill) need to follow the same path once they're through the backfill transform.

Here are the transforms:

def transform_backfill(retention_event):
    del retention_event['_metadata'] # Remove S3 retention metadata fields
    original_event = json.loads(retention_event.pop('message')) # Deserialize original event

    # Now that we've removed _metadata & message from retention_event,
    # all that remains are the original _metadata fields.
    retention_event['input_type'] = 'backfill' # Marking the input type as 'backfill'

    return dict( # Returning original event with _metadata fields
        _metadata=retention_event,
        **original_event
    )


def transform_mysql(event):
    event['custom_transform'] = 'here'
    return event


def transform(event):
    if event['_metadata']['input_label'] == 'MySQL-Backfill':
        event = transform_backfill(event)
        # at this point the value of event['_metadata']['input_label']
        # is expected to be 'MySQL'.

    if event['_metadata']['input_label'] == 'MySQL':
        return transform_mysql(event)

Note

In the code example above we set the input_type to backfill. Take this into account if you have other Code Engine logic based on the input_type field.

Search results

    No results found