Amazon Redshift for Custom Analytics - Full Guide

by Itamar Weiss  
25 min read  • 4 Apr 2018

In this blog post, we describe how we built our custom analytics solution over Amazon Redshift while dogfooding our own product in the process. At a glance, this article covers:

  • Why the world is shifting towards custom analytics solutions
  • Should you build a custom analytics solution?
  • Why Amazon Redshift became so popular
  • Choosing a BI visualization tool
  • Setting up your Amazon Redshift cluster
  • Building a data pipeline with Alooma
  • Popular SQL queries for creating engagement, retention, funnels and performance dashboards

The rise of custom analytics

It’s almost impossible to find a website or an app these days that doesn’t implement some sort of analytics. The notion of leveraging data to make product and strategic decisions is now common knowledge--even more essential in the consumer space. While you may have a gut feeling how your customers may react to product changes, sometimes their reactions could surprise you. Today's approach to product development with analytics allows you to test different hypotheses, measure how they perform, learn, and continuously improve your product or service.

Removing the barriers

Many products have come about to help make analytics accessible to web and mobile developers: Google analytics, Flurry, Mixpanel, Localytics, Segment and many others. These solutions usually provide easy-to-install SDKs for multiple platforms and out-of-the-box dashboards for the most popular types of analysis including user engagement, funnels, and user retention. One of the primary benefits of these products is that they provide an end-to-end solution. Alternatively, you could develop, configure, and set up the client-side SDKs, build your own data pipelines to collect the metrics, set up the data warehouse to store the data, and create the analysis and visualization tools. Piece of cake, right?

Actually, as the costs and complexity of owning and managing your own data warehouse and data visualization tools decrease by using tools such as Amazon Redshift and Google BigQuery, we see more and more companies that prefer to build their own custom analytics solutions. The trade-off from out-of-the-box simplicity is worth it for the powerful insights you can gain from your own custom solution.

For example, we at Alooma built our own custom analytics in order to analyze and improve our marketing and sales pipeline. We collect data about the actions visitors take on our website and correlate it with data coming from our Facebook ads, Google Adwords and Salesforce. It allows us to build easily digestible dashboards that analyze the performance of each component of our sales and marketing pipeline, and to have a holistic view of the ROI of each campaign. In short, it gives us a true understanding of what's working and what resources could be better allocated in our sales and marketing efforts - something an out-of-the-box solution would struggle to provide.

So let's dive deeper into how you can build your own custom analytics solution over Amazon Redshift by utilizing Alooma.

Why build your custom analytics solution?

The first question you may be wondering is why bother investing in building your own custom analytics solution when there are so many out-of-the-box solutions? We've spoken with hundreds of companies and here are some of the main reasons we hear time and time again:

  1. Owning your data - Data is power. Companies today understand that the data they have about their customers can have tremendous value. For some companies, data is the most valuable asset they have. Companies today want to make sure their data is handled correctly and to promise they will always have access and ownership over their data.

  2. Custom queries and deeper insights - While the built-in dashboard of third-party solutions provide very quick time-to-value, you are always limited to the dashboards that each third-party solution decided to implement. Every company has questions specific to their industry that require writing their own queries or building custom reports and dashboards to answer.

  3. Joining multiple datasets - If you want to get a holistic view of your business, you would need to join multiple data sets. For example, joining your web analytics data with your Salesforce data allows you to discern the true value your online marketing campaigns are providing. By integrating multiple sources (analytics, server logs, CRM, etc.) into your own data warehouse, you are able to join these datasets and therefore query them together to get a holistic view of your business.

  4. Machine learning - If you own your data, your are able to leverage machine learning to develop recommendation systems, create a custom experience for each of your users, or estimate the potential value of each user and retarget high-value users with customized campaigns. Without access to the entirety of your raw data, this level of granularity would be unlikely.

  5. Cost - Third party solutions usually have a low-cost (or even free) entry-level pricing, but they quickly become expensive as you scale. In order to cut costs, companies often begin to limit the amount of data to capture and store. While this may seem like a good idea at the time, it usually creates challenges in the future when the business is asking questions about historical data that was not captured, all in the name of lowering a monthly bill. In our experience, when processing more than a few million events per day (often times even less), it can become more cost-effective to build your own custom solution.

Why choose Amazon Redshift as my data warehouse?

Redshift was launched in 2013 and took the world of big-data analytics by storm. It was the fastest growing product of Amazon Web Services in 2014. Redshift combines high performance with low cost, providing great value for money. It is based on columnar storage technology, so it is great for analytic queries that require aggregation and for high throughput data ingestion.

  • It is fast - Thanks to it's massively parallel processing data warehouse architecture, parallelizing and distributing SQL operations to take advantage of all available resources.
  • It is scalable - It only takes a couple of clicks to spin up more nodes for more storage and better performance, up to 2 petabytes!
  • It is secure - Your data can be encrypted at rest and your Redshift's network can be isolated, using tools such as Amazon Virtual Private Cloud.
  • It is cost-effective - It is considered to be an order of magnitude cheaper than other commercial alternatives.

Should I build my data pipeline myself?

Moving data from one place to another sounds easy enough, right? Well, when you take into account high volume, ever-changing schemas and APIs from your third-party services, fault tolerance, and error handling, you start to see how daunting it can be (and it requires constant upkeep). We wrote about this extensively in our blog post: Building a professional grade data pipeline.

What visualization tool should I choose?

There are a variety of visualization tools you can choose from, depending on your specific needs and budget.

  1. re:dash - An open-source visualization tool. It has a relatively basic interface and feature-set, but it's open source and free. Since you need to install it on one of your servers yourself, installation might be a bit of a hassle... but hey, it's free.
  2. Amazon QuickSight - Amazon's business intelligence tool. It is probably the most cost-effective solution when considering cost vs feature-set.
  3. Mode - A well designed cloud-based visualization tool that allows you to quickly explore your data and share insights with different members of your team. Similar to re:dash, Mode relies quite heavily on writing SQL queries, which provides high flexibility but requires more advanced SQL skills.
  4. Looker - Looker is a cloud-based visualization tool. It allows you to perform complex analysis without the need to write complex SQL queries. It's a great option to create dashboards and interactive charts, especially if you're not a SQL guru.
  5. Tableau - Tableau is one of the most popular visualization tools. It is usually installed on-premises (although they now have a cloud-based offering as well). It's considered to be more expensive than the others, but it allows for very deep analytics through a simple drag-and-drop interface.

Step 1 - Setting up your Redshift cluster

The first step is to set up your Redshift cluster. AWS provides a great getting started guide that you can easily follow and get detailed pricing information.

What cluster size should you choose?

A rule of thumb says you should estimate how much data you would like to store in your data warehouse, and then multiply it by three: one part would be used for your data, one for temporary tables created by queries, and the last third for extra space just in case.

For example, for 6 months retention of data comprised of 1kb events streaming at a throughput of 1,000 events per second (EPS), you should plan for a cluster with at least 1kb*1000eps*3600[seconds/hour]*24[hours]*180[days]*3 = 46TB.

What hardware should I choose for my cluster?

Hardware will very much depend on your specific needs. If you plan to run complex queries and need very low latency, go for the high-performing dense compute hardware. If you need to store more data or have budget constraints, go for the dense storage nodes. You can find more details about the different Redshift node types here.

Once you have a Redshift cluster up and running, continue to the next step.

Step 2 - Building your data pipeline

Custom analytics data pipelines over Redshift are usually composed of the following functions:

  1. Connecting to your data sources
  2. Cleaning, filtering, enriching and transforming your data
  3. Mapping your data to Redshift tables
  4. Loading your data to Redshift

Given that we are a data pipeline company, we built our data pipeline with Alooma... surprising, right?

Connecting to your data sources

First, your data pipeline needs to be able to acquire your data from all of your data sources. Some of your data sources will push data to your pipeline, while others require your pipeline to actively pull data. For example, iOS, Android, and web applications push analytic events directly from the clients, while your data pipeline would have to regularly query the Salesforce API in order to collect the latest updates.

Alooma offers dozens of built-in connectors for easy and quick integration of all your data sources. It supports data sources including Google Analytics, Mixpanel, Localytics, Segment, Amazon S3, Azure storage, Salesforce, MongoDB, MysQL, PostgreSQL and provides SDKs including iOS, Android, Javascript, Python and Java. New data sources are added every week. Alooma is built to accommodate both structured (tabular) and semi-structured (JSON) data and to handle high throughput even at peak times.

Adding a data source

Sending data

If you would like to track your website or mobile app, you could use Alooma’s Javascript, Android and iOS SDKs. Our javascript SDK automatically tracks all pageviews on your website. You should also track all the relevant user actions. For example, in order to monitor our sales leads, we track when a user opens, closes and submits our contact form. Page views are automatically tracked when using the JS SDK.

If you use online marketing, we recommend tagging your landing page urls with campaign parameters that will allow you to track the source of each visitor. In our own implementation, we make sure that we know the campaign, the ad and the keyword that brought each paid visitor. This allows us to analyze the real ROI of each campaign, ad, and keyword.

An example landing page URL would be: https://www.alooma.com/?utm_source=alooma&utm_medium=content&utm_campaign=blog&utm_cotent=custom_analytics_to_redshift

Google Adwords simplifies tracking by allowing you to set a tracking template for your campaign url on a campaign level. Facebook and other advertisers, however, usually require you to add the tags manually to your ads when you provide the url.

Cleaning, filtering, transforming and enriching your data

After acquiring the raw data from all your data sources, it needs to be prepared before loading it to your Redshift. Some of your data you would probably prefer to ignore completely, while for some data points you would like to leave only the relevant fields. Some data would need enrichment (e.g. adding geolocation according to ip address), while other data would need to be transformed.

One example we encounter quite often is that Mixpanel stores timestamps in seconds, while Redshift expects timestamps in milliseconds. Trying to load Mixpanel data without fixing it first would result in corrupted data in your Redshift.

Alooma allows you to write simple Python code that runs on each event in your data pipeline. You can filter, clean, and transform your data with Python code while Alooma takes care of deployment and scale. You can go even further and enrich your data with built-in third-party libraries.

Clean, filter, transform and enrich your data

Mapping your data to Redshift tables

After your data is clean, it’s time to define how your data loads into your Redshift.

Alooma helps you easily map each field of every event type to the corresponding Amazon Redshift column within a simple user interface. The auto-mapping functionality automatically chooses Redshift-compatible column names for each field and uses the field statistics to determine the optimal column type (varchar, integer, boolean, timestamp, etc.). When your data's schema changes and new fields are seen for the first time, you will receive a notification in the Alooma dashboard and via email. While this would normally result in a loss of data, Alooma saves the new field in separate queue until you decide how it should be mapped. Once you update the mapping with the new field, columns will be generated in Amazon Redshift for you.

Creating a new Redshift table

When creating a new table, it is important to carefully choose the distribution key and sort key. You can find more details about this process in Redshift's best practices guide.

If there is a key that is more likely to be used for joins (e.g. user id, session id, etc), use it as a distribution key. Amazon Redshift makes sure that all data points with the same distribution key are stored in the same partition. This reduces the amount of data sent through the network during queries and dramatically improves performance. If you use Alooma's web and mobile SDKs, a good choice for a distribution key is properties_distinct_id, which represents a distinct user. A good sort key would be metadata_timestamp, which represents the time of the event since we will use timestamps to only query a small part of the data quite often.

Auto-mapping your data fields

Loading your data to Redshift

Loading your data to a data warehouses in general, and to Amazon Redshift in particular, can be tricky. In order to allow a high load throughput, data needs to be buffered and loaded in batches. To avoid data-loss, your design should take into account that Redshift will sometimes be unavailable due to maintenance (resize, vacuum) or a long-running query.

Alooma takes all of these factors into account and handles the optimization of data loading for you. Alooma buffers your data, loads it in micro-batches, parallelizes the loading process (where available), and keeps the optimum throughput and latency. Normally, your data will be loaded and available within your Redshift seconds after it first reaches Alooma.

Your data pipeline

Step 3 - Analysis

At this point, you should have your data in Redshift. Let's demonstrate how we built our lead-generation funnel analysis at Alooma.

The events table

We have a Redshift table containing user actions: mainly pageviews, contact form opened, and contact form sent events. Our events table has the following (simplified) structure:

timestamp distinct_id event page os ...
... ... ... ... ... ...

The timestamp is the time of the event, distinct_id is an id attached to a specific visitor, event is the name of the event (e.g. ‘pageview’, ‘contact_form_openend’ etc), and os is the operating system of the user. Usually, each row contains many more fields (not shown here for simplification) such as referrer, screen size, country, city and more.

The Salesforce tables

We import 2 main tables from Salesforce:

  • Leads - leads that come from the website - contain information about the lead, its source, its owner, and whether it converted or not.
  • Opportunities - If a lead is qualified (i.e. relevant to your business), it becomes an opportunity in Salesforce. This table contains information about the size of the opportunity in potential annual revenue and how the sales team estimates the probability of closing.

Engagement

When attempting to analyze the engagement of our users with our product, we can measure, for example, how many events we captured every day/week/month.

Events per week

A simple query to create this chart would look like this:

SELECT date_trunc('week',metadata_timestamp) AS week,
       count(1) AS COUNT
FROM events
GROUP BY week

Which will return a table similar to this:

Week Count
6/29/2015, 12:00:00 AM 1,230
7/6/2015, 12:00:00 AM 889
6/8/2015, 12:00:00 AM 1081
... ...

And ultimately can be used to create the following chart:

User engagement

If instead we have several types of events we could create a stacked chart by running the following query:

SELECT date_trunc('week',metadata_timestamp) AS week,
       event,
       count(1) AS COUNT
FROM events
GROUP BY week, event

Which will return the column ‘event’ and will allow us to break the results down by the analytics event type:

Week Event Count
6/29/2015, 12:00:00 AM mp_page_view 1,230
6/29/2015, 12:00:00 AM contact_form_opened 889
6/8/2015, 12:00:00 AM mp_page_view 1081
... ... ...

And we can finally create a stacked graph that explains user actions much more clearly yet with high granularity:

User engagement by event

Monthly active users (MAU)

The problem with the chart above is that we don’t know how many users generated the actions. We can see growth in the activity, but it’s hard for us to tell if it’s because there are more users, or because the existing users used the product more often.

Therefore, a popular metric to follow in addition is the monthly active users (MAU).

To show a chart of the MAU we can run the following query:

SELECT date_trunc('month',metadata_timestamp) AS month,
       count(distinct properties_distinct_id) AS COUNT
FROM events
GROUP BY month
order by month desc
Month Count
11/1/2015, 12:00:00 AM 3,628
10/1/2015, 12:00:00 AM 3,979
9/1/2015, 12:00:00 AM 1081
... ...

Monthly Active Users - MAU

Funnels

Engagement charts are interesting, but they are often referred to as “vanity metrics". That’s because they don’t tell us anything about the improvement in the performance of our product. Engagement can go up and to the right just because we invest more money in marketing, for example.

Funnels allow us to measure if we actually improved our product. It will tell us how likely a visitor is to convert to a paying user on our website, for example. If the conversion rates increase, it means that we actually improved our product.

Let’s say that we have a 3 step funnel:

  1. A user visits our website - page view event
  2. A user opens a contact form
  3. A user submits the contact form

A simple query that will allow us to see our funnel would be:

WITH timestamped_events AS
  (SELECT properties_distinct_id,
          min(CASE WHEN event = 'mp_page_view' THEN metadata_timestamp END) AS ts_landingpage,
          min(CASE WHEN event = 'contact_form_opened' THEN metadata_timestamp END) AS ts_contactformopened,
          min(CASE WHEN event = 'contact_form_sent' THEN metadata_timestamp END) AS ts_contactformsent
   FROM events
   GROUP BY properties_distinct_id)

SELECT sum(CASE WHEN ts_landingpage IS NOT NULL THEN 1 ELSE 0 END) AS landingpage_count,
       sum(CASE WHEN ts_contactformopened IS NOT NULL THEN 1 ELSE 0 END) AS contactformopened_count,
       sum(CASE WHEN ts_contactformsent IS NOT NULL THEN 1 ELSE 0 END) AS contactformsent_count
FROM timestamped_events

The first part of the query creates a temporary table of each visitor's distinct id and the first time they did any of the above actions. The second part of the query sums the number of distinct visitors who performed each step.

Landingpage_count Contactformopened_count Contactformsent_count
15,997 1,427 174

Conversion funnel

We could also see how our funnel develops over time by running:

WITH timestamped_events AS
  (SELECT properties_distinct_id,
          min(CASE WHEN event = 'mp_page_view' THEN metadata_timestamp END) AS ts_landingpage,
          min(CASE WHEN event = 'contact_form_opened' THEN metadata_timestamp END) AS ts_contactformopened,
          min(CASE WHEN event = 'contact_form_sent' THEN metadata_timestamp END) AS ts_contactformsent
   FROM events
   GROUP BY properties_distinct_id )

SELECT date_trunc('week', ts_landingpage) AS week,
       sum(CASE WHEN ts_landingpage IS NOT NULL THEN 1 ELSE 0 END) AS landingpage_count,
       sum(CASE WHEN ts_contactformopened IS NOT NULL THEN 1 ELSE 0 END) AS contactformopened_count,
       sum(CASE WHEN ts_contactformsent IS NOT NULL THEN 1 ELSE 0 END) AS contactformsent_count
FROM timestamped_events
GROUP BY week
Week Landingpage_count Contactformopened_count Contactformsent_count
6/15/2015, 12:00:00 AM 171 4 0
7/13/2015, 12:00:00 AM 244 24 3
8/17/2015, 12:00:00 AM 816 58 8

Conversion funnel by week

Retention

One of the most popular and important charts is the retention cohort analysis. It allows us to see how often our users come back to our product. We can learn how quickly users churn and what percentage of our initial users become regular/retained users.

If we look at our retention cohort chart, it allows gain insight as to whether our product is improving or not. If users that start using our product this week churn less than users who started using our product last month, we can infer that updates to our product offering have had a positive impact on the user experience.

A sample query to group users by the week of their first use of our product could be:

WITH users AS
  ( SELECT properties_distinct_id AS user_id,
           date_trunc('week', min(metadata_timestamp)) AS activated_at
   FROM events
   GROUP BY 1) ,

events AS
  ( SELECT properties_distinct_id AS user_id,
           metadata_timestamp AS occurred_at
   FROM events)

SELECT DATE_TRUNC('week',u.activated_at) AS signup_date,
       TRUNC(EXTRACT('EPOCH'
                     FROM e.occurred_at - u.activated_At)/(3600*24*7)) AS user_period,
       COUNT(DISTINCT e.user_id) AS retained_users
FROM users u
JOIN events e ON e.user_id = u.user_id
AND e.occurred_at >= u.activated_at
WHERE u.activated_at >= getdate() - INTERVAL '11 week'
GROUP BY 1,2
ORDER BY 1,2
Signup_date User_period Retained_users
9/7/2015, 12:00:00 AM 0.00 808
9/7/2015, 12:00:00 AM 1.00 72
9/7/2015, 12:00:00 AM 2.00 51
... ... ...

User retention cohort

Sales pipeline

Let’s use our Salesforce integration to analyze the reasons a lead didn’t convert to a paying customer. We can query our ‘opportunities’ table, originating from our Salesforce integration.

SELECT
       Loss_Reason,
       Alternative,
        count(1) count
FROM opportunities
WHERE closed = TRUE
  AND won = FALSE
group by 1,2
Loss_reason Alternative Count
missing features in-house 4
price in-house 3
... ... ...

Leads loss reason

Up until now, we only used the events table. Even though our custom solution allowed us to query our data at a higher resolution, we didn’t leverage one of the biggest benefits of building your own custom analytics solution, which is joining multiple different data sets.

In this example, we will join our analytical event data with sales data coming from Salesforce. These kinds of queries allow us to evaluate our product end-to-end and see how a specific click for a specific keyword for a specific ad converted to become a paying, recurring customer. While most analytics tools stop at the moment the conversation between the salespeople goes offline, our integration allows us to use the sales data--even if a lead becomes a paying customer months after they first landed on our website.

For example, we can see a breakdown of the status of the lead in the sales pipeline by where the lead converted on our website.

Leads status by click source

Phew, long post, hope it was useful! That's all for now. We would like to share more posts in the future that dive even deeper into analyzing and deriving insights from your data, so stay tuned and feel free to let us know if you have any ideas for us to write about.

Need help building your custom analytics solution? Contact us and tell us about your custom analytics use case.

Questions? Requests? Leave us a comment below.

Like what you read? Share on

Get your data flowing

Contact us to start using Alooma for free

Get Started

This might interest you as well

Take control of your data for free!

Sign up and get $500 worth of free credits to try Alooma.