Amazon Athena - Initial Analysis

by Itamar Weiss  
3 min read  • 30 Nov 2016

Amazon Athena

AWS has just announced the release of Amazon Athena - an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. In this post we'll dive into what Amazon Athena is and how it compares to Amazon Redshift.

Athena has a couple of interesting characteristics:

  • Serverless - No need to setup or manage any servers or data warehouses. You can start querying your data wherever it's already stored. Just point Athena to your data in Amazon S3, define its schema, and you can start querying it. Athena takes care of spinning up all the required infrastructure for you behind the scenes.
  • Pay per query - Amazon chose the pay per query pricing model for Athena. Current pricing is $5 per terabyte scanned per query. Using compression, partitioning, and by storing your data in a columnar format you can get better performance and lower your costs.
  • Built on Presto - Amazon Athena runs Presto behind the scenes. Athena allows running standard ANSI SQL against the most common data formats such as CSV, JSON, ORC and Parquet. To learn more about Presto, skip to the "What is Presto?" section below.

Athena vs AWS Redshift

One might wonder why Amazon released Athena when it already offers Redshift as a data warehouse. As always, for some use-cases Athena over S3 will be the better choice, while for others Redshift might be superior.

Advantages of Athena:

  • Completely managed - No need to spin up clusters, decide which machine types to choose or to maintain your cluster (e.g. VACUUM).
  • Unlimited scale - Unlike Redshift, Athena's querying resources and storage resources are independent. You can store as much data as you like and still query parts of it cost-effectively, in high performance.
  • With Athena you can query the data where it lives in its original format. Nevertheless, different storage formats can have very different performance, and for efficient querying it's better to store the data in a columnar format such as ORC.

Since Redshift requires the user to define the data schema at load time, and allows the user to optimize storage by performing the VACUUM operation, performance might be better without special optimizations in Redshift.

In addition, Redshift still provides the fastest query performance for enterprise reporting and business intelligence workloads, especially those involving multiple joins and sub-queries.

What is Presto?

Presto is an open-source distributed SQL query engine optimized for low-latency, ad-hoc analysis of data. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can process data from multiple data sources including the Hadoop Distributed File System (HDFS), Amazon S3, databases such as PostgreSQL and MySQL, and even Apache Kafka.

Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

What makes Presto fast?

  • In-memory data during execution - Intermediate results are never persisted on disk between different stages of the query. Once the data is read from its source, it's stored and processed in-memory.
  • Pipelining and streaming - Presto was designed to minimize query latency: as soon as available data sources are discovered by the query engine, the queries start being executed without any delay and the results start to stream to the next stage of the query engine. Presto doesn't wait for one stage to finish before starting to process the next stage.
  • Advanced dev optimizations such as very careful coding of inner loops, using efficient flat-memory data structures that almost eliminates garbage collection and on-the-fly bytecode generation of parts of the queries.

Presto has multiple data connectors and it is designed to be pluggable - allowing the community to create new connectors with relative ease.

The Alooma Platform + Amazon Athena

Alooma is your real-time data pipeline as a service. With Alooma you can collect your data from all your different data sources (transactional databases, cloud services, server logs, etc.), transform it, organize it, partition it and load it to Amazon S3.

Structuring your data correctly can have a significant impact over your Athena costs and performance. Alooma helps you load your data in the right format in real-time so it is always ready for analysis at the lowest latency, highest performance, and lowest query cost.

If you are interested to learn more about how you can leverage your data using Alooma and Athena, contact us to discuss your data integration needs and join the Alooma + Athena beta group.

This might interest you as well