What is Amazon Web Services (AWS) ETL?
Amazon Web Services (AWS) is a cloud-based computing service offering from Amazon. AWS offers over 90 services and products on its platform, including some ETL services and tools. AWS Glue is a managed ETL service and AWS Data Pipeline is an automated ETL service. Also related are AWS Elastic MapReduce (EMR) and Amazon Athena/Redshift Spectrum, which are data offerings that assist in the ETL process.
Different AWS ETL methods
Method 1: Use AWS and an EMR Cluster on an S3 Bucket
Method 2: Use Athena or Redshift Spectrum to Analyze Data in Amazon S3
Both of these methods set up a data pipeline between your source and destination data stores. Once there are log files available in your Amazon S3 bucket, you can begin.
A prerequisite for both is that the S3 files for the largest table need to be in one of three formats: CSV, non-partitioned Parquet, and partitioned Parquet (Apache Parquet is a columnar storage format that makes data available to any project in the Hadoop ecosystem). The columnar format allows Redshift Spectrum to scan only needed columns, thereby saving Amazon Redshift charges.
The basic ETL steps are:
- Create external schema
- Define external tables
- Query data
Method 3: Use AWS Glue - ETL AWS GLUE is a fully managed ETL service run from the AWS Management Console.
In addition to enabling user friendly ETL, it also allows you to catalog, clean, and move data between data stores. Since Glue is on a pay-per-resource-used model, it is cost efficient for companies without adequate programming resources.
The user creates a data catalog, generates transformations, and schedules/runs ETL jobs in the console. Glue uses Python to generate the ETL code to run in the Glue Apache Spark environment. Schema discovery is automated, too.
Glue is fully integrated with other AWS data services. Data held on Amazon VPC’s in MySQL or PostgreSQL databases can also be queried. Developers can customize and port the Python code used in Glue anywhere. Logs and notifications are pushed to Amazon CloudWatch for monitoring and alerts.
Pros and cons
Using the Data Pipeline methods require skilled programming resources. The trade-off is that your ETL processes and queries can be customized in any way you need. For businesses with highly individualized data needs this method is likely worth the extra cost. Method 2 is advantageous for businesses with strong database administrators or SQL programmers and for those with multiple datasets on AWS. For those with limited programming resources and data analysis time, using Glue can provide the benefits of ETL without the overhead. However, as with any managed solution, if you do not or cannot customize the code then some analysis nuances may be lost.