Loading your data into Amazon Redshift might seem like a straightforward task, but in order to fully enjoy the benefits of having your data stored neatly in a data warehouse, some routine maintenance is required. Vacuuming your carpet is not the world's most pleasant task. But if you don’t, you’ll notice when the dirt builds up. Redshift has it's own set of maintenance tasks, namely VACUUM and ANALYZE operations. In this post we will briefly explain why they are needed and how they optimize the performance of your data cluster, before presenting our open-source tool to do the dirty job for you.
Why VACUUM and ANALYZE?
Amazon Redshift breaks down the UPDATE function into a DELETE query, followed by an INSERT of the updated tuple. Moreover, any deleted row is actually marked for deletion (logically deleted), but in fact still resides on the disk. Since disk I/O operations are time consuming, Redshift doesn't immediately delete all the values from the disk. While in the short term this behavior results in faster performance, in the long run you accumulate a large amount of deleted rows. Besides the obvious waste of space, your query performance is degraded because the query processor has to scan many deleted rows. In this situation, we want to run a VACUUM operation to get rid of the deleted rows, reclaim space and make sure further queries will not scan through lots of deleted rows.
Another advantage of frequently vacuuming your tables, is that the operation ensures data is fully sorted on the disk, in case the table has a sort key. Upon loading, the data is stored in an optimal way to support certain query operations, such as range-restricted scans. However, after many INSERT, UPDATE and DELETE transactions, the table gradually becomes unsorted, resulting in subpar performance.
There are not many disadvantages to vacuuming your table as much as possible. In fact, running your VACUUM operation often is certainly recommended, though not strictly required in situations when you don't run UPDATE and DELETE queries. Since VACUUM is a heavy I/O operation, it might take longer for larger tables and affect the speed of other queries. Therefore, it is recommended to schedule your vacuums during the time when the activity is minimal.
Another periodic maintenance tool that improves Redshift's query performance is ANALYZE. It updates the metadata and the statistics of a table, details that are later used by the Query Optimizer to process query requests. When a large number of rows have been updated or inserted, the table statistics may become outdated. In such cases, it is highly recommended to run an ANALYZE command, especially on columns that are involved in joins, sorting or grouping operations. Just like VACUUM, ANALYZE is a useful operation that would improve your system performance for the price of a slight discomfort.
Our Solution: Open-Source Vacuumer-Analyzer Tool
At Alooma, we care about your comfort. That's why we created Vacuum-Analyze-Redshift to help you schedule your maintenance tasks. Before you run it, all you need to install is Docker. We wrapped Amazon's Redshift-Utils code inside a slim Docker container, so you don't have to worry about building it. You are welcome to study the documentation of Amazon to understand how their script works. We believe that in most cases the default values set are sufficient.
Let's demonstrate how the task can be run by presenting three potential types of users:
- General User - By providing your mandatory Amazon Redshift credentials, this basic shell script will connect to the database and perform VACUUM and ANALYZE operations. If DB_SCHEMA is not provided, the tool will work on your 'public' schema. If you wish to export the log from the container into your machine, follow Docker Tutorial to understand how Volumes work.
docker run --rm \ -e DB_NAME=your_db_name \ -e DB_USER=your_db_user \ -e DB_PWD=your_db_pass \ -e DB_HOST=your.redshift.cluster \ [-e DB_SCHEMA=your_schema ] \ # default: 'public' [-v /var/log/log.txt:/your/path] \ # use this to define a volume alooma/vacuum-analyze-redshift
- Intermediate User - You can set up an e-mail notification system to receive the log of the operation once it's completed. This way, you can run the operation in the background and get the report at your convenience. Make sure you set up the SEND_EMAIL flag to True and provide all the required E-mail credentials.
docker run --rm \ -e DB_NAME=your_db_name \ -e DB_USER=your_db_user \ -e DB_PWD=your_db_pass \ -e DB_HOST=your.redshift.cluster \ -e SEND_EMAIL=True \ -e EMAIL_HOST=smtp.gmail.com \ -e EMAIL_PORT=587 \ -e EMAIL_SENDERfirstname.lastname@example.org \ -e EMAIL_PWD=senderpassword \ -e EMAIL_RECIPIENTemail@example.com \ [-e DB_SCHEMA=your_schema ] \ alooma/vacuum-analyze-redshift
- Advanced User - In Amazon's repository documentation there is a great explanation for those who would like to customize their script and include more parameters.
docker run --rm \ -e DB_NAME=your_db_name \ # mandatory -e DB_USER=your_db_user \ # mandatory -e DB_PWD=your_db_pass \ # mandatory -e DB_HOST=your.redshift.cluster \ # mandatory -e DB_SCHEMA=your_schema. \ # default: 'public' -e OUTPUT_FILE=... \ # default: /var/log/log.txt -e VACUUM_FLAG=.. \ # default: True -e ANALYZE_FLAG=... \ # default: True -e DEBUG=... \ # default: True -e IGNORE_ERRORS=... \ # default: False -e SLOT_COUNT=... \ # default: 2 -e MIN_UNSORTED_PCT=... \ # default: 5 -e VACUUM_PARAM=... \ # default: FULL -e MAX_UNSORTED_PCT=... \ # default: 50 -e DELETED_PCT=... \ # default: 15 -e STATS_OFF_PCT=... \ # default; 10 -e MAX_TABLE_SIZE_MB=... \ # default: 700*1024 -e SEND_EMAIL= \ # default: false -e EMAIL_HOST=... \ # mandatory if SEND_EMAIL is true -e EMAIL_PORT=... \ # mandatory if SEND_EMAIL is true -e EMAIL_SENDER=... \ # mandatory if SEND_EMAIL is true -e EMAIL_PWD=... \ # mandatory if SEND_EMAIL is true -e EMAIL_RECIPIENT=... \ # mandatory if SEND_EMAIL is true alooma/vacuum-analyze-redshift
Automate the Task
Using the cron utility of *nix operating systems, you can schedule the above-mentioned script to run periodically at a given time. For example, suppose you would like to run your Vacuum/Analyze task on Mondays at 3:15 AM. Open your terminal. Save the script in a known path and make sure it has executing permissions. Enter:
$ crontab -e
Then, add into your crontab file the following line:
$ 15 3 * * 1 /path/to/script.sh 1>/dev/null 2>&1
Make sure you saved the file before you exit. You can check if your cron job has been saved by typing:
$ crontab -l
Cron is a powerful tool and the above example is just a basic introduction. You are welcome to find a tutorial and create more complex tasks.
The script provided above is not perfect. It will certainly make your life easier, yet in situations when your database is too large, a simple VACUUM command might be insufficient. A common error we encounter is a Column Limit Exceeded Error, which is well documented on Amazon Redshift website. However, the solutions they recommend are not without drawbacks. Increasing wlm_query_slot_count might result in a much lower performance of other queries, while performing a deep copy is another slow, I/O heavy operation.