Are there any open source scheduling tools for AWS Redshift?
One common periodic task that comes up a lot with AWS Redshift users is running timely Vacuum and Analyze.
When you are using a third-party ETL pipeline-as-a-service (shameless plug - I work in one :-)), they take care of keeping your Redshift cluster clean and healthy. But in case you need to manage it on your own, you need to keep in mind the importance of Vacuum and Analyze operations.
AWS Docs provide a great explanation of why each operation is needed, so I will provide a tl;dr version of it:
- Redshift queues delete operations from the disk to save on I/O and network latency, so DELETE operations only logically delete the records (soft delete), but they still occupy disk space.
- Redshift handles UPDATE as a DELETE followed by an INSERT of the new values, suffering from the same issue of soft deletes.
- Having many logically deleted records is expensive not only in terms of disk space, but also in terms of the performance of queries (traversing many “dead” records) and inserts (maintaining order by key).
- Redshift query optimizer uses different statistics and metadata to select the best execution plan. Having many deleted rows skews these statistics and makes the optimization…well, suboptimal.
The solution to all these pains is to run VACUUM operation - which forces hard delete and re-sorts the table, and ANALYZE operation - to recalculate the metadata and help your query optimizer to do the work.
I have written an open-source tool that can run these tasks for you, sending e-mail once they are completed. It can be scheduled using the cron utility. I hope you find it useful!
Published at Quora. See Original Question here