Building a Professional Grade Data Pipeline

by Rami Amar  
5 min read  • 11 Jul 2018

Customer after customer, the same question almost always comes up:

“Well, why don’t we just build the pipeline ourselves?”

Frankly, you could.

But at what cost? If you are currently maintaining an in-house system, then you know first-hand the obstacles to build a reliable, feature-rich data pipeline.

This is why we created Alooma - to help you avoid the pitfalls and headaches of building data pipelines. But hey, I'm an engineer too. I get that sometimes you have to get your hands dirty and try to do it yourself. So, if you're still insisting on the "DIY approach", we want to help you foresee the usually-unplanned issues one may encounter while building data pipelines.

Every Pipeline Starts With A Simple Script

“Copy data from A to B? I can write a script for that in 2 days!”

True. You probably could. Whenever you need to import data from a new source (e.g. your server logs) to a new destination (e.g. Redshift), call up any one of your developers, ask them to research the problem and write a solution... Ta-da! You’ll get a script up and running in just a few days. Whoever gets the challenge will spend a couple of days reading about Redshift’s best practices for loading data, and about monitoring files & directories for changes. The solution is a pretty straight forward script, easily implemented in Python:

  • Monitor a directory for new files (Python watchdog)
  • For each new file:
    • Convert the file to Redshift’s acceptable format
    • Copy the new files to S3 (Python boto or s3cmd)
    • Issue a Redshift Copy command

For a data engineer who is experienced with Redshift and Python, this is a week's worth of coding. Unfortunately, it never ends there. The "happy path" in data pipelines is never followed, and when life deviates, not only does your data leak, but (even worse) your engineer will wake up unhappy in the middle of the night to fix it.

The First Leak: Schemas Change

The one part of the script that you do have to write on your own (and not just learn how to use), is the part that converts the server logs' format, or schema, to Redshift's COPY acceptable schema. Your server logs might be in syslog format, Log4J format, JSON lines, or even (G-d forbid) XML. Redshift accepts table schemas: TSV, CSV, and anything in between. About a year ago, Amazon even added support for COPYing JSON lines. Converting schemas (aka "parsing") is a tedious job most programmers hate. It requires meticulous attention to every detail of the data's format: do the commas have a space after them? Do the timestamps have milliseconds and a timezone? Does the number always contain a decimal point or just sometimes? This is just the beginning... And sometimes you don't know all the details of the schema ahead of time, so you write your script and then end up running to fix it about 20-30 times. And then again a week later when a really rare log message arrives. As Mike Driscoll quips,

"The best minds of my generation are deleting commas from log files, and that makes me sad."

It makes us sad, too.

Yes, the schema game goes on... Schemas change regularly. It's part of a healthy, growing application (and business). With every new feature there are more messages and more details to include in the schema and transfer to your database. Schemas also change on the output side. Your analysts thought they need the data in one table structure, but once the table has been populated, they realize they actually need a different structure. If you're lucky, a simple "ALTER TABLE" will do the trick, but in our experience, you will probably end up dropping the table, recreating it, and reloading the data. Twice.

The Second Leak: The Inputs / Outputs Fail

So you've handled all the different schema corners, and your analysts are happy with how the tables are structured. You've even implemented some sort of schema version management solution (like Confluent's Schema Registry). Unfortunately, your data engineer is still being paged in the middle of the night to fix leaks. Leaks that happen because your inputs and outputs have their own set of regular failures.

Going back to our directory monitoring script, it is definitely not error free: the machine may run out of disk space; the program writing the files may have errors; your script which monitors the directory needs to be restarted after an OS reboot; the DNS server fails and your script can't resolve your Redshift's IP address. Redshift, although an incredible product by Amazon, has its issues too: the number of concurrent transactions maxed out and loading is stuck; someone ran a VACUUM on the table, and issuing a COPY command made everything freeze; Redshift's low on free space and some complex query brought it down to 0; or, you decided to resize your cluster (*gulp*). Trust us - we've seen it all.

Every time you have a leak, whether due to schema changes or input/output failures, sealing it is the first step. The second, and often forgotten, step is recovering the data.

The Third Leak: Recovering From the First Two Leaks

By now you should have an iron clad script - resilient to all schemas and failures of the past. It is not without scars - to your data or to your data engineer. Scars to your data - when leaks bring data to an unrecoverable lost state. Sometimes you can miss out on an hour of data, sometimes on a few days, and sometimes you realize your data has been skewed for weeks. We once met a CEO who told us it took his business over 6 months to realize their metrics were completely erroneous. This not only frustrates your data analysts, it hurts your business's ability to make reliable data-driven decisions. Scars to your data engineer - those are numerous. Other than the sleepless nights and work-full weekends, your data engineer may spend a few days recovering from a single leak: digging out files and offsets to understand where loading stopped; querying database tables to understand what has already been loaded; reading messy Python log files and stack traces. Again, we haven't just seen it, we fold up our sleeves and dive in when our customers need it.

What do we do with leaks? Well, we came to an understanding that they're inevitable. So we developed our Restream concept, and invested significant efforts into tracking every (yes, every) event in the pipeline. (Come back soon to read about Alooma's Restream, and learn how your data engineer can sleep better and never lose data)

The Fourth & Hardest Leak: Your Business Scales

If your data pipeline starts and ends with server logs and Redshift, we envy your peace of mind. But humor aside, even a small 10 person mobile app shop has more data pipelines than that. As your business scales, your data scales. First, you have more users, and more users means more data. Your battle tested, iron clad script was good for 100,000 users, but with 1,000,000 - it's just not keeping up. You may quickly find yourself in need of distributed stream processing, and scaling a data pipeline to run on a cluster... well, you are in the big leagues now.

Next, you develop another app, or have a 3rd party service you plan to discontinue. This means your demand for input sources and output databases increases. Consequentially, the amount of potential schema changes and failures increases in multiples. The engineering resources needed to build a data pipeline running on top of Kinesis, Kafka, Storm, Spark Streaming, Amazon Lambda (or whichever technology you chose) may easily reach months and even years of work hours. Your data engineering becomes a crucial and constant struggle.

A Final Note About Data Engineering

All those leaks and data scars - that's our business. As corny as it sounds, we're excited to hear about every data pipeline architecture. We come to the office every morning, eager to design the next tool to help make your data engineering processes fast, clean, and efficient. Utilizing the best open source technologies, we connect these tools to create a simple yet powerful platform.

If building this yourself is the best option for you, we welcome you as a brother or sister in arms (and scars). But if you realize it's best for you not to go down the DIY path, we are here to help you build a professional grade data pipeline solution, from any input, to any output. Have ideas or suggestions on how we can better assist you? Leave us a comment or feel free to contact us.

This might interest you as well