Data Pipelines of Tomorrow

by John Hammink  
7 min read  • 6 Dec 2018

By the time humans got around to creating systems that imported user data at regular fixed intervals (e.g. banks with nightly upload over ETL), they also began to see the potential for input data to provide an effective feedback loop on the system itself. Data was, by then after all, not just a message, but a key part of how the data pipeline — or the organization using it — would construct and harmonize itself.

In business systems, analytics data was also used to improve the process or product in question. Banking data, for instance, was fed back to the consumer as account balance statements, while also being used to optimize the business process. The data was used to automatically calculate incentive interest rates and fees for account holders, for example, and determine for product owners which demographic preferred which financial products.

Where things are today, and where they are headed

Nowadays, data (and data pipelines) are pretty ubiquitous: data no longer flows merely from nightly batch ingestion to central data stores and out to user dashboards, but typically in both directions. Consumer devices may even have their own data pipelines built in, which provide input and feedback to the larger system. This polydirectionality of data flowing through such systems is just one of many factors causing the amount of data in the greater datasphere to grow exponentially.

Indeed, as IDC points out, the 16.1 zettabytes of user data generated around the world in 2016 is expected to grow tenfold to 163 zettabytes by 2025. Far from the days of nightly import cycles at the bank, users in this world will be interacting with a data-driven endpoint on average once every 18 seconds.

To get a better sense of this future, we'll look at data — and data pipelines — from a few different perspectives: which direction the data of the future will flow, what data engineers can expect with distributed ledgers and blockchain technologies, and how regulatory compliance will work in a future with the immutable, ordered event log. We'll also consider pipeline requirements like those of scalability, performance, and design(ability) for our future pipelines.

From unidirectional to core-to-endpoint

Today, data often runs in near real time, polydirectionally, is fairly ubiquitous to users, and can even help save lives. Consider the core-to-endpoint (also known as core-to-edge, or C2E) data pipeline.

A core-to-endpoint system A core-to-endpoint system

In the past, a data pipeline was something where data went in one end (often as a batch import) and came out the other end, in the form of analytics or a dashboard that helped (an often fairly limited group of) users understand the data.

In a C2E model, data may run polydirectionally, that is, from many points of ingestion back to central or edge data stores for processing, aggregation or analytics, and then back out to endpoint devices or dashboards for more processing. The data can also serve as instructions or training data for subsequent systems that run on AI (more on this later). There's no one-size-fits-all for data pipelines, anymore.

Where can we see examples of C2E pipelines?

  • IoT
  • Automotive
  • Mobile real-time data
  • Massively multiplayer online games
  • Infrastructure
  • And more...

Blockchain and the distributed ledger

Order is critical for transactions on the blockchain. You could expect that, if events were written in an arbitrary order, it would be impossible to reconstruct the state of your data at any point in time, or who did what to whom and when in a given transaction.

However, whenever data is partitioned and distributed across a network, one must consider the CAP Theorem, that is, the idea that a user of such a network may, at scale, need to tweak the tradeoff between data consistency and availability. For this reason, we expect to see more users implementing their distributed ledgers as tunably-consistent distributed databases.

Pub-Sub, compliance, and the immutable, ordered, event log

Currently, a pub-sub architecture is what moves data from one datastore or location to another.

How does it work? Solutions like Apache Kafka, Apache Pulsar and Amazon Kinesis have producers publishing all of their items sequentially to some form of a log, while consumers pick and choose (or subscribe to) the sort of events they will consume. The pub-sub mechanism prevents the N*N complexity of a point-to-point integration, while removing complex data routing rules from producers by putting the onus on data consumers to consume based on preference.

The pub-sub mechanism The pub-sub mechanism

With an immutable log where all events are written, how would you comply to rules like those of GDPR, which stipulate, among other things, that EU citizens can withdraw consent to store their personal data? The rules also dictate that a user's data must be kept secure, up-to-date, and restricted to the minimum necessary.

You can, of course, simply discard the PII (personal identifying information) before actually writing it (this also solves the problem of data storage). Yair Weinberger of Alooma, during a recent Q&A session at the Alooma CONNECT conference, proposed 3 other ways to secure a partitioned event log:

  1. If the data is impossible to access, it doesn't exist. Write different partitions of a log to different locations. When permissions sunset, remove access to associated data locations.
  2. Automatically anonymize identifying information via a hash as it's created or written.
  3. Encrypt different partitions with a different key. When the time's up, revoke/destroy the key.

Encrypting the event log by partition Encrypting the event log by partition

In addition to those methods, it's already possible to simply query logs and perform analysis remotely, thus avoiding — or minimizing — local data storage requirements.

The data engineer of the future will be able to choose — and will need to decide — between these methods.

Scalability, reliability, and performance

To grow to scale, data pipeline owners may need to make a few decisions about the data that they store at rest. In the future, the quantity of data generated even within a system will likely outgrow the capacity to store it all. Thus, data engineers of the future will need to consider the following questions:

  1. Which data is to remain volatile (in memory only) and temporary?
  2. Which data is kept persistent and stored somewhere?

For the data that is stored, a pipeline's storage capacity will need to massively autoscale, while handling increasingly ambiguous formats.

This is explains why we now see data pipelines with several different kinds of data stores running side by side. Elasticsearch, for example, works great for storing unstructured (or semi-structured) text-based data, and might be run alongside Redis (a key-value store) where super-fast lookups are needed, or a distributed database containing a ledger.

Reliability: How MTTF → ∞

Reliability of any system is typically measured with the metric of mean time to failure (MTTF). MTTF can either mean the amount of time a system is up before it fails, or the number of events it can handle before it encounters a failure.

Simply put, MTTF of data pipelines will approach, but never reach, forever. With pipelines and their data growing at increasing volumes and becoming increasingly critical, failures that affect the availability of a system will only be less tolerated over time. Of course, fault-tolerance can be somewhat mitigated in some parts of the pipeline by toggling the consistency/availability tradeoff, as mentioned earlier.

Performance in the future

On a similar note, we predict that the latency of access to data stores — and the time it takes to run queries — will continue to shrink. This has much to do with the need for data stores to be in sync as much as possible and for users to be able to act on the results of their data in real time. Computational performance and latency will continue to improve — with processing times shrinking — whether processing is happening at the core or at an endpoint device.

Designing the pipeline of tomorrow

Do you know SQL? Many have observed that the next step involves dissolving the conceptual barrier between data that's in motion or stored, and that all data in a pipeline will be available from a single query interface. Confluent already talks about the table/stream duality. These days, we're seeing the beginning of a trend where data can be queried from anywhere in the pipeline — even when in motion — by alternately using a streaming SQL-like variant like KSQL, SQL on Amazon Kinesis, Apache Calcite, and Apache Pulsar SQL.

We also believe that pipelines will be, like Alooma already is today, available as a service, with full, common configurations available out-of-the-box.

We've looked at a few aspects of data — and data pipelines — of the future: directionality, compatibility with emerging technologies, and regulatory compliance with the immutable, ordered event log. We've also looked at scalability, performance, and design(ability) in the future.

These days, enterprises already use Alooma to pull in data from disparate sources, transform it, map it, clean it, customize and enrich it before loading it into a data destination of choice. As data transforms and pipeline requirements change, we believe the best way to futureproof your data pipeline is to go with the best-in-class. Contact us to learn more.

This might interest you as well