What is StreamSets?

byAlooma Team
Updated Dec 8, 2017

StreamSets is a cloud native collection of products designed to control data drift: the problem of changes in data, data sources, data infrastructure, and data processing. The company calls its applications a data operations platform. Included features are a living data map, performance management indices, and smart pipelines providing a similar level of control to common business operations systems.


StreamSets provides two products, the Data Collector and the Dataflow Performance Manager. The Data Collector is an open source application which allows users to build platform agnostic data pipelines. They are optimized for continuous ingestion with no data latency. The user can build batch and streaming dataflows with minimal coding. The Dataflow Performance Manager controls multiple dataflows within the visual user interface. Baselines and Key Performance Indicators (KPIs) are measured through the DPM.

Streamsets and ETL

As with many new products, StreamSets’ flexibility extends beyond traditional Extract, Transform, and Load (ETL). The DPM and Data Collector are useful for a variety of data management applications — real-time data mapping and maintenance of corporate architecture documentation. StreamSets easily integrates with Agile processes with its real-time dataflow measurements.

StreamSets approaches ETL differently, providing ETL upon ingest services. This means that the ETL process occurs upon ingestion of streaming data, rather than as a separate three-step process. Creating a pipeline does not require coding. Pipeline data is parsed into a StreamSets Data Collector Record. This common record format does not require custom data transformations. StreamSets uses general purpose connectors to avoid custom coding. In addition, search results can be ingested; providing further segmentation of data. StreamSets easily loads data from a relational database, flat files like Excel, or a CRM.

Pipelines in StreamSets are not single purpose artifacts. Each has a single data origin, but can have multiple destinations. This streamlines the ETL process by allowing pipelines to be used for multiple applications. Multiple uses increases analytics and ETL flexibility.

Like what you read? Share on