Modern, state-of-the-art data ingestion
Mix and match
Enrich it on the fly
Ingest from a variety of sources
Convert any schema to any other
Catch errors in our safety net
Migrate it all — securely
Learn more about data ingestion
Data ingestion is a process by which data is moved from a source to a destination where it can be stored and further analyzed. Given that event data volumes are larger today than ever and that data is typically streamed rather than imported in batches, the ability to ingest and process data at speed and scale is critical.
Depending on the source or destination, data ingestion may be:
- continuous or asynchronous;
- batched, real-time, or a lambda architecture (a combination of both).
Data scientists will typically spend most of their time on tidying and organizing — or cleansing — data, as the data at the source and destination may not share the same schemas, formats, types and timing.
There are about as many data ingestion best practices as there are DevOps people and data scientists managing data, but there are a few practices that anyone ingesting data should consider.
Create zones for ingestion (like landing, trusted, staging, refined, production, and/or sandbox) where you can experiment with your data or implement different access control, among other things. Automate it with tools that run batch or real-time ingestion, so you need not do it manually. Serve it by providing your users easy-to-use tools like plug-ins, filters, or data-cleaning tools so they can easily add new data sources. Govern it by introducing data governance and a data steward responsible for schemas, guidelines, and the overall state of your data. Promote it to your data consumers by letting them know when the ingested, cleaned data is ready for use, and by whom.
Finally, when possible, consider making your destination data schemas and types as close as possible to those of your source data. While Alooma provides all the tools you need to transform your data any way you like right in the pipeline, having similar source and destination schemas and types — when it makes sense to do so — will save you time and make troubleshooting easier when problems arise.
ETL was born in the world of batched, structured reporting from RDBMS; while data ingestion sprang forth in the era of IoT, where large volumes of data are generated every second.
Thus, ETL is generally better suited for importing data from structured files or source relational databases into another similarly structured format in batches. Data ingestion, on the other hand, has come about more recently, and tends to be better suited for very large, unstructured and schema-agnostic data, which is streamed in real time.