How can data integrity be achieved, and what are some examples?
Keeping your data integrity is like cleaning your house. It’s a repetitive, iterative process of carefully examining each piece of data and making sure its is not skewed, missing, or duplicated. It’s a never ending process, especially in our era of abundant data sources generating millions of records daily.
Data integrity is a core value of Alooma, and as the VP of engineering, I can write a long list of tips about how to keep your data as reliable as possible. So I’ll try to summarize:
- Integrate your sources into a single source of truth - the problem gets much much more complicated if you have to verify integrity across different data platforms holding partially overlapping data sets. But if you launch a data warehouse project, and move all your data there, it will be easier to verify (not to mention, your whole organization will profit from a better data culture). The rest of the tips will assume that you have such a data warehouse, and that you’re challenges are in moving all your data to it, in a reliable fashion.
- For every data source, maintain some failsafe option to replay / reread the data. With many persistent data sources, this is fairly easy, as all you need to reliably store is what has been replicated and what hasn’t. One of the tools we built at Alooma allows exactly that.
- Beware the exactly once paradigm, in every step of your pipeline.
- Keep ids for all records, allowing you to trace any record in your data warehouse to the original source record.
- Build periodic processes that run comparisons between tables on your data warehouse and your data sources. This is no easy task