Data Integrity/Quality: Ain't Nobody Got Time for that
Simple practices to not continue with garbage in, garbage out.
As data mesh, decentralization of teams, data team expansion, varying definitions of KPIs across departments, and the demand for quick outcomes become more prevalent, issues regarding data quality tend to arise. While following best practices from the beginning is ideal, it's not always possible due to overwhelming backlogs and stakeholder pressure for immediate results. This can result in shortcuts being taken or postponing tasks, which often leads to a bigger mess that becomes difficult to clean up effectively.
In this post, I’ll try to cover the basics in one place - choosing what and how to apply tests, testing options in dbt, and options on more advanced techniques - i.e. adding time-series forecasting with Snowpark use-case on potential errors that are not visible in plain sight.
Choosing the right strategy
And Then There Was Light
You're in good shape if you’re just getting started with your data models. You won’t break anything, or there won’t be much impact downstream. Let’s cover this scenario first.
Let’s start with a couple of questions for yourself:
How much do I know about the area I’m building a model for?
What’s the importance of testing all of the columns?
What’s important for the business and downstream users?
How much cost of thoroughly testing everything will it add?
By answering this question, you will be able to identify how many people will be involved in this, what’s the cost vs benefit, and what’s essential and what can be ditched.
If you’re not an area expert, you will have to talk to the data producers to understand column content and what are the rules applied on the source level. Speaking to business, you’ll understand the criticality, importance, and logic of the tests you’d have to build.
If you’re in a decentralized team and others depend on it, you’d have to go with at least a verbal data contract with downstream users on what’s covered on your end, the agreed structure and how the data change requests should come to you.
Being a technical expert, you will see queries generated by the tests, and you can foresee the potential cost added to the bill later on.
Your task now is to try and take the best from all sides. Please choose the most critical, implement them first, and iterate as you add more as long as it makes sense from the cost vs benefit ratio side.
Return of the data integrity
Now this will be the tricky part, where you’ll have to navigate the complexities of not breaking many things while improving the quality. Depending on the data team size and number of data models, it can be either a very smooth or painful journey.
If the team is big and it’s a critical component in your roadmap - an all-hands-on-deck situation, it should be smooth as butter in a pan. Dividing models one by one and reviewing them on the value they bring, the logic freshness will also help a lot in housekeeping. Efficiently changing things in a development environment and adjusting models one by one will help you sail smoothly. This will take a lot of time and usually is not the preferred way for all data team to go and spend that much time to fix things. So this is more of a fairytale situation, at least in my experience.
Now more of a realistic approach that part of the team will fix, while the other continues with business as usual approach. This is a more risky approach that might create a bit of even downtime if not thought out carefully (i.e. no development environment, no standards). Here what’s most important is communication with everyone using the data warehouse. You might wonder why - adding data validation, integrity, and quality checks might show inconsistency in data that you’ll have to address immediately. This will boil down to changes in dashboards and downstream models just because we’re so used to solving things with a case statement here, and a case statement there.
That later turns out to be a monster to maintain.
What helps in these scenarios while reworking is to run data-diff on two tables to ensure that difference is acceptable or at least the reasons are understandable. If all is communicated well, it will work like a well-oiled piece of machinery.
Rise of dbt
While I recognize the importance of maintaining data integrity, I must admit that I have taken shortcuts in the past and faced the consequences as a result. I acknowledge that I have made mistakes, and I hope to use this post as an opportunity to share some ideas and tips to help you avoid making similar errors in the future.
What I like about dbt is that it already helps you test your data to at least ensure fundamental integrity. I’ll cover very shortly some generics of dbt (to have it in one place) and add more packages that I’ve tried and used or what we’re using in our production pipelines, so if something seems familiar, feel free to skip it (most likely it was covered multiple times already, no point in reading it again and again).
Freshness
Did you have a scenario where you’re pipeline runs successfully, but in reality, no new data was processed just because it wasn’t in the source? So to ensure that one of the crucial components is the freshness test, which can throw errors and/or warnings depending on your setup. If you’re not familiar, read more here in dbt docs.
Generic tests
Usually good enough to cover the bare minimum for you. Taken straight from dbt docs.
unique: the order_id column in the orders model should be unique
not_null: the order_id column in the orders model should not contain null values
accepted_values: the status column in the orders should be one of 'placed', 'shipped', 'completed', or 'returned'
relationships: each customer_id in the orders model exists as an id in the customers table (also known as referential integrity)
This allows you to ensure primary and foreign key constraints (unique and relationship), and you can provide not null and allowed values. Like I said - basic enough. Keep in mind that tests are run after the model is run, so it doesn’t guarantee that data is not ruined already when the test is running!
Singular tests
One thing why I like dbt is that you can customize it as much as you want. If you need custom logic to be tested - create a custom test with all the Jinja templating you want. The test will fail if it returns any result. So go crazy here as much as you want; don’t create costly queries while at it, and remember the right frequency when to run them. More you can read in dbt docs.
Great expectations
Great expectations are on the rise. I’ve seen many of my ex-colleagues using it, some even contributing to the open-source project. No surprise that there is a ported version of it for dbt called dbt_expectations.
This allows even more functionality and ensures table shape and more complicated tests while not creating multiple singular tests. That’s a great benefit since we’re sticking to YAML engineering, and that information will be pushed to dbt docs that will be visible for dbt docs explorers in the company.
Dbt constraints
Not going to lie; I like Snowflake a lot. But as much as I like it, there were some things I was missing compared to RDBMS I worked on before. I was stoked when I discovered the dbt constraints library and its usage of Rely. Even though it’s not a pure Foreign Key check, at least we’re getting a performance boost (read more under Snowflake docs).
Re:data and Elementary
Both are dbt libraries but with cloud offerings (elementary, for now, has a cloud offering in beta). I wrote about it here and covered other vendors/solutions. Some more in-depth, some not.
Snowpark and Prophet
When working in a limited capacity, with different teams involved, you’re looking for the best tool for the job with what you have on your hands. Similarly, we have chosen some anomaly detection strategies with our stack (MWAA, Snowflake, dbt). Release of Snowparks Python API untied our hands to try different things and run them isolated from MWAA instances where we might have library mismatches. You might think, why not DockerOperator or PythonVirtualEnvironment? The thing is that MWAA is running in containers (ECS), and creating a container in a container is not an option
Similarly, it won’t work with the Virtual Environment operator. If we have a clash of core libraries, we won’t be able to install additional ones.
How does Snowpark come into play? Snowpark allows us to create User Defined Table Functions. We can create a simple and lean environment and create objects/classes that could be triggered as a simple function call with some parameters in a simple query.
We’re using Prophet library to achieve this. Environment has only Prophet installed since we’re not doing anything more complicated there. So we’re passing some categories on which we want to predict separately, adding holidays that would help in dealing with certain drops in the statistics and simple fit-predict. I will not put any code here, but I strongly suggest reading Felipe Hoffa's medium posts, which inspired me to try it out in this particular way.
One thing to keep in mind is that this anomaly check is done after all other tests and validations have passed. We’re interested in acting only on sudden drops in changes in customers’ behaviour, so we’re doing a bit more manipulation based on the results, and we’re shooting messages to Slack for relevant people to take action.
Summary
Prioritizing data integrity testing by conducting frequent and intelligent tests is important. Neglecting to add tests can undermine stakeholder confidence in the quality of data. It is essential to choose critical tests, determine what should be flagged as warnings and what can be disregarded. However, with the advent of ELT and the creation of various data marts for specific queries, it can be challenging to maintain consistency. To address this, it is vital to include rules governing data strategy and ensure everyone is on board. If you want to go a step further try either building some a bit more sophisticated tool/framework in house or use/buy what’s out in the market, that suits your needs.