It’s a Monday morning, you have your coffee in hand and are ready to get to work. You open your laptop only to find that your dbt tests have been failing all weekend. A data failure at the beginning of the week may just be every data person’s worst nightmare.
Even worse, with this test failing, you are clueless as to who is in charge of fixing it. Is it your responsibility or the software engineering team’s? You are unsure of what steps to take next, as you have never seen this particular test fail before.
Unfortunately, it’s not enough to have tests and monitoring in place- you need to know what to do when they fail and have a concrete plan to act on. Luckily, this problem can easily be fixed when you are proactive about dbt test failures and create an incident management plan alongside building out your tests.
The different parts of an incident management plan
There are five main parts of an incident management plan- the who, what, why, how, and post-incident reflection.
Who refers to the person responsible for investigating a failure.
Your incident management plan should leave no room for questioning who remediates a test failure. If there is any doubt as to who should be handling it, your plan needs to be more detailed.
What refers to the thing that is failing. It also includes dependencies that are affected by the failure, as these become failures as well.
Your incident management plan must include clear directions on how to identify what is failing. A dbt test failing may be a symptom of the failure rather than the failure itself, making a strong understanding of the lineage of your dbt project crucial.
Why refers to identifying the root cause of a failure.
After you identify what is failing, you need to look into the reasons for that failing. Incident management plans should help eliminate causes based on past problems.
How refers to the actions steps needed to fix the root cause.
While this depends on the specific issue and what caused it in the first place, your incident management plan must help you find a long-term solution rather than a quick fix. Instead of simply removing the test failure, this should be focused on fixing the source of the problem.
The post-incident reflection involves looking back at the entire incident management process and how it can be improved for future failures.
It includes proper documentation within your incident management plan on what occurred and why. It also addresses what needs to be done in other areas of your data stack to prevent the same issue from occurring again to a different resource.
Who- Assigning someone to fix the data quality issue
The first element of an incident management plan comes down to deciding who is responsible for fixing a dbt test failure. The simplest solution for this is to assign owners to different data resources. In the case of dbt tests, Elementary allows you to assign owners to specific tests.
Questions to ask yourself when deciding who
When deciding who should be the owner of a resource, there are a few questions you should ask yourself:
Is this a test on a source or model?
Depending on how the resource is created, the test should be owned by software/data engineering or analytics.
If a test is on a source, it should be owned by the engineering team that generates that data source. Source tests failing are most likely due to engineering adding constraints that don’t apply, or a bug in the code they wrote. Therefore, it makes the most sense to assign them as the owners as they are the ones who have ownership over generating this data.
If a test is on a data model, it should be owned by the analytics team, as they are the ones that created that data model. They understand the code better than anyone else and can debug it to find what is failing.
Who built the model or implemented the code?
Once you’ve determined whether the test is on a source or model, you can narrow the owner down to the specific person responsible on the given team. The owner should be the person who wrote the code to generate the data table or data model. They know the code best so they should be the one responsible for debugging it.
Is there a business owner of the model?
Test failures affect not only the data team but also the business teams who depend on that data for key metrics. For this reason, it can be helpful to have a business stakeholder as an owner of the model in addition to a technical owner. This will help identify any changes that may need to be made to the encoded business logic and will allow you to alert those who depend on the data that there is a failure.
How do we want to handle failures during non-business hours and weekends?
Once you assign owners to data resources, it becomes very clear who should be fixing the errors. However, this becomes a bit fuzzy during evenings and weekends, as most people are not checking for data alerts.
It can be helpful to evaluate how critical certain data issues are. Do they affect only the business or do they affect customers? Will waiting to solve an issue create a lot more work during the next business day?
Depending on your answer, you may want to create an on-call schedule with rotating engineers who check for alerts and fix them, independent of the owner of the data resource.
How to add an owner to a dbt test using Elementary
Elementary offers a few different ways to add owners to your models, but the easiest way is using a meta block like so:
models:
- name: example_model
meta:
owner: ["madison@learnanalyticsengineering.com", "@madison"]
Here, I assigned two owners: one is my email and the other is my Slack email prefix. This makes it so the owner is directly tagged in Elementary alerts on Slack, ensuring they are always notified when a test fails.
I also mentioned that you can add owners to specific dbt tests, making it even clearer who is responsible for a particular test failing. You would also do this using a block:
tests:
- not_null:
meta:
owner: ["madison@learnanalyticsengineering.com", "@madison"]
What- Identifying the root of the failing test and its dependencies
Luckily, dbt tests are extremely helpful in identifying issues in your data sources and models. When getting to the root of an issue with a model, it’s helpful to look at the sources that make up that model and whether any tests on those are also failing. This will help you to identify if the model failure is caused by a source failure or the model’s code. Using this process of elimination helps you get to the cause faster. It is also one of the reasons why it is so important to test both your sources and models, especially for uniqueness and null values!
Speaking with business stakeholders during this stage can also be helpful as they may have insight into something that’s changed within the product. I’ve experienced many dbt test failures caused by changes in the encoded business logic- the data team was the last to know about these changes!
It’s also important to identify other resources such as dashboards, reports, and other data models that depend on the failing resource so that you can properly alert the business. This is one of the reasons I recommend using dbt exposures- they help you identify dependencies outside of dbt in a pinch when critical data issues occur.
Why- Identifying the root cause
The root cause depends on what tests are failing, if any changes were recently made to the data by engineering, and if any business logic has changed. It changes for every situation, and will rarely be the same cause. Typically, the more dbt tests you have in place, the easier it is to eliminate what isn’t causing the issue.
Some helpful things to test include:
- schema changes
- null values
- uniqueness tests (especially on primary keys)
- freshness
Luckily these tests can be found as generic tests within dbt or through Elementary, making them easy to implement from the beginning of building your dbt project.
These tests are the first place you would want to start when identifying the root cause, as they catch the most basic, yet most common, data quality issues.
How- Documenting the problem and solution
Once you identify the problem, you need to focus on how to fix it. Again, this is very specific to the root of the dbt test failure. However, no matter the solution, the process for documenting the problem and solution should always be the same.
Documentation of problems and their potential solutions should be included in your incident management plan from the very start (before a test fails!). This documentation should include the basics of what to investigate- upstream failures, uniqueness, freshness, and null values. Over time, you should add to this document the different test failures you’ve experienced, their causes, and the solutions.
By documenting every test failure and the incident management process, you’ll help speed up future debugging.
Reflection- Refining tests in place
With the conclusion of any data incident needs to come reflection. Reflection allows you to look back on what went wrong and how you could have prevented it. It gives you a chance to see the gaps in your data pipeline that existed before the incident and the ones that continue to exist that need to be addressed.
Here are a few questions to ask yourself as you reflect:
- Were there additional tests you could have added that would have allowed you to eliminate potential causes?
- Are your sources AND models properly documented and tested?
- Was the owner of the source or model the best person to debug the test failure? (If not, change the owner!)
Reflecting on the test failure will allow you to discover other data quality issues before they occur, helping to make your data pipeline stronger than it was before the incident. You should constantly be reflecting and reiterating this plan as you learn more about your dbt project, helping to make incidents fewer in the future.
Now it’s your turn…
Now that we’ve reviewed why an incident management plan is so important, especially with dbt test failures, it’s time to write your own. Don’t put something like this off to the back burner, as it will only make your future more difficult.
Let’s be honest, none of us are going to avoid test failures entirely. It’s part of working in data! So why not take the extra time now to prepare for failures and ensure you have a solid plan for tackling them in the future?
Start with the who and then the what. The why and how sections should be theoretical of what could happen, outlining potential problems and solutions. Lastly, brainstorm different reflection templates, covering details that every user must fill out after working on an incident.
The more time invested in your incident management plan upfront, the quicker test failures can be solved and the sooner your data pipeline will be back up and running as usual.
Additional resources
Contributors
It’s a Monday morning, you have your coffee in hand and are ready to get to work. You open your laptop only to find that your dbt tests have been failing all weekend. A data failure at the beginning of the week may just be every data person’s worst nightmare.
Even worse, with this test failing, you are clueless as to who is in charge of fixing it. Is it your responsibility or the software engineering team’s? You are unsure of what steps to take next, as you have never seen this particular test fail before.
Unfortunately, it’s not enough to have tests and monitoring in place- you need to know what to do when they fail and have a concrete plan to act on. Luckily, this problem can easily be fixed when you are proactive about dbt test failures and create an incident management plan alongside building out your tests.
The different parts of an incident management plan
There are five main parts of an incident management plan- the who, what, why, how, and post-incident reflection.
Who refers to the person responsible for investigating a failure.
Your incident management plan should leave no room for questioning who remediates a test failure. If there is any doubt as to who should be handling it, your plan needs to be more detailed.
What refers to the thing that is failing. It also includes dependencies that are affected by the failure, as these become failures as well.
Your incident management plan must include clear directions on how to identify what is failing. A dbt test failing may be a symptom of the failure rather than the failure itself, making a strong understanding of the lineage of your dbt project crucial.
Why refers to identifying the root cause of a failure.
After you identify what is failing, you need to look into the reasons for that failing. Incident management plans should help eliminate causes based on past problems.
How refers to the actions steps needed to fix the root cause.
While this depends on the specific issue and what caused it in the first place, your incident management plan must help you find a long-term solution rather than a quick fix. Instead of simply removing the test failure, this should be focused on fixing the source of the problem.
The post-incident reflection involves looking back at the entire incident management process and how it can be improved for future failures.
It includes proper documentation within your incident management plan on what occurred and why. It also addresses what needs to be done in other areas of your data stack to prevent the same issue from occurring again to a different resource.
Who- Assigning someone to fix the data quality issue
The first element of an incident management plan comes down to deciding who is responsible for fixing a dbt test failure. The simplest solution for this is to assign owners to different data resources. In the case of dbt tests, Elementary allows you to assign owners to specific tests.
Questions to ask yourself when deciding who
When deciding who should be the owner of a resource, there are a few questions you should ask yourself:
Is this a test on a source or model?
Depending on how the resource is created, the test should be owned by software/data engineering or analytics.
If a test is on a source, it should be owned by the engineering team that generates that data source. Source tests failing are most likely due to engineering adding constraints that don’t apply, or a bug in the code they wrote. Therefore, it makes the most sense to assign them as the owners as they are the ones who have ownership over generating this data.
If a test is on a data model, it should be owned by the analytics team, as they are the ones that created that data model. They understand the code better than anyone else and can debug it to find what is failing.
Who built the model or implemented the code?
Once you’ve determined whether the test is on a source or model, you can narrow the owner down to the specific person responsible on the given team. The owner should be the person who wrote the code to generate the data table or data model. They know the code best so they should be the one responsible for debugging it.
Is there a business owner of the model?
Test failures affect not only the data team but also the business teams who depend on that data for key metrics. For this reason, it can be helpful to have a business stakeholder as an owner of the model in addition to a technical owner. This will help identify any changes that may need to be made to the encoded business logic and will allow you to alert those who depend on the data that there is a failure.
How do we want to handle failures during non-business hours and weekends?
Once you assign owners to data resources, it becomes very clear who should be fixing the errors. However, this becomes a bit fuzzy during evenings and weekends, as most people are not checking for data alerts.
It can be helpful to evaluate how critical certain data issues are. Do they affect only the business or do they affect customers? Will waiting to solve an issue create a lot more work during the next business day?
Depending on your answer, you may want to create an on-call schedule with rotating engineers who check for alerts and fix them, independent of the owner of the data resource.
How to add an owner to a dbt test using Elementary
Elementary offers a few different ways to add owners to your models, but the easiest way is using a meta block like so:
models:
- name: example_model
meta:
owner: ["madison@learnanalyticsengineering.com", "@madison"]
Here, I assigned two owners: one is my email and the other is my Slack email prefix. This makes it so the owner is directly tagged in Elementary alerts on Slack, ensuring they are always notified when a test fails.
I also mentioned that you can add owners to specific dbt tests, making it even clearer who is responsible for a particular test failing. You would also do this using a block:
tests:
- not_null:
meta:
owner: ["madison@learnanalyticsengineering.com", "@madison"]
What- Identifying the root of the failing test and its dependencies
Luckily, dbt tests are extremely helpful in identifying issues in your data sources and models. When getting to the root of an issue with a model, it’s helpful to look at the sources that make up that model and whether any tests on those are also failing. This will help you to identify if the model failure is caused by a source failure or the model’s code. Using this process of elimination helps you get to the cause faster. It is also one of the reasons why it is so important to test both your sources and models, especially for uniqueness and null values!
Speaking with business stakeholders during this stage can also be helpful as they may have insight into something that’s changed within the product. I’ve experienced many dbt test failures caused by changes in the encoded business logic- the data team was the last to know about these changes!
It’s also important to identify other resources such as dashboards, reports, and other data models that depend on the failing resource so that you can properly alert the business. This is one of the reasons I recommend using dbt exposures- they help you identify dependencies outside of dbt in a pinch when critical data issues occur.
Why- Identifying the root cause
The root cause depends on what tests are failing, if any changes were recently made to the data by engineering, and if any business logic has changed. It changes for every situation, and will rarely be the same cause. Typically, the more dbt tests you have in place, the easier it is to eliminate what isn’t causing the issue.
Some helpful things to test include:
- schema changes
- null values
- uniqueness tests (especially on primary keys)
- freshness
Luckily these tests can be found as generic tests within dbt or through Elementary, making them easy to implement from the beginning of building your dbt project.
These tests are the first place you would want to start when identifying the root cause, as they catch the most basic, yet most common, data quality issues.
How- Documenting the problem and solution
Once you identify the problem, you need to focus on how to fix it. Again, this is very specific to the root of the dbt test failure. However, no matter the solution, the process for documenting the problem and solution should always be the same.
Documentation of problems and their potential solutions should be included in your incident management plan from the very start (before a test fails!). This documentation should include the basics of what to investigate- upstream failures, uniqueness, freshness, and null values. Over time, you should add to this document the different test failures you’ve experienced, their causes, and the solutions.
By documenting every test failure and the incident management process, you’ll help speed up future debugging.
Reflection- Refining tests in place
With the conclusion of any data incident needs to come reflection. Reflection allows you to look back on what went wrong and how you could have prevented it. It gives you a chance to see the gaps in your data pipeline that existed before the incident and the ones that continue to exist that need to be addressed.
Here are a few questions to ask yourself as you reflect:
- Were there additional tests you could have added that would have allowed you to eliminate potential causes?
- Are your sources AND models properly documented and tested?
- Was the owner of the source or model the best person to debug the test failure? (If not, change the owner!)
Reflecting on the test failure will allow you to discover other data quality issues before they occur, helping to make your data pipeline stronger than it was before the incident. You should constantly be reflecting and reiterating this plan as you learn more about your dbt project, helping to make incidents fewer in the future.
Now it’s your turn…
Now that we’ve reviewed why an incident management plan is so important, especially with dbt test failures, it’s time to write your own. Don’t put something like this off to the back burner, as it will only make your future more difficult.
Let’s be honest, none of us are going to avoid test failures entirely. It’s part of working in data! So why not take the extra time now to prepare for failures and ensure you have a solid plan for tackling them in the future?
Start with the who and then the what. The why and how sections should be theoretical of what could happen, outlining potential problems and solutions. Lastly, brainstorm different reflection templates, covering details that every user must fill out after working on an incident.
The more time invested in your incident management plan upfront, the quicker test failures can be solved and the sooner your data pipeline will be back up and running as usual.