Add observability to your dbt project: Top 3 dbt testing packages

Add data observability to your dbt projects by leveraging the top three dbt testing packages: dbt_utils, dbt_expectations, and Elementary. This guide covers installation, key features, and practical examples to help you implement advanced data tests and anomaly detection.

If you are a dbt user, you understand how powerful the tool is in ensuring you produce the highest quality data. Because of dbt’s emphasis on data quality, it is extremely easy to test your source data and data models.

dbt’s built-in generic tests allow you to test your primary keys, foreign keys, and other fields within your models. However, in addition to these pre-installed tests, dbt offers the ability to install external packages. These packages aren’t just boiler-plate code, but rather highly thought-out, tested packages built by some of the brightest names in the data industry.

In this article, I will teach you how to leverage three of the most popular packages in dbt, helping to ensure you test edge cases in your data and produce the highest quality data models possible.

Installing dbt testing packages

All packages in dbt are added to a file called packages.yml, which is created automatically when you first create your dbt project. This basic YAML file includes packages and their versions you wish dbt to install.

To add a package to this file, specify its name and the latest version:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
    
  - package: calogica/dbt_expectations
    version: 0.10.3
    
  - package: elementary-data/elementary
    version: 0.15.2

You can find the latest version of a package and any dependencies on the dbt version required in the dbt package hub.

After you’ve added the packages you wish to install in your dbt project to packages.yml, you must run dpt deps to install them and begin using them in your code.

Don’t skip this step or dbt won’t be able to find the referenced tests!

Package #1: dbt_utils

dbt_utils is a well-rounded, handy package that provides you with a lot of everyday, core functions. It contains generic tests  that you can apply in addition to your standard unique, not_null,relationships and accepted values tests in dbt. It not only includes tests but macros, materializations, and SQL generations as well.

I like to use this package to test expectations on my source data such as the amount of NULL values in a field, natural key validity, and matching row counts.

Let’s look at how we can use dbt_utils to apply these tests to our data:

Testing for the proportion of NULL values

While dbt can test for the presence of NULL values, sometimes we need more customization on the simple not_null test.

I recently came across a scenario where NULL values were expected in one of my source’s fields, but not at the proportion I was seeing them. NULL values should have accounted for a very small portion of all the values in the field, but were instead a majority of the values. It turned out that this was an issue in how the data was being pulled from an external API.

To ensure this didn’t occur again without us knowing, I added the not_null_proportion test from dbt_utils. I set it to 0.50 so that the test would fail if a majority of the values in the field were once again NULL.

The test looked like this when applied:

sources:
  - name: external_accounts
    columns:
      - name: account_type
        tests:
          - dbt_utils.not_null_proportion:
              at_least: 0.50

Now, if this unexpected issue with the API occurs again, the data team will be the first to know about it.

Testing for natural key validity

If your source table does not have a primary key, it is recommended to create a surrogate key in its staging model so that you have a field to represent a unique row in your table. Adding a unique_combination_of_columns test to the source is a great way to test the values of two fields, ensuring they will always create a unique key.

This test from dbt_utils is helpful in that it is performant even on large datasets.

To assert that the values in three fields, account_id, account_type, and state, are always unique across the combination of these three, we can apply the test unique_combination_of_columns like so:

- name: account_types_across_states
  tests:
    - dbt_utils.unique_combination_of_columns:
        combination_of_columns:
          - account_id
          - account_type 
          - state

Keep in mind that the same value can exist more than once in each field, so generic unique tests would not work here. It is the uniqueness across the three columns that we are testing.

Testing for matching values (cardinal equality)

It is a best practice to check for the number of unique values in a field between upstream and downstream models.

For example, if I select from a staging model in one of my intermediate data models and use a left join to join it to another staging model, the output of this model should still have the same number of unique values for a field as it did in the first staging model. This would be a great example of when to test for cardinal equality.

A cardinal equality test can be applied similar to how a relationship test within dbt is applied. You want to reference the name of the other model and field that you are comparing in the test itself.

If we were to test a model referrals_joined_owners to ensure the count of values in the referral_id filed always matched the values of the id field in stg_raw_referrals, the test would look like this: 

models:
  - name: referrals_joined_owners
    columns:
      - name: referral_id
        tests:
          - dbt_utils.cardinality_equality:
              field: id
              to: ref('stg_raw__referrals')

This test helps to ensure there are no orphan records in the model the test is applied to as compared to the staging model that is referenced in the test.

Package # 2: dbt_expectations

dbt_expectations is a dbt package primarily focused on asserting expectations about your data, hence the name. With each test, you can fill in the blank of this statement with the purpose of the test: “I expect ——, so fail the test if this expectation is not met”.

Keep in mind that asserting expectations is a reactive testing style- the data has already broken your expectations by the time you are alerted. However, it prevents downstream models from breaking and draws attention to a possible root cause.

dbt_expectations is my go-to package for checking the values of a data field and ensuring they look as I expect them to. It is also helpful for testing dimensions of your dataset like the number of columns and rows and specific names of columns.

Some of my most-used tests from this package include expect_column_count_to_equal, expect_values_to_be_increasing, and expect_column_pair_values_A_to_be_greater_than_B .

Testing for schema changes

While this is a make-shift solution for testing for schema changes, it’s what I used before I discovered Elementary. Often times, engineers may change the schema of a table without alerting the data team, causing the table to behave in a way you didn’t expect.

Testing the number of columns in a table using expect_column_count_to_equal is a helpful way to know whether a column was added or removed. Of course, this has its limitations, but it is better than not being alerted at all.

For example, if I expected the accounts table to always have 11 columns, I would apply this test on the model like so:

sources:
  - name: accounts
    tests:
      - dbt_expectations.expect_table_column_count_to_equal:
          value: 11

The object that you apply the test on varies between tests in dbt_expectations, so make sure you read the documentation on whether the test is applied to a source, model, column, or seed.

Testing integer values

dbt_expectations is great for testing columns derived by transformations, especially integer fields. I like to think this acts as unit testing in a way. Yes, it is testing the values of the data, but it is also testing to ensure the code produces the values as expected.

Cumulative, event-driven data models are commonly built in dbt. To test that the values of a field, such as company revenue, are increasing when ordered, you can use a test called expect_values_to_be_increasing.

To use this test, you must specify a sort_column, which is typically a date field. dbt_expectations assumes that the values aren’t already ordered in the model, using this field to sort the values. In an event-driven model, this is typically a date field representing when the metric was calculated.

If we want to ensure revenue by day is always increasing, we could write a test like so:

models: 
	- name: revenue_by_day 
	  columns:
	     - name: revenue 
	       tests:
		    - dbt_expectations.expect_column_values_to_be_increasing:
			     sort_column: date_day

Notice that this test is applied to a specific column rather than a model.

With this specific test there are also optional parameters such as row_condition, strictly, and group_by that you can add to further customize the test.

Testing multi-field attributes

dbt_expectations also allows you to test the combination of different fields, making it easy to test transformation code within your data models. I like to use the test expect_column_pair_values_A_to_be_greater_than_B to test the behavior of calculations within my models. This is particularly helpful for looking at any type of currency calculation like price, tax, revenue, or profit.

For example, if I want to make sure that the price charged to a customer is always more than the actual price of the product (because of tax), I would apply this test to the columns product_price and price_charged

models:
	- name: order_charges
		tests:
		  - dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B:
		      column_A: price_charged
		      column_B: product_price
		      or_equal: True
		      row_condition: "discount_id is NULL"

In this test I am choosing to set or_equal to be true in the case of a state not having any sales tax on the product.

I also set a row condition where the discount_id must be NULL. This means the test will only be applied to records that do not have a discount_id.Therefore, this test will only fail if the product_price is greater than the price_charged for records without discounts.

Package #3: Elementary

Elementary is a dbt package offered as both manual testing through its open-source product and automated testing through its cloud product. It offers a wider range of testing, covering anomalies in volume, freshness, and different dimensions of your data.

Elementary does a great job at covering the data issues that are most likely to slip through the cracks, making them hard to detect in downstream models and dashboards. It also allows you to customize anomaly tests using different parameters, ensuring you catch all anomalies and not just the basic ones based on record count.

Testing for volume anomalies

Anomaly tests in Elementary are very customizable depending on the data you have available and your biggest needs.

For volume tests, it allows you to specify a timestamp field that is then used to bucket your data, calculating metrics for each bucket and comparing them to one another. This is different than other volume tests that simply look at the number of rows ingested into a table by time period, comparing it to an expected range.

Elementary also allows you to specify a where_expression to filter any data that should not be included in the anomaly detection. This is helpful for filtering out any data added to a table for testing purposes.

In addition, there are many configurations that let you play around with how and when anomalies should be detected. seasonality, fail_on_zero, and detection_delay are just a few of these.

Adding all of these parameters to a test in Elementary would look like this:

models:
  - name: order_charges
    tests:
      - elementary.volume_anomalies:
          timestamp_column: order_created_at
          where_expression: "order_name NOT LIKE '%test%'"
          detection_delay:
            period: hour
            count: 3

Notice that you have a lot more customizability with Elementary tests , allowing you to cover all possible edge cases in your data. Elementary often saves me from needing to write a custom test that aims to detect a very specific issue in my data.

Testing for schema changes

I mentioned earlier that there was a better solution to testing for the number of columns in a model. Elementary offers a schema_change test that alerts you of deleted tables, deleted or added columns, and changes in datatype.

The best part? This is all done in one simple test.

When applied to a model, it looks like this:

models:
  - name: accounts
    tests:
      - elementary.schema_changes

This is a test you can easily apply to all of your models without much thought, ensuring you are the first to know when the structure of any table has been changed and has the possibility of affecting downstream data models.

Automated testing

If you’re looking to start utilizing tests in dbt right away, but without much effort, Elementary offers a cloud solution with automated testing. The cloud tool monitors your metadata without the need to write any tests, checking for freshness, volume, and schema changes.

This is a great option for someone with many sources and models written in dbt, but with no tests. It can be a heavy lift to add tests to all of these resources all at once. However, you also want to focus on data quality as soon as possible, as you never know what you could be missing.

Automated testing with Elementary Cloud is also a perfect option for the user who is just beginning to use dbt. It is a great way to monitor your data quality from the very start, ensuring you are following best practices along the way.

Conclusion

I recommend using a mix of all three of these testing packages in your dbt project. dbt_utils, dbt_expectations, and Elementary each have their own unique value that they provide your project.

dbt_utils allows for advanced primary key testing. dbt_expectations helps you assert expectations of your data onto models, sources, seeds, and fields. Elementary allows you to configure advanced anomaly detection and apply generic tests automatically when you don’t have the time to write them yourself.

There are also a bunch of other tests available to use within dbt, which you can find listed by use case in Elementary’s dbt test hub.

Together, these tests help you create a robust testing environment, helping you to monitor your data quality at every layer.

Contributors

No items found.

If you are a dbt user, you understand how powerful the tool is in ensuring you produce the highest quality data. Because of dbt’s emphasis on data quality, it is extremely easy to test your source data and data models.

dbt’s built-in generic tests allow you to test your primary keys, foreign keys, and other fields within your models. However, in addition to these pre-installed tests, dbt offers the ability to install external packages. These packages aren’t just boiler-plate code, but rather highly thought-out, tested packages built by some of the brightest names in the data industry.

In this article, I will teach you how to leverage three of the most popular packages in dbt, helping to ensure you test edge cases in your data and produce the highest quality data models possible.

Installing dbt testing packages

All packages in dbt are added to a file called packages.yml, which is created automatically when you first create your dbt project. This basic YAML file includes packages and their versions you wish dbt to install.

To add a package to this file, specify its name and the latest version:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1
    
  - package: calogica/dbt_expectations
    version: 0.10.3
    
  - package: elementary-data/elementary
    version: 0.15.2

You can find the latest version of a package and any dependencies on the dbt version required in the dbt package hub.

After you’ve added the packages you wish to install in your dbt project to packages.yml, you must run dpt deps to install them and begin using them in your code.

Don’t skip this step or dbt won’t be able to find the referenced tests!

Package #1: dbt_utils

dbt_utils is a well-rounded, handy package that provides you with a lot of everyday, core functions. It contains generic tests  that you can apply in addition to your standard unique, not_null,relationships and accepted values tests in dbt. It not only includes tests but macros, materializations, and SQL generations as well.

I like to use this package to test expectations on my source data such as the amount of NULL values in a field, natural key validity, and matching row counts.

Let’s look at how we can use dbt_utils to apply these tests to our data:

Testing for the proportion of NULL values

While dbt can test for the presence of NULL values, sometimes we need more customization on the simple not_null test.

I recently came across a scenario where NULL values were expected in one of my source’s fields, but not at the proportion I was seeing them. NULL values should have accounted for a very small portion of all the values in the field, but were instead a majority of the values. It turned out that this was an issue in how the data was being pulled from an external API.

To ensure this didn’t occur again without us knowing, I added the not_null_proportion test from dbt_utils. I set it to 0.50 so that the test would fail if a majority of the values in the field were once again NULL.

The test looked like this when applied:

sources:
  - name: external_accounts
    columns:
      - name: account_type
        tests:
          - dbt_utils.not_null_proportion:
              at_least: 0.50

Now, if this unexpected issue with the API occurs again, the data team will be the first to know about it.

Testing for natural key validity

If your source table does not have a primary key, it is recommended to create a surrogate key in its staging model so that you have a field to represent a unique row in your table. Adding a unique_combination_of_columns test to the source is a great way to test the values of two fields, ensuring they will always create a unique key.

This test from dbt_utils is helpful in that it is performant even on large datasets.

To assert that the values in three fields, account_id, account_type, and state, are always unique across the combination of these three, we can apply the test unique_combination_of_columns like so:

- name: account_types_across_states
  tests:
    - dbt_utils.unique_combination_of_columns:
        combination_of_columns:
          - account_id
          - account_type 
          - state

Keep in mind that the same value can exist more than once in each field, so generic unique tests would not work here. It is the uniqueness across the three columns that we are testing.

Testing for matching values (cardinal equality)

It is a best practice to check for the number of unique values in a field between upstream and downstream models.

For example, if I select from a staging model in one of my intermediate data models and use a left join to join it to another staging model, the output of this model should still have the same number of unique values for a field as it did in the first staging model. This would be a great example of when to test for cardinal equality.

A cardinal equality test can be applied similar to how a relationship test within dbt is applied. You want to reference the name of the other model and field that you are comparing in the test itself.

If we were to test a model referrals_joined_owners to ensure the count of values in the referral_id filed always matched the values of the id field in stg_raw_referrals, the test would look like this: 

models:
  - name: referrals_joined_owners
    columns:
      - name: referral_id
        tests:
          - dbt_utils.cardinality_equality:
              field: id
              to: ref('stg_raw__referrals')

This test helps to ensure there are no orphan records in the model the test is applied to as compared to the staging model that is referenced in the test.

Package # 2: dbt_expectations

dbt_expectations is a dbt package primarily focused on asserting expectations about your data, hence the name. With each test, you can fill in the blank of this statement with the purpose of the test: “I expect ——, so fail the test if this expectation is not met”.

Keep in mind that asserting expectations is a reactive testing style- the data has already broken your expectations by the time you are alerted. However, it prevents downstream models from breaking and draws attention to a possible root cause.

dbt_expectations is my go-to package for checking the values of a data field and ensuring they look as I expect them to. It is also helpful for testing dimensions of your dataset like the number of columns and rows and specific names of columns.

Some of my most-used tests from this package include expect_column_count_to_equal, expect_values_to_be_increasing, and expect_column_pair_values_A_to_be_greater_than_B .

Testing for schema changes

While this is a make-shift solution for testing for schema changes, it’s what I used before I discovered Elementary. Often times, engineers may change the schema of a table without alerting the data team, causing the table to behave in a way you didn’t expect.

Testing the number of columns in a table using expect_column_count_to_equal is a helpful way to know whether a column was added or removed. Of course, this has its limitations, but it is better than not being alerted at all.

For example, if I expected the accounts table to always have 11 columns, I would apply this test on the model like so:

sources:
  - name: accounts
    tests:
      - dbt_expectations.expect_table_column_count_to_equal:
          value: 11

The object that you apply the test on varies between tests in dbt_expectations, so make sure you read the documentation on whether the test is applied to a source, model, column, or seed.

Testing integer values

dbt_expectations is great for testing columns derived by transformations, especially integer fields. I like to think this acts as unit testing in a way. Yes, it is testing the values of the data, but it is also testing to ensure the code produces the values as expected.

Cumulative, event-driven data models are commonly built in dbt. To test that the values of a field, such as company revenue, are increasing when ordered, you can use a test called expect_values_to_be_increasing.

To use this test, you must specify a sort_column, which is typically a date field. dbt_expectations assumes that the values aren’t already ordered in the model, using this field to sort the values. In an event-driven model, this is typically a date field representing when the metric was calculated.

If we want to ensure revenue by day is always increasing, we could write a test like so:

models: 
	- name: revenue_by_day 
	  columns:
	     - name: revenue 
	       tests:
		    - dbt_expectations.expect_column_values_to_be_increasing:
			     sort_column: date_day

Notice that this test is applied to a specific column rather than a model.

With this specific test there are also optional parameters such as row_condition, strictly, and group_by that you can add to further customize the test.

Testing multi-field attributes

dbt_expectations also allows you to test the combination of different fields, making it easy to test transformation code within your data models. I like to use the test expect_column_pair_values_A_to_be_greater_than_B to test the behavior of calculations within my models. This is particularly helpful for looking at any type of currency calculation like price, tax, revenue, or profit.

For example, if I want to make sure that the price charged to a customer is always more than the actual price of the product (because of tax), I would apply this test to the columns product_price and price_charged

models:
	- name: order_charges
		tests:
		  - dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B:
		      column_A: price_charged
		      column_B: product_price
		      or_equal: True
		      row_condition: "discount_id is NULL"

In this test I am choosing to set or_equal to be true in the case of a state not having any sales tax on the product.

I also set a row condition where the discount_id must be NULL. This means the test will only be applied to records that do not have a discount_id.Therefore, this test will only fail if the product_price is greater than the price_charged for records without discounts.

Package #3: Elementary

Elementary is a dbt package offered as both manual testing through its open-source product and automated testing through its cloud product. It offers a wider range of testing, covering anomalies in volume, freshness, and different dimensions of your data.

Elementary does a great job at covering the data issues that are most likely to slip through the cracks, making them hard to detect in downstream models and dashboards. It also allows you to customize anomaly tests using different parameters, ensuring you catch all anomalies and not just the basic ones based on record count.

Testing for volume anomalies

Anomaly tests in Elementary are very customizable depending on the data you have available and your biggest needs.

For volume tests, it allows you to specify a timestamp field that is then used to bucket your data, calculating metrics for each bucket and comparing them to one another. This is different than other volume tests that simply look at the number of rows ingested into a table by time period, comparing it to an expected range.

Elementary also allows you to specify a where_expression to filter any data that should not be included in the anomaly detection. This is helpful for filtering out any data added to a table for testing purposes.

In addition, there are many configurations that let you play around with how and when anomalies should be detected. seasonality, fail_on_zero, and detection_delay are just a few of these.

Adding all of these parameters to a test in Elementary would look like this:

models:
  - name: order_charges
    tests:
      - elementary.volume_anomalies:
          timestamp_column: order_created_at
          where_expression: "order_name NOT LIKE '%test%'"
          detection_delay:
            period: hour
            count: 3

Notice that you have a lot more customizability with Elementary tests , allowing you to cover all possible edge cases in your data. Elementary often saves me from needing to write a custom test that aims to detect a very specific issue in my data.

Testing for schema changes

I mentioned earlier that there was a better solution to testing for the number of columns in a model. Elementary offers a schema_change test that alerts you of deleted tables, deleted or added columns, and changes in datatype.

The best part? This is all done in one simple test.

When applied to a model, it looks like this:

models:
  - name: accounts
    tests:
      - elementary.schema_changes

This is a test you can easily apply to all of your models without much thought, ensuring you are the first to know when the structure of any table has been changed and has the possibility of affecting downstream data models.

Automated testing

If you’re looking to start utilizing tests in dbt right away, but without much effort, Elementary offers a cloud solution with automated testing. The cloud tool monitors your metadata without the need to write any tests, checking for freshness, volume, and schema changes.

This is a great option for someone with many sources and models written in dbt, but with no tests. It can be a heavy lift to add tests to all of these resources all at once. However, you also want to focus on data quality as soon as possible, as you never know what you could be missing.

Automated testing with Elementary Cloud is also a perfect option for the user who is just beginning to use dbt. It is a great way to monitor your data quality from the very start, ensuring you are following best practices along the way.

Conclusion

I recommend using a mix of all three of these testing packages in your dbt project. dbt_utils, dbt_expectations, and Elementary each have their own unique value that they provide your project.

dbt_utils allows for advanced primary key testing. dbt_expectations helps you assert expectations of your data onto models, sources, seeds, and fields. Elementary allows you to configure advanced anomaly detection and apply generic tests automatically when you don’t have the time to write them yourself.

There are also a bunch of other tests available to use within dbt, which you can find listed by use case in Elementary’s dbt test hub.

Together, these tests help you create a robust testing environment, helping you to monitor your data quality at every layer.

Contributors

No items found.