This page explains the expect_column_distinct_values_to_contain_set
test from the dbt-expectations package, which ensures that a set of specific values is present among the distinct values in a column. This test is useful for verifying the presence of critical categorical data elements within a dataset.
How it Works
The expect_column_distinct_values_to_contain_set
test identifies distinct values in a chosen column and checks whether all values listed in a predefined set are included among these distinct values. It is particularly beneficial for confirming the representation of essential values in the dataset, without requiring that all entries in the column are limited to this set.
Steps and Conditions:
- Column Selection: Choose the column that you intend to verify for the presence of certain values.
- Specify Required Values: Define the set of values that must appear at least once among the column's distinct entries.
- Optional Configuration:
- Quote Values: You may decide whether to treat string values as quoted (i.e., treated as strings). By default, this is set to
true
. - Row Condition: Specify a condition to limit the rows evaluated in this test. For instance, you might exclude rows where IDs are null if such rows are irrelevant or expected to be incomplete.
- Quote Values: You may decide whether to treat string values as quoted (i.e., treated as strings). By default, this is set to
- Execution: The test compiles the distinct values from the selected column, applies the row condition if present, and then checks if every value from the predefined set appears in this list of distinct column values.
- Outcome:
- Pass: If all values from the specified set are found among the distinct values of the column, the test passes.
- Fail: If one or more specified values are not found, the test fails, indicating missing expected values which might affect further analysis or operations.
Example Usage: E-commerce
An E-commerce company could use this test to ensure that key product categories are represented in the product listings. Ensuring that certain categories are always available can be essential for maintaining a diverse and attractive product portfolio.
Consider a scenario where the products
table has a category
column that categorizes each product. To verify that important product categories like "electronics" and "furniture" are always present in the catalog, you could set up this test as follows:
In this configuration, the test checks that the distinct categories of all active products must include "electronics" and "furniture". This is critical for the E-commerce platform to ensure that they continuously meet diverse customer demands across these significant product categories.