This page explains the expect_column_values_to_not_be_in_set
test from the dbt-expectations package. This test is key for ensuring that the values in a specified column do not include any disallowed or undesirable entries. It is useful for filtering out unwanted data entries that can negatively impact the integrity of the dataset.
How it Works
The expect_column_values_to_not_be_in_set
test verifies that none of the values in a targeted column appear in a specified set of unacceptable values. By enforcing this, the test helps maintain the purity of the data by preventing unwanted values from being present.
Steps and Conditions:
- Column Selection: Choose the column you wish to check for the presence of disallowed values.
- Define Unacceptable Value Set: Specify the set of values that should not be found in the chosen column using the
value_set
argument. - Optional Configuration:
- Quote Values: Use the
quote_values
argument to decide if the values invalue_set
should be treated as quoted. The default setting istrue
. - Row Condition: The
row_condition
argument allows you to specify conditions that filter which rows are considered in the test, such as excluding rows where specific fields might be null.
- Quote Values: Use the
- Execution: The test applies the optional row condition and then checks each row in the specified column to ensure that its value is not included in the
value_set
. - Outcome:
- Pass: If none of the values in the column are found in the unacceptable value set, the test passes.
- Fail: If any value in the column is detected in the set of disallowed values, the test fails, indicating a contamination of data that requires review and cleanup.
Example Usage: Fintech
For a Fintech company, maintaining accuracy and compliance in transaction categorization is crucial. Incorrect or unwanted transaction categories can lead to issues in reporting and regulatory compliance.
Consider a scenario where the transactions
table includes a category
column that classifies each financial transaction. Ensuring that this column does not include deprecated or erroneous categories is vital.
In this example, expect_column_values_to_not_be_in_set
ensures that the category
column in the transactions
table does not contain the values 'undefined', 'miscategorized', or 'deprecated', particularly for transactions where the amount is greater than zero. This setup helps the fintech company to avoid misclassification and maintain accurate and compliant financial records.