This page explains the expect_column_values_to_not_be_in_set test from the dbt-expectations package. This test is key for ensuring that the values in a specified column do not include any disallowed or undesirable entries. It is useful for filtering out unwanted data entries that can negatively impact the integrity of the dataset.
How it Works
The expect_column_values_to_not_be_in_set test verifies that none of the values in a targeted column appear in a specified set of unacceptable values. By enforcing this, the test helps maintain the purity of the data by preventing unwanted values from being present.
Steps and Conditions:
- Column Selection: Choose the column you wish to check for the presence of disallowed values.
- Define Unacceptable Value Set: Specify the set of values that should not be found in the chosen column using the
value_setargument. - Optional Configuration:
- Quote Values: Use the
quote_valuesargument to decide if the values invalue_setshould be treated as quoted. The default setting istrue. - Row Condition: The
row_conditionargument allows you to specify conditions that filter which rows are considered in the test, such as excluding rows where specific fields might be null.
- Quote Values: Use the
- Execution: The test applies the optional row condition and then checks each row in the specified column to ensure that its value is not included in the
value_set. - Outcome:
- Pass: If none of the values in the column are found in the unacceptable value set, the test passes.
- Fail: If any value in the column is detected in the set of disallowed values, the test fails, indicating a contamination of data that requires review and cleanup.
Example Usage: Fintech
For a Fintech company, maintaining accuracy and compliance in transaction categorization is crucial. Incorrect or unwanted transaction categories can lead to issues in reporting and regulatory compliance.
Consider a scenario where the transactions table includes a category column that classifies each financial transaction. Ensuring that this column does not include deprecated or erroneous categories is vital.
In this example, expect_column_values_to_not_be_in_set ensures that the category column in the transactions table does not contain the values 'undefined', 'miscategorized', or 'deprecated', particularly for transactions where the amount is greater than zero. This setup helps the fintech company to avoid misclassification and maintain accurate and compliant financial records.