This page introduces the expect_column_values_to_not_match_regex
test from the dbt-expectations package, which is designed to ensure that strings in a particular column do not match a specified regular expression pattern. This test is essential for validating the format of data entries and ensuring they adhere to expected patterns or rules that define what should not appear in the data.
How it Works
The expect_column_values_to_not_match_regex
test evaluates each entry in a specified column, ensuring that none of these entries match a given regular expression pattern. This test is useful for enforcing data integrity by confirming that unwanted patterns or characters do not appear in the data values.
Steps and Conditions:
- Regular Expression Definition: Define the regular expression pattern that should not appear in any of the string values of the targeted column.
- Column Validation: Apply the test to the column, checking every value against the defined regular expression.
- Optional Configurations:
- Row Condition: Specify a condition that filters which rows are considered during the validation.
- Raw Regular Expression: Determine whether the regular expression is treated as a raw string, which affects how special characters in the regex are interpreted.
- Regex Flags: Set any flags that modify the behavior of the regex matching, such as case-insensitivity.
- Outcome:
- Pass: The test passes if none of the column's values match the regex pattern.
- Fail: The test fails if at least one value matches the pattern, indicating that the data contains unexpected or forbidden patterns.
Example Usage: Fintech Company
For a Fintech company, maintaining the security and format of sensitive data like personal identifiers or account numbers is critical. Consider using the expect_column_values_to_not_match_regex
test on a customer_identifiers
column in a database that stores sensitive customer information.
In this example, the regex pattern is designed to fail if it finds any strings that contain "SSN" or "CreditCard," regardless of case. This setup ensures that sensitive data types are not stored improperly in the customer_identifiers
column, aligning with best practices for data security in the financial sector.