This page explains the expect_column_values_to_be_within_n_stdevs
test from the dbt-expectations package. This statistical test checks if the values of a specific column fall within a certain number of standard deviations (sigma) from the mean. It is useful for identifying outliers and ensuring that data points conform to expected statistical behaviors.
How it Works
The expect_column_values_to_be_within_n_stdevs
test applies a statistical check to verify that column values are within a certain number of standard deviations from the column's mean. This is crucial in detecting values that deviate significantly from the norm, which might indicate data issues or anomalies.
Steps and Conditions:
- Column Selection: Define the column for which the deviation from the mean is to be tested.
- Grouping (Optional): Identify if the test should be applied to grouped data using the
group_by
parameter. This can be used to perform the test within subsets of the data. - Define Sigma Threshold: Set the number of standard deviations (
sigma_threshold
) within which the column values should fall. The default threshold is 3, which corresponds to a standard rule for identifying outliers in a normal distribution. - Calculation: The test calculates the mean and standard deviation for the selected column (and possibly within each group). It then checks whether each value in the column lies within the defined range from the mean.
- Outcome:
- Pass: If all column values are within the specified number of standard deviations from the mean, the test passes, indicating no unusual deviation.
- Fail: If any value exceeds this range, the test fails, flagging these values as potential outliers.
Example Usage: Fintech Company
For a Fintech company that processes numerous transactions daily, ensuring the transaction amounts are within expected limits is crucial to detect fraud or entry errors.
Consider a scenario where the transactions
table records every transaction's amount in the amount
column, and the table also keeps track of which branch each transaction occurred in using a branch_id
column.
In this example, the expect_column_values_to_be_within_n_stdevs
test ensures that transaction amounts at each branch do not deviate significantly from the branch's typical transaction size by more than 3 standard deviations. Such monitoring is instrumental in early detection of anomalies like transaction fraud or processing errors at specific branches, maintaining the reliability and security of financial operations.