dbt hub

dbt test: expect_column_values_to_be_within_n_stdevs

USE CASE

Values distribution
Anomaly detection

APPLIES TO

Column

This page explains the expect_column_values_to_be_within_n_stdevs test from the dbt-expectations package. This statistical test checks if the values of a specific column fall within a certain number of standard deviations (sigma) from the mean. It is useful for identifying outliers and ensuring that data points conform to expected statistical behaviors.

How it Works

The expect_column_values_to_be_within_n_stdevs test applies a statistical check to verify that column values are within a certain number of standard deviations from the column's mean. This is crucial in detecting values that deviate significantly from the norm, which might indicate data issues or anomalies.

Steps and Conditions:

  1. Column Selection: Define the column for which the deviation from the mean is to be tested.
  2. Grouping (Optional): Identify if the test should be applied to grouped data using the group_by parameter. This can be used to perform the test within subsets of the data.
  3. Define Sigma Threshold: Set the number of standard deviations (sigma_threshold) within which the column values should fall. The default threshold is 3, which corresponds to a standard rule for identifying outliers in a normal distribution.
  4. Calculation: The test calculates the mean and standard deviation for the selected column (and possibly within each group). It then checks whether each value in the column lies within the defined range from the mean.
  5. Outcome:
    • Pass: If all column values are within the specified number of standard deviations from the mean, the test passes, indicating no unusual deviation.
    • Fail: If any value exceeds this range, the test fails, flagging these values as potential outliers.

Example Usage: Fintech Company

For a Fintech company that processes numerous transactions daily, ensuring the transaction amounts are within expected limits is crucial to detect fraud or entry errors.

Consider a scenario where the transactions table records every transaction's amount in the amount column, and the table also keeps track of which branch each transaction occurred in using a branch_id column.


tests:
  - dbt_expectations.expect_column_values_to_be_within_n_stdevs:
      column: amount
      group_by: branch_id
      sigma_threshold: 3

In this example, the expect_column_values_to_be_within_n_stdevs test ensures that transaction amounts at each branch do not deviate significantly from the branch's typical transaction size by more than 3 standard deviations. Such monitoring is instrumental in early detection of anomalies like transaction fraud or processing errors at specific branches, maintaining the reliability and security of financial operations.

The only data observability platform built into your dbt code

  • Get monitors on your production tables out-of-the-box with zero configuration
  • Add tests to your code in bulk with a simple UI
  • Track test results over time
  • Set owners and create meaningful alerts
  • Triage incidents faster using our end-to-end column-level lineage graph
An illustration showing a debugging console with dbt run commands and an error message. The image includes stylized icons: a smiling orange bar chart, the purple dbt logo, and a magnifying glass containing a pink error icon.