dbt hub

dbt test: column_anomalies

USE CASE

Anomaly detection
Missing values
String format

APPLIES TO

Column

This page details the column_anomalies test in dbt (data build tool), part of the anomaly detection suite provided by Elementary. This test focuses on monitoring and detecting anomalies at the column level within a table, based on specific metrics like missing data counts or value lengths. It's designed to ensure that each column's data remains consistent and within expected patterns, adjusting its checks based on the data type of each column.

How it Works

The column_anomalies test performs targeted anomaly detection on selected columns by analyzing specific aspects of the data, such as the count of missing values or the minimum length of string values. It can be customized to exclude certain columns and to focus on particular time periods or conditions, making it a flexible tool for maintaining data integrity.

Steps and Conditions:

  1. Column Selection and Configuration: The test targets specified columns, applying configured anomaly checks like missing_count or min_length.
  2. Data Type and Metric Analysis: It evaluates the data based on the selected metrics, considering the column's data type.
  3. Customization: Users can exclude columns, define where_expression conditions, set time buckets for temporal analysis, and adjust other parameters to tailor the test.
  4. Outcome:
    • Pass: If the data in the column meets the expected patterns and thresholds for the selected metrics, the test passes, indicating no detected anomalies.
    • Fail: If the data deviates significantly from expected patterns or thresholds, the test fails, signaling potential issues that may need investigation.

Example Usage: Videogame Company

For a Videogame company, monitoring player engagement and activity is crucial. The column_anomalies test can be applied to the login_events table to ensure that player login data, such as user_name, does not exhibit unexpected anomalies, like a sudden increase in missing values or a decrease in the minimum length of usernames, which could indicate issues with data collection or processing.

Consider a scenario where the login_events table tracks each player's login events, including details like user_name and the event's country_name. Monitoring for anomalies in the user_name column, such as an unusual number of missing values or changes in the typical length of usernames, can help identify potential issues early.


models:
  - name: login_events
    config:
      elementary:
        timestamp_column: 'loaded_at'
    columns:
      - name: user_name
        tests:
          - elementary.column_anomalies:
              column_anomalies:
                - missing_count
                - min_length
              where_expression: "event_type in ('event_1', 'event_2') and country_name != 'unwanted country'"
              time_bucket:
                period: day
                count: 1
              tags: ['elementary']

In this example, the column_anomalies test is configured to monitor the user_name column for anomalies related to missing values and minimum length, with a daily analysis period. This setup helps the Videogame company detect and address potential data quality issues, ensuring that player engagement metrics are accurate and reliable.

The only data observability platform built into your dbt code

  • Get monitors on your production tables out-of-the-box with zero configuration
  • Add tests to your code in bulk with a simple UI
  • Track test results over time
  • Set owners and create meaningful alerts
  • Triage incidents faster using our end-to-end column-level lineage graph