dbt hub

dbt test: expect_column_distinct_values_to_be_in_set

USE CASE

Values set

APPLIES TO

Column

This page describes the expect_column_distinct_values_to_be_in_set test from the dbt-expectations package. This test checks that all distinct values in a selected column are contained within a predefined set. It’s useful for ensuring that column data conforms to expected categorical outcomes without unexpected or erroneous values.

How it Works

The expect_column_distinct_values_to_be_in_set test verifies that every distinct value in a specified column is part of a given set. If any distinct value is not in this set, the test will fail, signaling discrepancies that require attention.

Steps and Conditions:

  1. Column Selection: Choose the column to analyze for distinct values.
  2. Define Value Set: List all acceptable distinct values for that column, using the value_set argument.
  3. Optional Configuration:
    • Quote Values: Optionally, specify whether values should be quoted with quote_values. The default setting is true.
    • Row Condition: Use row_condition to include only certain rows based on a condition. This can focus the test on relevant data subsets.
  4. Evaluation: The test gathers all distinct values from the column, considering any row conditions, and checks if each is included in the specified set.
  5. Outcome:
    • Pass: If all distinct column values are found within the specified set, the test passes.
    • Fail: If any distinct value in the column is outside the set, the test fails.

Example Usage: B2B SaaS

In a B2B SaaS context, ensuring consistent data labeling in a product's feature usage can be pivotal. Consider a scenario with a feature_access_log table that records each time a customer accesses a feature, noted in a column named feature_name.


models:
  - name: feature_access_log
    columns:
      - name: feature_name
        tests:
          - dbt_expectations.expect_column_distinct_values_to_be_in_set:
              value_set: ['dashboard', 'report', 'upload', 'download']
              row_condition: "access_date >= '2023-01-01'"

In this example, the test ensures that only the features 'dashboard', 'report', 'upload', and 'download' are logged in the feature_name column for entries since January 1, 2023. This helps the B2B SaaS company maintain precise tracking of feature usage, which is crucial for product management and customer success tasks.

The only data observability platform built into your dbt code

  • Get monitors on your production tables out-of-the-box with zero configuration
  • Add tests to your code in bulk with a simple UI
  • Track test results over time
  • Set owners and create meaningful alerts
  • Triage incidents faster using our end-to-end column-level lineage graph