dbt hub

dbt test: expect_column_distinct_count_to_be_greater_than

USE CASE

Distinct count

APPLIES TO

Column

This page explains the expect_column_distinct_count_to_be_greater_than test in dbt, which verifies that the distinct count of values in a specific column exceeds a predetermined threshold. This test is important for assessing the variability and comprehensiveness of the data in a column, ensuring that there's sufficient diversity for meaningful analysis.

How it Works

The expect_column_distinct_count_to_be_greater_than test calculates the unique count of entries in a designated column and checks if this number surpasses the set threshold, indicating adequate diversity in the data values.

Steps and Conditions:

  1. Column Selection: Choose the relevant column whose distinct value count needs to be verified.
  2. Define Threshold: Determine the minimum number of unique entries required in the column, specified with the value parameter.
  3. Optional Configuration:
    • Quote Values: Determine if the values should be quoted, with true as the default.
    • Group By: Optionally group the data by other columns to assess distinct counts within grouped subsets.
    • Row Condition: Filter which rows to include in the count based on a specified condition, like excluding null or specific values.
  4. Execution: Execute the test by applying any row conditions or groupings, then compute the distinct count in the column and compare it against the threshold.
  5. Outcome:
    • Pass: The distinct count exceeds the threshold, confirming sufficient diversity in the data.
    • Fail: The distinct count does not meet the threshold, indicating a potential issue with data variety or completeness.

Example Usage: E-commerce

In an E-commerce context, keeping track of the diversity in product offerings can be essential for market competitiveness and customer satisfaction.

Consider a situation where the product_listings table contains entries for each product on an E-commerce platform with a category_id column that helps in categorizing products.


tests:
  - dbt_expectations.expect_column_distinct_count_to_be_greater_than:
      column: category_id
      value: 50
      row_condition: "listing_date >= '2023-01-01'"

In this scenario, the expect_column_distinct_count_to_be_greater_than test is set to ensure that the category_id column in the product_listings table has more than 50 distinct values for listings added from January 1, 2023, onwards. This check helps confirm that the platform offers a wide range of product categories, which is crucial for attracting diverse customer interests and maintaining a competitive product portfolio.

The only data observability platform built into your dbt code

  • Get monitors on your production tables out-of-the-box with zero configuration
  • Add tests to your code in bulk with a simple UI
  • Track test results over time
  • Set owners and create meaningful alerts
  • Triage incidents faster using our end-to-end column-level lineage graph