dbt hub

dbt test: expect_column_distinct_count_to_be_less_than

USE CASE

Distinct count

APPLIES TO

Column

This page explains the expect_column_distinct_count_to_be_less_than test from the dbt-expectations package. This test checks that the number of distinct values in a specified column is less than a certain threshold. It's useful when you need to enforce a maximum limit on the variability or diversity of the data within a column.

How it Works

The expect_column_distinct_count_to_be_less_than test verifies that the number of unique entries in a column does not exceed a pre-set limit. This function is important for cases where high diversity might indicate incorrect data grouping, duplicate entries, or other inconsistencies.

Steps and Conditions:

  1. Column Selection: Choose the column you want to evaluate for distinct value count.
  2. Define a Maximum Threshold: Establish a maximum allowable number of distinct values using the value parameter.
  3. Optional Configuration:
    • Quote Values: Use the quote_values argument to determine if values should be quoted, default is true.
    • Group By: Utilize the group_by parameter to specify grouping by one or more columns, which is valuable for evaluating distinct counts in segmented datasets.
    • Row Condition: Employ the row_condition parameter to filter the rows that will be included in the test.
  4. Execution: After applying any row filters and groups, the test counts the distinct values in the target column and compares this count against the set threshold.
  5. Outcome:
    • Pass: The test passes if the number of distinct values is less than the threshold, confirming control over data variability.
    • Fail: If the distinct count exceeds the threshold, the test fails, signaling potential issues in data collection or categorization.

Example Usage: Fintech

For a Fintech company, monitoring the number of unique transaction types processed through an application is critical to ensure that all transactions are within expected boundaries and to prevent any unusual activities.

Consider a scenario where the transaction_types table records types of transactions processed, and a column type_id identifies each transaction type.


tests:
  - dbt_expectations.expect_column_distinct_count_to_be_less_than:
      column: type_id
      value: 5
      row_condition: "date >= '2023-01-01'"

In this example, the expect_column_distinct_count_to_be_less_than test is used to ensure that the type_id column in the transaction_types model does not have more than 5 distinct transaction types since January 1, 2023. This setup helps maintain confidence in the consistency and categorization of transaction types handled by the fintech platform.

The only data observability platform built into your dbt code

  • Get monitors on your production tables out-of-the-box with zero configuration
  • Add tests to your code in bulk with a simple UI
  • Track test results over time
  • Set owners and create meaningful alerts
  • Triage incidents faster using our end-to-end column-level lineage graph