This page explains the expect_column_proportion_of_unique_values_to_be_between
test from the dbt-expectations package. This test verifies that the proportion of unique values within a specified column falls between a given minimum and maximum value, ensuring the appropriate level of uniqueness within the data. This test is useful for assessing data consistency and uniqueness requirements.
How it Works
The expect_column_proportion_of_unique_values_to_be_between
test calculates the proportion of unique values in the chosen column and checks whether this proportion lies between the specified minimum and maximum boundaries.
Steps and Conditions:
- Column Selection: Choose the column to evaluate for unique value proportion.
- Define Limits: Set the minimum (
min_value
) and maximum (max_value
) values for the acceptable range of the proportion of unique values. - Optional Configurations:
- Group By: This parameter (
group_by
) allows the test to be applied to distinct groups within the data, aiding in granular analysis across different segments. - Row Condition: Use
row_condition
to filter which rows are considered in the calculation, based on specified conditions. - Strict Comparison: Specify if the bounds should be strictly enforced (
strictly: true
) or if they should include equality (strictly: false
).
- Group By: This parameter (
- Execution: Calculate the unique value proportion in the column, considering any groupings or conditions defined. Compare this proportion to the set range.
- Outcome:
- Pass: If the unique value proportion is within the specified range, the test passes.
- Fail: If the proportion is outside the specified range, the test fails.
Example Usage: E-commerce Company
In an E-commerce platform, managing the variety of products offered can be crucial for customer satisfaction. Consider a scenario where the product_listing
table records details about products, including a category_id
column that signifies different product categories.
In this example, the test checks that the proportion of unique category_id
values per store_id
for active listings falls between 10% and 50%. This range ensures that each store offers a healthy variety of product categories, not too limited (less than 10%) and not overly diversified to the point of dilution (>50%). This balance is critical for optimizing inventory and satisfying diverse customer preferences.