Data Health Scores
This video introduces the concept of data health on Elementary's Cloud Platform, which provides a high-level health score for data, broken down into quality dimensions like completeness, accuracy, uniqueness, validity, and consistency. Elementary automatically maps dbt tests and anomaly monitors into these categories. Users can tag tables by business domain, see health scores for each domain, and identify gaps in test coverage, enabling better communication between data teams and data consumers.
Data health is a concept that exists only on the Cloud Platform and it enables conversations between the technical side of the data team and the data consumers and data analysts and it enables you to get the high level total health score of your data and also a breakdown into quality dimension scores.
Quality dimension is a common framework in the industry. It's not something that we invented, but we incorporated it into the Platform. And what's cool about it is that we automatically mapped all the tests that exist in the ecosystem. Like all the dbt packages test, dbt utils, dbt expectation, the native dbt test, and all the Elementary anomaly detection monitors and we mapped them into these quality dimension categories. So for example, all the not null tests in dbt will be counted as completeness. Unique tests will be counted as uniqueness. Expression is true, for example, is a common test in dbt utils, will be counted as accuracy because usually it validates accuracy.
Accuracy represents business logic or business constraints or requirements that are true only for your business. Validity represents usually that the data has the right structure and formats. So it's usually validations around formats, the length of the strings, the min and max range for your numeric values, things like that.
And accuracy is more like, for example, if you are an e commerce shop and you sell products between the price of $100 to $200, then you can implement a validation that validates this business requirement. And then you will get visibility into the score of your dataset that do not adhere to these requirements.
Consistency means validating that there is no missing values between source to target. So if you're familiar with the relationship test in dbt, for example, this is considered as a consistency validation. Also doing aggregation on your source tables and comparing them to your public reporting tables.
Then this is another way to validate source to target and provide that the data is consistent between source to your target tables or to your end of pipeline tables. Here you can also use the tags and slice and dice it by the different tags. If you use tags and leverage them for grouping your tables based off your business domains, then you will get the health score for this domain specifically.
What we recommend to do is to take all your tables which are being used for analytics, that are served as an API for your data analysts. And tag them with a business domain, and this way, you can see a score to all of these tables based on the different quality dimension scores. And you can also leverage that to understand or uncover where are your coverage gaps.
For example, I see that for the sales domain, I don't have any validations on my business requirements for this domain, and if I will implement them, then I will start getting visibility into that. So I know that I have a coverage gap here, and from here, I can go directly to the UI and start adding tests.