Mixed Granularities
Handling Multiple Granularities
Many queries in data will mix granularities, for example in the TPC-H derived example might want in the same
query to count customers (on customer
entity) and orders (on orders
entity).
For each metric, Honeydew automatically applies the right granularity and handles any related complexities.
There are two types of “join traps” that arise from mixing granularities:
Fan-out trap
When joining customer
and orders
the resulting JOIN would have a row per order (as orders is the high granularity having a many:1
relation to a customer). The row will include duplicated data per customer - each order row will have the same customer data.
Trying to count customers on the joined table will result in a data error due to row duplication - this is called a fan-out trap.
Honeydew automatically detects that situation and builds a query that deduplicates customer rows when counting them.
Chasm trap
When joining data with a conformed dimension, the resulting JOIN can have massive row duplication.
For example when having a logins
table and a visits
table with a shared users
conformed dimension, may want to combine visits and logins.
Joining the two directly on user_id
would result in row duplication: the combined granularity is visits x logins, that is each
visit row will have duplicated every login row of same user and vice versa. Not only this leads to errors, this usually results in query
failure due to limited memory.
Honeydew automatically detects that situation and builds a query that avoids a chasm join and can compute in a performant manner.
Example
The count
metric in customers is defined this way:
Note this is not a COUNT DISTINCT
- since it is on the customer
entity it knows its “native” level.
When you build a dynamic dataset such as:
It will count it on the customer table (150,000 in the TPCH_SF1 sample data).
However, if would want to break it down by order type:
Joining orders
and customer
directly would result in a fan-out trap, skewing the customer count to over 150,000 in total.
In SQL queries, this particular case can be solved by doing
COUNT DISTINCT
instead of COUNT
- but that approach would not help for other types of aggregations like AVG
or SUM
.
Honeydew instead creates query-time deduplication:
- Join orders and customers
- Deduplicate customer rows by the key
c_custkey
and the group - Count the metric on deduplicated data
Most of the time, there is no need to use COUNT DISTINCT
in Honeydew - Honeydew knows the level of granularity, and can build more
performant queries by building distinct rows only when needed.
Was this page helpful?