Handling Multiple Granularities
Many queries in data will mix granularities, for example in the TPC-H derived example might want in the same query to count customers (oncustomer entity) and orders (on orders entity).
For each metric, Honeydew automatically applies the right granularity and handles any related complexities.
There are two types of “join traps” that arise from mixing granularities:
Fan-out trap
When joiningcustomer and orders the resulting JOIN would have a row per order (as orders is the high granularity having a many:1
relation to a customer). The row will include duplicated data per customer - each order row will have the same customer data.
Trying to count customers on the joined table will result in a data error due to row duplication - this is called a fan-out trap.
Honeydew automatically detects that situation and builds a query that deduplicates customer rows when counting them.
Chasm trap
When joining data with a conformed dimension, the resulting JOIN can have massive row duplication. For example when having alogins table and a visits table with a shared users conformed dimension, may want to combine visits and logins.
Joining the two directly on user_id would result in row duplication: the combined granularity is visits x logins, that is each
visit row will have duplicated every login row of same user and vice versa. Not only this leads to errors, this usually results in query
failure due to limited memory.
Honeydew automatically detects that situation and builds a query that avoids a chasm join and can compute in a performant manner.
Example
Thecount metric in customers is defined this way:
COUNT DISTINCT - since it is on the customer entity it knows its “native” level.
When you build a dynamic dataset such as:
orders and customer directly would result in a fan-out trap, skewing the customer count to over 150,000 in total.
In SQL queries, this particular case can be solved by doing
COUNT DISTINCT instead of COUNT - but that approach would not help for other types of aggregations like AVG or SUM.
Honeydew instead creates query-time deduplication:
- Join orders and customers
- Deduplicate customer rows by the key
c_custkeyand the group - Count the metric on deduplicated data
Most of the time, there is no need to use
COUNT DISTINCT in Honeydew - Honeydew knows the level of granularity, and can build more
performant queries by building distinct rows only when needed.