customer
entity) and orders (on orders
entity).
For each metric, Honeydew automatically applies the right granularity and handles any related complexities.
There are two types of “join traps” that arise from mixing granularities:
customer
and orders
the resulting JOIN would have a row per order (as orders is the high granularity having a many:1
relation to a customer). The row will include duplicated data per customer - each order row will have the same customer data.
Trying to count customers on the joined table will result in a data error due to row duplication - this is called a fan-out trap.
Honeydew automatically detects that situation and builds a query that deduplicates customer rows when counting them.
logins
table and a visits
table with a shared users
conformed dimension, may want to combine visits and logins.
Joining the two directly on user_id
would result in row duplication: the combined granularity is visits x logins, that is each
visit row will have duplicated every login row of same user and vice versa. Not only this leads to errors, this usually results in query
failure due to limited memory.
Honeydew automatically detects that situation and builds a query that avoids a chasm join and can compute in a performant manner.
count
metric in customers is defined this way:
COUNT DISTINCT
- since it is on the customer
entity it knows its “native” level.
When you build a dynamic dataset such as:
orders
and customer
directly would result in a fan-out trap, skewing the customer count to over 150,000 in total.
In SQL queries, this particular case can be solved by doing
COUNT DISTINCT
instead of COUNT
- but that approach would not help for other types of aggregations like AVG
or SUM
.
Honeydew instead creates query-time deduplication:
c_custkey
and the groupCOUNT DISTINCT
in Honeydew - Honeydew knows the level of granularity, and can build more
performant queries by building distinct rows only when needed.