Handling Multiple Granularities

Many queries in data will mix granularities, for example in the TPC-H derived example might want in the same query to count customers (on customer entity) and orders (on orders entity).

For each metric, Honeydew automatically applies the right granularity and handles any related complexities.

There are two types of “join traps” that arise from mixing granularities:

Fan-out trap

When joining customer and orders the resulting JOIN would have a row per order (as orders is the high granularity having a many:1 relation to a customer). The row will include duplicated data per customer - each order row will have the same customer data.

Trying to count customers on the joined table will result in a data error due to row duplication - this is called a fan-out trap.

Honeydew automatically detects that situation and builds a query that deduplicates customer rows when counting them.

Chasm trap

When joining data with a conformed dimension, the resulting JOIN can have massive row duplication.

For example when having a logins table and a visits table with a shared users conformed dimension, may want to combine visits and logins.

Joining the two directly on user_id would result in row duplication: the combined granularity is visits x logins, that is each visit row will have duplicated every login row of same user and vice versa. Not only this leads to errors, this usually results in query failure due to limited memory.

Honeydew automatically detects that situation and builds a query that avoids a chasm join and can compute in a performant manner.

Example

The count metric in customers is defined this way:

COUNT(customer.c_custkey)

Note this is not a COUNT DISTINCT - since it is on the customer entity it knows its “native” level.

When you build a dynamic dataset such as:

attributes: []
metrics:
  - customer.count
filters: []

It will count it on the customer table (150,000 in the TPCH_SF1 sample data).

However, if would want to break it down by order type:

attributes:
  - orders.o_orderstatus
metrics:
  - customer.count
filters: []

Joining orders and customer directly would result in a fan-out trap, skewing the customer count to over 150,000 in total.

In SQL queries, this particular case can be solved by doing COUNT DISTINCT instead of COUNT - but that approach would not help for other types of aggregations like AVG or SUM.

Honeydew instead creates query-time deduplication:

  1. Join orders and customers
  2. Deduplicate customer rows by the key c_custkey and the group
  3. Count the metric on deduplicated data

Most of the time, there is no need to use COUNT DISTINCT in Honeydew - Honeydew knows the level of granularity, and can build more performant queries by building distinct rows only when needed.

Was this page helpful?