Introduction

Honeydew entities can model many different data structures. However, it is very common in relational modeling to separate entities to facts and dimensions:

  1. Dimensions provide descriptive information for measurements. Dimensions typically represent business entities (such as a user, a product or a place), and include many properties for each instance. For example, a customer dimension will include per-customer information such as age or geography. In a star or snowflake schemas, dimensions are typically the leafs.
  2. Facts typically represent measurable events (i.e. transactions, website visits, etc). They typically include a time field and reference related dimensions (i.e. the transacting or visiting user) via foreign keys. In a star schema, the fact is the center of a star.

In a schema that has multiple facts (and thus multiple stars or snowflake schemas combined), dimension that more than more than one fact refers to are called Shared Dimensions.

The shared (also called conformed) dimension has the same meaning to every fact with which it relates.

For example, a user entity might appear in different facts describing things the user has done:

  • logins fact table describing application login events
  • visits fact table describing website visits events

Building Shared Dimensions

Honeydew automatically detects shared dimensions when more than one fact has a relationship to them.

There is no configuration required. Honeydew will create optimized queries that conform data to the shared dimension.

Honeydew never performs a fact-to-fact join on a shared dimension key, as that results in a chasm trap heavy query.

When calculating metrics on a shared dimensions (like logins and visits count per user), Honeydew will calculate them separately and only join once aggregated. That performance optimization is applied automatically.

Multiple Paths to Shared Dimensions

In more complex schemas where fact tables refer to other fact tables, more than one relationship path can be defined.

For example, can have a schema with two facts and a shared dimension:

  1. A customer shared dimension details information per customer.
  2. A orders fact entity has per-order data, and has a relation to the ordering customer.
  3. A lineitem fact entity has the line items for each order and has a direct relationship to the ordering customer.

In that case, there are two ways to relate lineitem and customer:

  1. Join directly lineitem and customer on lineitem.customer_id = customer.id
  2. Join lineitem to orders on order_id and them join orders to customers on orders.customer_id = customer.id

When there is more than one way to connected a shared dimension Honeydew requires a direct relationship to be defined.

Honeydew will always join via the direct relationship when multiple relationship paths are possible.

In the example above, Honeydew will join lineitem and customer directly when looking at line items of a customer.

When multiple paths are possible to connect a shared dimension, but there is no direct relation, then Honeydew will not be able to resolve a path. A direct relationship must be added.

When there are multiple paths to a shared dimension required in a query, and none of those paths is a direct connection, Honeydew will fail to compile.

Must create a direct connection for each entity to the shared dimension whenever multiple paths present.

Can use calculated attributes to copy join keys if are missing. For example if:

  1. Have a lineitem entity that relates to both customer and supplier.
  2. Both customer and supplier have a shared dimension called nation.

Then for lineitem would not know which nation to prefer when counting revenue by nation.

To resolve this situation:

  1. Create a calculated attribute in lineitem that equals to the preferred nation (for example that equals to customer.c_nationkey).
  2. Connect that calculated attribute directly to the shared nation entity.

Metrics that combine multiple facts

A common use case with shared dimensions is querying one fact based on information from another fact with a connected shared dimension.

For example, would want to build a metric that counts the visits of all users that had a login last month.

Since there is no direct connection between visits and logins to join on, need to go through the shared user in the users entity that takes data from the logins fact, so it can be used later on when joining users with visits.

Connect metrics through a shared dimension

Given the following schema:

  • Two facts: logins and visits
  • Connected through a single shared dimension: users

Build the following metric:

count_visits_of_recently_logged_in_users =
-- Count visits for users who also had logins in the last 30 days
visits.count FILTER(
   -- Did the visiting user had logins over the last 30 days?
   WHERE (
      -- Count logins
      logins.count
      -- Over last 30 days
      FILTER(WHERE logins.time >= CURRENT_DATE() - 30)
      -- For the shared user
      GROUP BY (users.user_id)
   ) > 0)

The internal metric on logins has a GROUP BY (users.id) qualifier that tells it to be calculated per user.

Since user is a shared dimension between logins and visits, the result can be applied to a metric in visits.

Note that if there are multiple shared dimensions, can connect through more than one.

Break to steps with a calculated attribute in the shared dimension

Use a calculated attribute when the connection between facts goes through a single shared dimension, and there is no dynamic calculation.

In that case, can built a simpler single fact metric that uses that calculated attribute.

To implement the question above with a calculated attribute approach:

  1. Build a calculated attribute in the users shared entity named logins_last_month defined as

    logins.count FILTER(WHERE logins.time >= CURRENT_DATE() - 30)
    

    This counts per user the logins (assuming logins has a count metric) in the last month (assuming a time attribute), at the granularity of a user (since the calculated attribute is part of the user entity). See also how metrics behave in calculated attributes.

  2. Build a metric in visits called count_visits_of_recently_logged_in_users defined as

    visits.count FILTER(WHERE users.logins_last_month > 0)
    

    This is a reusable metric based on data from two facts, counting first at the login fact granularity (logins last month) and then at the visit fact granularity (visits of users that logged in last month).