Multiple Facts and Shared Dimensions
Introduction
Honeydew entities can model many different data structures. However, it is very common in relational modeling to separate entities to facts and dimensions:
- Dimensions provide descriptive information for measurements. Dimensions typically represent business entities (such as a user, a product or a place), and include many properties for each instance. For example, a customer dimension will include per-customer information such as age or geography. In a star or snowflake schemas, dimensions are typically the leafs.
- Facts typically represent measurable events (i.e. transactions, website visits, etc). They typically include a time field and reference related dimensions (i.e. the transacting or visiting user) via foreign keys. In a star schema, the fact is the center of a star.
In a schema that has multiple facts (and thus multiple stars or snowflake schemas combined), dimension that more than more than one fact refers to are called Shared Dimensions.
The shared (also called conformed) dimension has the same meaning to every fact with which it relates.
For example, a user
entity might appear in different facts describing things the user has done:
logins
fact table describing application login eventsvisits
fact table describing website visits events
Building Shared Dimensions
Honeydew automatically detects shared dimensions when more than one fact has a relationship to them.
There is no configuration required. Honeydew will create optimized queries that conform data to the shared dimension.
Honeydew never performs a fact-to-fact join on a shared dimension key, as that results in a chasm trap heavy query.
When calculating metrics on a shared dimensions (like logins and visits count per user), Honeydew will calculate them separately and only join once aggregated. That performance optimization is applied automatically.
Multiple Paths to Shared Dimensions
In more complex schemas where fact tables refer to other fact tables, more than one relationship path can be defined.
For example, can have a schema with two facts and a shared dimension:
- A
customer
shared dimension details information per customer. - A
orders
fact entity has per-order data, and has a relation to the ordering customer. - A
lineitem
fact entity has the line items for each order and has a direct relationship to the ordering customer.
In that case, there are two ways to relate lineitem
and customer
:
- Join directly
lineitem
andcustomer
onlineitem.customer_id = customer.id
- Join
lineitem
toorders
onorder_id
and them join orders to customers onorders.customer_id = customer.id
When there is more than one way to connected a shared dimension Honeydew requires a direct relationship to be defined.
Honeydew will always join via the direct relationship when multiple relationship paths are possible.
In the example above, Honeydew will join lineitem
and customer
directly when looking at line items of a customer.
When multiple paths are possible to connect a shared dimension, but there is no direct relation, then Honeydew will not be able to resolve a path. A direct relationship must be added.
When there are multiple paths to a shared dimension required in a query, and none of those paths is a direct connection, Honeydew will fail to compile.
Must create a direct connection for each entity to the shared dimension whenever multiple paths present.
Can use calculated attributes to copy join keys if are missing. For example if:
- Have a
lineitem
entity that relates to bothcustomer
andsupplier
. - Both
customer
andsupplier
have a shared dimension callednation
.
Then for lineitem
would not know which nation
to prefer when counting revenue by nation.
To resolve this situation:
- Create a calculated attribute in
lineitem
that equals to the preferred nation (for example that equals tocustomer.c_nationkey
). - Connect that calculated attribute directly to the shared
nation
entity.
Metrics that combine multiple facts
A common use case with shared dimensions is querying one fact based on information from another fact with a connected shared dimension.
For example, would want to build a metric that counts the visits of all users that had a login last month.
Since there is no direct connection between visits and logins to join on, need to go through the shared user in the users
entity
that takes data from the logins
fact, so it can be used later on when joining users with visits
.
Connect metrics through a shared dimension
Given the following schema:
- Two facts:
logins
andvisits
- Connected through a single shared dimension:
users
Build the following metric:
The internal metric on logins
has a GROUP BY (users.id)
qualifier that tells it to be calculated per user.
Since user is a shared dimension between logins
and visits
, the result can be applied to a metric in visits.
Note that if there are multiple shared dimensions, can connect through more than one.
Break to steps with a calculated attribute in the shared dimension
Use a calculated attribute when the connection between facts goes through a single shared dimension, and there is no dynamic calculation.
In that case, can built a simpler single fact metric that uses that calculated attribute.
To implement the question above with a calculated attribute approach:
-
Build a calculated attribute in the
users
shared entity namedlogins_last_month
defined asThis counts per user the logins (assuming logins has a
count
metric) in the last month (assuming atime
attribute), at the granularity of a user (since the calculated attribute is part of the user entity). See also how metrics behave in calculated attributes. -
Build a metric in
visits
calledcount_visits_of_recently_logged_in_users
defined asThis is a reusable metric based on data from two facts, counting first at the login fact granularity (logins last month) and then at the visit fact granularity (visits of users that logged in last month).
Was this page helpful?