Introduction
Honeydew entities can model many different data structures. However, it is very common in relational modeling to separate entities to facts and dimensions:- Dimensions provide descriptive information for measurements. Dimensions typically represent business entities (such as a user, a product or a place), and include many properties for each instance. For example, a customer dimension will include per-customer information such as age or geography. In a star or snowflake schemas, dimensions are typically the leafs.
- Facts typically represent measurable events (i.e. transactions, website visits, etc). They typically include a time field and reference related dimensions (i.e. the transacting or visiting user) via foreign keys. In a star schema, the fact is the center of a star.
user entity might appear in different facts describing things the user has done:
loginsfact table describing application login eventsvisitsfact table describing website visits events
Building Shared Dimensions
Honeydew automatically detects shared dimensions when more than one fact has a relationship to them. There is no configuration required. Honeydew will create optimized queries that conform data to the shared dimension.Honeydew never performs a fact-to-fact join on a shared dimension key, as that results in a
chasm trap heavy query.When calculating metrics on a shared
dimensions (like logins and visits count per user), Honeydew will calculate them separately and only join once
aggregated. That performance optimization is applied automatically.
Multiple Paths to Shared Dimensions
In more complex schemas where fact tables refer to other fact tables, more than one relationship path can be defined. For example, can have a schema with two facts and a shared dimension:- A
customershared dimension details information per customer. - A
ordersfact entity has per-order data, and has a relation to the ordering customer. - A
lineitemfact entity has the line items for each order and has a direct relationship to the ordering customer.
lineitem and customer:
- Join directly
lineitemandcustomeronlineitem.customer_id = customer.id - Join
lineitemtoordersonorder_idand them join orders to customers onorders.customer_id = customer.id
Honeydew will always join via the direct relationship when multiple relationship paths are possible.In the example above, Honeydew will join
lineitem and customer directly when looking at line items of a customer.When there are multiple paths to a shared dimension required in a query, and none of those paths is a direct connection, Honeydew will fail to compile.Must create a direct connection for each entity to the shared dimension whenever multiple paths present.Can use calculated attributes to copy join keys if are missing. For example if:
- Have a
lineitementity that relates to bothcustomerandsupplier. - Both
customerandsupplierhave a shared dimension callednation.
lineitem would not know which nation to prefer when counting revenue by nation.To resolve this situation:- Create a calculated attribute in
lineitemthat equals to the preferred nation (for example that equals tocustomer.c_nationkey). - Connect that calculated attribute directly to the shared
nationentity.
Metrics that combine multiple facts
A common use case with shared dimensions is querying one fact based on information from another fact with a connected shared dimension. For example, would want to build a metric that counts the visits of all users that had a login last month. Since there is no direct connection between visits and logins to join on, need to go through the shared user in theusers entity
that takes data from the logins fact, so it can be used later on when joining users with visits.
Connect metrics through a shared dimension
Given the following schema:- Two facts:
loginsandvisits - Connected through a single shared dimension:
users
The internal metric on
logins has a GROUP BY (users.id) qualifier that tells it to be calculated per user.Since user is a shared dimension between logins and visits, the result can be applied to a metric in visits.Note that if there are multiple shared dimensions, can connect through more than one.Break to steps with a calculated attribute in the shared dimension
Use a calculated attribute when the connection between facts goes through a single shared dimension, and there is no dynamic calculation. In that case, can built a simpler single fact metric that uses that calculated attribute. To implement the question above with a calculated attribute approach:-
Build a calculated attribute in the
usersshared entity namedlogins_last_monthdefined asThis counts per user the logins (assuming logins has acountmetric) in the last month (assuming atimeattribute), at the granularity of a user (since the calculated attribute is part of the user entity). See also how metrics behave in calculated attributes. -
Build a metric in
visitscalledcount_visits_of_recently_logged_in_usersdefined asThis is a reusable metric based on data from two facts, counting first at the login fact granularity (logins last month) and then at the visit fact granularity (visits of users that logged in last month).