TPC-H is an example data set used mostly for databases & data warehouses performance benchmarking.
The data represents an ordering system. It contains customers and orders, where each order is comprised of order line items, and each line item is a certain part.
A dataset derived from TPC-H is readily available as part of any Snowflake instance, in the
Honeydew Example Data
This is the ERD for a schema derived from TPC-H, as exists in the Snowflake sample data.
Honeydew examples are based on those tables, as well as their relations.
They are build on the
SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 schema: TPC-H with a scale factor of 1.
Full schema can be found at https://github.com/honeydew-ai/tpch-demo
What the example does not show
The TPC-H represents a warehouse modeled with dimensional modeling. Data lakes today include data modeled in other ways, such as semi-nested data or wide fact tables.
Honeydew is built to support a wide variety of data structures. However, some of the techniques used in the example are more applicable for dimensional modeling, and less for other types of data.
TPC-H Derived Data Model
This is the ERD as is reflected in the Honeydew UI:
Honeydew includes information about every entity and relationship in data.
The simplest type of an entity is a table in the data warehouse. For example, the
customer entity is based on the
Every entity also has a granularity associated with it (think of it as a virtual table with rows) - for example, the
customer entity represents a customer. Every “row” in it, is a unique customer.
The granularity is part of the entity definition and called its “key”. For customer, it is the column
c_custkey in the table.
Simple entities in the model include:
customer- based on the
CUSTOMERtable with key
orders- based on the
ORDERStable with key
parts- based on the
PARTStable with key
region- based on the
REGIONtable with key
supplier- based on the
SUPPLIERtable with key
nation- based on the
NATIONtable with key
A table with a single key is the most common type of entity.
Note that in some cases an entity key is more than one column. For example, the
lineitem entity represents the lines of an order.
Its unique key is a combination of an order key and a line offset in it.
There are two entities like that:
lineitem- lines in an order. Based on the
LINEITEMtable with key (
partsupp- data on supplier/part combination. Based on the
PARTSUPPtable with key (
The last entity in the example is called
supplier_quarters. It is a “virtual entity” - an entity that does not exist as a table, but is based
on a calculation. More about it and when are virtual entities useful, later.
Entities by themselves don’t help build interesting queries. Insights come from connecting different business domains and entities. In SQL terms, different entities need to be JOINed.
Relationships tell Honeydew which path to take to JOIN entities, and how to compute metrics across multiple entities.
The basic information of relationship in Honeydew includes:
- Its direction (1:many or many:1)
- The columns to JOIN on (might be more than one)
lineitem relationship is:
- 1:many (an order has many line items in it)
- Based on joining
orders.o_orderkey = lineitem.l_orderkey
In the Honeydew UI, the direction of the arrow signifies the relative granularity (1:many):
Relations of customer
orders is also connected to
- many:1 (a customer has many orders)
- Based on joining
orders.o_custkey = customer.c_custkey
One of the advanced properties of a relationship is its default JOIN direction (left/right/inner/automatic).
Since some customers don’t have orders, but every order has a customer, the best default JOIN direction is
customers relationship is set that way (marked in the UI with the small join type marker on the arrow).
Most modern data models will use LEFT as join configuration between facts and dimensions.
Relations of nation
Note that the
nation entity has two paths - it can be the nation of a customer, or the nation of a supplier.
This situation, when the meaning of a table depends on the join path taken, is called role playing.
In short, if there are different paths, they are given names (“roles”) so a query can distinguish which “nation” is needed.
There are two roles defined on relationships of
supplier_nationif meaning the nation of a supplier (
nation.n_nationkey = supplier.s_nation_key)
customer_nationif meaning the nation of a customer (
nation.n_nationkey = customer.c_nation_key)
Role-playing relations are important, but are not very common.
The TPC-H benchmark includes different queries that can be used to highlight different concepts of using Honeydew for semantic modeling.
All the “Q##” here refer to questions defined in the benchmark.
Most of the time using Honeydew will be done with a UI or API. However can explore the calculations in their source form:
- The semantic model is at https://github.com/honeydew-ai/tpch-demo/tree/main/tpch_demo/schema
- Specific datasets built with it at https://github.com/honeydew-ai/tpch-demo/tree/main/tpch_demo/perspectives
Honeydew separates between a semantic model (entities and metrics) and queries on it (a specific metric in a specific grain).
Queries defined by the TPC-H benchmark are implemented as Dynamic Datasets, showcasing different ways to ask about data.
The order of queries here is based on best order of learning.
The first one: Q01 - “Pricing Summary Report”
- How to slice and dice metrics by attributes of a single table
- How to make a reusable metric object
- How to use filters
This question counts business metrics by status of transactions - what was billed, shipped, and returned.
A dynamic dataset in Honeydew is a collection of attributes and metrics. By default, every metric is grouped by all the attributes.
So, to implement it need to cut the data by status of transactions - the attributes
Counting line items is the dynamic dataset
attributes: - lineitem.l_linestatus - lineitem.l_returnflag metrics: - lineitem.count
count metric is automatically created for every entity.
The Q01 query calls for a number of other metrics as well. For example, it asks for Total and Average Quantity:.
Can build a dynamic dataset to include those metrics:
attributes: - lineitem.l_linestatus - lineitem.l_returnflag metrics: - lineitem.count - SUM(lineitem.l_quantity) as sum_qty - AVG(lineitem.l_quantity) as avg_qty
This computes them in a query. But does not make those metrics part of the semantic model. To add them to the semantic model
create them in the
Create the metric
avg_qty metrics in the
lineitem entity with their SQL, as
Now the dataset can look like
attributes: - lineitem.l_linestatus - lineitem.l_returnflag metrics: - lineitem.count - lineitem.l_sum_qty - lineitem.l_avg_qty
But what if want to see this metric only on a subset of the data? Let’s filter
attributes: - lineitem.l_linestatus - lineitem.l_returnflag metrics: - lineitem.count - lineitem.l_sum_qty - lineitem.l_avg_qty filters: - lineitem.l_shipdate <= date('1998-12-01')
This will not count any line items after Dec 1998. (yes, the TPC-H sample data happens in the 90s).
Check out the full dynamic dataset
tpch_q01 for more.
Calculated attributes: Q16 - “Parts/Supplier Relationship”
- How to build calculated attributes.
- How to use them in a filter.
This question focuses on finding out how many good suppliers can supply parts. A good supplier is one that has not received complaints.
The way a “bad” supplier is marked in the data is the supplier comment field:
supplier.s_comment like '%Customer%Complaints%'.
So easy to count suppliers that had not complaints in a dynamic dataset:
attributes:  metrics: - supplier.count filters: - not supplier.s_comment like '%Customer%Complaints%'
But perhaps want to use the logic of a supplier not having complaints elsewhere?
So far, all the attributes we used were columns in the data. A calculated attribute is like a “virtual column”.
Let’s codify this logic in a reusable way by making a calculated attribute in the
supplier entity called
supplier.s_comment like '%Customer%Complaints%'
Now can check filter out suppliers with complaints:
attributes:  metrics: - supplier.count filters: - not supplier.has_complaints
Can also use
has_complaints as an attribute to group by, or as a part of another expression.
If a calculation is defined in a dynamic dataset, it is not reusable elsewhere. Make it a calculated attribute or a metric in the semantic model to reuse it and manage changes.
Check out the full
tpch_q16 for more.
Combining entities: Q03 - “Shipping Priority”
- How to combine different entities together.
- How to use parameters to filter.
This question looks at unshipped orders at a given date and the unshipped value, to get unshipped orders with the highest value.
The previous query operated only on one entity and one table -
But what if need to combine entities and tables together? In this query, would want to see a metric of
order value (counted on the items of an order in
lineitem) but break it down by order date and shipping priority.
Let’s start with creating a value metric called
lineitem, that counts the actual price paid including discounts:
Now can use it in a dynamic dataset, such as:
attributes: - orders.o_orderdate - orders.o_shippriority metrics: - lineitem.revenue
lineitem, grouped by attributes in
Under the hood, Honeydew joins the
lineitem tables based on the relationship between them.
Can combine any entities together, as long as there is a relationship between them that supports the question.
Don’t worry, if there is no way to combine data, Honeydew will tell you.
Now, this is not the whole question - we want revenue of only unshipped orders at a particular date.
So, lets filter orders that happened before a given date, but their ship date has not arrived yet:
attributes: - orders.o_orderdate - orders.o_shippriority metrics: - lineitem.revenue filters: - lineitem.l_shipdate > $date - orders.o_orderdate < $date
Filters can be ad-hoc or reusable. All the filters apply together (AND), and every separate filter is pushed down as close to the data as possible for performance.
When there are a number of filters (like here), the best practice is write separate filters instead
of one big expression with an
AND, as Honeydew can deliver better performance that way.
Wait, but what is
$date is a parameter, to make this query a parameterized dynamic dataset - so you can choose the date later.
The default value is defined in the dynamic dataset as well, but can be changed at query time:
parameters: - name: date value: '''1995-03-15''::date' description: ''
Check out the full
tpch_q03 for more.
Derived metrics: Q14 - “Promotion Effect”
- How to create a filtered metric and a ratio metric
- Combining different entities in a single metric
- Alternatives to using parameters
This questions checks how promotions affect revenue.
A promotion is market in the data in the type part -
part.p_type LIKE 'PROMO%.
We have already defined revenue in the previous query (Q03). The definition of revenue was in the metric
The revenue of promotions only counts lineitems that match a promoted part, or:
lineitem.revenue FILTER (WHERE part.p_type LIKE 'PROMO%')
This expression takes the
revenue as a black-box, and applies a filter on the rows that it counts on. This is called a filtered metric.
The expression for that metric used a field from
part and a field from
lineitem. As long as there is a relationship path connection the
two, Honeydew will figure out the needed JOINs.
Can check this in a dynamic dataset that only has this.
However, the query calls for promotion effect - or, which percentage of the revenue went to promotions?
The way to define it is a ratio metric, which is called
100.0 * lineitem.revenue FILTER (WHERE part.p_type LIKE 'PROMO%') / lineitem.revenue
Same metric (
lineitem.revenue here) can appear multiple times in an expression, each time with different filters or groupings.
tpch_q14 looks at the revenue on a parameterized time range:
attributes:  metrics: - lineitem.promo_revenue filters: - lineitem.l_shipdate between $date and dateadd('month', 1, $date)
This follow the TPC-H specification. But what if want to just calculated the promo revenue on all months, and let the user decide later which is interesting?
This is done by just using an attribute as a group:
attributes: - date_trunc('month', lineitem.l_shipdate) as ship_month metrics: - lineitem.promo_revenue filters: 
The TPC-H benchmark heavily uses paramaters for filtering attributes in queries.
However, many use cases can be better off with adding the field filtered with a parameter as a group instead, and let the user filter later.
Other examples of replacing parameters with groups see
Metrics as attributes: Q04 - “Order Priority Checking”
- How aggregations become calculated attributes,and why do it
- Use one in a filter
This question checks how many orders had have late deliveries?
The way a late delivery is defined is that it is an order that has an item received after the commitment date.
Or, in SQL, a late order is one that has line items where
l_commitdate < l_receiptdate.
How can we make “is this order late” an attribute of an order?
The idea is that there is aggregation on lineitems (number of late items), that we want to count for each order
lineitem.count FILTER (WHERE lineitem.l_commitdate < lineitem.l_receiptdate)
To achieve that, need to create a calculated attribute in
orders (because we want to build a property of an order).
First, can just count all items of an order, not just late. This calculated attribute in
orders will count all line items for each order:
If a calculated attribute of an entity has an aggregation SQL or a metric, that aggregation is counted for each entity key.
So if a calculated attribute in
lineitem.count it means the count of items for
granularity key of orders).
However, what we want is to count late items and mark the order as late if there is more than 0.
So make a calculated attribute in
lineitem.count FILTER (WHERE lineitem.l_commitdate < lineitem.l_receiptdate) > 0
This will be a boolean attribute (TRUE/FALSE) of an order that is true when there was at least one late item.
Now can use it in other expressions, for example to see how many late orders do we have based on their shipping priority:
attributes: - orders.o_orderpriority metrics: - orders.count filters: - orders.has_late_deliveries
See more about metrics as calculated attributes or checkout the full
Reusable filters and layered calculations: Q17 - “Small-Quantity-Order Revenue”
- That metrics, attributes, and filters from any entity can be easily combined
- To build a filter
How much revenue was from orders of small quantities of parts?
A small quantity would be defined as an order of less then 20% of an average order quantity of a part.
Or in other words:
Make a metric counting average quantity of line items called
Now use it in as a calculated attribute in
part to remember the average quantity a part is ordered:
Now use it to make a filter in
is_small_quantity_items (note it uses the field from
lineitem.l_quantity < 0.2*part.avg_qty
Filters are very similar to calculated attributes, but signify reusable cuts of data. When a filter is used in a dynamic dataset, its expression is pushed down.
Now use this for a question - how much revenue do we have from small quantity orders?
attributes:  metrics: - lineitem.revenue filters: - lineitem.is_small_quantity_items
Check out the full
tpch_q17 that does few things more.
Use a metric as a group by: Q13 - “Customer Distribution”
- How to build a distribution histogram by using a metric as a group by attribute
This question wants to understand how many customers are there, based on how many orders did they make? How many of the customers do a lot of orders and how many do very few?
To implement it, first want to remember per customer how many orders a customer had.
For that, first create a calculated attribute in
counts its orders by using the
Now can use it as a group by for customer count - build the following dynamic dataset:
attributes: - customer.order_count metrics: - customer.count
It will count customers by the property of how many orders each had - a distribution query.
Check out the full
tpch_q13 that combines this with parameters and filtered metrics.