TPC-H Derived Example
Background
TPC-H is an example data set used mostly for databases & data warehouses performance benchmarking.
The data represents an ordering system. It contains customers and orders, where each order is comprised of order line items, and each line item is a certain part.
A dataset derived from TPC-H is readily available as part of any Snowflake instance, in the SNOWFLAKE_SAMPLE_DATA
database.
Honeydew Example Data
This is the ERD for a schema derived from TPC-H, as exists in the Snowflake sample data.
Honeydew examples are based on those tables, as well as their relations.
They are build on the SNOWFLAKE_SAMPLE_DATA.TPCH_SF1
schema: TPC-H with a scale factor of 1.
Full schema can be found at https://github.com/honeydew-ai/tpch-demo
What the example does not show
The TPC-H represents a warehouse modeled with dimensional modeling. Data lakes today include data modeled in other ways, such as semi-nested data or wide fact tables.
Honeydew is built to support a wide variety of data structures. However, some of the techniques used in the example are more applicable for dimensional modeling, and less for other types of data.
TPC-H Derived Data Model
This is the ERD as is reflected in the Honeydew UI:
Honeydew includes information about every entity and relationship in data.
Entities
The simplest type of an entity is a table in the data warehouse. For example, the customer
entity is based on the SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
table.
Every entity also has a granularity associated with it (think of it as a virtual table with rows) - for example, the customer
entity represents a customer. Every “row” in it, is a unique customer.
The granularity is part of the entity definition and called its “key”. For customer, it is the column c_custkey
in the table.
Simple entities in the model include:
customer
- based on theCUSTOMER
table with keyc_custkey
orders
- based on theORDERS
table with keyo_orderkey
parts
- based on thePARTS
table with keyp_partkey
region
- based on theREGION
table with keyr_regionkey
supplier
- based on theSUPPLIER
table with keys_suppkey
nation
- based on theNATION
table with keyn_nationkey
A table with a single key is the most common type of entity.
Note that in some cases an entity key is more than one column. For example, the lineitem
entity represents the lines of an order.
Its unique key is a combination of an order key and a line offset in it.
There are two entities like that:
lineitem
- lines in an order. Based on theLINEITEM
table with key (l_orderkey
,l_linenumber
)partsupp
- data on supplier/part combination. Based on thePARTSUPP
table with key (ps_partkey
,ps_suppkey
)
The last entity in the example is called supplier_quarters
. It is a “virtual entity” - an entity that does not exist as a table, but is based
on a calculation. More about it and when are virtual entities useful, later.
For more information, read on entities
Relations
Entities by themselves don’t help build interesting queries. Insights come from connecting different business domains and entities. In SQL terms, different entities need to be JOINed.
Relationships tell Honeydew which path to take to JOIN entities, and how to compute metrics across multiple entities.
The basic information of relationship in Honeydew includes:
- Its direction (1:many or many:1)
- The columns to JOIN on (might be more than one)
For example, orders
and lineitem
relationship is:
- 1:many (an order has many line items in it)
- Based on joining
orders.o_orderkey = lineitem.l_orderkey
In the Honeydew UI, the direction of the arrow signifies the relative granularity (1:many):
For more information, read on relations
Relations of customer
orders
is also connected to customers
:
- many:1 (a customer has many orders)
- Based on joining
orders.o_custkey = customer.c_custkey
One of the advanced properties of a relationship is its default JOIN direction (left/right/inner/automatic).
Since some customers don’t have orders, but every order has a customer, the best default JOIN direction is LEFT JOIN
.
The orders
to customers
relationship is set that way (marked in the UI with the small join type marker on the arrow).
Most modern data models will use LEFT as join configuration between facts and dimensions.
Examples: Basic
The TPC-H benchmark includes different queries that can be used to highlight different concepts of using Honeydew for semantic modeling.
All the “Q##” here refer to questions defined in the benchmark.
Most of the time using Honeydew will be done with a UI or API. However can explore the calculations in their source form:
- The semantic model is at https://github.com/honeydew-ai/tpch-demo/tree/main/tpch_demo/schema
- Specific datasets built with it at https://github.com/honeydew-ai/tpch-demo/tree/main/tpch_demo/perspectives
Honeydew separates between a semantic model (entities and metrics) and queries on it (a specific metric in a specific grain).
Queries defined by the TPC-H benchmark are implemented as Dynamic Datasets, showcasing different ways to ask about data.
Queries can also be ad-hoc, using the [SQL interface][/jdbc], the Native App or a BI tool.
The order of queries here is based on best order of learning.
The first one: Q01 - “Pricing Summary Report”
Learn here:
- How to slice and dice metrics by attributes of a single table
- How to make a reusable metric object
- How to use filters
This question counts business metrics by status of transactions - what was billed, shipped, and returned.
A dynamic dataset in Honeydew is a collection of attributes and metrics. By default, every metric is grouped by all the attributes.
So, to implement it need to cut the data by status of transactions - the attributes lineitem.l_linestatus
and lineitem.l_returnflag
.
Counting line items is the dynamic dataset
The count
metric is automatically created for every entity.
The Q01 query calls for a number of other metrics as well. For example, it asks for Total and Average Quantity:.
Can build a dynamic dataset to include those metrics:
This computes them in a query. But does not make those metrics part of the semantic model. To add them to the semantic model
create them in the lineitem
entity.
Metrics are standalone aggregations. They can be defined once and reused anywhere.
Create the metric sum_qty
and avg_qty
metrics in the lineitem
entity with their SQL, as
Now the dataset can look like
But what if want to see this metric only on a subset of the data? Let’s filter
This will not count any line items after Dec 1998. (yes, the TPC-H sample data happens in the 90s).
Check out the full dynamic dataset tpch_q01
for more.
Calculated attributes: Q16 - “Parts/Supplier Relationship”
Learn here:
- How to build calculated attributes.
- How to use them in a filter.
This question focuses on finding out how many good suppliers can supply parts. A good supplier is one that has not received complaints.
The way a “bad” supplier is marked in the data is the supplier comment field: supplier.s_comment like '%Customer%Complaints%'
.
So easy to count suppliers that had not complaints in a dynamic dataset:
But perhaps want to use the logic of a supplier not having complaints elsewhere?
So far, all the attributes we used were columns in the data. A calculated attribute is like a “virtual column”.
Let’s codify this logic in a reusable way by making a calculated attribute in the supplier
entity called has_complaints
:
Now can check filter out suppliers with complaints:
Can also use has_complaints
as an attribute to group by, or as a part of another expression.
If a calculation is defined in a dynamic dataset, it is not reusable elsewhere. Make it a calculated attribute or a metric in the semantic model to reuse it and manage changes.
Check out the full tpch_q16
for more.
Combining entities: Q03 - “Shipping Priority”
Learn here:
- How to combine different entities together.
- How to use parameters to filter.
This question looks at unshipped orders at a given date and the unshipped value, to get unshipped orders with the highest value.
The previous query operated only on one entity and one table - lineitem
.
But what if need to combine entities and tables together? In this query, would want to see a metric of
order value (counted on the items of an order in lineitem
) but break it down by order date and shipping priority.
Let’s start with creating a value metric called revenue
in lineitem
, that counts the actual price paid including discounts:
Now can use it in a dynamic dataset, such as:
This calculated revenue
on lineitem
, grouped by attributes in orders
.
Under the hood, Honeydew joins the orders
and lineitem
tables based on the relationship between them.
Can combine any entities together, as long as there is a relationship between them that supports the question.
Don’t worry, if there is no way to combine data, Honeydew will tell you.
Now, this is not the whole question - we want revenue of only unshipped orders at a particular date.
So, lets filter orders that happened before a given date, but their ship date has not arrived yet:
Filters can be ad-hoc or reusable boolean attributes. All the filters apply together (AND), and every separate filter is pushed down as close to the data as possible, for performance gain.
When there are a number of filters (like here), the best practice is write separate filters instead
of one big expression with an AND
, as Honeydew can deliver better performance that way.
Wait, but what is $date
?
$date
is a parameter, to make this query a parameterized dynamic dataset - so you can choose the date later.
The default value is defined in the dynamic dataset as well, but can be changed at query time:
Check out the full tpch_q03
for more.
Derived metrics: Q14 - “Promotion Effect”
Learn here:
- How to create a filtered metric and a ratio metric
- Combining different entities in a single metric
- Alternatives to using parameters
This questions checks how promotions affect revenue.
A promotion is market in the data in the type part - part.p_type LIKE 'PROMO%
.
We have already defined revenue in the previous query (Q03). The definition of revenue was in the metric lineitem.revenue
:
The revenue of promotions only counts lineitems that match a promoted part, or:
This expression takes the revenue
as a black-box, and applies a filter on the rows that it counts on. This is called a filtered metric.
The expression for that metric used a field from part
and a field from lineitem
. As long as there is a relationship path connection the
two, Honeydew will figure out the needed JOINs.
Can check this in a dynamic dataset that only has this.
However, the query calls for promotion effect - or, which percentage of the revenue went to promotions?
The way to define it is a ratio metric, which is called lineitem.promo_revenue
:
Same metric (lineitem.revenue
here) can appear multiple times in an expression, each time with different filters or groupings.
The full tpch_q14
looks at the revenue on a parameterized time range:
This follow the TPC-H specification. But what if want to just calculated the promo revenue on all months, and let the user decide later which is interesting?
This is done by just using an attribute as a group:
The TPC-H benchmark heavily uses paramaters for filtering attributes in queries.
However, many use cases can be better off with adding the field filtered with a parameter as a group instead, and let the user filter later.
Other examples of replacing parameters with groups see q02_expanded
and q08_expanded
.
Examples: Intermediate
Metrics as attributes: Q04 - “Order Priority Checking”
Learn here:
- How aggregations become calculated attributes,and why do it
- Use one in a filter
This question checks how many orders had have late deliveries?
The way a late delivery is defined is that it is an order that has an item received after the commitment date.
Or, in SQL, a late order is one that has line items where l_commitdate < l_receiptdate
.
How can we make “is this order late” an attribute of an order?
The idea is that there is aggregation on lineitems (number of late items), that we want to count for each order
To achieve that, need to create a calculated attribute in orders
(because we want to build a property of an order).
First, can just count all items of an order, not just late. This calculated attribute in orders
will count all line items for each order:
If a calculated attribute of an entity has an aggregation SQL or a metric, that aggregation is counted for each entity key.
So if a calculated attribute in orders
says lineitem.count
it means the count of items for o_orderkey
(the
granularity key of orders).
However, what we want is to count late items and mark the order as late if there is more than 0.
So make a calculated attribute in orders
called has_late_deliveries
:
This will be a boolean attribute (TRUE/FALSE) of an order that is true when there was at least one late item.
Now can use it in other expressions, for example to see how many late orders do we have based on their shipping priority:
See more about metrics as calculated attributes or checkout the full tpch_q04
.
Reusable filters and layered calculations: Q17 - “Small-Quantity-Order Revenue”
Learn here:
- That metrics, attributes, and filters from any entity can be easily combined
- To build a filter
How much revenue was from orders of small quantities of parts?
A small quantity would be defined as an order of less then 20% of an average order quantity of a part.
Or in other words:
Make a metric counting average quantity of line items called avg_qty
:
Now use it in as a calculated attribute in part
to remember the average quantity a part is ordered:
Now use it to make a filter in lineitems
called is_small_quantity_items
(note it uses the field from part
)
Now use this for a question - how much revenue do we have from small quantity orders?
Check out the full tpch_q17
that does few things more.
Use a metric as a group by: Q13 - “Customer Distribution”
Learn here:
- How to build a distribution histogram by using a metric as a group by attribute
This question wants to understand how many customers are there, based on how many orders did they make? How many of the customers do a lot of orders and how many do very few?
To implement it, first want to remember per customer how many orders a customer had.
For that, first create a calculated attribute in customer
that
counts its orders by using the orders.count
metric:
Now can use it as a group by for customer count - build the following dynamic dataset:
It will count customers by the property of how many orders each had - a distribution query.
Check out the full tpch_q13
that combines this with parameters and filtered metrics.
Was this page helpful?