Background
TPC-H is an example data set used mostly for databases & data warehouses performance benchmarking. The data represents an ordering system. It contains customers and orders, where each order is comprised of order line items, and each line item is a certain part. A dataset derived from TPC-H is readily available as part of any Snowflake instance, in theSNOWFLAKE_SAMPLE_DATA
database.
Honeydew Example Data
This is the ERD for a schema derived from TPC-H, as exists in the Snowflake sample data. Honeydew examples are based on those tables, as well as their relations. They are build on theSNOWFLAKE_SAMPLE_DATA.TPCH_SF1
schema: TPC-H with a scale factor of 1.
Full schema can be found at https://github.com/honeydew-ai/tpch-demo

What the example does not show
The TPC-H represents a warehouse modeled with dimensional modeling. Data lakes today include data modeled in other ways, such as semi-nested data or wide fact tables. Honeydew is built to support a wide variety of data structures. However, some of the techniques used in the example are more applicable for dimensional modeling, and less for other types of data.TPC-H Derived Data Model
This is the ERD as is reflected in the Honeydew UI:
Entities
The simplest type of an entity is a table in the data warehouse. For example, thecustomer
entity is based on the SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
table.
Every entity also has a granularity associated with it (think of it as a virtual table with rows) - for example, the customer
entity represents a customer. Every “row” in it, is a unique customer.
The granularity is part of the entity definition and called its “key”. For customer, it is the column c_custkey
in the table.
Simple entities in the model include:
customer
- based on theCUSTOMER
table with keyc_custkey
orders
- based on theORDERS
table with keyo_orderkey
parts
- based on thePARTS
table with keyp_partkey
region
- based on theREGION
table with keyr_regionkey
supplier
- based on theSUPPLIER
table with keys_suppkey
nation
- based on theNATION
table with keyn_nationkey
A table with a single key is the most common type of entity.
lineitem
entity represents the lines of an order.
Its unique key is a combination of an order key and a line offset in it.
There are two entities like that:
lineitem
- lines in an order. Based on theLINEITEM
table with key (l_orderkey
,l_linenumber
)partsupp
- data on supplier/part combination. Based on thePARTSUPP
table with key (ps_partkey
,ps_suppkey
)
supplier_quarters
. It is a “virtual entity” - an entity that does not exist as a table, but is based
on a calculation. More about it and when are virtual entities useful, later.
For more information, read on entities
Relations
Entities by themselves don’t help build interesting queries. Insights come from connecting different business domains and entities. In SQL terms, different entities need to be JOINed. Relationships tell Honeydew which path to take to JOIN entities, and how to compute metrics across multiple entities. The basic information of relationship in Honeydew includes:- Its direction (1:many or many:1)
- The columns to JOIN on (might be more than one)
orders
and lineitem
relationship is:
- 1:many (an order has many line items in it)
- Based on joining
orders.o_orderkey = lineitem.l_orderkey

For more information, read on relations
Relations of customer
orders
is also connected to customers
:
- many:1 (a customer has many orders)
- Based on joining
orders.o_custkey = customer.c_custkey
LEFT JOIN
.
The orders
to customers
relationship is set that way (marked in the UI with the small join type marker on the arrow).
Most modern data models will use LEFT as join configuration between facts and dimensions.
Examples: Basic
The TPC-H benchmark includes different queries that can be used to highlight different concepts of using Honeydew for semantic modeling. All the “Q##” here refer to questions defined in the benchmark. Most of the time using Honeydew will be done with a UI or API. However can explore the calculations in their source form:- The semantic model is at https://github.com/honeydew-ai/tpch-demo/tree/main/tpch_demo/schema
- Specific datasets built with it at https://github.com/honeydew-ai/tpch-demo/tree/main/tpch_demo/perspectives
Queries can also be ad-hoc, using the [SQL interface][/jdbc], the Native App or a BI tool.
The first one: Q01 - “Pricing Summary Report”
Learn here:This question counts business metrics by status of transactions - what was billed, shipped, and returned. A dynamic dataset in Honeydew is a collection of attributes and metrics. By default, every metric is grouped by all the attributes. So, to implement it need to cut the data by status of transactions - the attributes
- How to slice and dice metrics by attributes of a single table
- How to make a reusable metric object
- How to use filters
lineitem.l_linestatus
and lineitem.l_returnflag
.
Counting line items is the dynamic dataset
The
count
metric is automatically created for every entity.lineitem
entity.
Metrics are standalone aggregations. They can be defined once and reused anywhere.
sum_qty
and avg_qty
metrics in the lineitem
entity with their SQL, as
tpch_q01
for more.
Calculated attributes: Q16 - “Parts/Supplier Relationship”
Learn here:This question focuses on finding out how many good suppliers can supply parts. A good supplier is one that has not received complaints. The way a “bad” supplier is marked in the data is the supplier comment field:
- How to build calculated attributes.
- How to use them in a filter.
supplier.s_comment like '%Customer%Complaints%'
.
So easy to count suppliers that had not complaints in a dynamic dataset:
supplier
entity called has_complaints
:
has_complaints
as an attribute to group by, or as a part of another expression.
If a calculation is defined in a dynamic dataset, it is not reusable elsewhere. Make it a calculated attribute or a metric in the
semantic model to reuse it and manage changes.
tpch_q16
for more.
Combining entities: Q03 - “Shipping Priority”
Learn here:This question looks at unshipped orders at a given date and the unshipped value, to get unshipped orders with the highest value. The previous query operated only on one entity and one table -
- How to combine different entities together.
- How to use parameters to filter.
lineitem
.
But what if need to combine entities and tables together? In this query, would want to see a metric of
order value (counted on the items of an order in lineitem
) but break it down by order date and shipping priority.
Let’s start with creating a value metric called revenue
in lineitem
, that counts the actual price paid including discounts:
revenue
on lineitem
, grouped by attributes in orders
.
Under the hood, Honeydew joins the orders
and lineitem
tables based on the relationship between them.
Can combine any entities together, as long as there is a relationship between them that supports the question.Don’t worry, if there is no way to combine data, Honeydew will tell you.
Filters can be ad-hoc or reusable boolean attributes.
All the filters apply together (AND), and every separate
filter is pushed down as close to the data as possible, for performance gain.When there are a number of filters (like here), the best practice is write separate filters instead
of one big expression with an
AND
, as Honeydew can deliver better performance that way.$date
?
$date
is a parameter, to make this query a parameterized dynamic dataset - so you can choose the date later.
The default value is defined in the dynamic dataset as well, but can be changed at query time:
tpch_q03
for more.
Derived metrics: Q14 - “Promotion Effect”
Learn here:This questions checks how promotions affect revenue. A promotion is market in the data in the type part -
- How to create a filtered metric and a ratio metric
- Combining different entities in a single metric
- Alternatives to using parameters
part.p_type LIKE 'PROMO%
.
We have already defined revenue in the previous query (Q03). The definition of revenue was in the metric lineitem.revenue
:
revenue
as a black-box, and applies a filter on the rows that it counts on. This is called a filtered metric.
The expression for that metric used a field from
part
and a field from lineitem
. As long as there is a relationship path connection the
two, Honeydew will figure out the needed JOINs.lineitem.promo_revenue
:
Same metric (
lineitem.revenue
here) can appear multiple times in an expression, each time with different filters or groupings.tpch_q14
looks at the revenue on a parameterized time range:
The TPC-H benchmark heavily uses paramaters for filtering attributes in queries.However, many use cases can be better off with adding the field filtered with a parameter as a group instead, and let the user filter later.Other examples of replacing parameters with groups see
q02_expanded
and q08_expanded
.Examples: Intermediate
Metrics as attributes: Q04 - “Order Priority Checking”
Learn here:This question checks how many orders had have late deliveries? The way a late delivery is defined is that it is an order that has an item received after the commitment date. Or, in SQL, a late order is one that has line items where
- How aggregations become calculated attributes,and why do it
- Use one in a filter
l_commitdate < l_receiptdate
.
How can we make “is this order late” an attribute of an order?
The idea is that there is aggregation on lineitems (number of late items), that we want to count for each order
orders
(because we want to build a property of an order).
First, can just count all items of an order, not just late. This calculated attribute in orders
will count all line items for each order:
If a calculated attribute of an entity has an aggregation SQL or a metric, that aggregation is counted for each entity key.So if a calculated attribute in
orders
says lineitem.count
it means the count of items for o_orderkey
(the
granularity key of orders).orders
called has_late_deliveries
:
tpch_q04
.
Reusable filters and layered calculations: Q17 - “Small-Quantity-Order Revenue”
Learn here:How much revenue was from orders of small quantities of parts? A small quantity would be defined as an order of less then 20% of an average order quantity of a part. Or in other words: Make a metric counting average quantity of line items called
- That metrics, attributes, and filters from any entity can be easily combined
- To build a filter
avg_qty
:
part
to remember the average quantity a part is ordered:
lineitems
called is_small_quantity_items
(note it uses the field from part
)
tpch_q17
that does few things more.
Use a metric as a group by: Q13 - “Customer Distribution”
Learn here:This question wants to understand how many customers are there, based on how many orders did they make? How many of the customers do a lot of orders and how many do very few? To implement it, first want to remember per customer how many orders a customer had. For that, first create a calculated attribute in
- How to build a distribution histogram by using a metric as a group by attribute
customer
that
counts its orders by using the orders.count
metric:
tpch_q13
that combines this with parameters and filtered metrics.