Background

TPC-H is an example data set used mostly for databases & data warehouses performance benchmarking.

The data represents an ordering system. It contains customers and orders, where each order is comprised of order line items, and each line item is a certain part.

A dataset derived from TPC-H is readily available as part of any Snowflake instance, in the SNOWFLAKE_SAMPLE_DATA database.

Honeydew Example Data

This is the ERD for a schema derived from TPC-H, as exists in the Snowflake sample data.

Honeydew examples are based on those tables, as well as their relations. They are build on the SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 schema: TPC-H with a scale factor of 1.

Full schema can be found at https://github.com/honeydew-ai/tpch-demo

What the example does not show

The TPC-H represents a warehouse modeled with dimensional modeling. Data lakes today include data modeled in other ways, such as semi-nested data or wide fact tables.

Honeydew is built to support a wide variety of data structures. However, some of the techniques used in the example are more applicable for dimensional modeling, and less for other types of data.

TPC-H Derived Data Model

This is the ERD as is reflected in the Honeydew UI:

Honeydew includes information about every entity and relationship in data.

Entities

The simplest type of an entity is a table in the data warehouse. For example, the customer entity is based on the SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER table.

Every entity also has a granularity associated with it (think of it as a virtual table with rows) - for example, the customer entity represents a customer. Every “row” in it, is a unique customer.

The granularity is part of the entity definition and called its “key”. For customer, it is the column c_custkey in the table.

Simple entities in the model include:

  • customer - based on the CUSTOMER table with key c_custkey
  • orders - based on the ORDERS table with key o_orderkey
  • parts - based on the PARTS table with key p_partkey
  • region - based on the REGION table with key r_regionkey
  • supplier - based on the SUPPLIER table with key s_suppkey
  • nation - based on the NATION table with key n_nationkey

A table with a single key is the most common type of entity.

Note that in some cases an entity key is more than one column. For example, the lineitem entity represents the lines of an order. Its unique key is a combination of an order key and a line offset in it.

There are two entities like that:

  • lineitem - lines in an order. Based on the LINEITEM table with key (l_orderkey, l_linenumber)
  • partsupp - data on supplier/part combination. Based on the PARTSUPP table with key (ps_partkey, ps_suppkey)

The last entity in the example is called supplier_quarters. It is a “virtual entity” - an entity that does not exist as a table, but is based on a calculation. More about it and when are virtual entities useful, later.

For more information, read on entities

Relations

Entities by themselves don’t help build interesting queries. Insights come from connecting different business domains and entities. In SQL terms, different entities need to be JOINed.

Relationships tell Honeydew which path to take to JOIN entities, and how to compute metrics across multiple entities.

The basic information of relationship in Honeydew includes:

  • Its direction (1:many or many:1)
  • The columns to JOIN on (might be more than one)

For example, orders and lineitem relationship is:

  • 1:many (an order has many line items in it)
  • Based on joining orders.o_orderkey = lineitem.l_orderkey

In the Honeydew UI, the direction of the arrow signifies the relative granularity (1:many):

For more information, read on relations

Relations of customer

orders is also connected to customers:

  • many:1 (a customer has many orders)
  • Based on joining orders.o_custkey = customer.c_custkey

One of the advanced properties of a relationship is its default JOIN direction (left/right/inner/automatic).

Since some customers don’t have orders, but every order has a customer, the best default JOIN direction is LEFT JOIN.

The orders to customers relationship is set that way (marked in the UI with the small join type marker on the arrow).

Most modern data models will use LEFT as join configuration between facts and dimensions.

Examples: Basic

The TPC-H benchmark includes different queries that can be used to highlight different concepts of using Honeydew for semantic modeling.

All the “Q##” here refer to questions defined in the benchmark.

Most of the time using Honeydew will be done with a UI or API. However can explore the calculations in their source form:

Honeydew separates between a semantic model (entities and metrics) and queries on it (a specific metric in a specific grain).

Queries defined by the TPC-H benchmark are implemented as Dynamic Datasets, showcasing different ways to ask about data.

Queries can also be ad-hoc, using the [SQL interface][/jdbc], the Native App or a BI tool.

The order of queries here is based on best order of learning.

The first one: Q01 - “Pricing Summary Report”

Learn here:

  1. How to slice and dice metrics by attributes of a single table
  2. How to make a reusable metric object
  3. How to use filters

This question counts business metrics by status of transactions - what was billed, shipped, and returned.

A dynamic dataset in Honeydew is a collection of attributes and metrics. By default, every metric is grouped by all the attributes.

So, to implement it need to cut the data by status of transactions - the attributes lineitem.l_linestatus and lineitem.l_returnflag.

Counting line items is the dynamic dataset

attributes:
  - lineitem.l_linestatus
  - lineitem.l_returnflag
metrics:
  - lineitem.count

The count metric is automatically created for every entity.

The Q01 query calls for a number of other metrics as well. For example, it asks for Total and Average Quantity:.

Can build a dynamic dataset to include those metrics:

attributes:
  - lineitem.l_linestatus
  - lineitem.l_returnflag
metrics:
  - lineitem.count
  - SUM(lineitem.l_quantity) as sum_qty
  - AVG(lineitem.l_quantity) as avg_qty

This computes them in a query. But does not make those metrics part of the semantic model. To add them to the semantic model create them in the lineitem entity.

Metrics are standalone aggregations. They can be defined once and reused anywhere.

Create the metric sum_qty and avg_qty metrics in the lineitem entity with their SQL, as

SUM(lineitem.l_quantity)

Now the dataset can look like

attributes:
  - lineitem.l_linestatus
  - lineitem.l_returnflag
metrics:
  - lineitem.count
  - lineitem.l_sum_qty
  - lineitem.l_avg_qty

But what if want to see this metric only on a subset of the data? Let’s filter

attributes:
  - lineitem.l_linestatus
  - lineitem.l_returnflag
metrics:
  - lineitem.count
  - lineitem.l_sum_qty
  - lineitem.l_avg_qty
filters:
  - lineitem.l_shipdate <= date('1998-12-01')

This will not count any line items after Dec 1998. (yes, the TPC-H sample data happens in the 90s).

Check out the full dynamic dataset tpch_q01 for more.

Calculated attributes: Q16 - “Parts/Supplier Relationship”

Learn here:

  1. How to build calculated attributes.
  2. How to use them in a filter.

This question focuses on finding out how many good suppliers can supply parts. A good supplier is one that has not received complaints.

The way a “bad” supplier is marked in the data is the supplier comment field: supplier.s_comment like '%Customer%Complaints%'.

So easy to count suppliers that had not complaints in a dynamic dataset:

attributes: []
metrics:
  - supplier.count
filters:
  - not supplier.s_comment like '%Customer%Complaints%'

But perhaps want to use the logic of a supplier not having complaints elsewhere?

So far, all the attributes we used were columns in the data. A calculated attribute is like a “virtual column”.

Let’s codify this logic in a reusable way by making a calculated attribute in the supplier entity called has_complaints:

supplier.s_comment like '%Customer%Complaints%'

Now can check filter out suppliers with complaints:

attributes: []
metrics:
  - supplier.count
filters:
  - not supplier.has_complaints

Can also use has_complaints as an attribute to group by, or as a part of another expression.

If a calculation is defined in a dynamic dataset, it is not reusable elsewhere. Make it a calculated attribute or a metric in the semantic model to reuse it and manage changes.

Check out the full tpch_q16 for more.

Combining entities: Q03 - “Shipping Priority”

Learn here:

  1. How to combine different entities together.
  2. How to use parameters to filter.

This question looks at unshipped orders at a given date and the unshipped value, to get unshipped orders with the highest value.

The previous query operated only on one entity and one table - lineitem.

But what if need to combine entities and tables together? In this query, would want to see a metric of order value (counted on the items of an order in lineitem) but break it down by order date and shipping priority.

Let’s start with creating a value metric called revenue in lineitem, that counts the actual price paid including discounts:

sum(lineitem.l_extendedprice*(1-lineitem.l_discount))

Now can use it in a dynamic dataset, such as:

attributes:
  - orders.o_orderdate
  - orders.o_shippriority
metrics:
  - lineitem.revenue

This calculated revenue on lineitem, grouped by attributes in orders.

Under the hood, Honeydew joins the orders and lineitem tables based on the relationship between them.

Can combine any entities together, as long as there is a relationship between them that supports the question.

Don’t worry, if there is no way to combine data, Honeydew will tell you.

Now, this is not the whole question - we want revenue of only unshipped orders at a particular date.

So, lets filter orders that happened before a given date, but their ship date has not arrived yet:

attributes:
  - orders.o_orderdate
  - orders.o_shippriority
metrics:
  - lineitem.revenue
filters:
  - lineitem.l_shipdate > $date
  - orders.o_orderdate < $date

Filters can be ad-hoc or reusable boolean attributes. All the filters apply together (AND), and every separate filter is pushed down as close to the data as possible, for performance gain.

When there are a number of filters (like here), the best practice is write separate filters instead of one big expression with an AND, as Honeydew can deliver better performance that way.

Wait, but what is $date?

$date is a parameter, to make this query a parameterized dynamic dataset - so you can choose the date later.

The default value is defined in the dynamic dataset as well, but can be changed at query time:

parameters:
  - name: date
    value: '''1995-03-15''::date'
    description: ''

Check out the full tpch_q03 for more.

Derived metrics: Q14 - “Promotion Effect”

Learn here:

  1. How to create a filtered metric and a ratio metric
  2. Combining different entities in a single metric
  3. Alternatives to using parameters

This questions checks how promotions affect revenue.

A promotion is market in the data in the type part - part.p_type LIKE 'PROMO%.

We have already defined revenue in the previous query (Q03). The definition of revenue was in the metric lineitem.revenue:

sum(lineitem.l_extendedprice*(1-lineitem.l_discount))

The revenue of promotions only counts lineitems that match a promoted part, or:

lineitem.revenue FILTER (WHERE part.p_type LIKE 'PROMO%')

This expression takes the revenue as a black-box, and applies a filter on the rows that it counts on. This is called a filtered metric.

The expression for that metric used a field from part and a field from lineitem. As long as there is a relationship path connection the two, Honeydew will figure out the needed JOINs.

Can check this in a dynamic dataset that only has this.

However, the query calls for promotion effect - or, which percentage of the revenue went to promotions?

The way to define it is a ratio metric, which is called lineitem.promo_revenue:

100.0 * lineitem.revenue FILTER (WHERE part.p_type LIKE 'PROMO%') / lineitem.revenue

Same metric (lineitem.revenue here) can appear multiple times in an expression, each time with different filters or groupings.

The full tpch_q14 looks at the revenue on a parameterized time range:

attributes: []
metrics:
  - lineitem.promo_revenue
filters:
  - lineitem.l_shipdate between $date and dateadd('month', 1, $date)

This follow the TPC-H specification. But what if want to just calculated the promo revenue on all months, and let the user decide later which is interesting?

This is done by just using an attribute as a group:

attributes:
  - date_trunc('month', lineitem.l_shipdate) as ship_month
metrics:
  - lineitem.promo_revenue
filters: []

The TPC-H benchmark heavily uses paramaters for filtering attributes in queries.

However, many use cases can be better off with adding the field filtered with a parameter as a group instead, and let the user filter later.

Other examples of replacing parameters with groups see q02_expanded and q08_expanded.

Examples: Intermediate

Metrics as attributes: Q04 - “Order Priority Checking”

Learn here:

  1. How aggregations become calculated attributes,and why do it
  2. Use one in a filter

This question checks how many orders had have late deliveries?

The way a late delivery is defined is that it is an order that has an item received after the commitment date.

Or, in SQL, a late order is one that has line items where l_commitdate < l_receiptdate.

How can we make “is this order late” an attribute of an order?

The idea is that there is aggregation on lineitems (number of late items), that we want to count for each order

lineitem.count FILTER (WHERE lineitem.l_commitdate < lineitem.l_receiptdate)

To achieve that, need to create a calculated attribute in orders (because we want to build a property of an order).

First, can just count all items of an order, not just late. This calculated attribute in orders will count all line items for each order:

lineitems.count

If a calculated attribute of an entity has an aggregation SQL or a metric, that aggregation is counted for each entity key.

So if a calculated attribute in orders says lineitem.count it means the count of items for o_orderkey (the granularity key of orders).

However, what we want is to count late items and mark the order as late if there is more than 0.

So make a calculated attribute in orders called has_late_deliveries:

lineitem.count FILTER (WHERE lineitem.l_commitdate < lineitem.l_receiptdate) > 0

This will be a boolean attribute (TRUE/FALSE) of an order that is true when there was at least one late item.

Now can use it in other expressions, for example to see how many late orders do we have based on their shipping priority:

attributes:
  - orders.o_orderpriority
metrics:
  - orders.count
filters:
  - orders.has_late_deliveries

See more about metrics as calculated attributes or checkout the full tpch_q04.

Reusable filters and layered calculations: Q17 - “Small-Quantity-Order Revenue”

Learn here:

  1. That metrics, attributes, and filters from any entity can be easily combined
  2. To build a filter

How much revenue was from orders of small quantities of parts?

A small quantity would be defined as an order of less then 20% of an average order quantity of a part.

Or in other words:

Make a metric counting average quantity of line items called avg_qty:

avg(lineitem.l_quantity)

Now use it in as a calculated attribute in part to remember the average quantity a part is ordered:

lineitem.avg_qty

Now use it to make a filter in lineitems called is_small_quantity_items (note it uses the field from part)

lineitem.l_quantity < 0.2*part.avg_qty

Now use this for a question - how much revenue do we have from small quantity orders?

attributes: []
metrics:
  - lineitem.revenue
filters:
  - lineitem.is_small_quantity_items

Check out the full tpch_q17 that does few things more.

Use a metric as a group by: Q13 - “Customer Distribution”

Learn here:

  1. How to build a distribution histogram by using a metric as a group by attribute

This question wants to understand how many customers are there, based on how many orders did they make? How many of the customers do a lot of orders and how many do very few?

To implement it, first want to remember per customer how many orders a customer had.

For that, first create a calculated attribute in customer that counts its orders by using the orders.count metric:

orders.count

Now can use it as a group by for customer count - build the following dynamic dataset:

attributes:
  - customer.order_count
metrics:
  - customer.count

It will count customers by the property of how many orders each had - a distribution query.

Check out the full tpch_q13 that combines this with parameters and filtered metrics.