SNOWFLAKE_SAMPLE_DATA
database.
SNOWFLAKE_SAMPLE_DATA.TPCH_SF1
schema: TPC-H with a scale factor of 1.
Full schema can be found at https://github.com/honeydew-ai/tpch-demo
customer
entity is based on the SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
table.
Every entity also has a granularity associated with it (think of it as a virtual table with rows) - for example, the customer
entity represents a customer. Every “row” in it, is a unique customer.
The granularity is part of the entity definition and called its “key”. For customer, it is the column c_custkey
in the table.
Simple entities in the model include:
customer
- based on the CUSTOMER
table with key c_custkey
orders
- based on the ORDERS
table with key o_orderkey
parts
- based on the PARTS
table with key p_partkey
region
- based on the REGION
table with key r_regionkey
supplier
- based on the SUPPLIER
table with key s_suppkey
nation
- based on the NATION
table with key n_nationkey
lineitem
entity represents the lines of an order.
Its unique key is a combination of an order key and a line offset in it.
There are two entities like that:
lineitem
- lines in an order. Based on the LINEITEM
table with key (l_orderkey
, l_linenumber
)partsupp
- data on supplier/part combination. Based on the PARTSUPP
table with key (ps_partkey
, ps_suppkey
)supplier_quarters
. It is a “virtual entity” - an entity that does not exist as a table, but is based
on a calculation. More about it and when are virtual entities useful, later.
orders
and lineitem
relationship is:
orders.o_orderkey = lineitem.l_orderkey
orders
is also connected to customers
:
orders.o_custkey = customer.c_custkey
LEFT JOIN
.
The orders
to customers
relationship is set that way (marked in the UI with the small join type marker on the arrow).
Learn here:This question counts business metrics by status of transactions - what was billed, shipped, and returned. A dynamic dataset in Honeydew is a collection of attributes and metrics. By default, every metric is grouped by all the attributes. So, to implement it need to cut the data by status of transactions - the attributes
- How to slice and dice metrics by attributes of a single table
- How to make a reusable metric object
- How to use filters
lineitem.l_linestatus
and lineitem.l_returnflag
.
Counting line items is the dynamic dataset
count
metric is automatically created for every entity.lineitem
entity.
sum_qty
and avg_qty
metrics in the lineitem
entity with their SQL, as
tpch_q01
for more.
Learn here:This question focuses on finding out how many good suppliers can supply parts. A good supplier is one that has not received complaints. The way a “bad” supplier is marked in the data is the supplier comment field:
- How to build calculated attributes.
- How to use them in a filter.
supplier.s_comment like '%Customer%Complaints%'
.
So easy to count suppliers that had not complaints in a dynamic dataset:
supplier
entity called has_complaints
:
has_complaints
as an attribute to group by, or as a part of another expression.
tpch_q16
for more.
Learn here:This question looks at unshipped orders at a given date and the unshipped value, to get unshipped orders with the highest value. The previous query operated only on one entity and one table -
- How to combine different entities together.
- How to use parameters to filter.
lineitem
.
But what if need to combine entities and tables together? In this query, would want to see a metric of
order value (counted on the items of an order in lineitem
) but break it down by order date and shipping priority.
Let’s start with creating a value metric called revenue
in lineitem
, that counts the actual price paid including discounts:
revenue
on lineitem
, grouped by attributes in orders
.
Under the hood, Honeydew joins the orders
and lineitem
tables based on the relationship between them.
AND
, as Honeydew can deliver better performance that way.$date
?
$date
is a parameter, to make this query a parameterized dynamic dataset - so you can choose the date later.
The default value is defined in the dynamic dataset as well, but can be changed at query time:
tpch_q03
for more.
Learn here:This questions checks how promotions affect revenue. A promotion is market in the data in the type part -
- How to create a filtered metric and a ratio metric
- Combining different entities in a single metric
- Alternatives to using parameters
part.p_type LIKE 'PROMO%
.
We have already defined revenue in the previous query (Q03). The definition of revenue was in the metric lineitem.revenue
:
revenue
as a black-box, and applies a filter on the rows that it counts on. This is called a filtered metric.
part
and a field from lineitem
. As long as there is a relationship path connection the
two, Honeydew will figure out the needed JOINs.lineitem.promo_revenue
:
lineitem.revenue
here) can appear multiple times in an expression, each time with different filters or groupings.tpch_q14
looks at the revenue on a parameterized time range:
q02_expanded
and q08_expanded
.Learn here:This question checks how many orders had have late deliveries? The way a late delivery is defined is that it is an order that has an item received after the commitment date. Or, in SQL, a late order is one that has line items where
- How aggregations become calculated attributes,and why do it
- Use one in a filter
l_commitdate < l_receiptdate
.
How can we make “is this order late” an attribute of an order?
The idea is that there is aggregation on lineitems (number of late items), that we want to count for each order
orders
(because we want to build a property of an order).
First, can just count all items of an order, not just late. This calculated attribute in orders
will count all line items for each order:
orders
says lineitem.count
it means the count of items for o_orderkey
(the
granularity key of orders).orders
called has_late_deliveries
:
tpch_q04
.
Learn here:How much revenue was from orders of small quantities of parts? A small quantity would be defined as an order of less then 20% of an average order quantity of a part. Or in other words: Make a metric counting average quantity of line items called
- That metrics, attributes, and filters from any entity can be easily combined
- To build a filter
avg_qty
:
part
to remember the average quantity a part is ordered:
lineitems
called is_small_quantity_items
(note it uses the field from part
)
tpch_q17
that does few things more.
Learn here:This question wants to understand how many customers are there, based on how many orders did they make? How many of the customers do a lot of orders and how many do very few? To implement it, first want to remember per customer how many orders a customer had. For that, first create a calculated attribute in
- How to build a distribution histogram by using a metric as a group by attribute
customer
that
counts its orders by using the orders.count
metric:
tpch_q13
that combines this with parameters and filtered metrics.