Schema in Git

Every building block in Honeydew (calculations, entities, dynamic datasets) is defined under the hood in a text file managed in Git.

While Honeydew is in Beta, the underlying schema may change.

Example

TPC-H is an example data set (back from the ‘90s), used mostly for databases & data warehouses performance benchmarking.

See TPC-H Example.

In Honeydew, we will build four entities to represent these four tables, as well as their relations.

Note that for any component, the name of the yaml file is the same as the name field within its definition.

Parts

Let’s start with the entity parts. Here is how the definition in parts/parts.yml will look like:

type: entity
name: parts
keys: [partkey]
key_dataset: tpch_parts

As we can see, it contains a dataset called tpch_parts. It also contains a single key called partkey. This is a hint for Honeydew semantic engine regarding how this entity can be referred from other entities as part of the ERD. We will expand on this later.

This is the definition for parts/datasets/tpch_parts.yml:

type: dataset
entity: parts
name: tpch_parts
sql: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.PART
dataset_type: table
attributes:
  - column: p_partkey
    name: partkey
    datatype: string
  - column: p_name
    name: name
    datatype: string
  - column: p_mfgr
    name: manufacturer
    datatype: string
  - column: p_brand
    name: brand
    datatype: string
  - column: p_type
    name: part_type
    datatype: string
  - column: p_size
    name: part_size
    datatype: string
  - column: p_container
    name: partcontainer
    datatype: string
  - column: p_retailprice
    name: retailprice
    datatype: float

We can see its SQL relates to a table called SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.PART in Snowflake. We know it’s a table since we have dataset_type: table. The other options are sql - when we want to define the table over a certain SQL select statement, and entity - when we want to create the dataset over an existing entity.

It has the attributes partkey, name, manufacturer, brand, part_type, etc…

Each attribute has a name, as well as the mapping to the column in the data warehouse, through the column field. Each attribute also has a datatype definition. The values can be string, date, number of float.

All the attributes of this data set are directly available as part of the entity parts.

Customers

Let’s take a look at another example, for the entity customers.

This is how customers/customers.yml looks like:

type: entity
name: customers
keys: [custkey]
key_dataset: tpch_customers

This is the definition for customers/datasets/tpch_customers.yml (very similar to what we saw for tpch_parts.yml):

type: dataset
entity: customers
name: tpch_customers
sql: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
dataset_type: table
attributes:
  - column: c_custkey
    name: custkey
    datatype: string
  - column: c_name
    name: name
    datatype: string
  - column: c_nationkey
    name: nationkey
    datatype: string
  - column: c_acctbal
    name: balance
    datatype: float

However, for customers, we also have a definition of a metric called count.

This is how it is defined in customers/metrics/count.yml:

type: metric
entity: customers
name: count
datatype: number
sql: |-
  COUNT(*)

The sql field defines how the aggregation of this metric is performed. In this case, it’s a simple count(*) metric.

Orders

Let now look at the next entity, called orders, which has many more attributes.

This is orders/orders.yml:

type: entity
name: orders
keys: [orderkey]
key_dataset: tpch_orders
relations:
  - src_field: o_custkey
    target_entity: customers
    target_field: custkey
    rel_type: many-to-one

Note the relations defined here, it’s the first time we see them. Here we can see that the order entity is related to customers in a many-to-one relation (there can be multiple orders for each customer).

And this is orders/datasets/tpch_orders.yml:

type: dataset
entity: orders
name: tpch_orders
sql: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
dataset_type: table
attributes:
  - column: o_orderkey
    name: orderkey
    datatype: string
  - column: o_custkey
    name: custkey
    datatype: string
  - column: o_orderstatus
    name: status
    datatype: string
  - column: o_orderdate
    name: date
    datatype: date
  - column: o_shippriority
    name: shippriority
    datatype: string
  - column: o_orderpriority
    name: orderpriority
    datatype: string

So far it all looks familiar.

For this entity we are also creating a few calculated attributes.

Let’s start with is_valid_status. This is how it is defined in orders/attributes/is_valid_status.yml:

type: calculated_attribute
entity: orders
name: is_valid_status
datatype: number
sql: |-
  orders.status IN ('O', 'F')

We see that it is a calculated attribute according to type: calculated_attribute.

Here we have an sql definition for this attribute, which checks whether status is included in a list of predefined values.

Let’s see another attribute - orders/attributes/is_good_order.yml:

type: calculated_attribute
entity: orders
name: is_good_order
datatype: number
sql: |-
  orders.is_valid_status AND customers.c_nationkey LIKE 'a%'

Here we have a rather simple SQL definition as well. Note that here we refer two different fields on two different entities - is_valid_status on orders, and c_nationkey on customers. Note how we don’t have to tell anything about the way to join these entities - the semantic engine will figure it out from these two entities’ definitions.

Now let’s take a look at a rather complex calculated attribute - orders/attributes/days_since_last_order.yml:

type: calculated_attribute
entity: orders
name: days_since_last_order
datatype: number
sql: |-
  orders.date - LAG(orders.date, 1) OVER (PARTITION BY customers.custkey ORDER BY orders.date)

Here the SQL is a window function calculation on top of both orders and customers entities.

Here’s another simple calculated attribute - orders/attributes/order_month.yml:

type: calculated_attribute
entity: orders
name: order_month
datatype: date
sql: |-
  DATE_TRUNC('MONTH', orders.date)

It simply calculates the month for the given date.

We have a similar one also for the year - orders/attributes/order_year.yml:

type: calculated_attribute
entity: orders
name: order_year
datatype: date
sql: |-
  DATE_TRUNC('YEAR', orders.date)

For orders we also have a couple of filters defined. Let’s see how a filter definition looks like.

First, let’s take a look at orders/filters/is_fully_shipped.yml:

type: filter
entity: orders
name: is_fully_shipped
filter_sql: |-
  orders.status = 'O'

This filter encapsulates the semantic definition for fully-shipped orders. Note that the definition is done through the filter_sql field.

Here’s another one - orders/filters/this_year.yml:

type: filter
entity: orders
name: this_year
filter_sql: |-
  orders.date BETWEEN DATE('1995-01-01') AND DATE('1995-12-31')

The concept is pretty similar.

And yes, we know, it’s funny to call 1995 “this year”, but that’s how old TPC-H is 🙂.

We have a few metrics defined on orders, let’s take a look at some of them, starting with count_done_orders (orders/metrics/count_done_orders.yml):

type: metric
entity: orders
name: count_done_orders
datatype: number
sql: |-
  COUNT(*) FILTER (WHERE orders.status = 'O')

In this metric we are introducing the concept of filtered metrics. It would automatically count only done orders, at any level of granularity that this metric is used.

Here’s another one, for calculating average time between orders orders/metrics/avg_time_between_orders.yml :

type: metric
entity: orders
name: avg_time_between_orders
datatype: number
sql: |-
  AVG(orders.days_since_last_order)

Remember the days_since_last_order attribute we defined earlier on? This metric uses that attribute to calculate this metric.

Line Items

Let look at the last of the four entities - lineitems.

Here’s lineitems/lineitems.yml:

type: entity
name: lineitems
keys: [l_orderkey, l_linenumber]
key_dataset: tpch_lineitems
relations:
  - target_entity: orders
    rel_type: many-to-one
    connection:
      - src_field: l_orderkey
        target_field: orderkey
  - target_entity: parts
    rel_type: many-to-one
    connection:
      - src_field: l_partkey
        target_field: partkey

Here we can see that the order_line entity is related to orders in a many-to-one relation (there can be multiple order lines within each order). We can also see that lineitems is also related to parts in a many-to-one relation (there can be multiple order lines referring to the same part).

And the dataset supporting it - lineitems/datasets/tpch_lineitems:

type: dataset
entity: lineitems
name: tpch_lineitems
sql: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.LINEITEM
dataset_type: table
attributes:
  - column: l_linenumber
    name: linenumber
    datatype: number
  - column: l_orderkey
    name: orderkey
    datatype: string
  - column: l_partkey
    name: partkey
    datatype: string
  - column: l_discount
    name: discount
    datatype: float
  - column: l_extendedprice
    name: price
    datatype: float
  - column: l_quantity
    name: quantity
    datatype: number
  - column: l_returnflag
    name: returnflag
    datatype: string
  - column: l_linestatus
    name: status
    datatype: string
  - column: l_tax
    name: tax
    datatype: float
  - column: l_shipdate
    name: shipdate
    datatype: date
  - column: l_commitdate
    name: commitdate
    datatype: date
  - column: l_receiptdate
    name: receiptdate
    datatype: date
  - column: l_shipmode
    name: shipmode
    datatype: string