Introduction to calculated attributes

A calculated attribute (sometimes called a “dimension”) adds a virtual column to an entity.

In other words, the attribute can be computed for each unique row of an entity.

Attributes within entities can be simple computations (like a CASE WHEN or multiplying a value). They can also be complex computations that span multiple related entities and aggregations.

Metadata

Attributes may include metadata such as their owner, business description, labels. See metadata section for more details.

Basic calculated attribute

A basic calculated attribute only depends on other attributes in the same entity.

For example, the lineitems entity (which represents lines of an order), there is an attribute that is called extendedprice. It represents the price not accounting for discount. You might want to have an attribute that reflects the actual price.

The definition of price is:

lineitems.extendedprice*(1-lineitems.discount)*(1-lineitems.tax)

This code is standard SQL. Anything in SQL (like window functions, nested data unpacking, user-defined functions, and type conversions) just works.

Combining entities

Calculated attributes can be based on multiple entities.

Which entities can be used depends on their relationship and level of granularity:

An attribute may use it its calculation

  • Any attribute from a related entity that has the same or lower level of granularity as its own.
  • Any aggregation on a related entity that has a higher level of granularity

Eventually, a calculated attribute is built per unique key (row) of its entity and must be able to resolve its dependencies for a row.

For example, an attribute in the orders entity has a value for each order.

If you want to combine it with attributes of other entities, it can use any attribute from the customers entity, but not from the lineitem entity:

  • Each order has a single associated customer: the customers entity is related to orders entity with a lower granularity (a many:1 relationship). Any attribute in orders may refer to any attribute in customers. Technically, the row per order can be extended with columns coming from its customer.
  • Each order has many associated lineitems: the lineitem entity is related to the orders entity with a higher granularity (a 1:many relationship). So the is no single lineitem per order to choose from for the calculation, so using it would result in an error.

However, it is possible to aggregate lineitems for an order: it possible to create an attribute per order that counts how many lineitems it had. See examples below.

Multi-entity attribute

The price of a line item is based on the quantity acquired times the price per unit (if you buy 3 apples for $2, then the price of the line is $6).

The calculation would combine parts and the lineitems entity: every lineitem has an associated part that can be used in the calculation.

lineitems.quantity * parts.retailprice

When a calculated attribute refers to multiple entities, the engine will verify it can build a JOIN between them to compute. This is possible when referring to related entities that have a many-to-one relationship.

Since in this example, every row of lineitems has exactly one associated part from parts this calculation is possible. If the relation was reversed (a line item would have multiple possible parts), it would result in an error as it would not be possible to choose a part.

Note - This is originally encoded in the TPC-H extendedprice attribute of lineitems. We can recreate the same calculation.

Using aggregations

Attributes may use aggregations as part of the calculation. It is usually required when referring to entities with a higher granularity than the one of the attribute.

For example, a customer has many orders, so an attribute in customer can’t refer to a single order, but it can look at any aggregation over them like a count.

While it is possible to use aggregations on entities of same or lower granularity, it has little value: when referring to a lower granularity only a single row matches (an order has a single customer), so aggregating over can be meaningless.

When referring to aggregations, that are two types to consider:

  • Ad-hoc aggregations (like a SUM or a COUNT)
  • Metrics - named aggregations (see metrics for more)

The behavior for both is the same, but an ad-hoc aggregation in an attribute is not as reusable as a metric as no other attribute can use it. Whether to use an ad-hoc aggregation or a metric depends on data modeling and reusability requirements. The best practice is use a metric if the aggregation code is expected to be reused, and ad-hoc otherwise.

Attribute with aggregations

The total price of an order is the sum of the prices of every line item bought in that order.

sum(lineitems.price)

Note - this is originally encoded in the totalprice attribute of orders TPC-H sample data. We can recreate the same calculation as a calculated field.

Note this example uses the price calculated attribute defined in the previous section.

Any aggregation defined in a calculated attribute of an entity is grouped by that entity key. Here, every unique row of an order, gets the sum of related line items.

Note that this can part of a more complex calculation:

case
    when orders.totalprice_calculated > 250000 then 'big'
    when orders.totalprice_calculated < 25000 then 'small'
    else 'regular'
end

Attribute with metrics (named aggregations)

The TPC-H schema defines that the status of order this way:

  • When status of all lines is F (fulfilled) then status of order is F
  • When status of all lines is O (ordered) then status of order is O
  • Else the status of order is P (partially fulfilled)

This calculation can be a calculated field using the metric lineitems.count that is the count of lineitems, combined with a filter:

case
    when (lineitems.count = lineitems.count FILTER (where lineitems.linestatus = 'F')) then 'F'
    when (lineitems.count = lineitems.count FILTER (where lineitems.linestatus = 'O')) then 'O'
    else 'P'
end

Metrics are a very powerful abstractions for reusability. When a metric is combined with a FILTER qualifier like her it can reduced to aggregating over a selection of related rows to the attribute instead of them all. Check out metrics for more.

Note - this is originally encoded in the orderstatus attribute of orders TPC-H sample data.

Constant attributes

Attributes can be constants and don’t have to compute anything. Using an attribute like that will create a virtual column with a constant value.

However, in most use cases that call for a constant, it is better to use parameters both for reusability reasons (parameters allow to change their value in an ad-hoc computation) and for performance reasons.

YAML Schema

Every attribute is backed by a text file in git that defines it, and keeps history of every change.

The schema for an attribute is:

type: calculated_attribute
entity: [entity name]
name: [attribute name]
display_name: [display name]
owner: [owner]
labels: [...]
folder: [folder]
desciption: |-
  [description]
datatype: [datatype - number/string/bool/...]
timegrain: [the time grain for time attributes - hour/day/month/...]
sql: |-
  [sql statement]

Fields:

  • entity: Name of entity the attribute belongs to
  • name: Name of attribute
  • display_name, owner, description, labels, folder: Metadata
  • datatype: Data type of attribute, as exposed to JDBC (string by default)
  • timegrain: Time grain of the attribute, if it is a date/time attribute. This can be used in time-based aggregations.
  • sql: SQL statement of the calculated attribute

For example, a calculated attribute extracting the month of an order:

type: calculated_attribute
entity: orders
name: order_month
datatype: date
sql: |-
  DATE_TRUNC('MONTH', orders.o_orderdate)