Attributes
Introduction to calculated attributes
A calculated attribute (sometimes called a “dimension”) adds a virtual column to an entity.
In other words, the attribute can be computed for each unique row of an entity.
Attributes within entities can be simple computations (like a CASE WHEN
or multiplying a value). They can also be complex computations that span multiple related entities and aggregations.
Metadata
Attributes may include metadata such as their owner, business description, labels. See metadata section for more details.
Basic calculated attribute
A basic calculated attribute only depends on other attributes in the same entity.
For example, the lineitems
entity (which represents lines of an order), there is an attribute that is called extendedprice
. It represents the price not accounting for discount. You might want to have an attribute that reflects the actual price.
The definition of price
is:
This code is standard SQL. Anything in SQL (like window functions, nested data unpacking, user-defined functions, and type conversions) just works.
Combining entities
Calculated attributes can be based on multiple entities.
Which entities can be used depends on their relationship and level of granularity:
An attribute may use it its calculation
- Any attribute from a related entity that has the same or lower level of granularity as its own.
- Any aggregation on a related entity that has a higher level of granularity
Eventually, a calculated attribute is built per unique key (row) of its entity and must be able to resolve its dependencies for a row.
For example, an attribute in the orders
entity has a value for each order.
If you want to combine it with attributes of other entities, it can use any attribute from the customers
entity, but not from the lineitem
entity:
- Each order has a single associated customer: the
customers
entity is related toorders
entity with a lower granularity (a many:1 relationship). Any attribute inorders
may refer to any attribute incustomers
. Technically, the row per order can be extended with columns coming from its customer. - Each order has many associated lineitems: the
lineitem
entity is related to theorders
entity with a higher granularity (a 1:many relationship). So the is no single lineitem per order to choose from for the calculation, so using it would result in an error.
However, it is possible to aggregate lineitems
for an order: it possible to create an attribute per order that counts how many lineitems it had. See examples below.
Multi-entity attribute
The price of a line item is based on the quantity acquired times the price per unit (if you buy 3 apples for $2, then the price of the line is $6).
The calculation would combine parts
and the lineitems
entity: every lineitem has an associated part that can be used in the calculation.
When a calculated attribute refers to multiple entities, the engine will verify it can build a JOIN between them to compute. This is possible when referring to related entities that have a many-to-one relationship.
Since in this example, every row of lineitems
has exactly one associated part from parts
this calculation is possible. If the relation was reversed (a line item would have multiple possible parts), it would result in an error as it would not be possible to choose a part.
Note - This is originally encoded in the TPC-H extendedprice
attribute of lineitems
. We can recreate the same calculation.
Lookup attributes
Assume there is a table promotion_part_types
that has a list of every promoted part type.
is_promotion
attribute for a part would be TRUE is it matches a value from that table.
This is called a lookup attribute:
An attribute lookup would be compiled to a WHERE IN (SELECT ...)
over a
subquery of potential attribute values.
Using aggregations
Attributes may use aggregations as part of the calculation. It is usually required when referring to entities with a higher granularity than the one of the attribute.
For example, a customer has many orders, so an attribute in customer
can’t refer to a single order, but it can look at any aggregation over them like a count.
While it is possible to use aggregations on entities of same or lower granularity, it has little value: when referring to a lower granularity only a single row matches (an order has a single customer), so aggregating over can be meaningless.
When referring to aggregations, that are two types to consider:
- Ad-hoc aggregations (like a
SUM
or aCOUNT
) - Metrics - named aggregations (see metrics for more)
The behavior for both is the same, but an ad-hoc aggregation in an attribute is not as reusable as a metric as no other attribute can use it. Whether to use an ad-hoc aggregation or a metric depends on data modeling and reusability requirements. The best practice is use a metric if the aggregation code is expected to be reused, and ad-hoc otherwise.
Attribute with aggregations
The total price of an order is the sum of the prices of every line item bought in that order.
Note - this is originally encoded in the totalprice
attribute of orders
TPC-H sample data. We can recreate the same calculation as a calculated field.
Note this example uses the price
calculated attribute defined in the previous section.
Any aggregation defined in a calculated attribute of an entity is grouped by that entity key. Here, every unique row of an order, gets the sum of related line items.
Note that this can part of a more complex calculation:
Attribute with metrics (named aggregations)
The TPC-H schema defines that the status of order this way:
- When status of all lines is
F
(fulfilled) then status of order isF
- When status of all lines is
O
(ordered) then status of order isO
- Else the status of order is
P
(partially fulfilled)
This calculation can be a calculated field using the metric lineitems.count
that is the count of lineitems, combined with a filter:
Metrics are a very powerful abstractions for reusability. When a metric is
combined with a FILTER
qualifier like her it can reduced to aggregating over
a selection of related rows to the attribute instead of them all. Check out
metrics for more.
Note - this is originally encoded in the orderstatus
attribute of orders
TPC-H sample data.
Constant attributes
Attributes can be constants and don’t have to compute anything. Using an attribute like that will create a virtual column with a constant value.
However, in most use cases that call for a constant, it is better to use parameters both for reusability reasons (parameters allow to change their value in an ad-hoc computation) and for performance reasons.
Semi-structured attributes
Attributes can be semi-structured objects like a JSON or an array.
Can also create regular attributes by unpacking nested fields, e.g.
And
Note that it is common to explicitly cast to the correct type when unpacking a semi-structured attribute.
Since most BI tools do not support semi-structured objects, their data type in SQL interface is reported as a string.
YAML Schema
Every attribute is backed by a text file in git that defines it, and keeps history of every change.
The schema for an attribute is:
Fields:
entity
: Name of entity the attribute belongs toname
: Name of attributedisplay_name
,owner
,description
,labels
,folder
,hidden
: Metadatadatatype
: Data type of attribute, as exposed to JDBC (string by default)timegrain
: Time grain of the attribute, if it is a date/time attribute. This can be used in time-based aggregations.sql
: SQL statement of the calculated attributemetadata
: Additional metadata for the attribute (see examples for Tableau and Power BI)
For example, a calculated attribute extracting the month of an order: