Attributes
Introduction to calculated attributes
A calculated attribute (sometimes called a “dimension”) adds a virtual column to an entity.
In other words, the attribute can be computed for each unique row of an entity.
Attributes within entities can be simple computations (like a CASE WHEN
or multiplying a value). They can also be complex computations that span multiple related entities and aggregations.
Metadata
Attributes may include metadata such as their owner, business description, labels. See metadata section for more details.
Basic calculated attribute
A basic calculated attribute only depends on other attributes in the same entity.
For example, the lineitems
entity (which represents lines of an order), there is an attribute that is called extendedprice
. It represents the price not accounting for discount. You might want to have an attribute that reflects the actual price.
The definition of price
is:
This code is standard SQL. Anything in SQL (like window functions, nested data unpacking, user-defined functions, and type conversions) just works.
Build the new calculated attribute
Build the new calculated attribute
Go to the lineitems
entity, and press “+Attribute”. Call it price
, enter
the SQL calculation formula above and save it.
Check it with Preview Data
Check it with Preview Data
You can preview data by going to the ‘Preview’ tab.
Combining entities
Calculated attributes can be based on multiple entities.
Which entities can be used depends on their relationship and level of granularity:
An attribute may use it its calculation
- Any attribute from a related entity that has the same or lower level of granularity as its own.
- Any aggregation on a related entity that has a higher level of granularity
Eventually, a calculated attribute is built per unique key (row) of its entity and must be able to resolve its dependencies for a row.
For example, an attribute in the orders
entity has a value for each order.
If you want to combine it with attributes of other entities, it can use any attribute from the customers
entity, but not from the lineitem
entity:
See the relationship graph
See the relationship graph
You can view it in the product by going to the main page of entities.
- Each order has a single associated customer: the
customers
entity is related toorders
entity with a lower granularity (a many:1 relationship). Any attribute inorders
may refer to any attribute incustomers
. Technically, the row per order can be extended with columns coming from its customer. - Each order has many associated lineitems: the
lineitem
entity is related to theorders
entity with a higher granularity (a 1:many relationship). So the is no single lineitem per order to choose from for the calculation, so using it would result in an error.
However, it is possible to aggregate lineitems
for an order: it possible to create an attribute per order that counts how many lineitems it had. See examples below.
Multi-entity attribute
The price of a line item is based on the quantity acquired times the price per unit (if you buy 3 apples for $2, then the price of the line is $6).
The calculation would combine parts
and the lineitems
entity: every lineitem has an associated part that can be used in the calculation.
When a calculated attribute refers to multiple entities, the engine will verify it can build a JOIN between them to compute. This is possible when referring to related entities that have a many-to-one relationship.
Since in this example, every row of lineitems
has exactly one associated part from parts
this calculation is possible. If the relation was reversed (a line item would have multiple possible parts), it would result in an error as it would not be possible to choose a part.
Note - This is originally encoded in the TPC-H extendedprice
attribute of lineitems
. We can recreate the same calculation.
Build the new calculated attribute
Build the new calculated attribute
Go to the lineitems
entity, and press “+Attribute”.
Call it extendedprice_calculated
, and enter the SQL calculation lineitems.quantity * parts.retailprice
and save it.
Check it with a Dynamic Dataset
Check it with a Dynamic Dataset
Build a new dynamic dataset that has the following attributes:
lineitems.extendedprice
(the original from TPC-H)lineitems.extendedprice_calculated
(the new one you just built)
You can see they are very similar. They are not the same due to rounding errors from the calculation used in building the TPC-H sample dataset.
Update the `price` attribute to use it
Update the `price` attribute to use it
Go to the lineitems
entity, and edit the price
attribute.
Change its calculation from
To
It will now use the calculated attribute we just built
Lookup attributes
Assume there is a table promotion_part_types
that has a list of every promoted part type.
is_promotion
attribute for a part would be TRUE is it matches a value from that table.
This is called a lookup attribute:
An attribute lookup would be compiled to a WHERE IN (SELECT ...)
over a
subquery of potential attribute values.
Using aggregations
Attributes may use aggregations as part of the calculation. It is usually required when referring to entities with a higher granularity than the one of the attribute.
For example, a customer has many orders, so an attribute in customer
can’t refer to a single order, but it can look at any aggregation over them like a count.
See the relationship graph
See the relationship graph
You can view it in the product by going to the main page of entities.
While it is possible to use aggregations on entities of same or lower granularity, it has little value: when referring to a lower granularity only a single row matches (an order has a single customer), so aggregating over can be meaningless.
When referring to aggregations, that are two types to consider:
- Ad-hoc aggregations (like a
SUM
or aCOUNT
) - Metrics - named aggregations (see metrics for more)
The behavior for both is the same, but an ad-hoc aggregation in an attribute is not as reusable as a metric as no other attribute can use it. Whether to use an ad-hoc aggregation or a metric depends on data modeling and reusability requirements. The best practice is use a metric if the aggregation code is expected to be reused, and ad-hoc otherwise.
Attribute with aggregations
The total price of an order is the sum of the prices of every line item bought in that order.
Note - this is originally encoded in the totalprice
attribute of orders
TPC-H sample data. We can recreate the same calculation as a calculated field.
Note this example uses the price
calculated attribute defined in the previous section.
Any aggregation defined in a calculated attribute of an entity is grouped by that entity key. Here, every unique row of an order, gets the sum of related line items.
Note that this can part of a more complex calculation:
Build the total price attribute
Build the total price attribute
Go to the orders
entity, and press “+Attribute”.
Call it totalprice_calculated
, enter the SQL calculation formula:
Build the order size classification attribute
Build the order size classification attribute
Go to the orders
entity, and press “+Attribute”.
Call it order_size_classification
, enter the SQL calculation formula:
Check it with a Dynamic Dataset
Check it with a Dynamic Dataset
Build a new dynamic dataset, with orders.order_size_classification
attributes and a single metric COUNT(*) as count
. Preview its data to see how many orders of each size classification there are.
It should be something like:
orders.order_size_classification | count |
---|---|
big | 228,149 |
small | 86,560 |
regular | 1,185,291 |
Attribute with metrics (named aggregations)
The TPC-H schema defines that the status of order this way:
- When status of all lines is
F
(fulfilled) then status of order isF
- When status of all lines is
O
(ordered) then status of order isO
- Else the status of order is
P
(partially fulfilled)
This calculation can be a calculated field using the metric lineitems.count
that is the count of lineitems, combined with a filter:
Metrics are a very powerful abstractions for reusability. When a metric is
combined with a FILTER
qualifier like her it can reduced to aggregating over
a selection of related rows to the attribute instead of them all. Check out
metrics for more.
Note - this is originally encoded in the orderstatus
attribute of orders
TPC-H sample data.
Constant attributes
Attributes can be constants and don’t have to compute anything. Using an attribute like that will create a virtual column with a constant value.
However, in most use cases that call for a constant, it is better to use parameters both for reusability reasons (parameters allow to change their value in an ad-hoc computation) and for performance reasons.
Semi-structured attributes
Attributes can be semi-structured objects like a JSON or an array.
Can also create regular attributes by unpacking nested fields, e.g.
And
Note that it is common to explicitly cast to the correct type when unpacking a semi-structured attribute.
Since most BI tools do not support semi-structured objects, their data type in SQL interface is reported as a string.
YAML Schema
Every attribute is backed by a text file in git that defines it, and keeps history of every change.
The schema for an attribute is:
Fields:
-
entity
: Name of entity the attribute belongs to -
name
: Name of attribute -
display_name
,owner
,description
,labels
,folder
,hidden
: Metadata -
datatype
: Data type of attribute, as exposed to JDBC (string
is the default, if not specified). Possible values are:bool
- a boolean value.number
- an integer numberfloat
- a floating point numberdate
- a date valuetimestamp
- a timestamp valuestring
- a string value, used also for any other type of data (like array, JSON, etc.)
-
timegrain
: Time grain of the attribute, if it is a date/time attribute. This can be used in time-based aggregations. Possible values are:microsecond
,millisecond
,second
,minute
,hour
,day
,week
,month
,quarter
,year
. -
sql
: SQL statement of the calculated attribute -
metadata
: Additional metadata for the attribute (see examples for AI, Tableau, Power BI and Lightdash)
For example, a calculated attribute extracting the month of an order: