Entities
Entity
An Entity in Honeydew is defined as “a collection of stuff that shares the same granularity”.
For example in TPC-H, we can make an entity called orders
- all the data we have per orders.
The orders
entity is based on a TABLE in Snowflake, that has a unique key column in it (orderkey
), and one row per order. All properties of a single order are in the columns.
When modeling a SQL data warehouse, entities like customers may be stored in tables called “Dimensions”, and events on them like payments in tables called “Facts”. If you start from a well-architected data warehouse then import each existing table in it (dimension or fact) as an entity in Honeydew. But don’t worry if you don’t! We are here to make modeling business entities easy.
Granularity Key
The most important thing in an Entity is its ”Granularity Key”. This is what defines what is a unique instance of the Entity - for example in orders
entity, the key is orderkey
defining a unique order.
Granularity Key can be a combination of a few attributes that are unique together. This is called composite or compound key. For example, in TPC-H lineitems
entity, the key is a combination of orderkey
and linenumber
.
An entity can be key-less, but there is significant limitations to key-less entities, the most important is that it only be the higher granularity of every entity related to it.
Honeydew engine assumes entity keys are unique and non-null. Unexpected results may be otherwise.
It is good practice to test the key column(s) for uniqueness (automatic testing within Honeydew coming soon)
Default Date Field
When using time metrics, an entity may set a default date field to join all metrics in an entity to a date spine in the same way.
See default date field in time metrics for more details.
Metadata
Entities may include metadata such as their owner, business description, labels. See metadata section for more details.
Source Table
Entities don’t have to be based on a physical table in the data warehouse. Following entity types are possible:
- A physical table or a view
- A custom SQL query that defines the data for the entity
- A virtual entity, that is based on a calculation
The definition of the source table (regardless of its type) sets columns that become attributes of the entity.
When defining entities in the UI, the entity granularity key must come from its source table. If the entity key is based on a calculated attribute, then create a key-less entity first, make the calculated attribute, and then use it as its key.
Configuration
Source tables based on a data warehouse path (database.schema.table
) can be of any kind supported by Snowflake (TABLE, VIEW, DYNAMIC TABLE, MATERIALIZED VIEW, etc.).
Source tables based on a custom SQL query, will run that query for any access to the entity data.
Custom SQL may result in lower performance:
- Filters might not be pushed down into it, resulting in larger table scans
- Column selection might not be pushed down into it, resulting in unnecessary data movement.
If the source data is a large dataset that is commonly filtered for performance reasons, consider to encapsulate it in a table. In the source data has many columns, consider to encapsulate it in a view.
In particular, always prefer pointing to a table over using SELECT * FROM table
custom SQL source table.
Following configuration is supported:
- Choice of columns from the source table to add to the entity as attributes
- Names of columns: may rename columns before making them into attributes
Use renaming to remove unnecessary prefixes from columns names
Virtual Entity
An Entity is defined by its granularity key and source table.
However, sometimes that granularity comes from attributes that exist in the semantic model.
Use Cases
There are few reasons to make a virtual entity:
- Nested or Denormalized data tables that include few levels of granularity together.
- Build metrics on a level of granularity that is not an entity key.
- Build 1:many relationships to a level of granularity that is not an entity key.
Interface
For a virtual entity, must define:
- Source entity (it can be virtual as well)
- Granularity key that comes from the source entity
- Attributes that come from the source entity that are at the virtual entity granularity
For example an event table might include a user_id
column and a user_name
, that is per user_id
and is duplicated in the event table.
If you have an events entity (on the event table), can build from it a users entity with user_id
as key, and user_name
as an attribute.
Virtual entities have a single granularity key - multi-attribute granularity is not supported.
However, it is easy to make one:
- Build in the source entity a calculated attribute
virtual_key
that is a concatenation of the multiple attributes - Build a virtual entity that has
virtual_key
as a key, and all the multiple attributes as attributes from the source
Example - virtual entity to change metric granularity
For example, in TPC-H, we may want to look at orders.orderdate
as a Granularity Key of a ""Date” entity so we can do daily metrics:
- Make a virtual entity based on
orders
withorderdate
as its key - Make a calculated attribute in it that calculates daily revenue (defined as
lineitems.revenue
) - Make a metric in it that calculated average daily revenue (defined as
AVG(daily_revenue)
) - Use that metric to look at average daily revenue by
lineitems.payment_method
YAML Schema
Entity
Every entity is backed by a text file in git that defines it, and keeps history of every change.
The schema for entity is:
type: entity
name: [name]
display_name: [display name]
owner: [owner]
description: |-
[description...]
labels: [...]
folder: [folder]
hidden: [True/False/Yes/No]
keys:
- [key attr 1]
- ...
key_dataset: [key dataset]
default_date_field: [date field]
is_time_spine: [True/False/Yes/No]
relations: [relations - see relations for details]
Fields:
name
: Name of entitydisplay_name
,owner
,description
,labels
,folder
,hidden
: Metadatakeys
: Granularity keys of entity (can be multiple). Attributes must exist either in dataset or as calculated attributes.key_dataset
: Reference to source table (see below)default_date_field
: Default date field for time metrics defined in the entityis_time_spine
: If the entity is a time spine, used for time metrics
Source Table
Every entity relies on a source table that is in a separate text file in git.
The schema for source table is:
type: dataset
entity: [entity name]
name: [source table name]
display_name: [display name]
sql: |
[sql statement]
owner: [owner]
description: |-
[description]
labels: [...]
folder: [folder]
hidden: [True/False/Yes/No]
dataset_type: [table/sql/entity]
attributes:
- column: [src column name]
name: [attribute name]
display_name: [display name]
datatype: [date type - number/string/...]
timegrain: [the time grain for time attributes - hour/day/month/...]
desciption: [desciption]
labels: [...]
folder: [folder]
hidden: [True/False/Yes/No]
Fields:
name
: Name of source tabledisplay_name
,owner
,description
,labels
,folder
,hidden
: Metadatadataset_type
: Choose if a table, a SQL query or a virtual entitysql
: The name of table (when type is a table) / the base entity (virtual table) / SQL query (custom sql)attributes
: List of attributes to bring from the source, with optional renaming
Was this page helpful?