Datasets
How to build datasets based on shared semantics
Intro
A Dynamic Dataset is a way to look at data using Honeydew. A typical Honeydew deployment has many thousands of pieces of logic. An analysis might need just a subset, for example, the orders of a particular customer.
The simplest dataset is “all the data of an entity”.
Dynamic datasets are built:
- Attributes
- Metrics
- Filters
The components can come from fields defined in the semantic layer (attributes and metrics) or defined at ad-hoc field SQL expressions.
Unlike attributes or metrics defined in Entities, ad-hoc calculations defined in a dataset is not reusable by other calculations.
Use Cases
- Build static datasets for data science (i.e. to load into a Python data frame)
- Share data with external tools that accept a table
- Deliver semantic models into dbt
- Setup aggregate aware caches for query acceleration
Metadata
Dynamic datasets may include metadata such as their owner, business description, labels. See metadata section for more details.
Example
Build a new dynamic dataset, and select from lineitem
two fields: l_linestatus
(an attribute) and sum_qty
(a metric).
The sums up item quantity per their status delivery status.
The schema for it is -
Interface
Accessing the data
Dynamic datasets can power different delivery modes:
- Snowflake - with a view, a table or a dynamic table
- SQL interface - using a JDBC connector for live data access
- BI connection - combining SQL interface with metadata specific to a BI tool such as Looker or Tableau
- Python - with the honeydew
hdapi
python module - 3rd party such as dbt model - using honeydew APIs
Filtering Order
Filters defined as part of a dynamic dataset definition will apply to its query.
For more details, see order of computation
Parameterized Datasets
Dynamic datasets can be parameterized with Parameters.
Parameters can be specific to a dynamic dataset, or can overrule global parameters.
Domains
Dynamic datasets operate with the context of a domain if set and default domain otherwise.
Filters, parameters and any other domain configuration will apply to the dynamic dataset.
Materialization in Snowflake
When a dynamic dataset is delivered to Snowflake, all fields are created in the chosen Snowflake objects (a view or a table):
- Attributes as columns.
- Metrics as columns. Metrics are aggregated and grouped by default by all the attributes (unless specified otherwise with a
GROUP BY
in the metric definition). - All filters of the dataset apply.
Dynamically changing the level of granularity is not possible after delivery - if need to explore the same dataset at different levels of granularity use the SQL interface.
Views
Datasets can be deployed as a Snowflake VIEW.
Honeydew will maintain the underlying query and will detect if its logic different from the shared source of truth.
Use views for:
- Build live datasets for sharing
- Integration with tools that can’t use live JDBC connection
- Real-time data
Parameterized Datasets can be deployed as VIEWs. Parameters can be controlled
with the Snowflake SET
SQL command.
Tables
Datasets can be deployed as a Snowflake TABLE.
Honeydew will NOT automatically detect if the logic that built a TABLE has changed.
Use tables for:
- “Freeze” a dataset for an ad-hoc analysis on it
- Use cases where a dynamic table is impossible to use
Dynamic Tables
Datasets can be deployed as a Snowflake Dynamic Table.
Honeydew will maintain the underlying query and will detect if its logic different from the shared source of truth.
Following configuration is possible:
- Target database and schema
- Virtual data warehouse to use
- Dynamic table lag units (seconds / minutes / hours / days)
- Dynamic table lag amount
- Should it refresh downstream dynamic tables
Use dynamic tables for:
- Large data or complex queries that require performance
- Continuous materialization with automatic refresh
Example
A dataset called example
that include:
- The attributes
lineitem.l_linestatus
andorders.o_orderdate
- The metrics
lineitem.sum_qty
andlineitem.revenue
- A filter such as
lineitem.l_commitdate < date('1997-01-01')
It allows exploration on two attributes and two metrics, over a filtered set of data.
The following live SQL query will:
return a table with 3 columns:
lineitem.l_linestatus
unique valuesrevenue
aggregated named metric grouped bylineitem.l_linestatus
min_order_date
ad-hoc metric grouped bylineitem.l_linestatus
and will filter the data with two filters:
"orders.o_orderdate" > date('1997-01-01')
given in the querylineitem.l_commitdate < date('1997-01-01')
defined in the dynamic dataset
YAML Schema
The schema for a dynamic dataset is:
Fields:
name
: Name of the datasetowner
,description
,labels
,folder
,display_name
,hidden
(all optional): Metadatadomain
(optional): Domain in which context to runattributes
(optional): List of attribute expressions (can be named attributes or ad-hoc expressions)metrics
(optional): List of metric expressions (can be named metrics or ad-hoc expressions)filters
(optional): List of filter expressions (can be named boolean attributes or ad-hoc expressions)parameters
(optional): List of parameters and values (override workspace values, if exist)use_cache
(optional): Whether to use caches (for entities and/or pre-aggregations). Default - yes.transform_sql
(optional): An additional SQL transformation that will be applied to the dataset. A typical use case includes addingORDER BY
orLIMIT
clauses.
Was this page helpful?