Intro
A Dynamic Dataset is a way to look at data using Honeydew. A typical Honeydew deployment has many thousands of pieces of logic. An analysis might need just a subset, for example, the orders of a particular customer. The simplest dataset is “all the data of an entity”. Dynamic datasets are built:- Attributes
- Metrics
- Filters
Use Cases
- Build static datasets for data science (i.e. to load into a Python data frame)
- Share data with external tools that accept a table
- Deliver semantic models into dbt
- Setup aggregate aware caches for query acceleration
Metadata
Dynamic datasets may include metadata such as their owner, business description, labels. See metadata section for more details.Example
Build a new dynamic dataset, and select fromlineitem
two fields: l_linestatus
(an attribute) and sum_qty
(a metric).
The sums up item quantity per their status delivery status.
The schema for it is -
Interface
Accessing the data
Dynamic datasets can power different delivery modes:- Snowflake - with a view, a table or a dynamic table
- BI - creating an extract based on a dynamic dataset that was deployed to Snowflake
- Python - with the honeydew
hdapi
python module - 3rd party such as dbt model - using honeydew APIs
Filtering Order
Filters defined as part of a dynamic dataset definition will apply to its query. For more details, see order of computationParameterized Datasets
Dynamic datasets can be parameterized with Parameters. Parameters can be specific to a dynamic dataset, or can overrule global parameters.Domains
Dynamic datasets operate with the context of a domain if set and default domain otherwise. Filters, parameters and any other domain configuration will apply to the dynamic dataset.Materialization in Snowflake
When a dynamic dataset is delivered to Snowflake, all fields are created in the chosen Snowflake objects (a view or a table):- Attributes as columns.
- Metrics as columns. Metrics are aggregated and grouped by default by all the attributes (unless specified otherwise with a
GROUP BY
in the metric definition). - All filters of the dataset apply.
Views
Datasets can be deployed as a Snowflake VIEW. Honeydew will maintain the underlying query and will detect if its logic different from the shared source of truth. Use views for:- Build live datasets for sharing
- Integration with tools that can’t use live JDBC connection
- Real-time data
Parameterized Datasets can be deployed as VIEWs. Parameters can be controlled
with the Snowflake
SET
SQL command.Tables
Datasets can be deployed as a Snowflake TABLE.Honeydew will NOT automatically detect if the logic that built a TABLE has
changed.
- “Freeze” a dataset for an ad-hoc analysis on it
- Use cases where a dynamic table is impossible to use
Dynamic Tables
Datasets can be deployed as a Snowflake Dynamic Table. Honeydew will maintain the underlying query and will detect if its logic different from the shared source of truth. Following configuration is possible:- Target database and schema
- Virtual data warehouse to use
- Dynamic table lag units (seconds / minutes / hours / days)
- Dynamic table lag amount
- Should it refresh downstream dynamic tables
- Dynamic table refresh mode (
AUTO
/FULL
/INCREMENTAL
). Default isFULL
, if not provided. - Dynamic table initialize mode (
ON_CREATE
/ON_SCHEDULE
). Default isON_CREATE
, if not provided. - Dynamic table clustering expressions (optional)
Use dynamic tables for:
- Large data or complex queries that require performance
- Continuous materialization with automatic refresh
Example
A dataset calledexample
that include:
- The attributes
lineitem.l_linestatus
andorders.o_orderdate
- The metrics
lineitem.sum_qty
andlineitem.revenue
- A filter such as
lineitem.l_commitdate < date('1997-01-01')
lineitem.l_linestatus
unique valuesrevenue
aggregated named metric grouped bylineitem.l_linestatus
min_order_date
ad-hoc metric grouped bylineitem.l_linestatus
"orders.o_orderdate" > date('1997-01-01')
given in the querylineitem.l_commitdate < date('1997-01-01')
defined in the dynamic dataset
YAML Schema
Each dynamic dataset is defined by a YAML file in Git, which also tracks and preserves the full history of every change. The schema for a dynamic dataset is:name
: Name of the datasetowner
,description
,labels
,folder
,display_name
,hidden
(all optional): Metadatadomain
(optional): Domain in which context to runattributes
(optional): List of attribute expressions (can be named attributes or ad-hoc expressions)metrics
(optional): List of metric expressions (can be named metrics or ad-hoc expressions)filters
(optional): List of filter expressions (can be named boolean attributes or ad-hoc expressions)parameters
(optional): List of parameters and values (override workspace values, if exist)use_cache
(optional): Whether to use caches (for entities and/or pre-aggregations). Default - yes.transform_sql
(optional): An additional SQL transformation that will be applied to the dataset. A typical use case includes addingORDER BY
orLIMIT
clauses.