Datasets
Intro
A Dynamic Dataset is a way to look at data using Honeydew. A typical Honeydew deployment has many thousands of pieces of logic. An analysis might need just a subset, for example, the orders of a particular customer.
The simplest dataset is “all the data of an entity”.
Dynamic datasets are built:
- Attributes
- Metrics
- Filters
The components can come from fields defined in the semantic layer (attributes and metrics) or defined at ad-hoc field SQL expressions.
Unlike attributes or metrics defined in Entities, ad-hoc calculations defined in a dataset is not reusable by other calculations.
Metadata
Dynamic datasets may include metadata such as their owner, business description, labels. See metadata section for more details.
Git Schema
Every dynamic dataset is backed by a text file in git that defines it, and keeps history of every change. See schema for more details on the underlying representation.
Example
Build a new dynamic dataset, and select from lineitem
two fields: l_linestatus
(an attribute) and sum_qty
(a metric).
The sums up item quantity per their status delivery status.
The schema for it is -
attributes:
- lineitem.l_linestatus
metrics:
- lineitem.sum_qty
Interface
Accessing the data
Dynamic datasets can power different delivery modes:
- Snowflake - with a view, a table or a dynamic table
- SQL interface - using a JDBC connector for live data access
- BI connection - combining SQL interface with metadata specific to a BI tool such as Looker or Tableau
- Python - with the honeydew
hdapi
python module - 3rd party such as dbt model - using honeydew APIs
Filtering Order
Filters defined as part of a dynamic dataset definition will apply on every query.
Other filters (such as those defined in a filtered metric) apply as well with the following order of evaluation:
- Source table filters - defined as part of entity source table
- Filters of filtered metrics used in calculated attributes
- Dynamic Datasets filters
- Filters that come from a live SQL interface
- Filters of filtered metrics referenced in the dynamic dataset
Parameterized Datasets
Dynamic datasets can be parameterized with Parameters.
Parameters can be specific to a dynamic dataset, or can overrule global parameters.
Materialization in Snowflake
When a dynamic dataset is delivered to Snowflake, all fields are created in the chosen Snowflake objects (a view or a table):
- Attributes as columns.
- Metrics as columns. Metrics are aggregated and grouped by default by all the attributes (unless specified otherwise with a
GROUP BY
in the metric definition). - All filters of the dataset apply.
Dynamically changing the level of granularity is not possible after delivery - if need to explore the same dataset at different levels of granularity use the SQL interface.
Views
Datasets can be deployed as a Snowflake VIEW.
Honeydew will maintain the underlying query and will detect if its logic different from the shared source of truth.
Use views for:
- Build live datasets for sharing
- Integration with tools that can’t use live JDBC connection
- Real-time data
Parameterized Datasets can be deployed as VIEWs. Parameters can be controlled
with the Snowflake SET
SQL command.
Tables
Datasets can be deployed as a Snowflake TABLE.
Honeydew will NOT automatically detect if the logic that built a TABLE has changed.
Use tables for:
- “Freeze” a dataset for an ad-hoc analysis on it
- Use cases where a dynamic table is impossible to use
Dynamic Tables
Datasets can be deployed as a Snowflake Dynamic Table.
Honeydew will maintain the underlying query and will detect if its logic different from the shared source of truth.
Following configuration is possible:
- Target database and schema
- Virtual data warehouse to use
- Dynamic table lag units (seconds / minutes / hours / days)
- Dynamic table lag amount
- Should it refresh downstream dynamic tables
Use dynamic tables for:
- Large data or complex queries that require performance
- Continuous materialization with automatic refresh
Live access with a SQL interface
A dynamic dataset is accessible through a JDBC connection for interactive exploration.
- Attributes in the dataset are accessible as SQL columns.
- Metrics in the dataset are accessible with the
AGG
command. - All filters of the dataset apply.
In addition a live SQL query may add filters of its own and ad-hoc calculations on top of the dynamic dataset.
When a dataset is accessed through the SQL interface it behaves according to the user query:
- Group by given attributes selected in
GROUP BY
, as long as they are using fields defined in the dataset. - Aggregate by given metrics (with
AGG
) or aggregations functions (SUM
,COUNT
, etc.) - Filter by given
WHERE
andHAVING
conditions
As long as the expressions are using fields defined in the dataset.
Dynamic datasets used in live SQL interface can only include named fields (metrics and attributes) coming from the semantic layer.
Ad-hoc attributes or metrics defined in the dynamic dataset are not possible with a SQL interface.
If need ad-hoc attributes or metrics, can add them in the live SQL query.
Example
A dataset called example
that include:
- The attributes
lineitem.l_linestatus
andorders.o_orderdate
- The metrics
lineitem.sum_qty
andlineitem.revenue
- A filter such as
lineitem.l_commitdate < date('1997-01-01')
It allows exploration on two attributes and two metrics, over a filtered set of data.
The following live SQL query will:
SELECT
"lineitem.l_linestatus",
AGG("lineitem.revenue") as revenue,
MIN("orders.o_orderdate") as min_order_date
WHERE
"orders.o_orderdate" > date('1997-01-01')
FROM dynamic_datasets.example
GROUP BY 1
return a table with 3 columns:
lineitem.l_linestatus
unique valuesrevenue
aggregated named metric grouped bylineitem.l_linestatus
min_order_date
ad-hoc metric grouped bylineitem.l_linestatus
and will filter the data with two filters:
"orders.o_orderdate" > date('1997-01-01')
given in the querylineitem.l_commitdate < date('1997-01-01')
defined in the dynamic dataset
Was this page helpful?