Intro

A Dynamic Dataset is a way to look at data using Honeydew. A typical Honeydew deployment has many thousands of pieces of logic. An analysis might need just a subset, for example, the orders of a particular customer.

The simplest dataset is “all the data of an entity”.

Dynamic datasets are built:

  1. Attributes
  2. Metrics
  3. Filters

The components can come from fields defined in the semantic layer (attributes and metrics) or defined at ad-hoc field SQL expressions.

Unlike attributes or metrics defined in Entities, ad-hoc calculations defined in a dataset is not reusable by other calculations.

Use Cases

  1. Build static datasets for data science (i.e. to load into a Python data frame)
  2. Share data with external tools that accept a table
  3. Deliver semantic models into dbt
  4. Setup aggregate aware caches for query acceleration

Metadata

Dynamic datasets may include metadata such as their owner, business description, labels. See metadata section for more details.

Example

Build a new dynamic dataset, and select from lineitem two fields: l_linestatus (an attribute) and sum_qty (a metric). The sums up item quantity per their status delivery status.

The schema for it is -

attributes:
- lineitem.l_linestatus
metrics:
- lineitem.sum_qty

Interface

Accessing the data

Dynamic datasets can power different delivery modes:

  • Snowflake - with a view, a table or a dynamic table
  • SQL interface - using a JDBC connector for live data access
  • BI connection - combining SQL interface with metadata specific to a BI tool such as Looker or Tableau
  • Python - with the honeydew hdapi python module
  • 3rd party such as dbt model - using honeydew APIs

Filtering Order

Filters defined as part of a dynamic dataset definition will apply to its query.

For more details, see order of computation

Parameterized Datasets

Dynamic datasets can be parameterized with Parameters.

Parameters can be specific to a dynamic dataset, or can overrule global parameters.

Domains

Dynamic datasets operate with the context of a domain if set and default domain otherwise.

Filters, parameters and any other domain configuration will apply to the dynamic dataset.

Materialization in Snowflake

When a dynamic dataset is delivered to Snowflake, all fields are created in the chosen Snowflake objects (a view or a table):

  • Attributes as columns.
  • Metrics as columns. Metrics are aggregated and grouped by default by all the attributes (unless specified otherwise with a GROUP BY in the metric definition).
  • All filters of the dataset apply.

Dynamically changing the level of granularity is not possible after delivery - if need to explore the same dataset at different levels of granularity use the SQL interface.

Views

Datasets can be deployed as a Snowflake VIEW.

Honeydew will maintain the underlying query and will detect if its logic different from the shared source of truth.

Use views for:

  • Build live datasets for sharing
  • Integration with tools that can’t use live JDBC connection
  • Real-time data

Parameterized Datasets can be deployed as VIEWs. Parameters can be controlled with the Snowflake SET SQL command.

Tables

Datasets can be deployed as a Snowflake TABLE.

Honeydew will NOT automatically detect if the logic that built a TABLE has changed.

Use tables for:

  • “Freeze” a dataset for an ad-hoc analysis on it
  • Use cases where a dynamic table is impossible to use

Dynamic Tables

Datasets can be deployed as a Snowflake Dynamic Table.

Honeydew will maintain the underlying query and will detect if its logic different from the shared source of truth.

Following configuration is possible:

  • Target database and schema
  • Virtual data warehouse to use
  • Dynamic table lag units (seconds / minutes / hours / days)
  • Dynamic table lag amount
  • Should it refresh downstream dynamic tables

Use dynamic tables for:

  • Large data or complex queries that require performance
  • Continuous materialization with automatic refresh

Example

A dataset called example that include:

  • The attributes lineitem.l_linestatus and orders.o_orderdate
  • The metrics lineitem.sum_qty and lineitem.revenue
  • A filter such as lineitem.l_commitdate < date('1997-01-01')

It allows exploration on two attributes and two metrics, over a filtered set of data.

The following live SQL query will:

SELECT
    "lineitem.l_linestatus",
    AGG("lineitem.revenue") as revenue,
    MIN("orders.o_orderdate") as min_order_date
WHERE
    "orders.o_orderdate" > date('1997-01-01')
FROM dynamic_datasets.example
GROUP BY 1

return a table with 3 columns:

  • lineitem.l_linestatus unique values
  • revenue aggregated named metric grouped by lineitem.l_linestatus
  • min_order_date ad-hoc metric grouped by lineitem.l_linestatus

and will filter the data with two filters:

  • "orders.o_orderdate" > date('1997-01-01') given in the query
  • lineitem.l_commitdate < date('1997-01-01') defined in the dynamic dataset

YAML Schema

The schema for a dynamic dataset is:

type: perspective
name: [dataset name]
owner: [owner]
domain: [domain name]
description: [description]
labels: [label list]
folder: [folder]
display_name: [display name]
hidden: [yes/no/true/false]
use_cache: [yes/no/true/false]
attributes:
  - [attribute expression]
  - ...
metrics:
  - [metric expression]
  - ...
filters:
  - [filter expression]
  - ...
parameters:
  - name: [parameter name]
    value: [parameter value]
  - ...
transform_sql: [SQL expression]

Fields:

  • name: Name of the dataset
  • owner, description, labels, folder, display_name, hidden (all optional): Metadata
  • domain (optional): Domain in which context to run
  • attributes (optional): List of attribute expressions (can be named attributes or ad-hoc expressions)
  • metrics (optional): List of metric expressions (can be named metrics or ad-hoc expressions)
  • filters (optional): List of filter expressions (can be named boolean attributes or ad-hoc expressions)
  • parameters (optional): List of parameters and values (override workspace values, if exist)
  • use_cache (optional): Whether to use caches (for entities and/or pre-aggregations). Default - yes.
  • transform_sql (optional): An additional SQL transformation that will be applied to the dataset. A typical use case includes adding ORDER BY or LIMIT clauses.