Datasets

Intro

A Dynamic Dataset is a way to look at data using Honeydew. A typical Honeydew deployment has many thousands of pieces of logic. An analysis might need just a subset, for example, the orders of a particular customer. The simplest dataset is “all the data of an entity”. Dynamic datasets are built:

Attributes
Metrics
Filters

The components can come from fields defined in the semantic layer (attributes and metrics) or defined at ad-hoc field SQL expressions. Unlike attributes or metrics defined in Entities, ad-hoc calculations defined in a dataset is not reusable by other calculations.

Use Cases

Build static datasets for data science (i.e. to load into a Python data frame)
Share data with external tools that accept a table
Deliver semantic models into dbt
Setup aggregate aware caches for query acceleration

Metadata

Dynamic datasets may include metadata such as their owner, business description, labels. See metadata section for more details.

Example

Build a new dynamic dataset, and select from lineitem two fields: l_linestatus (an attribute) and sum_qty (a metric). The sums up item quantity per their status delivery status. The schema for it is -

attributes:
- lineitem.l_linestatus
metrics:
- lineitem.sum_qty

Interface

Accessing the data

Dynamic datasets can power different delivery modes:

Snowflake - with a view, a table or a dynamic table
BI - creating an extract based on a dynamic dataset that was deployed to Snowflake
Python - with the honeydew hdapi python module
3rd party such as dbt model - using honeydew APIs

Filtering Order

Filters defined as part of a dynamic dataset definition will apply to its query. For more details, see order of computation

Parameterized Datasets

Dynamic datasets can be parameterized with Parameters. Parameters can be specific to a dynamic dataset, or can overrule global parameters.

Domains

Dynamic datasets operate with the context of a domain if set and default domain otherwise. Filters, parameters and any other domain configuration will apply to the dynamic dataset.

Materialization in Snowflake

When a dynamic dataset is delivered to Snowflake, all fields are created in the chosen Snowflake objects (a view or a table):

Attributes as columns.
Metrics as columns. Metrics are aggregated and grouped by default by all the attributes (unless specified otherwise with a GROUP BY in the metric definition).
All filters of the dataset apply.

Dynamically changing the level of granularity is not possible after delivery - if need to explore the same dataset at different levels of granularity use the SQL interface.

Views

Datasets can be deployed as a Snowflake VIEW. Honeydew will maintain the underlying query and will detect if its logic different from the shared source of truth. Use views for:

Build live datasets for sharing
Integration with tools that can’t use live JDBC connection
Real-time data

Parameterized Datasets can be deployed as VIEWs. Parameters can be controlled with the Snowflake SET SQL command.

Tables

Datasets can be deployed as a Snowflake TABLE.

Honeydew will NOT automatically detect if the logic that built a TABLE has changed.

Use tables for:

“Freeze” a dataset for an ad-hoc analysis on it
Use cases where a dynamic table is impossible to use

Dynamic Tables

Datasets can be deployed as a Snowflake Dynamic Table. Honeydew will maintain the underlying query and will detect if its logic different from the shared source of truth. Following configuration is possible:

Target database and schema
Virtual data warehouse to use
Dynamic table lag units (seconds / minutes / hours / days)
Dynamic table lag amount
Should it refresh downstream dynamic tables
Dynamic table refresh mode (AUTO / FULL / INCREMENTAL). Default is FULL, if not provided.
Dynamic table initialize mode (ON_CREATE / ON_SCHEDULE). Default is ON_CREATE, if not provided.
Dynamic table clustering expressions (optional)

You can find more information about Dynamic Tables creation parameters here.

Use dynamic tables for:

Large data or complex queries that require performance
Continuous materialization with automatic refresh

Example

A dataset called example that include:

The attributes lineitem.l_linestatus and orders.o_orderdate
The metrics lineitem.sum_qty and lineitem.revenue
A filter such as lineitem.l_commitdate < date('1997-01-01')

It allows exploration on two attributes and two metrics, over a filtered set of data. The following live SQL query will:

SELECT
    "lineitem.l_linestatus",
    AGG("lineitem.revenue") as revenue,
    MIN("orders.o_orderdate") as min_order_date
WHERE
    "orders.o_orderdate" > date('1997-01-01')
FROM dynamic_datasets.example
GROUP BY 1

return a table with 3 columns:

lineitem.l_linestatus unique values
revenue aggregated named metric grouped by lineitem.l_linestatus
min_order_date ad-hoc metric grouped by lineitem.l_linestatus

and will filter the data with two filters:

"orders.o_orderdate" > date('1997-01-01') given in the query
lineitem.l_commitdate < date('1997-01-01') defined in the dynamic dataset

YAML Schema

The schema for a dynamic dataset is:

type: perspective
name: [dataset name]
owner: [owner]
domain: [domain name]
description: [description]
labels: [label list]
folder: [folder]
display_name: [display name]
hidden: [yes/no/true/false]
use_cache: [yes/no/true/false]
attributes:
  - [attribute expression]
  - ...
metrics:
  - [metric expression]
  - ...
filters:
  - [filter expression]
  - ...
parameters:
  - name: [parameter name]
    value: [parameter value]
  - ...
transform_sql: [SQL expression]
delivery:
  - snowflake:
    target: [view/table/dynamic_table]
    name: [target table/view name]
    schema: [target schema name (optional)]
    database: [target database name (optional)]
    warehouse: [name of the warehouse to use for materialization (optional)]
    dynamic_table_settings: [used if dynamic_table target is selected]
      - lag: [optional]
        - num: [integer]
        - units: [seconds/minutes/hours/days]
      - downstream: [yes/no/true/false]
      - refresh_mode: [AUTO/FULL/INCREMENTAL]
      - initialize: [ON_CREATE/ON_SCHEDULE]
      - cluster_by: [comma-separated clustering keys (optional)]
    enabled: [yes/no/true/false]

Fields:

name: Name of the dataset
owner, description, labels, folder, display_name, hidden (all optional): Metadata
domain (optional): Domain in which context to run
attributes (optional): List of attribute expressions (can be named attributes or ad-hoc expressions)
metrics (optional): List of metric expressions (can be named metrics or ad-hoc expressions)
filters (optional): List of filter expressions (can be named boolean attributes or ad-hoc expressions)
parameters (optional): List of parameters and values (override workspace values, if exist)
use_cache (optional): Whether to use caches (for entities and/or pre-aggregations). Default - yes.
transform_sql (optional): An additional SQL transformation that will be applied to the dataset. A typical use case includes adding ORDER BY or LIMIT clauses.

Quick Start

Migration Guides

Reference

Integrations

Security

Release Notes

Intro

Use Cases

Metadata

Example

Interface

Accessing the data

Filtering Order

Parameterized Datasets

Domains

Materialization in Snowflake

Views

Tables

Dynamic Tables

Example

YAML Schema

Quick Start

Migration Guides

Reference

Integrations

Security

Release Notes

​Intro

​Use Cases

​Metadata

​Example

​Interface

​Accessing the data

​Filtering Order

​Parameterized Datasets

​Domains

​Materialization in Snowflake

​Views

​Tables

​Dynamic Tables

​Example

​YAML Schema

Intro

Use Cases

Metadata

Example

Interface

Accessing the data

Filtering Order

Parameterized Datasets

Domains

Materialization in Snowflake

Views

Tables

Dynamic Tables

Example

YAML Schema