> ## Documentation Index
> Fetch the complete documentation index at: https://honeydew.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Datasets

> How to build datasets based on shared semantics

<video className="w-full aspect-video rounded-xl" controls preload="none" poster="/docs/videos/Datasets-poster.png" width="1280" height="720">
  <source src="https://mintcdn.com/honeydew/HFebY7suyOhLXdiF/videos/Datasets.webm?fit=max&auto=format&n=HFebY7suyOhLXdiF&q=85&s=c1853ef619d826d1ada534cbd48e01d4" type="video/webm" data-path="videos/Datasets.webm" />
</video>

## Intro

A Dynamic Dataset is a way to look at data using Honeydew. A typical Honeydew deployment has many thousands of pieces of logic. An analysis might need just a subset, for example, the orders of a particular customer.

The simplest dataset is “all the data of an entity”.

Dynamic datasets are built:

1. Attributes
2. Metrics
3. Filters

The components can come from fields defined in the semantic layer (attributes and metrics) or defined at ad-hoc field SQL expressions.

Unlike attributes or metrics defined in Entities, ad-hoc calculations defined in a dataset is not reusable by other calculations.

### Use Cases

1. Build static datasets for data science (i.e. to load into a Python data frame)
2. Share data with external tools that accept a table
3. Deliver semantic models into [dbt](/integration/etl-tools/dbt)
4. Setup [aggregate aware caches](/performance/aggregate-awareness) for query acceleration

### Metadata

Dynamic datasets may include metadata such as their owner, business description, labels. See [metadata](/governance/metadata) section for more details.

### Example

Build a new dynamic dataset, and select from `lineitem` two fields: `l_linestatus` (an attribute) and `sum_qty` (a metric).
The sums up item quantity per their status delivery status.

The schema for it is -

```yaml theme={null}
attributes:
- lineitem.l_linestatus
metrics:
- lineitem.sum_qty
```

## Interface

### Accessing the data

Dynamic datasets can power different delivery modes:

* Snowflake - with a view, a table, a dynamic table or an interactive table
* Databricks - with a view or a table
* BI - creating an extract based on a dynamic dataset that was deployed
  to Snowflake or Databricks
* Python - with the honeydew `hdapi` python module
* 3rd party such as dbt model - using honeydew APIs

### Filtering Order

Filters defined as part of a dynamic dataset definition will apply to its query.

For more details, see [order of computation](/advanced-modeling/order-of-computation)

### Parameterized Datasets

Dynamic datasets can be parameterized with [Parameters](/parameters).

Parameters can be specific to a dynamic dataset, or can overrule global parameters.

### Domains

Dynamic datasets operate with the context of a [domain](/domains) if set and default domain otherwise.

Filters, parameters and any other domain configuration will apply to the dynamic dataset.

## Materialization

When a dynamic dataset is delivered to a data warehouse, all fields are created in the chosen objects (a view or a table):

* Attributes as columns.
* Metrics as columns. Metrics are aggregated and grouped by default by all the attributes (unless specified otherwise with a `GROUP BY` in the metric definition).
* All filters of the dataset apply.

Dynamically changing the level of granularity is not possible after delivery - if you need
to explore the same dataset at different levels of granularity use the SQL interface.

### Snowflake

#### Views

Datasets can be deployed as a Snowflake VIEW.

Honeydew will maintain the underlying query and will detect if its logic is different from the shared source of truth.

Use views for:

* Build live datasets for sharing
* Integration with tools that can't use live JDBC connection
* Real-time data

<Tip>
  Parameterized Datasets can be deployed as VIEWs. Parameters can be controlled
  with the Snowflake `SET` SQL command.
</Tip>

#### Tables

Datasets can be deployed as a Snowflake TABLE.

<Warning>
  Honeydew will NOT automatically detect if the logic that built a TABLE has changed
</Warning>

Use tables for:

* "Freeze" a dataset for an ad-hoc analysis on it
* Use cases where a dynamic table is impossible to use

Following configuration is possible:

* Target database and schema
* Virtual data warehouse to use
* Should it be a transient table (the default is to create transient tables)
* Table clustering expressions (optional)

You can find more information about Tables creation parameters
[here](https://docs.snowflake.com/en/sql-reference/sql/create-table).

#### Dynamic Tables

Datasets can be deployed as a Snowflake [Dynamic Table](https://docs.snowflake.com/en/user-guide/dynamic-tables-about).

Honeydew will maintain the underlying query and will detect if its logic different from the shared source of truth.

Following configuration is possible:

* Target database and schema
* Virtual data warehouse to use
* Dynamic table lag amount and units (seconds / minutes / hours / days)
* Should it refresh downstream dynamic tables
* Dynamic table refresh mode (`AUTO` / `FULL` / `INCREMENTAL`). Default is `FULL`, if not provided.
* Dynamic table initialize mode (`ON_CREATE` / `ON_SCHEDULE`). Default is `ON_CREATE`, if not provided.
* Dynamic table clustering expressions (optional)

You can find more information about Dynamic Tables creation parameters
[here](https://docs.snowflake.com/en/sql-reference/sql/create-dynamic-table).

<Tip>
  Use dynamic tables for:

  * Large data or complex queries that require performance
  * Continuous materialization with automatic refresh
</Tip>

#### Interactive Tables

Datasets can be deployed as a Snowflake [Interactive Table](https://docs.snowflake.com/en/user-guide/interactive).

<Note>
  Snowflake Interactive Tables are currently in preview and are not generally available yet.
  Please refer to [Snowflake documentation](https://docs.snowflake.com/en/user-guide/interactive) for more details.
</Note>

Honeydew will maintain the underlying query and will detect if its logic different from the shared source of truth.

Following configuration is possible:

* Target database and schema
* Virtual data warehouse to use
* Interactive table clustering expressions (required)
* Interactive table lag amount and units (seconds / minutes / hours / days) (optional)

You can find more information about Interactive Tables creation parameters
[here](https://docs.snowflake.com/en/sql-reference/sql/create-interactive-table).

<Tip>
  Use interactive tables for dynamic datasets that need to be optimized for low-latency, interactive queries
</Tip>

### Databricks

#### Views

Datasets can be deployed as a Databricks VIEW.

Honeydew will maintain the underlying query and will detect
if its logic is different from the shared source of truth.

Use views for:

* Build live datasets for sharing
* Integration with tools that can't use a live JDBC connection
* Real-time data

#### Tables

Datasets can be deployed as a Databricks TABLE.

<Warning>
  Honeydew will NOT automatically detect if the logic that built a TABLE has changed
</Warning>

Use tables for:

* "Freeze" a dataset for an ad-hoc analysis on it

Following configuration is possible:

* Target catalog and schema
* HTTP path for cluster or warehouse connection
* Table clustering expressions (optional)

## YAML Schema

Each dynamic dataset is defined by a YAML file in Git, which also tracks and preserves the full history of every change.

The schema for a dynamic dataset is:

```yaml theme={null}
type: perspective
name: <dataset name>
owner: <owner>
domain: <domain name>
description: <description>
labels: [label1, label2, ...]
tags:
  - key: <key>
    value: <value>
    source: <snowflake/databricks/bigquery/honeydew/other>
  - ...
folder: <folder>
display_name: <display name>
hidden: <yes/no/true/false>
use_cache: <yes/no/true/false>
attributes:
  - <attribute expression>
  - ...
metrics:
  - <metric expression>
  - ...
filters:
  - <filter expression>
  - ...
parameters:
  - name: <parameter name>
    value: <parameter value>
  - ...
transform_sql: <SQL expression>
group_by_all: <yes/no/true/false> # group by all attributes to
  deduplicate the data, when no metrics are provided
delivery:
  - snowflake:
    target: <view/table/dynamic_table/interactive_table>
    name: <target table/view name>
    schema: <target schema name (optional)>
    database: <target database name (optional)>
    warehouse: <name of the warehouse to use for materialization (optional)>
    table_settings: <used if table target is selected>
      - transient: <yes/no/true/false (optional - default is true)>
      - cluster_by: <comma-separated clustering keys (optional)>
    dynamic_table_settings: <used if dynamic_table target is selected>
      - lag: <optional>
        - num: <integer>
        - units: <seconds/minutes/hours/days>
      - downstream: <yes/no/true/false>
      - refresh_mode: <AUTO/FULL/INCREMENTAL>
      - initialize: <ON_CREATE/ON_SCHEDULE>
      - cluster_by: <comma-separated clustering keys (optional)>
    interactive_table_settings: <used if interactive_table target is selected>
      - lag: <optional>
        - num: <integer>
        - units: <seconds/minutes/hours/days>
      - cluster_by: <comma-separated clustering keys (optional)>
    enabled: <yes/no/true/false>
  - databricks:
    target: <view/table>
    name: <target table/view name>
    schema: <target schema name (optional)>
    catalog: <target catalog name (optional)>
    http_path: <HTTP path for cluster/warehouse (optional)>
    table_settings: <used if table target is selected>
      - cluster_by: <comma-separated clustering keys (optional)>
    enabled: <yes/no/true/false>
```

Fields:

* `name`: Name of the dataset
* `owner`, `description`, `labels`, `tags`, `folder`, `display_name`, `hidden` (all optional): Metadata
* `domain` (optional): [Domain](/domains) in which context to run
* `attributes` (optional): List of [attribute](/calculations/attributes) expressions (can be named attributes or ad-hoc expressions)
* `metrics` (optional): List of [metric](/calculations/metrics) expressions (can be named metrics or ad-hoc expressions)
* `filters` (optional): List of filter expressions (can be named boolean attributes or ad-hoc expressions)
* `parameters` (optional): List of [parameters](/parameters) and values (override workspace values, if exist)
* `use_cache` (optional): Whether to use caches (for [entities](/performance/entity-caching) and/or
  [pre-aggregations](/performance/aggregate-awareness)). Default - yes.
* `transform_sql` (optional): An additional SQL transformation that will be applied to the dataset.
  A typical use case includes adding `ORDER BY` or `LIMIT` clauses.
