Introduction

Many analytic workloads require aggregations or joins on large datasets (such as fact tables), which can be slow or expensive when the data is large.

Honeydew can improve the performance of aggregative queries of any kind by leveraging Honeydew-managed pre-aggregations in Snowflake.

The aggregated cache is defined as dynamic datasets in Honeydew that are stored as tables in your Snowflake.

Aggregate aware caching can improve perfomance by orders of magnitude on large datasets.

Aggregate caching is built on dynamic datasets that are set up to act as a cache.

Those dynamic datasets typically include:

  1. Metrics that require acceleration (if users commonly ask about daily revenue, can put it in the per-day aggregate).
  2. Groups that are needed to accelerate user queries (if users commonly ask for metrics by day/month/quarter, can pre-compute per-day results).
  3. Domain that needs to match user queries (as different domains might have different configurations).

Once a dynamic dataset is configured to be used for pre-aggregate caching, deploy it to Snowflake via Honeydew UI or with your ETL/ELT orchestrator.

One-time refreshes can be done from Honeydew UI.

Periodic refreshes of pre-aggregated caches require an orchestration framework, such as dbt, Airflow or scheduled ETL queries.

Honeydew checks if the dataset table exists in Snowflake to use it for acceleration.

A dynamic dataset that is configured for pre-aggregate caching but is not deployed to Snowflake will not be used.

How It Works

Honeydew automatically matches a user query with a pre-aggregated dataset that can accelerate it.

Honeydew verifies all semantic definitions (groups, filters in use, join paths in use) before deciding that a pre-aggregation can serve the user query.

Matching Groups

A pre-aggregated dataset will match a user query only when its groups can help the user query.

Roll Up Matching

Metric based on additive aggregations (like SUM) will match the groups in the user query can be computed by rolling up from the pre-aggregated dataset (see more details below).

For example, if the user count metric is additive (defined as a COUNT), and there is a preaggregate for total_users by year, city:

  1. A user query for total_users by year, city will match exactly since it has same groups.
  2. A user query for total_users by city will match and summarize over all years.
  3. A user query for total total_users (no groups) will match and summarize over all years and cities.

Aggregate aware caching is best when used with additive metrics.

Exact Matching

Metrics based on non-additive aggregations (like COUNT(DISTINCT)) will match only when the groups in the user query are exactly the same as in the pre-aggregated dataset.

For example, if the unique user count metric is non-additive (defined as a COUNT(DISTINCT)), and there is a preaggregate for unique_users by year, city:

  1. A user query for unique_users by year, city will match exactly since it has same groups.
  2. A user query for unique_users by city will not match.

Distinct count metric are not additive, but approximate distinct metrics are.

Matching Filters

A pre-aggregated dataset will match a user query only when its filters match fully.

For example, if the preaggregate dynamic dataset was built for 2023, only queries filtering for 2023 will use it.

Aggregate caches are most effective when built without explicit filters.

Domain filters can be used to filter both the pre-aggregated caches and the user queries to the same domain:

  1. Create a cached pre-aggregate dataset in a domain filtered_domain with a filter.
  2. Any query in the domain filtered_domain will use the cached pre-aggregate dataset.

The user query can be further filtered on groups that match the pre-aggregate.

For example, given a preaggregate for revenue by year and city, where year = 2023:

  1. A user query for revenue by year, city, where year = 2023 will use the preaggregate (since it matches all preaggregate filters and groups)
  2. A user query for revenue by year, city, where year = 2023 and city='Seattle' will also use the preaggregate (since it matches all preaggregate filters and groups, and its additional filters are on the city group that exists in the pre-aggregated dataset)
  3. A user query for revenue by year, where year = 2023 and city='Seattle' will use the preaggregate in the same way, but roll up over all cities to get a yearly total.
  4. A user query for revenue by year, city without any filters will not match the preaggregate (since it was not filtered by year = 2023)

Domains

A pre-aggregated dataset will match a user query only when Source Filters of user’s query domain and the pre-aggregated dataset domain are same.

Queries done in the same domain as the cached pre-aggregate dataset will always match its source filters.

Additive Metrics

Metrics based on aggregation like COUNT, SUM, MAX, MIN are additive: they can be “rolled-up” from the preaggregate.

If there is a count metric pre-aggregated for each day, it can be summed up to see value per month.

This is called rolling up a metric from groups.

Partial Group Matching

When an aggregate is built over multiple groups with additive metrics, it can serve queries on any group subset.

For example, if the revenue metric is additive (defined as a SUM), and there is a preaggregate for revenue by year, city:

  1. A user query for revenue by year, city will match exactly since it has same groups.
  2. A user query for revenue by city will match, and will roll up from year groups to get totals per city.
  3. A user query for revenue by year will match, and will roll up from city groups to get totals per year.
  4. A user query for total revenue (no groups) will match, and will roll up from all groups to get the total.
  5. A user query for revenue by year, color will not match since it has different groups.

Matching Entity Keys

When an aggregate is built over an entity key group with additive metrics, it can serve queries on any group that comes from an attribute in that entity.

For example, if the revenue metric is additive (defined as a SUM), and there is a preaggregate for revenue by customer.id, then user queries on any group that comes from the customer entity will roll up.

  1. A user query for revenue by customer.id will match exactly.
  2. A user query for revenue by customer.name will match, and will roll up from customer.id lines.
  3. A user query for revenue by customer.name, customer.age will also match, and roll up.
  4. A user query for revenue totals (no groups) will match, and will roll up over all groups to get the total.

If an entity has a composite key, then all parts of the key must present in the aggregate dataset to be able to leverage this functionality.

When building pre-aggregates for additive metrics, always prefer entity keys, as that allows the preaggregate cache to accelerate a wider set of queries at once.

Configuring Additive Metrics

Honeydew detects simple additive metrics automatically (metrics that are an aggregation or direct derivatives thereof).

Any metric can be set to be additive using the rollup metric property:

type: metric
## ... metric definition
rollup: sum

Use rollup to allow aggregate aware roll-ups on complex metrics that are not detected automatically to be additive.

Possible values for roll up functions are:

  • sum: Run a SUM aggregation to roll up. Used for metrics that are a SUM or COUNT aggregation.
  • min, max: Run a MIN or MAX aggregation to roll up. Used for the corresponding aggregations.
  • hll: (coming soon) Run an approximate distinct count combine (HLL_COMBINE) to roll up (see Estimating the Number of Distinct Values).
  • no_rollup: Disable roll-up behavior.

Tips for Effective Pre-aggregations

  1. Build for additive metrics - a single pre-aggregated dataset can serve many queries with roll-ups.
  2. Don’t use explicit filters on the data - any filter limits user queries that can be served.
  3. For pre-aggregate caches for workloads that operate on a subset of the data, build them in a filtered domain.
  4. Use entity keys as groups with additive metrics - more queries can be served with roll-ups from entity keys.

Don’t use pre-aggregates when metric results are not deterministic (for a metric like SUM(RANDOM())).

Automatic Aggregate Caching

Honeydew can automatically set up and use aggregate caching based on recent queries.

This is most useful for accelerating live queries that calculate a dataset and then slice and dice it, such as Pivot Tables.

To enable short term aggregate cache in a domain, add this to the domain YAML:

type: domain
...
short_term_aggregate_cache_ttl_seconds: [the duration for which a query result can be reused as part of the short-term aggregate cache]

The typical duration for short-term aggregate caching is 600 seconds (10 minutes), but it can be configured to any value up to 24 hours.

Short-Term Automatic Aggregate Caching is in Beta - reach out to support@honeydew.ai to enable in your account.

Configuration

Pre-aggregated datasets are set up via building a dynamic dataset that is enabled to be used as cache.

Configuration when building from ETL code or Honeydew

Set up in Dynamic Dataset YAML schema cache delivery settings for Snowflake:

type: perspective
# ... dynamic dataset configuration
delivery:
  # enable Snowflake as preaggregate cache
  use_for_cache: snowflake
  # Set Snowflake delivery settings (where the entity cache resides)
  snowflake:
    enabled: true
    name: <name_of_table>
    schema: <name_of_schema>
    target: table

If using the Honeydew UI “deploy” action or Honeydew deploy API, a deployed dynamic dataset will be immediately ready to accelerate user queries.

If building the table in ETL code - your ETL code should call the appropriate Honeydew API (see below).

Configuration when using dbt as orchestrator

Set up in Dynamic Dataset YAML schema cache delivery settings for dbt:

type:  perspective
# ... dynamic dataset configuration
delivery:
  # enable dbt materialization as preaggregate cache
  use_for_cache: dbt
  # dbt settings (name of dbt model that creates the table in Snowflake)
  dbt:
    enabled: true
    dbt_model: name_of_model

Your dbt code should call the Honeydew dbt APIs to build the dynamic dataset (see below).

Orchestration

Entity Cache refresh relies on external orchestration (with dbt or otherwise) or manual deployments.

Set up with Honeydew

Use the Honeydew deploy functionality to write the dynamic dataset to a Snowflake table from the UI, or using the Native Application Deploy API:

select SEMANTIC_LAYER_ENTERPRISE_EDITION.API.DEPLOY_DYNAMIC_DATASET(
    -- workspace & branch
    'workspace_name', 'branch_name',
    -- dataset name
    'dataset_name'
    );

Set up with ETL tools

Use the Snowflake Native Application API to get SQL for entity cache:

select SEMANTIC_LAYER_ENTERPRISE_EDITION.API.GET_SQL_FOR_DYNAMIC_DATASET(
        -- workspace & branch
        'workspace_name', 'branch_name',
        -- dataset name
        'dataset_name',
    );

Create the table using that SQL in Snowflake. Honeydew uses the table update time to detect cache validity.

Dynamic datasets can be defined within a domain. All domain configuration (such as filters) will be applied to the dynamic dataset used for cache.

Set up with dbt

To set up dbt as a cache orchestrator:

  1. In dbt, create an entity cache model by using the Honeydew dbt cache macro
  2. In dbt, use the config macro to set up materialization settings such as clustering or dynamic tables
  3. In Honeydew, set the entity’s dbt delivery settings to the chosen dbt model name

For example, this can be the monthly_kpi model in dbt:

-- Set up materialization parameters for cache
{{ config(materialized='table') }}

-- Set up any additional dependencies in dbt with
-- depends_on: {{ ref('upstream_parent_model') }}

-- Cache for customers entity
{{ honeydew.get_dataset_sql('monthly_kpi') }}

Pre-aggregated datasets are usually not incremental.

See more in Honeydew dbt documentation