Performance

Performance is a concern in few scenarios:

  1. Complex multi-stage computations (that might require many JOINs)
  2. Large data amounts (such as event tables)
  3. Ad-hoc exploration

Honeydew combines following tactics for acceleration:

  • Preparation: Cache entities to avoid re-computation
  • Optimization: Push down filters and prune joins to minimize scanned data
  • Materialization: Deploy dynamic datasets as tables

Join pruning and filter push down

Honeydew engine will only join tables that are actually needed for a specific user query.

If there are filters, they will be pushed down as close to the data as possible. See order of filtering for more details.

Caching

Entities may have calculated attributes that are expensive to compute - with aggregations, JOINs or large table scans.

By default, every access to a calculated attribute recomputes it.

The cache enables to materialize entities and update them on triggers or a set schedule, in order to avoid re-computation.

It is stored as tables in your Snowflake, and recomputes based on configuration.

When entity cache is enabled, Honeydew will automatically leverage data in the cache when applicable.

Entity Cache requires dbt integration. Contact us for other orchestration requirements.

Entity Caching with dbt

To set up dbt as a cache orchestrator:

  1. In dbt, create an entity cache model in using the Honeydew dbt cache macro
  2. In dbt, use the config macro to set up materialization settings such as clustering or dynamic tables
  3. In Honeydew, set entity dbt delivery settings to the chosen dbt model name

Set up in Honeydew a dbt source for the entity table, and Honeydew will maintain the reference.

For example, this can be the customers model in dbt:

-- Set up materialization parameters for cache
{{ config(materialized='table') }}

-- Set up any additional dependencies in dbt with
-- depends_on: {{ ref('upstream_parent_model') }}

-- Cache for customers entity
{{ get_honeydew_entity_sql('customers') }}

May use is_incremental() dbt function in combination with the Honeydew SQL macro for incremental caches. However, make sure to check if the computation itself changed between runs to avoid mixing different version of logic in the same table.

Data Exploration

Ad-hoc exploration means that the question to ask is unknown, and does not have a pre-computed answer.

Oftentimes, a cache is not sufficient for ad-hoc exploration.

The best practice for faster exploration is to limit data for exploration with filtering or sampling.

Domain filters

Domain filters limit the data that is accessed by a user query, which improves performance.

For example, can add to a domain a filter that limits data in a fact table only to the last week. Any user query, regardless of the choice of attributes and metrics, will then be limited to the last weeks data. Honeydew engine will push down the filter to reduce data access and improve performance.

In some cases of logical data partitioning (for example, having a column indicating a tenant in a multi-tenant environment, or a column indicating data sampling level) a domain source filter can be further applied to improve performance.

Exploration with a SQL interface

Data exploration from a live BI connection or a user SQL tool is a common way to explore data.

For each user query, Honeydew will:

  1. Push down any filters so they reduce the amount of data scanned
  2. Build only required JOINs for user query
  3. Optimize the query structure for performance

For example, can build for exploration a dynamic dataset that includes 1000 attributes and 300 metrics. If a user it to access just one metric and one dimension, Honeydew for that query will not compute anything else but the user request.

Exploration with dynamic datasets deployed as VIEWs

Dynamic Datasets can be deployed as a Snowflake VIEW.

When they are deployed as a VIEW, all metrics and attributes are part of VIEW query, so improvements such as reducing JOINs or pushing down filters are not possible.

However, can increase the performance of VIEWs by combining parameters and filters.

If a dynamic dataset is defined with a parameterized filter, any access to the view can be controlled with the view parameter.

For example, if you defined a view that has the following filter orders.o_orderdate = $date_to_explore then can do:

-- snowflake query
SET date_to_explore='1998-01-01'::DATE;
-- will only process a single chosen date
SELECT * FROM honeydew_view;

Dynamic Datasets

Dynamic Datasets that are deployed to Snowflake can be materialization by making them to a TABLE or a DYNAMIC TABLE.

This is important when the dataset processed a large amount of data or has many multi-stage complex computations.

See materialization for more details.