Many analytic workloads require aggregations or joins on large datasets (such as fact tables), which can be slow or expensive when the data is large.
Honeydew can improve the performance of aggregative queries of any kind by leveraging Honeydew-managed pre-aggregations in Snowflake.
The aggregated cache is defined as dynamic datasets in Honeydew that are stored as tables in your Snowflake.
Aggregate aware caching can improve perfomance by orders of magnitude on large datasets.
Aggregate caching is built on dynamic datasets that are set up to act as a cache.
Those dynamic datasets typically include:
Once a dynamic dataset is configured to be used for pre-aggregate caching, deploy it to Snowflake via Honeydew UI or with your ETL/ELT orchestrator.
One-time refreshes can be done from Honeydew UI.
Periodic refreshes of pre-aggregated caches require an orchestration framework, such as dbt, Airflow or scheduled ETL queries.
Honeydew checks if the dataset table exists in Snowflake to use it for acceleration.
A dynamic dataset that is configured for pre-aggregate caching but is not deployed to Snowflake will not be used.
Honeydew automatically matches a user query with a pre-aggregated dataset that can accelerate it.
Honeydew verifies all semantic definitions (groups, filters in use, join paths in use) before deciding that a pre-aggregation can serve the user query.
A pre-aggregated dataset will match a user query only when its groups can help the user query.
Metric based on additive aggregations (like SUM
) will match the groups in the user query can be computed
by rolling up from the pre-aggregated dataset (see more details below).
For example, if the user count metric is additive (defined as a COUNT
), and there is a preaggregate
for total_users
by year, city
:
total_users
by year, city
will match exactly since it has same groups.total_users
by city
will match and summarize over all years.total_users
(no groups) will match and summarize over all years and cities.Aggregate aware caching is best when used with additive metrics.
Metrics based on non-additive aggregations (like COUNT(DISTINCT)
) will match only when the groups in the
user query are exactly the same as in the pre-aggregated dataset.
For example, if the unique user count metric is non-additive (defined as a COUNT(DISTINCT)
), and there is a preaggregate
for unique_users
by year, city
:
unique_users
by year, city
will match exactly since it has same groups.unique_users
by city
will not match.Distinct count metric are not additive, but approximate distinct metrics are.
A pre-aggregated dataset will match a user query only when its filters match fully.
For example, if the preaggregate dynamic dataset was built for 2023, only queries filtering for 2023 will use it.
Aggregate caches are most effective when built without explicit filters.
Domain filters can be used to filter both the pre-aggregated caches and the user queries to the same domain:
filtered_domain
with a filter.filtered_domain
will use the cached pre-aggregate dataset.The user query can be further filtered on groups that match the pre-aggregate.
For example, given a preaggregate for revenue
by year
and city
, where year = 2023
:
revenue
by year, city
, where year = 2023
will use the preaggregate (since it matches
all preaggregate filters and groups)revenue
by year, city
, where year = 2023 and city='Seattle'
will also use the
preaggregate (since it matches all preaggregate filters and groups, and its additional filters are on the
city
group that exists in the pre-aggregated dataset)revenue
by year
, where year = 2023 and city='Seattle'
will use the preaggregate
in the same way, but roll up over all cities to get a yearly total.revenue
by year, city
without any filters will not match the preaggregate
(since it was not filtered by year = 2023
)A pre-aggregated dataset will match a user query only when Source Filters of user’s query domain and the pre-aggregated dataset domain are same.
Queries done in the same domain as the cached pre-aggregate dataset will always match its source filters.
Metrics based on aggregation like COUNT
, SUM
, MAX
, MIN
are additive: they can be “rolled-up” from the preaggregate.
If there is a count metric pre-aggregated for each day, it can be summed up to see value per month.
This is called rolling up a metric from groups.
When an aggregate is built over multiple groups with additive metrics, it can serve queries on any group subset.
For example, if the revenue metric is additive (defined as a SUM
), and there is a preaggregate for
revenue
by year, city
:
revenue
by year, city
will match exactly since it has same groups.revenue
by city
will match, and will roll up from year
groups to get totals per city.revenue
by year
will match, and will roll up from city
groups to get totals per year.revenue
(no groups) will match, and will roll up from all groups to get the total.revenue
by year, color
will not match since it has different groups.When an aggregate is built over an entity key group with additive metrics, it can serve queries on any group that comes from an attribute in that entity.
For example, if the revenue metric is additive (defined as a SUM
), and there is a preaggregate for revenue
by
customer.id
, then user queries on any group that comes from the customer
entity will roll up.
revenue
by customer.id
will match exactly.revenue
by customer.name
will match, and will roll up from customer.id
lines.revenue
by customer.name, customer.age
will also match, and roll up.revenue
totals (no groups) will match, and will roll up over all groups to get the total.If an entity has a composite key, then all parts of the key must present in the aggregate dataset to be able to leverage this functionality.
When building pre-aggregates for additive metrics, always prefer entity keys, as that allows the preaggregate cache to accelerate a wider set of queries at once.
Honeydew detects simple additive metrics automatically (metrics that are an aggregation or direct derivatives thereof).
Any metric can be set to be additive using the rollup
metric property:
Use rollup
to allow aggregate aware roll-ups on complex metrics that are not detected automatically to
be additive.
Possible values for roll up functions are:
sum
: Run a SUM aggregation to roll up. Used for metrics that are a SUM
or COUNT
aggregation.min
, max
: Run a MIN or MAX aggregation to roll up. Used for the corresponding aggregations.hll
: (coming soon) Run an approximate distinct count combine (HLL_COMBINE
) to roll up (see
Estimating the Number of Distinct Values).no_rollup
: Disable roll-up behavior.Don’t use pre-aggregates when metric results are not deterministic (for a metric like SUM(RANDOM())
).
Honeydew can automatically set up and use aggregate caching based on recent queries.
This is most useful for accelerating live queries that calculate a dataset and then slice and dice it, such as Pivot Tables.
To enable short term aggregate cache in a domain, add this to the domain YAML:
The typical duration for short-term aggregate caching is 600 seconds (10 minutes), but it can be configured to any value up to 24 hours.
Short-Term Automatic Aggregate Caching is currently in Beta. Contact support@honeydew.ai to activate it for your account.
Pre-aggregated datasets are set up via building a dynamic dataset that is enabled to be used as cache.
Set up in Dynamic Dataset YAML schema cache delivery settings for Snowflake:
If using the Honeydew UI “deploy” action or Honeydew deploy API, a deployed dynamic dataset will be immediately ready to accelerate user queries.
If building the table in ETL code - your ETL code should call the appropriate Honeydew API (see below).
Set up in Dynamic Dataset YAML schema cache delivery settings for dbt:
Your dbt code should call the Honeydew dbt APIs to build the dynamic dataset (see below).
Entity Cache refresh relies on external orchestration (with dbt or otherwise) or manual deployments.
Use the Honeydew deploy functionality to write the dynamic dataset to a Snowflake table from the UI, or using the Native Application Deploy API:
Use the Snowflake Native Application API to get SQL for entity cache:
Create the table using that SQL in Snowflake. Honeydew uses the table update time to detect cache validity.
Dynamic datasets can be defined within a domain. All domain configuration (such as filters) will be applied to the dynamic dataset used for cache.
To set up dbt as a cache orchestrator:
config
macro to set up materialization settings such as clustering or dynamic tablesFor example, this can be the monthly_kpi
model in dbt:
Pre-aggregated datasets are usually not incremental.
See more in Honeydew dbt documentation