Aggregate Aware Caching
Introduction
Many analytic workloads require aggregations or joins on large datasets (such as fact tables), which can be slow or expensive when the data is large.
Honeydew can improve the performance of aggregative queries of any kind by leveraging Honeydew-managed pre-aggregations in Snowflake.
The aggregated cache is defined as dynamic datasets in Honeydew that are stored as tables in your Snowflake.
Aggregate aware caching can improve perfomance by orders of magnitude on large datasets.
Aggregate caching is built on dynamic datasets that are set up to act as a cache.
Those dynamic datasets typically include:
- Metrics that require acceleration (if users commonly ask about daily revenue, can put it in the per-day aggregate).
- Groups that are needed to accelerate user queries (if users commonly ask for metrics by day/month/quarter, can pre-compute per-day results).
- Domain that needs to match user queries (as different domains might have different configurations).
Once a dynamic dataset is configured to be used for pre-aggregate caching, deploy it to Snowflake via Honeydew UI or with your ETL/ELT orchestrator.
One-time refreshes can be done from Honeydew UI.
Periodic refreshes of pre-aggregated caches require an orchestration framework, such as dbt, Airflow or scheduled ETL queries.
Honeydew checks if the dataset table exists in Snowflake to use it for acceleration.
A dynamic dataset that is configured for pre-aggregate caching but is not deployed to Snowflake will not be used.
How It Works
Honeydew automatically matches a user query with a pre-aggregated dataset that can accelerate it.
Honeydew verifies all semantic definitions (groups, filters in use, join paths in use) before deciding that a pre-aggregation can serve the user query.
Matching Groups
A pre-aggregated dataset will match a user query only when its groups can help the user query.
Roll Up Matching
Metric based on additive aggregations (like SUM
) will match the groups in the user query can be computed
by rolling up from the pre-aggregated dataset (see more details below).
For example, if the user count metric is additive (defined as a COUNT
), and there is a preaggregate
for total_users
by year, city
:
- A user query for
total_users
byyear, city
will match exactly since it has same groups. - A user query for
total_users
bycity
will match and summarize over all years. - A user query for total
total_users
(no groups) will match and summarize over all years and cities.
Aggregate aware caching is best when used with additive metrics.
Exact Matching
Metrics based on non-additive aggregations (like COUNT(DISTINCT)
) will match only when the groups in the
user query are exactly the same as in the pre-aggregated dataset.
For example, if the unique user count metric is non-additive (defined as a COUNT(DISTINCT)
), and there is a preaggregate
for unique_users
by year, city
:
- A user query for
unique_users
byyear, city
will match exactly since it has same groups. - A user query for
unique_users
bycity
will not match.
Distinct count metric are not additive, but approximate distinct metrics are.
Matching Filters
A pre-aggregated dataset will match a user query only when its filters match fully.
For example, if the preaggregate dynamic dataset was built for 2023, only queries filtering for 2023 will use it.
Aggregate caches are most effective when built without explicit filters.
Domain filters can be used to filter both the pre-aggregated caches and the user queries to the same domain:
- Create a cached pre-aggregate dataset in a domain
filtered_domain
with a filter. - Any query in the domain
filtered_domain
will use the cached pre-aggregate dataset.
The user query can be further filtered on groups that match the pre-aggregate.
For example, given a preaggregate for revenue
by year
and city
, where year = 2023
:
- A user query for
revenue
byyear, city
, whereyear = 2023
will use the preaggregate (since it matches all preaggregate filters and groups) - A user query for
revenue
byyear, city
, whereyear = 2023 and city='Seattle'
will also use the preaggregate (since it matches all preaggregate filters and groups, and its additional filters are on thecity
group that exists in the pre-aggregated dataset) - A user query for
revenue
byyear
, whereyear = 2023 and city='Seattle'
will use the preaggregate in the same way, but roll up over all cities to get a yearly total. - A user query for
revenue
byyear, city
without any filters will not match the preaggregate (since it was not filtered byyear = 2023
)
Domains
A pre-aggregated dataset will match a user query only when:
- Active Roles of user’s query domain and the pre-aggregated dataset domain are same.
- Source Filters of user’s query domain and the pre-aggregated dataset domain are same.
Queries done in the same domain as the cached pre-aggregate dataset will always match its source filters and active roles.
Additive Metrics
Metrics based on aggregation like COUNT
, SUM
, MAX
, MIN
are additive: they can be “rolled-up” from the preaggregate.
If there is a count metric pre-aggregated for each day, it can be summed up to see value per month.
This is called rolling up a metric from groups.
Partial Group Matching
When an aggregate is built over multiple groups with additive metrics, it can serve queries on any group subset.
For example, if the revenue metric is additive (defined as a SUM
), and there is a preaggregate for
revenue
by year, city
:
- A user query for
revenue
byyear, city
will match exactly since it has same groups. - A user query for
revenue
bycity
will match, and will roll up fromyear
groups to get totals per city. - A user query for
revenue
byyear
will match, and will roll up fromcity
groups to get totals per year. - A user query for total
revenue
(no groups) will match, and will roll up from all groups to get the total. - A user query for
revenue
byyear, color
will not match since it has different groups.
Matching Entity Keys
When an aggregate is built over an entity key group with additive metrics, it can serve queries on any group that comes from an attribute in that entity.
For example, if the revenue metric is additive (defined as a SUM
), and there is a preaggregate for revenue
by
customer.id
, then user queries on any group that comes from the customer
entity will roll up.
- A user query for
revenue
bycustomer.id
will match exactly. - A user query for
revenue
bycustomer.name
will match, and will roll up fromcustomer.id
lines. - A user query for
revenue
bycustomer.name, customer.age
will also match, and roll up. - A user query for
revenue
totals (no groups) will match, and will roll up over all groups to get the total.
If an entity has a composite key, then all parts of the key must present in the aggregate dataset to be able to leverage this functionality.
When building pre-aggregates for additive metrics, always prefer entity keys, as that allows the preaggregate cache to accelerate a wider set of queries at once.
Configuring Additive Metrics
Honeydew detects simple additive metrics automatically (metrics that are an aggregation or direct derivatives thereof).
Any metric can be set to be additive using the rollup
metric property:
Use rollup
to allow aggregate aware roll-ups on complex metrics that are not detected automatically to
be additive.
Possible values for roll up functions are:
sum
: Run a SUM aggregation to roll up. Used for metrics that are aSUM
orCOUNT
aggregation.min
,max
: Run a MIN or MAX aggregation to roll up. Used for the corresponding aggregations.hll
: (coming soon) Run an approximate distinct count combine (HLL_COMBINE
) to roll up (see Estimating the Number of Distinct Values).no_rollup
: Disable roll-up behavior.
Tips for Effective Pre-aggregations
- Build for additive metrics - a single pre-aggregated dataset can serve many queries with roll-ups.
- Don’t use explicit filters on the data - any filter limits user queries that can be served.
- For pre-aggregate caches for workloads that operate on a subset of the data, build them in a filtered domain.
- Use entity keys as groups with additive metrics - more queries can be served with roll-ups from entity keys.
Don’t use pre-aggregates when metric results are not deterministic (for a metric like SUM(RANDOM())
).
Automatic Aggregate Caching
Honeydew can automatically set up and use aggregate caching based on recent queries.
This is most useful for accelerating live queries that calculate a dataset and then slice and dice it, such as Pivot Tables.
To enable short term aggregate cache in a domain, add this to the domain YAML:
The typical duration for short-term aggregate caching is 600 seconds (10 minutes), but it can be configured to any value up to 24 hours.
Short-Term Automatic Aggregate Caching is in Beta - reach out to support@honeydew.ai to enable in your account.
Configuration
Pre-aggregated datasets are set up via building a dynamic dataset that is enabled to be used as cache.
Configuration when building from ETL code or Honeydew
Set up in Dynamic Dataset YAML schema cache delivery settings for Snowflake:
If using the Honeydew UI “deploy” action or Honeydew deploy API, a deployed dynamic dataset will be immediately ready to accelerate user queries.
If building the table in ETL code - your ETL code should call the appropriate Honeydew API (see below).
Configuration when using dbt as orchestrator
Set up in Dynamic Dataset YAML schema cache delivery settings for dbt:
Your dbt code should call the Honeydew dbt APIs to build the dynamic dataset (see below).
Orchestration
Entity Cache refresh relies on external orchestration (with dbt or otherwise) or manual deployments.
Set up with Honeydew
Use the Honeydew deploy functionality to write the dynamic dataset to a Snowflake table from the UI, or using the Native Application Deploy API:
Set up with ETL tools
Use the Snowflake Native Application API to get SQL for entity cache:
Create the table using that SQL in Snowflake. Honeydew uses the table update time to detect cache validity.
Dynamic datasets can be defined within a domain. All domain configuration (such as active roles or filters) will be applied to the dynamic dataset used for cache.
Set up with dbt
To set up dbt as a cache orchestrator:
- In dbt, create an entity cache model by using the Honeydew dbt cache macro
- In dbt, use the
config
macro to set up materialization settings such as clustering or dynamic tables - In Honeydew, set the entity’s dbt delivery settings to the chosen dbt model name
For example, this can be the monthly_kpi
model in dbt:
Pre-aggregated datasets are usually not incremental.
See more in Honeydew dbt documentation
Was this page helpful?