Filtering and Sampling
Filtering and Sampling
Honeydew pushes filters down to data, in order to reduce data scans.
Use domain filters to reduce scanned data in ad-hoc exploration
Domain filters limit the data that is accessed by a user query, which improves performance.
For example, can add a filter to a domain, that limits data in a fact table to the last week only. Any user query, regardless of the choice of attributes and metrics, will then be limited to the last week’s data. Honeydew engine will push down the filter to reduce data access and improve performance.
In some cases of logical data partitioning (for example, having a column indicating a tenant in a multi-tenant environment, or a column indicating data sampling level) a domain source filter can be further applied to improve performance.
Domain source filters can be used to implement data sampling:
- In a large fact table to be sampled, create a column with sampling level in the fact table. Filtering on that field will allow to select a data sample - for example, 0.1% of events will have the value “0.1”. Related events are commonly sampled together (i.e. events from the same user/session).
- Add the sampling level column to Snowflake clustering settings of the fact table.
- Create a “sampled data domain” with a source filter on the sampling level column.
Use parameters to reduced scanned data in dynamic datasets deployed as VIEWs
Dynamic Datasets can be deployed as a Snowflake VIEW.
When they are deployed as a VIEW, all metrics and attributes are part of VIEW query, so improvements such as reducing JOINs or pushing down filters are not possible.
However, performance of VIEWs can be increased by combining parameters and filters.
If a dynamic dataset is defined with a parameterized filter, any access to the view can be controlled with the view parameter.
For example, if a view with the following filter orders.o_orderdate = $date_to_explore
was defined, then can do:
Was this page helpful?