Entity Source Data

The source data definition sets where the data for an entity comes from.

Source data is associated with a single entity.

The source data definition specifies:

  • The origin of the data (e.g., a physical object or an SQL query).
  • The columns to be included as attributes of an entity, along with their designated names.
  • The metadata associated with those attributes, derived from the source.

Columns

Columns from the source data are added as attributes in the entity.

Attribute Mapping and Renaming

As part of the source data definition, you can optionally specify a name for the created attribute. If no name is provided, the column name is used as the attribute’s default name.

A common pattern for light data cleaning transformations involves adding a column as a hidden attribute and then creating a calculated attribute with the same name that applies the transformation.

For example, if the city_name column in the source data contains surrounding whitespace, you can normalize the column by adding it as a hidden attribute and creating a calculated attribute to remove the whitespace:

  1. Add the city_name column as _city_name_raw attribute, set as hidden.
  2. Create a calculated attribute called city_name with the expression TRIM(entity._city_name_raw)

Structure and Type Synchronization

When source data is created, its column structure and data types are initially imported from the source.

If the structure or data types change over time, you can re-synchronize them by using the “synchronize from source” option available in the UI or API.

Data Mapping

Source data can be a database object (such as a table, view, or custom SQL query).

Source objects can include any Snowflake object containing data, such as TABLE, VIEW, DYNAMIC TABLE, MATERIALIZED VIEW, and more. All types of tables are supported, including regular tables, Iceberg tables, and external tables.

Custom SQL may result in lower performance due to the following reasons:

  • Filters may not be pushed down into the query, leading to larger table scans.
  • Column selection may not be pushed down, causing unnecessary data movement.

If the source data is a large dataset that is frequently filtered for performance reasons, it’s recommended to encapsulate it in a table. If the source data contains many columns, consider encapsulating it in a view.

In particular, it’s always better to reference a table directly rather than using a SELECT * FROM table custom SQL source.

Semi-Structured and JSON Data

Source data, whether it’s a query, table, or view, can include semi-structured columns (such as JSON, arrays, or maps).

A Honeydew attribute can be of a semi-structured type and used directly as a structured object. However, most BI tools cannot process semi-structured data and require simple, structured columns.

There are two approaches to normalize semi-structured data for BI:

  1. Use the source as-is with semi-structured attributes and create calculated attributes to unpack specific fields..
  2. Unpack all fields with a custom SQL query source.

For a JSON column with different fields, a structured custom SQL query can look like this:

-- the table JSON_TABLE has a column named json_data with different nested values
SELECT
    -- unpack a field by key
    json_data:"id"::string AS id,
    -- unpack a sub-field in a properties by key
    json_data:"address":"city"::string AS address_city,
    -- unpack a sub-field in an array by key
    json_data:"visits"[0]:"time"::timestamp AS first_visit_time,
FROM DATABASE.SCHEMA.JSON_TABLE;

Note that when unpacking semi-structured data, it is recommended to explicitly cast the columns to the appropriate types.

In most cases, it is recommended to avoid using LATERAL FLATTEN join to flatten a JSON object.

When using LATERAL FLATTEN, each key in the JSON object becomes a separate row with a key and a value, which can alter the granularity of the entity. If this behavior is desired, use the SEQ column, which provides a unique sequence number, as part of the entity key to maintain consistency and avoid issues with key duplication.

YAML Schema

Each entity is based on a source table defined in a separate text file stored in Git.

The schema for source table is:

type: dataset
entity: [entity name]
name: [source table name]
display_name: [display name]
sql: |
  [sql statement]
owner: [owner]
description: |-
  [description]
labels: [...]
folder: [folder]
hidden: [True/False/Yes/No]
dataset_type: [table/sql/entity]
attributes:
  - column: [src column name]
    name: [attribute name]
    display_name: [display name]
    datatype: [date type - number/string/...]
    timegrain: [the time grain for time attributes - hour/day/month/...]
    desciption: [desciption]
    labels: [...]
    folder: [folder]
    hidden: [True/False/Yes/No]

Fields:

  • name: Name of source table
  • display_name, owner, description, labels, folder, hidden: Metadata
  • dataset_type: Choose if a table, a SQL query or a virtual entity
  • sql: The name of table (when type is a table) / the base entity (virtual table) / SQL query (custom sql)
  • attributes: List of attributes to bring from the source, with optional renaming