Skip to content

Commit

Permalink
Update utils dbt docs (#622)
Browse files Browse the repository at this point in the history
* Update utils dbt docs

* Add on normalize changes

---------

Co-authored-by: Emiel <emiel.verkade@gmail.com>
  • Loading branch information
2 people authored and agnessnowplow committed Oct 6, 2023
1 parent 6c67bcb commit db4c1fb
Show file tree
Hide file tree
Showing 6 changed files with 26 additions and 11 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ All variables in Snowplow packages start with `snowplow__` but we have removed t
| `days_late_allowed` | The maximum allowed number of days between the event creation and it being sent to the collector. Exists to reduce lengthy table scans that can occur as a result of late arriving data. If set to `-1` disables this filter entirely, which can be useful if you have events with no `dvce_sent_tstamp` value. | 3 |
| `lookback_window_hours` | The number of hours to look before the latest event processed - to account for late arriving data, which comes out of order. | 6 |
| `start_date` | The date to start processing events from in the package on first run or a full refresh, based on `collector_tstamp`. | '2020-01-01' |
| `session_timestamp` | Determines which timestamp is used to process sessions of data. It's a good idea to have this timestamp be the same timestamp as the field you partition your events table on. | 'collector_tstamp' |
| `upsert_lookback_days` | Number of days to look back over the incremental derived tables during the upsert. Where performance is not a concern, should be set to as long a value as possible. Having too short a period can result in duplicates. Please see the [Snowplow Optimized Materialization](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-advanced-usage/dbt-incremental-materialization/index.md) section for more details. | 30 |

### Contexts, filters, and logs
Expand Down Expand Up @@ -113,6 +114,7 @@ export const GROUPS = [
"snowplow__days_late_allowed",
"snowplow__lookback_window_hours",
"snowplow__start_date",
"snowplow__session_timestamp",
"snowplow__upsert_lookback_days"] },
{ title: "Contexts, Filters, and Logs", fields: ["snowplow__app_id"] },
{ title: "Warehouse Specific", fields: ["snowplow__databricks_catalog",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,13 @@ models:
snowplow_<package>_custom_models:
+tags: snowplow_<package>_incremental #Adds tag to all models in the 'snowplow_<package>_custom_models' directory
```
:::tip
If you are using your own version of our `base` macros provided in our [`utils` package](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-models/dbt-utils-data-model/dbt-utils-advanced-operation/index.md), your tag will be `<package_name>_incremental` for what you set in the `snowplow_incremental_post_hook` hook.

:::

These models should also make use of the optimized `materialization` set such that `materialized='incremental` and `snowplow_optimize=true` in your model config. Finally, as well as referencing a `_this_run` table these models should make use of the `is_run_with_new_events` macro to only process the table when new events are available in the current run. This macro `snowplow_utils.is_run_with_new_events(package_name)` will evaluate whether the particular model, i.e. `{{ this }}`, has already processed the events in the given run of the model. This is returned as a boolean and effectively blocks the upsert to incremental models if the run only contains old data. This protects against your derived incremental tables being temporarily updated with incomplete data during batched back-fills of other models.

```jinja2 title="/models/snowplow_<package>_custom_models/my_custom_model.sql"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Throughout this page we refer to entity columns only by their major version (e.g

:::

The general idea behind the Snowplow-utils base functionality is to be able to add custom identifiers, bespoke behaviour, and customize naming Snowplow tables that get created as part of the incremental process. Here you can get a better understanding of what kinds of behaviours the package supports, how those scenarios actually work in terms of implementation, and a better understanding of all of the variables that can be used for these macros to unlock the full customization capability that has been built into the Snowplow-utils base functionality. You can find some example projects in the following [repository](https://github.com/snowplow-incubator/dbt-example-project).
The general idea behind the Snowplow-utils base functionality is to be able to add custom identifiers, bespoke behavior, and customize naming Snowplow tables that get created as part of the incremental process. Here you can get a better understanding of what kinds of behaviors the package supports, how those scenarios actually work in terms of implementation, and a better understanding of all of the variables that can be used for these macros to unlock the full customization capability that has been built into the Snowplow-utils base functionality. You can find some example projects in the following [repository](https://github.com/snowplow-incubator/dbt-example-project).

## Preface
:::info
Expand Down Expand Up @@ -503,7 +503,7 @@ vars:

Note that you can simply add more entities or self-describing events to join by adding more dicts to the list.

Once you've added in the entities or self-describing events that you want to leverage, you can use `snowplow__custom_sql` to transform them and surface that in your `snowplow_base_events_this_run` table. Similiarly to the example for other warehouses, suppose you have a custom context called `contexts_com_mycompany_click_1` which contains a `id` that you want to concat with Snowplow's `domain_sessionid`. You can add that transformation by adding the following to your `dbt_project.yml`:
Once you've added in the entities or self-describing events that you want to leverage, you can use `snowplow__custom_sql` to transform them and surface that in your `snowplow_base_events_this_run` table. Similarly to the example for other warehouses, suppose you have a custom context called `contexts_com_mycompany_click_1` which contains a `id` that you want to concat with Snowplow's `domain_sessionid`. You can add that transformation by adding the following to your `dbt_project.yml`:

```yml title="dbt_project.yml"
vars:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,6 @@ This model consists of a series of macros that generate models, together culmina
- `snowplow_base_new_event_limits.sql`: This model keeps track of the upper and lower limit of events that are going to be considered, which is referenced downstream in other models to limit table scans.
- `snowplow_base_sessions_lifecycle_manifest.sql`: This model maintains a session lifecycle manifest, outlining for each session (based on the specified `session_identifier` and `user_identifier`) the start and end timestamp of the session
- `snowplow_base_sessions_this_run.sql`: This model identifies which sessions are relevant to be processed for the current run, and picks only those sessions from the session lifecycle manifest outlined above.
- `snowplow_base_events_this_run.sql`: This model finally takes the output of the base sessions table and extracts all relevant events from the sessions identifed in the base sessions table, along with any custom contexts specified. This table can then be leveraged down the line by other Snowplow dbt packages or by a user's custom models.
- `snowplow_base_events_this_run.sql`: This model finally takes the output of the base sessions table and extracts all relevant events from the sessions identified in the base sessions table, along with any custom contexts specified. This table can then be leveraged down the line by other Snowplow dbt packages or by a user's custom models.

With these macros, you can specify which columns get used as timestamps for incremental data processing, and which identifiers are used on a `session` and `user` level. By default, the `session_tstamp` is the `collector_tstamp`, the `session_identifier` is `domain_sessionid`, and the `user_identifier` is `domain_userid`. For a more in-depth explanation into how you can customize these values, you can read the [Utils quickstart](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-quickstart/utils/index.md#6-setting-up-the-sessions-lifecycle-manifest-macro) docs.
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ import DbtPackageInstallation from "@site/docs/reusable/dbt-package-installation
## Setup

:::info
You can largely skip redundant copy + pasting by cloning the following dbt project repository that we have created in GitHub. You can find it [here](https://github.com/snowplow-incubator/dbt-example-project), and this has all of the boilerplate setup for you already. If you want to customise model names or parameter values, you can still follow the quickstart guide below to help you understand how to do that, and what changing each variable will mean for your models. Feel free to skip steps 1 and 2, however.
You can largely skip redundant copy + pasting by cloning the following dbt project repository that we have created in GitHub. You can find it [here](https://github.com/snowplow-incubator/dbt-example-project), and this has all of the boilerplate setup for you already. If you want to customize model names or parameter values, you can still follow the quickstart guide below to help you understand how to do that, and what changing each variable will mean for your models. Feel free to skip steps 1 and 2, however.

:::

Expand Down Expand Up @@ -242,7 +242,7 @@ If you have more than one session or user identifier, you can specify multiple e
`}
</code></pre>
For Redshift & Postgres we also introduce the `prefix` and `alias` fields, where `prefix` is the `prefix` that is put infront of each field name in the context, and `alias` is the table alias used upon joining. This can be useful when you are using custom SQL. As an example, using the above configuration we could access the `internal_user_id` field using the following SQL:
For Redshift & Postgres we also introduce the `prefix` and `alias` fields, where `prefix` is the `prefix` that is put in front of each field name in the context, and `alias` is the table alias used upon joining. This can be useful when you are using custom SQL. As an example, using the above configuration we could access the `internal_user_id` field using the following SQL:

```sql
mcc_iud.mcc_internal_user_id as internal_user_id,
Expand Down Expand Up @@ -283,7 +283,7 @@ For the `snowplow_base_sessions_this_run` model, you will need to add a post-hoo

Here the parameters that are called in both macros are only used to direct the macro to the right model names, so again if you've chosen to modify any of the table names then you should adjust the names in the right macros here. For the `base_quarantine_sessions` macro you simply pass the maximum session duration in days, which is taken from the `snowplow__max_session_days` variable, and you specify the name of the `snowplow_base_quarantined_sessions` table, specified by the `snowplow__quarantined_sessions` variable.

For the `base_create_snowplow_sessions_this_run` macro call, you specify the name of the `lifecycle_manifest_table` and the `new_event_limits_table`. The boilerplate contains their default names, and so if you have not customised anything you can simply copy this code into your `snowplow_base_sessions_this_run` model.
For the `base_create_snowplow_sessions_this_run` macro call, you specify the name of the `lifecycle_manifest_table` and the `new_event_limits_table`. The boilerplate contains their default names, and so if you have not customized anything you can simply copy this code into your `snowplow_base_sessions_this_run` model.

### 8. Setting up the events this run macro
For the `snowplow_base_events_this_run` model, you will need to run the following two macros in your model:
Expand All @@ -310,7 +310,7 @@ For the `snowplow_base_events_this_run` model, you will need to run the followin
Here you once again have a number of parameters that the macro can take, and to get an in-depth explanation of each variable passed here, please refer to the [configuration](/docs/modeling-your-data/modeling-your-data-with-dbt/dbt-configuration/utils/index.md) page. The variables used here are largely either self-explanatory or overlapping with those in the [lifecycle manifest](docs/modeling-your-data/modeling-your-data-with-dbt/dbt-quickstart/utils/index.md#6-setting-up-the-sessions-lifecycle-manifest-macro) section, except that you can now specify custom names for your `snowplow_base_sessions_this_run` table through the `snowplow__base_sessions` variable.

### 9. Modify your `dbt_project.yml`
To properly configure your dbt project to utilise and update the manifest tables correctly, you will need to add the following hooks to your `dbt_project.yml`
To properly configure your dbt project to utilize and update the manifest tables correctly, you will need to add the following hooks to your `dbt_project.yml`

```yml
# Completely or partially remove models from the manifest during run start.
Expand All @@ -319,10 +319,16 @@ on-run-start:
# Update manifest table with last event consumed per sucessfully executed node/model
on-run-end:
- "{{ snowplow_utils.snowplow_incremental_post_hook() }}"
- "{{ snowplow_utils.snowplow_incremental_post_hook(package_name='snowplow', incremental_manifest_table_name=var('snowplow__incremental_manifest', 'snowplow_incremental_manifest'), base_events_this_run_table_name='snowplow_base_events_this_run', session_timestamp=var('snowplow__session_timestamp')) }}"
```

The `snowplow_delete_from_manifest` macro is called to remove models from manifest if specified using the `models_to_remove` variable, in case of a partial or full refresh. The `snowplow_incremental_post_hook` is used to update the manifest table with the timestamp of the last event consumed successfully for each Snowplow model.
The `snowplow_delete_from_manifest` macro is called to remove models from manifest if specified using the `models_to_remove` variable, in case of a partial or full refresh. The `snowplow_incremental_post_hook` is used to update the manifest table with the timestamp of the last event consumed successfully for each Snowplow incremental model - make sure to change the `base_events_this_run_table_name` if you used a different table name.

:::tip

The `package_name` variable here is not the name of your project, instead it is what is used to identify your tagged incremental models as they should be tagged with `<package_name>_incremental`.

:::

### 10. Run your models

Expand Down
4 changes: 2 additions & 2 deletions src/componentVersions.js
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,9 @@ export const versions = {
dbtSnowplowUnified: '0.1.0',
dbtSnowplowWeb: '0.16.0',
dbtSnowplowMobile: '0.7.3',
dbtSnowplowUtils: '0.15.0',
dbtSnowplowUtils: '0.15.1',
dbtSnowplowMediaPlayer: '0.6.0',
dbtSnowplowNormalize: '0.3.2',
dbtSnowplowNormalize: '0.3.3',
dbtSnowplowFractribution: '0.3.5',
dbtSnowplowEcommerce: '0.5.3',

Expand Down

0 comments on commit db4c1fb

Please sign in to comment.