Microbatch - Adapter Maintainers Guide #371
MichelleArk
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
What is Microbatch?
As part of
dbt-core==1.9.0
anddbt-adapters==1.10.3
, we have introduced support for a new built-inincremental_strategy
calledmicrobatch
. This new incremental strategy materializes large, event-oriented datasets in an opinionated and ergonomic way using time ranges. Our “happy path” use case is “I need to process new (partitions of) data (each hour, day, etc), and efficiently upsert them into an existing table.”From the microbatch beta documentation:
Additional context is also available in the GH Discussion and GH Epic.
Support Considerations
This is an opt-in functionality which means you can choose not to support this in your adapter. As an overview of the minimal work required to support the new microbatch incremental strategy, refer to the
dbt-redshift
implementation here: https://github.com/dbt-labs/dbt-redshift/pull/924/files. The rest of this section will dive deeper into the details of support considerations, usingdbt-snowflake
anddbt-bigquery
as examples.Adapter Requirements
To support the
microbatch
incremental strategy, each adapter is responsible for:Extending the
BaseAdapter.valid_incremental_stratgies
method to include"microbatch"
in its result. (example)Implementing a
<my-adapter>__get_incremental_microbatch_sql
macro (example)In practice, this may look most similar to an existing implementations of either the
insert_overwrite
(preferred, if available), ordelete+insert
strategy.For each batch materialized by dbt, two new properties are available in the jinja context via the new
model.batch
attribute, which are only available when running microbatch models. When not in a microbatch model context,model.batch
will beNone
and access to sub-attributes is unsafe.model.batch.event_time_start
: datetimemodel.batch.event_time_end
: datetimemodel.batch.id
: string representation ofmodel.batch.event_time_start
, with no spaces,-
, or_
characters.model.batch.event_time_start
andmodel.batch.event_time_end
represent the time bounds of the running batch, and should be used to filter anydelete
ormerge
statements in the strategy implementation. This is both necessary for efficiency as well as correctness.model.batch.id
may be helpful for logging purposes, and is baked into the defaultmake_temp_relation
macro, acting as an additional suffix todbt_tmp
tables, so that each batch gets it an isolated temp table (implementation here).It may also be necessary to override the base implementation of a new method
BaseRelation._render_event_time_filter
. This method accepts anEventTimeFilter
fromdbt-core
, and generates the appropriate SQL to wrap aref
statement withwhere
filters using theevent_time_start
,event_time_end
, andevent_time
.Examples:
valid_incremental_strategies
implementation: https://github.com/dbt-labs/dbt-snowflake/blob/0b24a5a2d311ffec5f996ca28532076a637aa6b3/dbt/adapters/snowflake/impl.py#L426-L427snowflake__get_incremental_microbatch_sql
macro implementationBaseRelation._render_event_time_filtered
: https://github.com/dbt-labs/dbt-bigquery/pull/1422/filesOpting-into concurrency support
It is possible to opt-into support concurrency for your adapter’s microbatch strategy. The benefits of opting-in are primarily repeated during
--full-refresh
runs, whereby running batches concurrently (with respect to the global--threads
variable) leads to significantly reduced build times for users.Determine whether your
<my-adapter>_get_incremental_microbatch_sql
macro is safe to run concurrently, and set theMicrobatchConcurrency
capability toTrue
. By default,MicrobatchConcurrency
is set toFalse
, which directs dbt to execute each batch in serial. Common concurrency considerations are:model.batch.id
jinja global to do so, or to use globalmake_temp_relation
which will do this automatically if the defaultsuffix
is provided.Beyond ensuring correctness of the strategy across threads, it is recommended to benchmark the performance implications of supporting
MicrobatchConcurrency
by simulating a large input to a microbatch model, and running a--full-refresh
”backfill” of the microbatch model with 1 (serial execution), 4, and 8 threads. The overall runtime may go down significantly, but the total time executed against the warehouse may increase because the platform will need to manage merge into the main dataset safely (e.g. via locking).Users are able to opt in and out of concurrency support at the
model.config
level via thebatch_concurrency: bool
configuration, even if your adapter supportsMicrobatchConcurrency
. This means the end-user can determine whether any tradeoffs associated with concurrent microbatch invocations are acceptable for their use cases.For reference benchmarking, please refer to: dbt-labs/dbt-snowflake#1259 (comment)
Testing
A new base test,
BaseMicrobatch
has been implemented for concrete adapters to inherit and test against. It is possible to overridemicrobatch_model_sql
,input_model_sql
, andinsert_two_rows_sql
via fixtures.https://github.com/dbt-labs/dbt-adapters/blob/main/dbt-tests-adapter/dbt/tests/adapter/incremental/test_incremental_microbatch.py
Example override in dbt-snowflake: https://github.com/dbt-labs/dbt-snowflake/pull/1179/files#diff-54c15a3b4b6e274116439d3d4ac9416141d9bd39f9e6ffedc0724ab304cb81eb
Behaviour Flag: require_batched_execution_for_custom_microbatch_strategy
Lastly,
dbt-core
has introduced a new global behavior flag,require_batched_execution_for_custom_microbatch_strategy
. This behavior flag is configurable underflags
indbt_project.yml
, and defaults toFalse
. This behavior flag is intended to protect users that have created a custom incremental strategy called ‘microbatch’, since we are now effectively claiming that as a builtin / reserved.By default, projects with a custom incremental strategy called ‘microbatch’ will not run through the new microbatch execution framework whereby dbt-core computes individual batches and resolves
ref
andsource
calls withEventTimeFilter
. If the user turns this flag on, they are effectively opting-into using their custom 'microbatch' strategy in combination with dbt-core's new execution framework.Documentation on this flag is available here: https://docs.getdbt.com/reference/global-configs/behavior-changes#custom-microbatch-strategy
Beta Was this translation helpful? Give feedback.
All reactions