Incremental models #4834

begelundmuller · 2024-05-06T12:34:11Z

This PR adds initial support for incremental models across multiple connectors. Specifically, it:

Adds optional incremental state to models that can be referenced in a model's SQL using templating
Adds optional support for outputting the results of a model to another connector
It refactors the internal abstractions such that:
- The model reconciler is no longer tightly coupled to outputting data to an OLAP table
- Connectors can implement two new interfaces to support modeling:
  - AsModelExecutor for running a model and materializing it to the same or another connector
  - AsModelManager for managing model output in the connector (e.g. checks/renames/deletes)
- The new interfaces are based on the current Transporter interface and will eventually replace it. This enables unifying sources and models.
- Models now return result_connector and result_properties instead of a table. This enables models that output data to object stores (i.e. produce a path instead of a table).
Initially has executors for the following input/output connector combinations:
- DuckDB -> DuckDB
- ClickHouse -> ClickHouse
- SQLStore (BigQuery, Snowflake, Athena, Redshift) -> DuckDB
For DuckDB models, it supports two new output properties:
- incremental_strategy: {append,merge} for configuring how data is incrementally inserted into the output table
- unique_key: [columns...] for configuring a key to use for when the incremental_strategy is merge
When a model is triggered, we determine whether to run a full or incremental execution based on the following rules:
- Full run: if the model output does not exist, or the model's properties have changed, or the model's refs' properties have changed
- Incremental run: for scheduled refreshes, or if one of the model's refs have been refreshed

Here is a detailed example model that incrementally processes data from an upstream source/model:

-- models/bar.sql
-- @incremental: true
SELECT *, current_timestamp as inserted_on
FROM {{ ref "bar_source" }}
{{ if incremental }} WHERE event_time > (SELECT MAX(event_time) FROM bar) {{ end }}

Here is a detailed example model that incrementally ingests data from BigQuery into DuckDB:

# models/foo.yaml
refresh:
  cron: 0 * * * *

connector: bigquery
sql: >
  SELECT *
  FROM my_data
  {{ if incremental }} WHERE updated_on > '{{ .state.max_updated_on }}' {{ end }}

incremental: true
state:
  sql: SELECT MAX(updated_on) as max_updated_on FROM foo

output:
  connector: duckdb
  incremental_strategy: merge
  unique_key: [event_id]

The changes in this PR implement an initial subset of this proposal. It closes #4746.

nishantmonu51 · 2024-05-14T16:01:04Z

/review

rill-dev · 2024-05-14T16:01:13Z

Code Review Agent Run Status

AI Based Review: Successful

Code Review Overview

Summary: This PR introduces significant changes to support incremental models across multiple connectors, focusing on enhancing flexibility and functionality in data handling and processing. It includes major refactoring and deprecation of features, particularly in the handling of metrics and catalog definitions, suggesting a shift towards more dynamic data validation and model management strategies.
Code change type: Refactoring, Feature Addition, Documentation
Unit tests added: True
Estimated effort to review (1-5, lower is better): 4, due to the extensive changes across multiple files and systems, including refactoring and new feature integrations, which require careful review to ensure compatibility and functionality.

>>See detailed code suggestions<<
The Bito AI Code Review Agent successfully reviewed 70 files and discovered 9 issues. Please review these issues along with suggested fixes in the Changed Files.

See other commands you can run

High-level Feedback

Ensure comprehensive documentation for all new features and refactored components. Consider implementing more robust error handling and validation mechanisms to prevent potential security risks, such as SQL injection. Regularly update and maintain unit tests to cover new changes and scenarios.

runtime/drivers/duckdb/olap.go

runtime/drivers/clickhouse/model_manager.go

runtime/drivers/olap.go

runtime/drivers/duckdb/model_manager.go

runtime/reconcilers/project_parser.go

runtime/controller_test.go

k-anshul

Overall sounds good to me but I am somewhat concerned about data loss in case of InPlace Merge strategy.

k-anshul · 2024-05-15T06:19:18Z

runtime/drivers/duckdb/olap.go

+		}
+
+		// Insert the new data into the target table
+		return c.execWithLimits(ctx, &drivers.Statement{


Is it a conscious decision that query can be cancelled after older data has been dropped ?

It's a good point and admittedly the merge support is somewhat experimental at this point (also not sure how it will perform).

Doing a delete+insert is a common merge strategy for databases without native merge support. Ideally though, we would do it in a transaction, but I wonder if that might give us other problems with DuckDB.

For external table storage, I added the inPlace == false option where it takes a copy of the table to prevent this issue. It's used for DuckDB->DuckDB models, for unfortunately the SQLStore->DuckDB transporter currently does streaming writes, so couldn't use it there.

I'm planning more refactors here to eventually enable safe writes (among other things), but for now, this PR is growing big and would prefer to get it merged. Since the merge support is experimental and doesn't break any existing features, I think it should be fine.

If you would prefer, we could try adding a transaction here for the short term?

(And to address the question of context specifically – it's probably best to use the main context since these operations can run for a long time. And in terms of error scenarios, using a new context would not address the issue of the second query failing or the process being terminated.)

Yeah given its not an issue for DuckDB->DuckDB models which are also going to be the most common ones we can handle merge for SQL stores in later PRs.

begelundmuller added 13 commits April 30, 2024 18:30

Model resource changes

a01dd82

Merge branch 'main' into begelundmuller/incremental-models

0491851

Update new model spec

c6b3787

Update parser for new model spec

d8698c5

Add output_table to model state + fix errors in tests

f6c0d89

Implement reconciler

03c080f

Implement duckdb model executor

2266a3d

Add props resolution with incremental support

ec6cb1b

Fix issues

a0879c7

Fix frontend

edf39c1

Merge branch 'main' into begelundmuller/incremental-models

5e8310e

Support incremental state resolver

c61263a

Merge branch 'main' into begelundmuller/incremental-models

f448939

begelundmuller self-assigned this May 6, 2024

begelundmuller added 15 commits May 6, 2024 18:36

Self review

d1979d3

Fix CI

7fc9fd4

Merge branch 'main' into begelundmuller/incremental-models

d4476fd

Reduce nesting by extracting logic to util funcs

229b9f8

Refactor model executor interface

f203a7e

Fix lint

e14e6fe

Support appends and merges

fd2c141

Clickhouse modeling

05e1143

Fix test

868e2bd

Fix lint

ab1d4e4

Merge branch 'main' into begelundmuller/incremental-models

e6f3fbe

Rename file

63841e0

Incremental BigQuery to DuckDB

3f0a420

Fixes

6fdd191

Fix spec/refs hash for incremental models

13af342

begelundmuller changed the title ~~Incremental ingestion scaffolding~~ Incremental, multi-connector models May 13, 2024

begelundmuller changed the title ~~Incremental, multi-connector models~~ Incremental models May 13, 2024

begelundmuller requested a review from k-anshul May 13, 2024 13:31

begelundmuller marked this pull request as ready for review May 13, 2024 13:31

begelundmuller added 4 commits May 13, 2024 19:23

Misc. fixes

6143715

Merge branch 'main' into begelundmuller/incremental-models

9b9474e

Rename "state" to "incremental state" to avoid ambiguity

5fe677d

Merge branch 'main' into begelundmuller/incremental-models

bb681ab