Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental models #4834

Merged
merged 32 commits into from
May 15, 2024
Merged

Incremental models #4834

merged 32 commits into from
May 15, 2024

Conversation

begelundmuller
Copy link
Contributor

@begelundmuller begelundmuller commented May 6, 2024

This PR adds initial support for incremental models across multiple connectors. Specifically, it:

  • Adds optional incremental state to models that can be referenced in a model's SQL using templating
  • Adds optional support for outputting the results of a model to another connector
  • It refactors the internal abstractions such that:
    • The model reconciler is no longer tightly coupled to outputting data to an OLAP table
    • Connectors can implement two new interfaces to support modeling:
      • AsModelExecutor for running a model and materializing it to the same or another connector
      • AsModelManager for managing model output in the connector (e.g. checks/renames/deletes)
    • The new interfaces are based on the current Transporter interface and will eventually replace it. This enables unifying sources and models.
    • Models now return result_connector and result_properties instead of a table. This enables models that output data to object stores (i.e. produce a path instead of a table).
  • Initially has executors for the following input/output connector combinations:
    • DuckDB -> DuckDB
    • ClickHouse -> ClickHouse
    • SQLStore (BigQuery, Snowflake, Athena, Redshift) -> DuckDB
  • For DuckDB models, it supports two new output properties:
    • incremental_strategy: {append,merge} for configuring how data is incrementally inserted into the output table
    • unique_key: [columns...] for configuring a key to use for when the incremental_strategy is merge
  • When a model is triggered, we determine whether to run a full or incremental execution based on the following rules:
    • Full run: if the model output does not exist, or the model's properties have changed, or the model's refs' properties have changed
    • Incremental run: for scheduled refreshes, or if one of the model's refs have been refreshed

Here is a detailed example model that incrementally processes data from an upstream source/model:

-- models/bar.sql
-- @incremental: true
SELECT *, current_timestamp as inserted_on
FROM {{ ref "bar_source" }}
{{ if incremental }} WHERE event_time > (SELECT MAX(event_time) FROM bar) {{ end }}

Here is a detailed example model that incrementally ingests data from BigQuery into DuckDB:

# models/foo.yaml
refresh:
  cron: 0 * * * *

connector: bigquery
sql: >
  SELECT *
  FROM my_data
  {{ if incremental }} WHERE updated_on > '{{ .state.max_updated_on }}' {{ end }}

incremental: true
state:
  sql: SELECT MAX(updated_on) as max_updated_on FROM foo

output:
  connector: duckdb
  incremental_strategy: merge
  unique_key: [event_id]

The changes in this PR implement an initial subset of this proposal. It closes #4746.

@begelundmuller begelundmuller self-assigned this May 6, 2024
@begelundmuller begelundmuller changed the title Incremental ingestion scaffolding Incremental, multi-connector models May 13, 2024
@begelundmuller begelundmuller changed the title Incremental, multi-connector models Incremental models May 13, 2024
@begelundmuller begelundmuller marked this pull request as ready for review May 13, 2024 13:31
@nishantmonu51
Copy link
Collaborator

/review

@rill-dev
Copy link
Collaborator

rill-dev commented May 14, 2024

Code Review Agent Run Status

  • AI Based Review: Successful

Code Review Overview

  • Summary: This PR introduces significant changes to support incremental models across multiple connectors, focusing on enhancing flexibility and functionality in data handling and processing. It includes major refactoring and deprecation of features, particularly in the handling of metrics and catalog definitions, suggesting a shift towards more dynamic data validation and model management strategies.
  • Code change type: Refactoring, Feature Addition, Documentation
  • Unit tests added: True
  • Estimated effort to review (1-5, lower is better): 4, due to the extensive changes across multiple files and systems, including refactoring and new feature integrations, which require careful review to ensure compatibility and functionality.

>>See detailed code suggestions<<
The Bito AI Code Review Agent successfully reviewed 70 files and discovered 9 issues. Please review these issues along with suggested fixes in the Changed Files.

See other commands you can run

High-level Feedback

Ensure comprehensive documentation for all new features and refactored components. Consider implementing more robust error handling and validation mechanisms to prevent potential security risks, such as SQL injection. Regularly update and maintain unit tests to cover new changes and scenarios.

Copy link
Member

@k-anshul k-anshul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall sounds good to me but I am somewhat concerned about data loss in case of InPlace Merge strategy.

}

// Insert the new data into the target table
return c.execWithLimits(ctx, &drivers.Statement{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a conscious decision that query can be cancelled after older data has been dropped ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point and admittedly the merge support is somewhat experimental at this point (also not sure how it will perform).

Doing a delete+insert is a common merge strategy for databases without native merge support. Ideally though, we would do it in a transaction, but I wonder if that might give us other problems with DuckDB.

For external table storage, I added the inPlace == false option where it takes a copy of the table to prevent this issue. It's used for DuckDB->DuckDB models, for unfortunately the SQLStore->DuckDB transporter currently does streaming writes, so couldn't use it there.

I'm planning more refactors here to eventually enable safe writes (among other things), but for now, this PR is growing big and would prefer to get it merged. Since the merge support is experimental and doesn't break any existing features, I think it should be fine.

If you would prefer, we could try adding a transaction here for the short term?

Copy link
Contributor Author

@begelundmuller begelundmuller May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And to address the question of context specifically – it's probably best to use the main context since these operations can run for a long time. And in terms of error scenarios, using a new context would not address the issue of the second query failing or the process being terminated.)

Copy link
Member

@k-anshul k-anshul May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah given its not an issue for DuckDB->DuckDB models which are also going to be the most common ones we can handle merge for SQL stores in later PRs.

@begelundmuller begelundmuller merged commit 7bcd797 into main May 15, 2024
7 checks passed
@begelundmuller begelundmuller deleted the begelundmuller/incremental-models branch May 15, 2024 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feat(runtime): Implement incremental ETL
4 participants