-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental models #4834
Incremental models #4834
Conversation
/review |
Code Review Agent Run Status
Code Review Overview
>>See detailed code suggestions<< See other commands you can run High-level FeedbackEnsure comprehensive documentation for all new features and refactored components. Consider implementing more robust error handling and validation mechanisms to prevent potential security risks, such as SQL injection. Regularly update and maintain unit tests to cover new changes and scenarios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall sounds good to me but I am somewhat concerned about data loss in case of InPlace Merge
strategy.
} | ||
|
||
// Insert the new data into the target table | ||
return c.execWithLimits(ctx, &drivers.Statement{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a conscious decision that query can be cancelled after older data has been dropped ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point and admittedly the merge
support is somewhat experimental at this point (also not sure how it will perform).
Doing a delete+insert is a common merge strategy for databases without native merge support. Ideally though, we would do it in a transaction, but I wonder if that might give us other problems with DuckDB.
For external table storage, I added the inPlace == false
option where it takes a copy of the table to prevent this issue. It's used for DuckDB->DuckDB models, for unfortunately the SQLStore->DuckDB transporter currently does streaming writes, so couldn't use it there.
I'm planning more refactors here to eventually enable safe writes (among other things), but for now, this PR is growing big and would prefer to get it merged. Since the merge support is experimental and doesn't break any existing features, I think it should be fine.
If you would prefer, we could try adding a transaction here for the short term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And to address the question of context specifically – it's probably best to use the main context since these operations can run for a long time. And in terms of error scenarios, using a new context would not address the issue of the second query failing or the process being terminated.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah given its not an issue for DuckDB->DuckDB models which are also going to be the most common ones we can handle merge
for SQL stores in later PRs.
This PR adds initial support for incremental models across multiple connectors. Specifically, it:
AsModelExecutor
for running a model and materializing it to the same or another connectorAsModelManager
for managing model output in the connector (e.g. checks/renames/deletes)Transporter
interface and will eventually replace it. This enables unifying sources and models.result_connector
andresult_properties
instead of atable
. This enables models that output data to object stores (i.e. produce apath
instead of atable
).incremental_strategy: {append,merge}
for configuring how data is incrementally inserted into the output tableunique_key: [columns...]
for configuring a key to use for when theincremental_strategy
ismerge
Here is a detailed example model that incrementally processes data from an upstream source/model:
Here is a detailed example model that incrementally ingests data from BigQuery into DuckDB:
The changes in this PR implement an initial subset of this proposal. It closes #4746.