-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auto full-refresh when schema changes for incremental models #320
Comments
I'm interested in working on this, conceptually I assumed that after compilation you the code would need to issue a query to the target database to get the list of existing columns and then compare that to the sqlparse compiled AST of the model, and then conditionally do something like
it looks like there's a If this was run directly in the ModelRunner would that take care of both the normal dbt run as well as a single model dbt run --models=$foo case? Any other edge cases I should be aware of? |
@adamhaney cool! I think parsing the model SQL probably isn't the way to go here -- I can imagine things like We're in the process of making dbt less aware of the different materializations options. As such, the code for the incremental materialization is entirely defined here. So ideally, any incremental-specific code will also be defined inside of that incremental materialization definition. If needed, we can totally make changes to dbt to provide useful context variables in the incremental materialization to support functionality like this. I've been thinking about MD5ing the model contents and comparing against the previous value stored in the I'm not necessarily certain that's the best option here, but it's where my head is at. Do you buy all of that? |
I like the idea of being able to automatically do a full refresh when a model changes. We may also want to do a full refresh of all down-stream models too. Eg, Parent selects columns a, b & c. Changed to select a,b,c &d. Child does a select * from parent. Having said that, I'd probably avoid doing select * in my models, so might not be required. The proposed MD5ing of the model content and exposing the fact that it is different via jinja sounds like a great approach. |
thanks for your input @DavidAmesPup! I don't think this issue is on our near-term roadmap, but I think it would be super valuable nonetheless. Happy to point you (or anyone reading this comment) in the right direction if you're interested in contributing a PR! |
Contributing a pr sounds like a great way for us to get some experience on the dbt codebase during our POC phase. Sure, please point me in the right direction. We are not a Python shop but I'm sure we will manage :) |
Hey @drewbanin - we are going to pick this item up in our next sprint or two - do you have any pointers or design requests? |
hey @DavidAmesPup, awesome, glad to hear it! To be honest, I've given this some thought and I'm not sure of the perfect answer. I'd love to hear a little more about what you're thinking too, but here are my updated thoughts: I think the Instead, we've been talking a lot about adding a persistence layer to dbt in the form of a database. That could either look like a SQLite db, or the data warehouse itself. This database would contain information about when models last ran, who ran them, a hash of the model SQL, etc. I think we'd be able to use this database to understand whether or not a full refresh is in order for incremental models. Obviously adding a persistence layer is quite an undertaking! We're definitely going to work on this in the near-to-medium term, but is probably more involved than a simple Happy to discuss here, and keen to hear what you think |
hey @drewbanin Agreed that Rather than trying to hack around inside of DBT to get this to work - we are thinking writing a simple external util that can look at a folder of models and tell which ones have changed. It will persist a hash of previous contents to it's own persistence store). In our orchestration logic, we can run this util, if it detects any model changes, we will execute a dbt run --full-refresh with a list of models that changed and then do a normal dbt run to get the latest data for any incremental models not rebuilt. The biggest challenge that I can see is for us to know which models were built successfully and which failed. - Obviously, if we did a full refresh on a model and the full refresh failed, then we would want to run the full re-fresh on that model the next time around. I'd rather implement this without parsing the console output from dbt - I can see a couple of options
We would like some guidance as to which you thought would be the most useful option for dbt - we can then raise an issue, submit a PR etc. We are also happy to hear other options that you may have. Cheers, |
I think that sounds like a really great solution @DavidAmesPup! You're in luck on both counts:
>>> import dbt
>>> import dbt.main
>>> res, success = dbt.main.handle_and_check(['run'])
>>> res[0]
<dbt.node_runners.RunModelResult object at 0x11237b358>
>>> res[0].node
{'name': 'some_model', ........}'
>>> res[0].node['config']['materialized']
'incremental'
>>> res[0].errored or res[0].failed or res[0].skipped
False We don't support this API, and it may change in a future release, but this is the best way to get pass/fail info about a run at the moment. Hope this helps! |
Wanna to refresh conversation: I'm not sure that auto full-refresh could be a good idea. It should be directed by some flag, so use give an explicit agreement to rebuild the whole incremental model (which can be quite a long process). As an idea to give an idea to user, that it should be refreshed I suggest to introduce warning indicating the schema changed comparing to existing schema in DB. What do you say about such kind of first step? |
@sphinks I think that's a great idea! This approach would work well with |
@drewbanin could you guide me a little bit where should I start to look at to integrate such kind of warning? No so good with python, but want to help you. |
I think this might be pretty challenging to implement today! We have some code already in the incremental materialization which expands column types in the destination table if required. This could happen if dbt wants to insert a I think if I were going to implement this, I'd make a new method,
If we want to get more clever with our approach, like add or delete columns, then we could certainly do that in this method too. This won't extend well to the approach of full-refreshing automatically, which i think is ok. The more I think about this issue, the less inclined I am to make dbt automatically full-refresh when the relevant Somewhat relevant: dbt's Snapshots already add columns to the destination table when columns in the source query are changed. Long-term, I'd like to combine the logic of I don't want to discourage you from trying this out, but I do want to say: there are a lot of gnarly edge cases in this part of the codebase. The column types dbt knows about vary by database (BigQuery has |
closing in favor of #1132 The big idea:
|
Since we have the "state" method for node selectors, it should be possible to find all modified incremental models (and their children) automatically. Or am I missing something? Here is my idea. Run the command:
With the selector is defined as:
If there's no updated incremental model, the selector returns nothing, therefore, you can keep this in the your CI pipeline. |
…-period-0-19 Fix: insert by period for 0.19
Hi, this is what I want. in the dbt command, what is |
Regarding |
@rrbarbosa, thats not the case (just FYI): |
No description provided.
The text was updated successfully, but these errors were encountered: