-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add state:modified and state:new selectors #2695
Add state:modified and state:new selectors #2695
Conversation
6a397c7
to
d77113d
Compare
…hings We can fix this when we drop python 3.6 and unions stop collapsing types
982395c
to
153eb7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to celebrate just what a tremendous leap this is. The fact that I can define a selector,
selectors:
- name: deltas
definition:
method: state
value: modified
set an environment variable, and then just dbt run --selector deltas
is pretty neat, given that these are all new constructs in v0.18.0.
Playing around with this has helped me realize which modified signals make a lot of sense, and which make very little. I totally agree with your sentiment:
There's probably room to be less fiddly about what's a change and what isn't!
Name (database/schema/alias)
- If the
target.database
ortarget.schema
is different between the current invocation and the manifest, all nodes are considered modified. As this is almost always true for development/CI runs—it's the premise behind deferred runs—we definitely need a different behavior here. - This gets even trickier with
generate_thing_name
macros for database/schema/alias. E.g. if the macro ignoresschema
configs in non-prod envs, changing theschema
config won't mark any nodes as modified in dev/ci. - This has spun me around 180 degrees: I now think we should never consider database/schema/alias for marking nodes as modified, because these are high env-aware values that are likely to be different across environments. What do you think?
Vars
- In v1
dbt_project.yml
config, vars are scoped to a set of model nodes, so changing a variable marks all models in that scope as changed. I guess that's right, and it's only in the old config. - In v2 config, which scopes vars at the project level, changing a var does not mark any nodes as having been changed.
- In the future, if possible, we should identify which nodes depend on vars, similar to macro dependencies. (I know vars vis-a-vis parsing can be a nightmare.) In the meantime, would it be possible to raise a similar warning?
During a state comparison, dbt detected a change in vars: 'var1', 'var2'. This will not be marked as a modification.
core/dbt/graph/selector_methods.py
Outdated
) -> Iterator[UniqueId]: | ||
if self.previous_state is None or self.previous_state.manifest is None: | ||
raise RuntimeException( | ||
'Got a state selector method, but no deferred manifest' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit, "comparison" feels like the right word here (you use it the warning above)
'Got a state selector method, but no deferred manifest' | |
'Got a state selector method, but no comparison manifest' |
core/dbt/graph/selector_methods.py
Outdated
if self.macros_were_modified: | ||
logger.warning(warning_tag( | ||
'During a state comparison, dbt detected a change in ' | ||
'macros. This will not be marked as a modification.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to list up to three of the modified macros? Or is this an order of magnitude simpler as a boolean check? I don't feel that strongly
f'previous file had a checksum type of ' | ||
f'{second.checksum.name}, so it has changed' | ||
) | ||
warn_or_error(msg, node=first) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this warning/error series. Could we include the seed name as well?
I think this would be fine, perhaps not intuitive. One thing we could look into is comparing the configured database/schema/alias values, before
First of all, if we're considering vars, I think we should just not support v1 configs here. If you set We can do a vars check - as you've correctly identified, it's just like the macro dependencies issue. One issue we'll see is that we don't really have the vars in the manifest currently, so we'll have to include those. We'll have the same problem with |
Use of these macros is pretty widespread, and hard to account for in its permutations. I feel strongly that we need state comparison to work for projects that override default database/schema/alias behavior in an env-based way, and I want to worry as little as possible about the ways in which they do it. So I'm trying to think of a counterargument. What's the downside of completely ignoring changes to database/schema/alias in a CI environment? We already have good errors at compile time if dbt finds multiple resources with the same database representation. I agree this would be less intuitive. It warrants an explanation in our docs. What about for snapshots, where
|
Well, I imagine a world where you change a database name field or something and it ends up being queried by an external dashboard. Should |
I have to think about this more, intuitively I think CLI vars are like env vars, but vars defined in your dbt_project.yml are like config items. That's not a very good answer though, because that's a pain to implement. |
It's a really good thought. I think that'd be a separate (and really cool) project, maybe plugging into existing thoughts on catalog comparison + having dbt know more about downstream model "exposures." Separate from running models in CI, I could imagine providing a warning in CI: "FYI you changed some columns in model X / the alias of model X, be aware that this may impact Y, Z downstream exposures that depend on it." I'm good with punting on |
I would really like to do that, because then we can remove v1 config support before we do it. I think I will feel a lot better about this when calculating vars for a given model only requires its package. That's the kind of thing we can just cram into the manifest. If we're going to say that database/schema/alias is no longer considered a change for nodes, should it be for sources? Edit: and is a table's quoting setting not a change then? Doesn't feel like a change. |
[EDIT: OVERTAKEN BY EVENTS. This turned into an offline discussion that I'll summarize in a new issue] Sources are an excellent question. Changes to a source's database representation could indicate a meaningful modification, such as switching an ingestion mechanism, which is definitely worth testing downstream dependencies. By default, dbt expects sources to have the same database representations across environments. (Same as snapshots.) That said, users can (and often do) define sources with env-aware logic like: sources:
- name: application_db
schema: "{{ 'application_db_prod' if target.name == 'prod' else 'application_db_staging' }}" The former, which is an "actual" modification, takes place quite rarely. The latter would be identified as a false positive every single run. So while I don't feel great about it, I don't think we should compare source database/schema/identifier. That also keeps us consistent across node types. Here are more complex approaches I can imagine taking:
Quoting still feels like a change. Accidentally switching off quoting could definitely be a breaking change. Also, I would never expect quoting behavior to be different across environments. Even though this is tightly linked to database representation, it feels more like a model config in the vein of |
80d3bcd
to
13fb235
Compare
…os are added/removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! I confirmed locally:
- Macro warning lists some of the changed macros
- Seed warning includes name of too-big seed
- Modifications to database representation are based on the user-configured database/schema/alias rather than the resolved values
resolves #2641
Description
Added
state:modified
andstate:new
selectors. All new nodes are modified nodes.There are a few kinds of modifications:
Tags configured via dbt_project.yml are not a change (doing it via
{{ config(...) }}
is a raw_sql change, sadly).Seed files larger than 1mb do not get compared.
There's probably room to be less fiddly about what's a change and what isn't!
If you pass a
state:modified
selector and dbt detects that any of your macros have changed, dbt logs a blanket warning once per selector to let the user know thatstate:modified
won't see those changes. This warning is never an error (it goes directly throughlogger.warning
, notwarn_or_error
).Two small structural changes to dbt:
main
to accept--state
to commands that use selectors and merge some common logic on that front. This was inspired by making all these changes and then realizing I forgot to handlels
because it's totally separate... hopefully this is better.This PR does bloat the manifest a little bit with checksums. It seemed hard to avoid, though technically we could change it so only seeds have them.
Checklist
CHANGELOG.md
and added information about my change to the "dbt next" section.