Add state:modified and state:new selectors #2695

beckjake · 2020-08-11T20:20:25Z

resolves #2641

Description

Added state:modified and state:new selectors. All new nodes are modified nodes.

There are a few kinds of modifications:

names (database, schema, alias)
content (for seeds, a file checksum if <1mb or path if >=1mb. For tests, the test name + kwargs. For everything else, the raw sql of the node)
configuration (materialization, fqn, quoting, ...)
description fields, if persist_docs is set to persist the changed description

Tags configured via dbt_project.yml are not a change (doing it via {{ config(...) }} is a raw_sql change, sadly).

Seed files larger than 1mb do not get compared.

There's probably room to be less fiddly about what's a change and what isn't!

If you pass a state:modified selector and dbt detects that any of your macros have changed, dbt logs a blanket warning once per selector to let the user know that state:modified won't see those changes. This warning is never an error (it goes directly through logger.warning, not warn_or_error).

Two small structural changes to dbt:

I moved the various things about files themselves into a new contracts module alongside the manifest contracts (where they used to live) as part of this. Now that checksums live on nodes, it caused an import cycle. Also, they never really belonged in the manifest module.
I re-worked main to accept --state to commands that use selectors and merge some common logic on that front. This was inspired by making all these changes and then realizing I forgot to handle ls because it's totally separate... hopefully this is better.

This PR does bloat the manifest a little bit with checksums. It seemed hard to avoid, though technically we could change it so only seeds have them.

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

…hings We can fix this when we drop python 3.6 and unions stop collapsing types

jtcohen6

I want to celebrate just what a tremendous leap this is. The fact that I can define a selector,

selectors:
  - name: deltas
    definition:
      method: state
      value: modified

set an environment variable, and then just dbt run --selector deltas is pretty neat, given that these are all new constructs in v0.18.0.

Playing around with this has helped me realize which modified signals make a lot of sense, and which make very little. I totally agree with your sentiment:

There's probably room to be less fiddly about what's a change and what isn't!

Name (database/schema/alias)

If the target.database or target.schema is different between the current invocation and the manifest, all nodes are considered modified. As this is almost always true for development/CI runs—it's the premise behind deferred runs—we definitely need a different behavior here.
This gets even trickier with generate_thing_name macros for database/schema/alias. E.g. if the macro ignores schema configs in non-prod envs, changing the schema config won't mark any nodes as modified in dev/ci.
This has spun me around 180 degrees: I now think we should never consider database/schema/alias for marking nodes as modified, because these are high env-aware values that are likely to be different across environments. What do you think?

Vars

In v1 dbt_project.yml config, vars are scoped to a set of model nodes, so changing a variable marks all models in that scope as changed. I guess that's right, and it's only in the old config.
In v2 config, which scopes vars at the project level, changing a var does not mark any nodes as having been changed.
In the future, if possible, we should identify which nodes depend on vars, similar to macro dependencies. (I know vars vis-a-vis parsing can be a nightmare.) In the meantime, would it be possible to raise a similar warning?

During a state comparison, dbt detected a change in vars: 'var1', 'var2'. This will not be marked as a modification.

jtcohen6 · 2020-08-12T19:48:06Z

core/dbt/graph/selector_methods.py

+    ) -> Iterator[UniqueId]:
+        if self.previous_state is None or self.previous_state.manifest is None:
+            raise RuntimeException(
+                'Got a state selector method, but no deferred manifest'


This is a nit, "comparison" feels like the right word here (you use it the warning above)

Suggested change

'Got a state selector method, but no deferred manifest'

'Got a state selector method, but no comparison manifest'

jtcohen6 · 2020-08-12T19:50:02Z

core/dbt/graph/selector_methods.py

+            if self.macros_were_modified:
+                logger.warning(warning_tag(
+                    'During a state comparison, dbt detected a change in '
+                    'macros. This will not be marked as a modification.'


Would it be possible to list up to three of the modified macros? Or is this an order of magnitude simpler as a boolean check? I don't feel that strongly

jtcohen6 · 2020-08-12T19:54:23Z

core/dbt/contracts/graph/parsed.py

+                f'previous file had a checksum type of '
+                f'{second.checksum.name}, so it has changed'
+            )
+        warn_or_error(msg, node=first)


Love this warning/error series. Could we include the seed name as well?

beckjake · 2020-08-12T20:39:41Z

I now think we should never consider database/schema/alias for marking nodes as modified, because these are high env-aware values that are likely to be different across environments. What do you think?

I think this would be fine, perhaps not intuitive. One thing we could look into is comparing the configured database/schema/alias values, before target has been considered (so they can be None). Would that be better? I'm not sure exactly how easy it is, but it sure feels like it should be easy! This wouldn't account for generate_*_name macros though.

Vars

First of all, if we're considering vars, I think we should just not support v1 configs here. If you set state:modified and your config is v1, it's an error, can't do it. I don't want to think about it and support every silly edge case just for deprecated functionality.

We can do a vars check - as you've correctly identified, it's just like the macro dependencies issue. One issue we'll see is that we don't really have the vars in the manifest currently, so we'll have to include those. We'll have the same problem with env_var and I don't have a solution for that: we can't dump the env, there's secrets in there!

jtcohen6 · 2020-08-12T21:05:48Z

This wouldn't account for generate_*_name macros though.

Use of these macros is pretty widespread, and hard to account for in its permutations. I feel strongly that we need state comparison to work for projects that override default database/schema/alias behavior in an env-based way, and I want to worry as little as possible about the ways in which they do it.

So I'm trying to think of a counterargument. What's the downside of completely ignoring changes to database/schema/alias in a CI environment? We already have good errors at compile time if dbt finds multiple resources with the same database representation.

I agree this would be less intuitive. It warrants an explanation in our docs.

What about for snapshots, where target_database+ target_schema tend to be stable across environments? I don't think it matters. It's not currently possible to dbt snapshot --select state:modified, and I don't see a use case for "only execute my modified snapshots," anyway.

Vars

Totally agree with eschewing v1
Sounds feasible for v2
Env vars I don't care about. I see them in the same category as profile/target configs and database/schema/alias: these things should be different across deployment + development environments, so diffing on them does not help. (I'm open to the argument that vars should fall into this category, too, but for now I don't think they do.)

beckjake · 2020-08-12T21:10:09Z

What's the downside of completely ignoring changes to database/schema/alias in a CI environment?

Well, I imagine a world where you change a database name field or something and it ends up being queried by an external dashboard. Should dbt run --models state:modified pick that up? Maybe not, I actually don't have much intuition here.

beckjake · 2020-08-12T21:12:39Z

I'm open to the argument that vars should fall into this category, too, but for now I don't think they do.

I have to think about this more, intuitively I think CLI vars are like env vars, but vars defined in your dbt_project.yml are like config items. That's not a very good answer though, because that's a pain to implement.

jtcohen6 · 2020-08-12T21:36:29Z

Well, I imagine a world where you change a database name field or something and it ends up being queried by an external dashboard.

It's a really good thought. I think that'd be a separate (and really cool) project, maybe plugging into existing thoughts on catalog comparison + having dbt know more about downstream model "exposures." Separate from running models in CI, I could imagine providing a warning in CI: "FYI you changed some columns in model X / the alias of model X, be aware that this may impact Y, Z downstream exposures that depend on it."

I'm good with punting on vars until v0.19.0.

beckjake · 2020-08-13T14:50:34Z

I'm good with punting on vars until v0.19.0.

I would really like to do that, because then we can remove v1 config support before we do it. I think I will feel a lot better about this when calculating vars for a given model only requires its package. That's the kind of thing we can just cram into the manifest.

If we're going to say that database/schema/alias is no longer considered a change for nodes, should it be for sources?

Edit: and is a table's quoting setting not a change then? Doesn't feel like a change.

jtcohen6 · 2020-08-13T15:29:37Z

[EDIT: OVERTAKEN BY EVENTS. This turned into an offline discussion that I'll summarize in a new issue]

Sources are an excellent question. Changes to a source's database representation could indicate a meaningful modification, such as switching an ingestion mechanism, which is definitely worth testing downstream dependencies. By default, dbt expects sources to have the same database representations across environments. (Same as snapshots.)

That said, users can (and often do) define sources with env-aware logic like:

sources:
  - name: application_db
    schema: "{{ 'application_db_prod' if target.name == 'prod' else 'application_db_staging' }}"

The former, which is an "actual" modification, takes place quite rarely. The latter would be identified as a false positive every single run. So while I don't feel great about it, I don't think we should compare source database/schema/identifier. That also keeps us consistent across node types.

Here are more complex approaches I can imagine taking:

If a source's database/schema/name/identifier is a Jinja expression, we don't try to compare it; if it's a string literal, we do. This may not be possible, and it may be too confusing to be desirable, anyway.
An eventual resolution (not for v0.18.0) would be the ability to compare state against one manifest and defer to a different one. This would allow someone to dbt compile --target ci, have all the namespaces in that manifest be right for comparison, but then defer to another manifest with a different set of namespaces.

Quoting still feels like a change. Accidentally switching off quoting could definitely be a breaking change. Also, I would never expect quoting behavior to be different across environments. Even though this is tightly linked to database representation, it feels more like a model config in the vein of materialized, sort/dist, etc.

…os are added/removed

jtcohen6

This is awesome! I confirmed locally:

Macro warning lists some of the changed macros
Seed warning includes name of too-big seed
Modifications to database representation are based on the user-configured database/schema/alias rather than the resolved values

cla-bot bot added the cla:yes label Aug 11, 2020

beckjake force-pushed the feature/state-modified-selector branch from 6a397c7 to d77113d Compare August 11, 2020 20:43

beckjake requested review from jtcohen6, gshank and kwigley August 11, 2020 21:00

beckjake marked this pull request as ready for review August 11, 2020 21:07

Jacob Beck added 3 commits August 12, 2020 07:44

Add state:modified and state:new selectors

ebe5b46

mypy is rightfully mad about our funky inheritance, just copy+paste t…

7ba52d4

…hings We can fix this when we drop python 3.6 and unions stop collapsing types

flake8 update found more things to complain about

153eb7e

beckjake force-pushed the feature/state-modified-selector branch from 982395c to 153eb7e Compare August 12, 2020 13:45

jtcohen6 reviewed Aug 12, 2020

View reviewed changes

kwigley approved these changes Aug 13, 2020

View reviewed changes

gshank approved these changes Aug 13, 2020

View reviewed changes

PR feedback

13fb235

beckjake force-pushed the feature/state-modified-selector branch from 80d3bcd to 13fb235 Compare August 14, 2020 14:49

fix macro change check to account for new list return value when macr…

f3f713a

…os are added/removed

jtcohen6 mentioned this pull request Aug 14, 2020

Subselectors for state:modified #2704

Closed

fix ls tests

c8453d8

beckjake mentioned this pull request Aug 14, 2020

Feature/defer tests #2706

Closed

4 tasks

jtcohen6 approved these changes Aug 14, 2020

View reviewed changes

beckjake merged commit d554835 into dev/marian-anderson Aug 14, 2020

beckjake deleted the feature/state-modified-selector branch August 14, 2020 19:46

jtcohen6 mentioned this pull request Aug 18, 2020

Add pseudo selectors that select models based on artifact states #2465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add state:modified and state:new selectors #2695

Add state:modified and state:new selectors #2695

beckjake commented Aug 11, 2020 •

edited

Loading

jtcohen6 left a comment

jtcohen6 Aug 12, 2020

jtcohen6 Aug 12, 2020

jtcohen6 Aug 12, 2020

beckjake commented Aug 12, 2020 •

edited

Loading

jtcohen6 commented Aug 12, 2020

beckjake commented Aug 12, 2020

beckjake commented Aug 12, 2020

jtcohen6 commented Aug 12, 2020

beckjake commented Aug 13, 2020 •

edited

Loading

jtcohen6 commented Aug 13, 2020 •

edited

Loading

jtcohen6 left a comment

	'Got a state selector method, but no deferred manifest'
	'Got a state selector method, but no comparison manifest'

Add state:modified and state:new selectors #2695

Add state:modified and state:new selectors #2695

Conversation

beckjake commented Aug 11, 2020 • edited Loading

Description

Checklist

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 Aug 12, 2020

Choose a reason for hiding this comment

jtcohen6 Aug 12, 2020

Choose a reason for hiding this comment

jtcohen6 Aug 12, 2020

Choose a reason for hiding this comment

beckjake commented Aug 12, 2020 • edited Loading

jtcohen6 commented Aug 12, 2020

beckjake commented Aug 12, 2020

beckjake commented Aug 12, 2020

jtcohen6 commented Aug 12, 2020

beckjake commented Aug 13, 2020 • edited Loading

jtcohen6 commented Aug 13, 2020 • edited Loading

jtcohen6 left a comment

Choose a reason for hiding this comment

beckjake commented Aug 11, 2020 •

edited

Loading

beckjake commented Aug 12, 2020 •

edited

Loading

beckjake commented Aug 13, 2020 •

edited

Loading

jtcohen6 commented Aug 13, 2020 •

edited

Loading