state:modified picks up unchanged models if they have round brackets in their configs #2941

XiaozhouWang85 · 2020-12-08T17:33:27Z

Describe the bug

When using defer / state to run changed models. A lot of models that are definitely unchanged are selected and run.
dbt run -m state:modified+ --defer --state ./

I've already read these and don't believe this is covered there:

After comparing production manifest.json to the one I'm generating locally I noticed that the only difference appears to be lists in config of those models sometimes have different order. This problems only seems to occur when round brackets are used instead of square brackets.

Steps To Reproduce

When defining config as below, state:modified selects unchanged models.

{{
  config(
    unique_key = 'my_key',
    materialized = 'table',
    cluster_by = ('my_cluster_id_1', 'my_cluster_id_2' )
    partition_by = {'field': 'partition_id', 'data_type': 'timestamp'}
  )
}}

This problem seems to disappear once square brackets are used.

{{
  config(
    unique_key = 'my_key',
    materialized = 'table',
    cluster_by = ['my_cluster_id_1', 'my_cluster_id_2' ]
    partition_by = {'field': 'partition_id', 'data_type': 'timestamp'}
  )
}}

Expected behavior

There should not be a difference in behaviour between square and round brackets. Ideally state:modified should work correctly with both.

Screenshots and log output

n/a

System information

Which database are you using dbt with?

The output of dbt --version:

installed version: 0.18.0
   latest version: 0.18.1

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 0.18.0

The operating system you're using:
MacOS / Linux

The output of python --version:
Python 3.7.0

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-12-08T21:59:28Z

Nice find @XiaozhouWang85! Thanks for providing a thorough explication of the bug and a reproduction case.

We made the data type of cluster_by flexible because it can accept a single value (as a string) or multiple values (as an array). This is nice insofar as supplying a tuple also works, and doesn't throw a validation error:

https://github.com/fishtown-analytics/dbt/blob/886b574987488b2e8d1809d07d692da6b611d8e5/plugins/bigquery/dbt/adapters/bigquery/impl.py#L107

When dbt ultimately renders cluster_by and writes it to the manifest, though, it's as either a string or an array. Then, when it comes time for dbt to compare the unrendered config (tuple) against the other manifest (array), the simple == equality is affected by the data type mismatch:
https://github.com/fishtown-analytics/dbt/blob/886b574987488b2e8d1809d07d692da6b611d8e5/core/dbt/contracts/graph/model_config.py#L246-L259

ipdb> unrendered
{'unique_key': 'my_key', 'materialized': 'table', 'cluster_by': ('my_cluster_id_1', 'my_cluster_id_2'), 'partition_by': {'field': 'partition_id', 'data_type': 'timestamp'}}
ipdb> other
{'unique_key': 'my_key', 'materialized': 'table', 'cluster_by': ['my_cluster_id_1', 'my_cluster_id_2'], 'partition_by': {'field': 'partition_id', 'data_type': 'timestamp'}}
ipdb> key
'cluster_by'
ipdb> unrendered[key] == other[key]
False

I think our options are one of:

Disallow tuples for cluster_by configs. Document this and update validation.
Change the equality check in compare_key to return True if it's the same data in a different representation (i.e. list vs tuple, string vs. single-item list).

I'm leaning toward the former, for simplicity's sake. @kwigley I'd be curious to hear which of these you prefer.

One other thought: It's obvious that, if you went in and changed the cluster_by from a string to a single-item list, you'd be updating the file contents and thereby modifying the model. But if you were to configure it from dbt_project.yml instead, and modify the config there...

models:
  +cluster_by: 'my_cluster_id_1'

becomes

models:
  +cluster_by:
      - my_cluster_id_1

It's not clear to me whether switching the representation of the same, single cluster key from string to list should mark the model as modified. If we pursue path #1, it would be a modification; if we pursue path #2, it wouldn't be.

dmateusp · 2021-09-30T09:56:56Z

I just hit that issue where we have a custom materialization which takes a list of tuples as an argument like:

        deduplication_date_cols=[
            ("event_date", "desc"),
            ("extraction_date", "desc"),
            ("batch_date", "asc")
        ],

Inside the config, which leads to be always picked up as a change.

I changed it to

        deduplication_date_cols={
            "event_date": "desc",
            "extraction_date": "desc",
            "batch_date": "asc"
        },

and adapted my loops to be

- {% for col, order in deduplication_date_cols %}
+ {% for col, order in deduplication_date_cols.items() %}

@jtcohen6 what do you think about adding it as a caveat in https://docs.getdbt.com/reference/node-selection/state-comparison-caveats#false-positives ?

jtcohen6 · 2021-10-07T15:58:03Z

@dmateusp There is a note at the bottom of that docs page that links to all open issues with the tag state, so there is a way to find this one, if not by the quickest path. I'm not opposed to adding a note there to encourage arrays/dicts instead of tuples.

Frankly, I'd be even more interested in resolving this one, plain and simple. What do you think? Should we adapt the compare_key to match tuples and lists that contain otherwise identical content?

dmateusp · 2021-10-07T16:02:48Z

sounds good @jtcohen6 (re: fixing it rather than adding more docs)

github-actions · 2022-04-06T02:04:52Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

XiaozhouWang85 added bug Something isn't working triage labels Dec 8, 2020

jtcohen6 added state Stateful selection (state:modified, defer) and removed triage labels Dec 8, 2020

justinwagg mentioned this issue Jul 29, 2021

using partition in model config causes state:modified false positive #3645

Closed

5 tasks

github-actions bot added the stale Issues that have gone stale label Apr 6, 2022

github-actions bot closed this as completed Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

state:modified picks up unchanged models if they have round brackets in their configs #2941

state:modified picks up unchanged models if they have round brackets in their configs #2941

XiaozhouWang85 commented Dec 8, 2020

jtcohen6 commented Dec 8, 2020

dmateusp commented Sep 30, 2021

jtcohen6 commented Oct 7, 2021

dmateusp commented Oct 7, 2021

github-actions bot commented Apr 6, 2022

state:modified picks up unchanged models if they have round brackets in their configs #2941

state:modified picks up unchanged models if they have round brackets in their configs #2941

Comments

XiaozhouWang85 commented Dec 8, 2020

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Additional context

jtcohen6 commented Dec 8, 2020

dmateusp commented Sep 30, 2021

jtcohen6 commented Oct 7, 2021

dmateusp commented Oct 7, 2021

github-actions bot commented Apr 6, 2022