Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

state:modified picks up unchanged models if they have round brackets in their configs #2941

Closed
1 of 5 tasks
XiaozhouWang85 opened this issue Dec 8, 2020 · 5 comments
Closed
1 of 5 tasks
Labels
bug Something isn't working stale Issues that have gone stale state Stateful selection (state:modified, defer)

Comments

@XiaozhouWang85
Copy link

Describe the bug

When using defer / state to run changed models. A lot of models that are definitely unchanged are selected and run.
dbt run -m state:modified+ --defer --state ./

I've already read these and don't believe this is covered there:

After comparing production manifest.json to the one I'm generating locally I noticed that the only difference appears to be lists in config of those models sometimes have different order. This problems only seems to occur when round brackets are used instead of square brackets.

Steps To Reproduce

When defining config as below, state:modified selects unchanged models.

{{
  config(
    unique_key = 'my_key',
    materialized = 'table',
    cluster_by = ('my_cluster_id_1', 'my_cluster_id_2' )
    partition_by = {'field': 'partition_id', 'data_type': 'timestamp'}
  )
}}

This problem seems to disappear once square brackets are used.

{{
  config(
    unique_key = 'my_key',
    materialized = 'table',
    cluster_by = ['my_cluster_id_1', 'my_cluster_id_2' ]
    partition_by = {'field': 'partition_id', 'data_type': 'timestamp'}
  )
}}

Expected behavior

There should not be a difference in behaviour between square and round brackets. Ideally state:modified should work correctly with both.

Screenshots and log output

n/a

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.18.0
   latest version: 0.18.1

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 0.18.0

The operating system you're using:
MacOS / Linux

The output of python --version:
Python 3.7.0

Additional context

Add any other context about the problem here.

@XiaozhouWang85 XiaozhouWang85 added bug Something isn't working triage labels Dec 8, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Dec 8, 2020

Nice find @XiaozhouWang85! Thanks for providing a thorough explication of the bug and a reproduction case.

We made the data type of cluster_by flexible because it can accept a single value (as a string) or multiple values (as an array). This is nice insofar as supplying a tuple also works, and doesn't throw a validation error:

https://github.com/fishtown-analytics/dbt/blob/886b574987488b2e8d1809d07d692da6b611d8e5/plugins/bigquery/dbt/adapters/bigquery/impl.py#L107

When dbt ultimately renders cluster_by and writes it to the manifest, though, it's as either a string or an array. Then, when it comes time for dbt to compare the unrendered config (tuple) against the other manifest (array), the simple == equality is affected by the data type mismatch:
https://github.com/fishtown-analytics/dbt/blob/886b574987488b2e8d1809d07d692da6b611d8e5/core/dbt/contracts/graph/model_config.py#L246-L259

ipdb> unrendered
{'unique_key': 'my_key', 'materialized': 'table', 'cluster_by': ('my_cluster_id_1', 'my_cluster_id_2'), 'partition_by': {'field': 'partition_id', 'data_type': 'timestamp'}}
ipdb> other
{'unique_key': 'my_key', 'materialized': 'table', 'cluster_by': ['my_cluster_id_1', 'my_cluster_id_2'], 'partition_by': {'field': 'partition_id', 'data_type': 'timestamp'}}
ipdb> key
'cluster_by'
ipdb> unrendered[key] == other[key]
False

I think our options are one of:

  1. Disallow tuples for cluster_by configs. Document this and update validation.
  2. Change the equality check in compare_key to return True if it's the same data in a different representation (i.e. list vs tuple, string vs. single-item list).

I'm leaning toward the former, for simplicity's sake. @kwigley I'd be curious to hear which of these you prefer.

One other thought: It's obvious that, if you went in and changed the cluster_by from a string to a single-item list, you'd be updating the file contents and thereby modifying the model. But if you were to configure it from dbt_project.yml instead, and modify the config there...

models:
  +cluster_by: 'my_cluster_id_1'

becomes

models:
  +cluster_by:
      - my_cluster_id_1

It's not clear to me whether switching the representation of the same, single cluster key from string to list should mark the model as modified. If we pursue path #1, it would be a modification; if we pursue path #2, it wouldn't be.

@jtcohen6 jtcohen6 added state Stateful selection (state:modified, defer) and removed triage labels Dec 8, 2020
@dmateusp
Copy link
Contributor

I just hit that issue where we have a custom materialization which takes a list of tuples as an argument like:

        deduplication_date_cols=[
            ("event_date", "desc"),
            ("extraction_date", "desc"),
            ("batch_date", "asc")
        ],

Inside the config, which leads to be always picked up as a change.

I changed it to

        deduplication_date_cols={
            "event_date": "desc",
            "extraction_date": "desc",
            "batch_date": "asc"
        },

and adapted my loops to be

- {% for col, order in deduplication_date_cols %}
+ {% for col, order in deduplication_date_cols.items() %}

@jtcohen6 what do you think about adding it as a caveat in https://docs.getdbt.com/reference/node-selection/state-comparison-caveats#false-positives ?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 7, 2021

@dmateusp There is a note at the bottom of that docs page that links to all open issues with the tag state, so there is a way to find this one, if not by the quickest path. I'm not opposed to adding a note there to encourage arrays/dicts instead of tuples.

Frankly, I'd be even more interested in resolving this one, plain and simple. What do you think? Should we adapt the compare_key to match tuples and lists that contain otherwise identical content?

@dmateusp
Copy link
Contributor

dmateusp commented Oct 7, 2021

sounds good @jtcohen6 (re: fixing it rather than adding more docs)

@github-actions
Copy link
Contributor

github-actions bot commented Apr 6, 2022

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that have gone stale state Stateful selection (state:modified, defer)
Projects
None yet
Development

No branches or pull requests

3 participants