partial parsing: support overlapping sets of files by parser #1778

kjgorman · 2019-09-24T11:48:14Z

Describe the feature

Basically a feature request to track this TODO: https://github.com/fishtown-analytics/dbt/blob/f9c8442260e48bdd8bb7805b2e7541ab91492bb1/core/dbt/loader.py#L128-L130

I have a scenario where there are some overlapping model and test files which fails to be partially parsed:

CompilationException('dbt found two resources with the name "my_resource_name". Since these resources have the same name,\ndbt will be unable to find the correct resource when ref("validate_unknown_scheme") is used. To fix this,\nchange the name of one of these resources:\n- model.my_project.my_resource_name (models/my_project/my_resource_name.sql)\n- model.my_project.my_resource_name (models/my_project/my_resource_name.sql)', None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/dbt/main.py", line 89, in main
    results, succeeded = handle_and_check(args)
  File "/usr/local/lib/python3.7/site-packages/dbt/main.py", line 167, in handle_and_check
    task, res = run_from_args(parsed)
  File "/usr/local/lib/python3.7/site-packages/dbt/main.py", line 219, in run_from_args
    results = task.run()
  File "/usr/local/lib/python3.7/site-packages/dbt/task/runnable.py", line 286, in run
    self._runtime_initialize()
  File "/usr/local/lib/python3.7/site-packages/dbt/task/runnable.py", line 80, in _runtime_initialize
    super()._runtime_initialize()
  File "/usr/local/lib/python3.7/site-packages/dbt/task/runnable.py", line 51, in _runtime_initialize
    self.manifest, self.linker = monzo_runtime_patches.runtime_initialize(self.config, self.args)
  File "/usr/local/lib/python3.7/site-packages/dbt/monzo/patches/runtime_init.py", line 132, in runtime_initialize
    manifest = load_manifest(config)
  File "/usr/local/lib/python3.7/site-packages/dbt/task/runnable.py", line 37, in load_manifest
    manifest = GraphLoader.load_all(config, internal_manifest=internal)
  File "/usr/local/lib/python3.7/site-packages/dbt/loader.py", line 295, in load_all
    loader.load(internal_manifest=internal_manifest)
  File "/usr/local/lib/python3.7/site-packages/dbt/loader.py", line 191, in load
    self.parse_project(project, macro_manifest, old_results)
  File "/usr/local/lib/python3.7/site-packages/dbt/loader.py", line 168, in parse_project
    self.parse_with_cache(path, parser, old_results)
  File "/usr/local/lib/python3.7/site-packages/dbt/loader.py", line 120, in parse_with_cache
    if not self._get_cached(block, old_results):
  File "/usr/local/lib/python3.7/site-packages/dbt/loader.py", line 136, in _get_cached
    return self.results.sanitized_update(block.file, old_results)
  File "/usr/local/lib/python3.7/site-packages/dbt/parser/results.py", line 150, in sanitized_update
    self.add_node(source_file, node)
  File "/usr/local/lib/python3.7/site-packages/dbt/parser/results.py", line 76, in add_node
    _check_duplicates(node, self.nodes)
  File "/usr/local/lib/python3.7/site-packages/dbt/parser/results.py", line 27, in _check_duplicates
    raise_duplicate_resource_name(value, src[value.unique_id])
  File "/usr/local/lib/python3.7/site-packages/dbt/exceptions.py", line 645, in raise_duplicate_resource_name
    node_2.unique_id, node_2.original_file_path))
  File "/usr/local/lib/python3.7/site-packages/dbt/exceptions.py", line 345, in raise_compiler_error
    raise CompilationException(msg, node)
dbt.exceptions.CompilationException: Compilation Error
  dbt found two resources with the name "my_resource_name". Since these resources have the same name,
  dbt will be unable to find the correct resource when ref("my_resource_name") is used. To fix this,
  change the name of one of these resources:
  - model.my_project.my_resource_name (models/validation/my_resource_name.sql)
  - model.my_project.my_resource_name (models/validation/my_resource_name.sql)

In this instance, we have a validation table which we want to also generate an output data set to be observed outside of a test run, so the files are processed by both dbt.parser.models.ModelParser and dbt.parser.data_test.DataTestParser. (Perhaps there's some other way of achieving this that I'm not aware of—I'm quite new to dbt—which would obviate the need for this ticket.)

Looking briefly through the cached node processing, it seems like processing an already parsed file isn't actually a function contingent on what parser is reading the files, so it would perhaps be possible to simply skip files that are already added:

        if old_results.has_file(block.file):
+           # already loaded this file (perhaps by another parser)
+           if block.file.search_key in self.results.files:
+               return True

            return self.results.sanitized_update(block.file, old_results)

Although of course the presence of the TODO suggests there's more to it.

Describe alternatives you've considered

Just not have overlapping sets of parser candidate files like this, and not need to deal with this
Just not use the partial parsing functionality when we have these overlapping files
At least print a more useful error message, rather than asking to rename one of two items, where actually the two items are the same item

Additional context

Who will this benefit?

Me! 🙂

The text was updated successfully, but these errors were encountered:

drewbanin · 2019-09-26T02:47:18Z

Hey @kjgorman - thanks for the thorough writeup!

I think a good approach for your use case is to decouple your model and test definitions. The approach I would recommend would be to:

make a model that defines your validation dataset
make a custom data test that just does a select * from {{ ref('validation_model_name') }}

While this approach will require you to make two files for each validation query, it will also allow you to configure each resource independently. You could, for instance, configure the model to be materialized as a table while configuring the severity of the test. These are two different types of resources, and I definitely think they warrant two different declarations. For that reason alone, I don't think it would be a good idea to support overlapping paths for different resource parsers.

More broadly, I want dbt's top-level file paths (models/, tests/) to become less important over time. In order to achieve that, we'll likely leverage Jinja tags as we already do for snapshots. If we had tags for models and data tests, you could imagine blocks like:

{% model my_model %}
  select ...
{% endmodel %}

and

{% test test_a_thing_is_true %}
  select ...
{% endmodel %}

Since these blocks define resources explicitly, they could live side-by-side in a given directory (or even in the same file!). I think this approach does a good job of 1) explicitly defining resources and 2) keeping related logic close together. When we add these blocks to dbt, I'll feel a lot better about allowing multiple parsers to operate in the same directories.

Check out some related issues:

Very happy to discuss this further - keen to hear what you think.

kjgorman · 2019-09-30T19:07:21Z

Hey @drewbanin — thanks for writing that out for me, and sorry it's taken me so long to get back to you!

The approach you describe makes sense to me — I've refactored our validation tests today to use the simple refs and resolved the issue. It's a little unwieldy to manage the extra files, but we'll see if it's really a problem in practice—we were thinking a small helper shell function to generate the boilerplate might suffice.

The approach of explicit tagging with blocks is nice — with the path-based "convention over configuration" model, running into this error by using a somewhat irregular directory structure made the error more or less indecipherable until I'd spent some time debugging the parser source code.

drewbanin · 2019-10-01T14:33:47Z

Yeah - I totally buy that. I saw another report of a user that put a .yml file in the snapshots/ directory which caused dbt to miss it. I'm excited to make these paths less significant!

kjgorman added enhancement New feature or request triage labels Sep 24, 2019

drewbanin removed the triage label Sep 26, 2019

drewbanin closed this as completed Oct 1, 2019

jtcohen6 added the partial_parsing label Jun 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

partial parsing: support overlapping sets of files by parser #1778

partial parsing: support overlapping sets of files by parser #1778

kjgorman commented Sep 24, 2019

drewbanin commented Sep 26, 2019 •

edited

Loading

kjgorman commented Sep 30, 2019

drewbanin commented Oct 1, 2019

partial parsing: support overlapping sets of files by parser #1778

partial parsing: support overlapping sets of files by parser #1778

Comments

kjgorman commented Sep 24, 2019

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

drewbanin commented Sep 26, 2019 • edited Loading

kjgorman commented Sep 30, 2019

drewbanin commented Oct 1, 2019

drewbanin commented Sep 26, 2019 •

edited

Loading