Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partial parsing: support overlapping sets of files by parser #1778

Closed
kjgorman opened this issue Sep 24, 2019 · 3 comments
Closed

partial parsing: support overlapping sets of files by parser #1778

kjgorman opened this issue Sep 24, 2019 · 3 comments
Labels
enhancement New feature or request partial_parsing

Comments

@kjgorman
Copy link

Describe the feature

Basically a feature request to track this TODO: https://github.com/fishtown-analytics/dbt/blob/f9c8442260e48bdd8bb7805b2e7541ab91492bb1/core/dbt/loader.py#L128-L130

I have a scenario where there are some overlapping model and test files which fails to be partially parsed:

CompilationException('dbt found two resources with the name "my_resource_name". Since these resources have the same name,\ndbt will be unable to find the correct resource when ref("validate_unknown_scheme") is used. To fix this,\nchange the name of one of these resources:\n- model.my_project.my_resource_name (models/my_project/my_resource_name.sql)\n- model.my_project.my_resource_name (models/my_project/my_resource_name.sql)', None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/dbt/main.py", line 89, in main
    results, succeeded = handle_and_check(args)
  File "/usr/local/lib/python3.7/site-packages/dbt/main.py", line 167, in handle_and_check
    task, res = run_from_args(parsed)
  File "/usr/local/lib/python3.7/site-packages/dbt/main.py", line 219, in run_from_args
    results = task.run()
  File "/usr/local/lib/python3.7/site-packages/dbt/task/runnable.py", line 286, in run
    self._runtime_initialize()
  File "/usr/local/lib/python3.7/site-packages/dbt/task/runnable.py", line 80, in _runtime_initialize
    super()._runtime_initialize()
  File "/usr/local/lib/python3.7/site-packages/dbt/task/runnable.py", line 51, in _runtime_initialize
    self.manifest, self.linker = monzo_runtime_patches.runtime_initialize(self.config, self.args)
  File "/usr/local/lib/python3.7/site-packages/dbt/monzo/patches/runtime_init.py", line 132, in runtime_initialize
    manifest = load_manifest(config)
  File "/usr/local/lib/python3.7/site-packages/dbt/task/runnable.py", line 37, in load_manifest
    manifest = GraphLoader.load_all(config, internal_manifest=internal)
  File "/usr/local/lib/python3.7/site-packages/dbt/loader.py", line 295, in load_all
    loader.load(internal_manifest=internal_manifest)
  File "/usr/local/lib/python3.7/site-packages/dbt/loader.py", line 191, in load
    self.parse_project(project, macro_manifest, old_results)
  File "/usr/local/lib/python3.7/site-packages/dbt/loader.py", line 168, in parse_project
    self.parse_with_cache(path, parser, old_results)
  File "/usr/local/lib/python3.7/site-packages/dbt/loader.py", line 120, in parse_with_cache
    if not self._get_cached(block, old_results):
  File "/usr/local/lib/python3.7/site-packages/dbt/loader.py", line 136, in _get_cached
    return self.results.sanitized_update(block.file, old_results)
  File "/usr/local/lib/python3.7/site-packages/dbt/parser/results.py", line 150, in sanitized_update
    self.add_node(source_file, node)
  File "/usr/local/lib/python3.7/site-packages/dbt/parser/results.py", line 76, in add_node
    _check_duplicates(node, self.nodes)
  File "/usr/local/lib/python3.7/site-packages/dbt/parser/results.py", line 27, in _check_duplicates
    raise_duplicate_resource_name(value, src[value.unique_id])
  File "/usr/local/lib/python3.7/site-packages/dbt/exceptions.py", line 645, in raise_duplicate_resource_name
    node_2.unique_id, node_2.original_file_path))
  File "/usr/local/lib/python3.7/site-packages/dbt/exceptions.py", line 345, in raise_compiler_error
    raise CompilationException(msg, node)
dbt.exceptions.CompilationException: Compilation Error
  dbt found two resources with the name "my_resource_name". Since these resources have the same name,
  dbt will be unable to find the correct resource when ref("my_resource_name") is used. To fix this,
  change the name of one of these resources:
  - model.my_project.my_resource_name (models/validation/my_resource_name.sql)
  - model.my_project.my_resource_name (models/validation/my_resource_name.sql)

In this instance, we have a validation table which we want to also generate an output data set to be observed outside of a test run, so the files are processed by both dbt.parser.models.ModelParser and dbt.parser.data_test.DataTestParser. (Perhaps there's some other way of achieving this that I'm not aware of—I'm quite new to dbt—which would obviate the need for this ticket.)

Looking briefly through the cached node processing, it seems like processing an already parsed file isn't actually a function contingent on what parser is reading the files, so it would perhaps be possible to simply skip files that are already added:

        if old_results.has_file(block.file):
+           # already loaded this file (perhaps by another parser)
+           if block.file.search_key in self.results.files:
+               return True

            return self.results.sanitized_update(block.file, old_results)

Although of course the presence of the TODO suggests there's more to it.

Describe alternatives you've considered

  • Just not have overlapping sets of parser candidate files like this, and not need to deal with this
  • Just not use the partial parsing functionality when we have these overlapping files
  • At least print a more useful error message, rather than asking to rename one of two items, where actually the two items are the same item

Additional context

Who will this benefit?

Me! 🙂

@kjgorman kjgorman added enhancement New feature or request triage labels Sep 24, 2019
@drewbanin drewbanin removed the triage label Sep 26, 2019
@drewbanin
Copy link
Contributor

drewbanin commented Sep 26, 2019

Hey @kjgorman - thanks for the thorough writeup!

I think a good approach for your use case is to decouple your model and test definitions. The approach I would recommend would be to:

  1. make a model that defines your validation dataset
  2. make a custom data test that just does a select * from {{ ref('validation_model_name') }}

While this approach will require you to make two files for each validation query, it will also allow you to configure each resource independently. You could, for instance, configure the model to be materialized as a table while configuring the severity of the test. These are two different types of resources, and I definitely think they warrant two different declarations. For that reason alone, I don't think it would be a good idea to support overlapping paths for different resource parsers.

More broadly, I want dbt's top-level file paths (models/, tests/) to become less important over time. In order to achieve that, we'll likely leverage Jinja tags as we already do for snapshots. If we had tags for models and data tests, you could imagine blocks like:

{% model my_model %}
  select ...
{% endmodel %}

and

{% test test_a_thing_is_true %}
  select ...
{% endmodel %}

Since these blocks define resources explicitly, they could live side-by-side in a given directory (or even in the same file!). I think this approach does a good job of 1) explicitly defining resources and 2) keeping related logic close together. When we add these blocks to dbt, I'll feel a lot better about allowing multiple parsers to operate in the same directories.

Check out some related issues:

Very happy to discuss this further - keen to hear what you think.

@kjgorman
Copy link
Author

Hey @drewbanin — thanks for writing that out for me, and sorry it's taken me so long to get back to you!

The approach you describe makes sense to me — I've refactored our validation tests today to use the simple refs and resolved the issue. It's a little unwieldy to manage the extra files, but we'll see if it's really a problem in practice—we were thinking a small helper shell function to generate the boilerplate might suffice.

The approach of explicit tagging with blocks is nice — with the path-based "convention over configuration" model, running into this error by using a somewhat irregular directory structure made the error more or less indecipherable until I'd spent some time debugging the parser source code.

@drewbanin
Copy link
Contributor

Yeah - I totally buy that. I saw another report of a user that put a .yml file in the snapshots/ directory which caused dbt to miss it. I'm excited to make these paths less significant!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request partial_parsing
Projects
None yet
Development

No branches or pull requests

3 participants