-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move partial parsing to end of parsing. Switch to using msgpack for saved manifest. #3364
Conversation
Things that were changed to support using msgpack instead of pickle:
|
Changes to support partial parsing:
|
e98d6ed
to
2c9b90e
Compare
Note: testing of most filetypes has not been completed, but that should mainly involve changes in core/dbt/parser/partial.py. The rest of it should be more or less stable. |
@gshank I've only taken for a brief spin so far with our internal analytics project, and it works magnificently (and very quickly) when there are no file changes. When I do make changes and re-parse (by running
When I get around the hook error (by commenting it out + full re-parse), and I change a model file, I see a dependency-related compilation error:
If I do the same and change a
Does that look familiar to you? |
The double hook thing is because I'm not preventing hook parsing when partial parsing. I'll fix that. I'm not surprised about the source thing; I haven't tested source changes at all, only models and model sections in schema files. Will test those next. |
c312f1e
to
34519fa
Compare
Got it, makes sense. Confirmed that the hook-related error went away with the latest commit.
Update after reading comment below: You're right! I was testing this between some other fixes, and got my dependency versions crossed. |
I think you don't have Mashumaro 2.5, which is needed to distinguish between SourceFile and SchemaSourceFile. Are you just running from a branch? You probably need to do pip install -r dev-requirements.txt -r editable-requirements.txt |
d030902
to
2a7d86b
Compare
Ok, I was able to get this working! Thanks for helping me debug the error above. I'm going to jot down some of the things I've noticed while trying this out (in addition to the issues with
In both cases, I was expecting to see the message from In all cases, it's very fast :) One funny thing I noticed: the |
Good catch on the refs. Will have to look for those and re-schedule them for parsing. We could do a total file count too, probably by looping through the files dictionary and counting them. At the end of parsing, maybe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the last few fixes! I think there's still some testing to do around deleted/disabled models, so I'm glad it's on the radar.
This is really coming along in local testing. The biggest outstanding issue is with sources. If I edit a resource that depends on a source (whether a model's .sql or associated yaml properties), I get a missing target error when I run dbt parse
:
Model 'model.fishtown_internal_analytics.stg_slack__users' (models/staging/slack/stg_slack__users.sql) depends on a source named 'slack.users' which was not found
I also see that error if I edit unrelated files but then I run a more involved task, e.g. dbt run
(or even dbt list
). I know source dependency checks is very much front of mind; I'd say it's the biggest blocker to more extensive testing right now.
Other, smaller things:
- If I edit a docs block, I get
dbt found two resources with the name "<docs block name>"
, which seems similar to the error we saw previously with hook parsing. - If I edit the
depends_on
inside an exposure, I get a key error that's just the name of the exposure, with the line below in the stacktrace. (If I change any other exposure properties, we're in okay shape, and if I add a net-new one, that's ok too, but I also see this error if I delete an existing exposure entirely, since that obviously involves editingdepends_on
.)
File ".../dbt/core/dbt/parser/partial.py", line 557, in delete_schema_exposure
exposure = self.saved_manifest.exposures[unique_id])
@@ -442,12 +477,13 @@ class ParsedMacro(UnparsedBaseNode, HasUniqueID): | |||
docs: Docs = field(default_factory=Docs) | |||
patch_path: Optional[str] = None | |||
arguments: List[MacroArgument] = field(default_factory=list) | |||
created_at: int = field(default_factory=lambda: int(time.time())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is epoch time, right? No reason to expect any wackiness around timezone / system differences? I ask only because I've been burned before, as someone who used to do a lot of analytics work :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's epoch time, because it's a lot more straightforward to a simple "later than" test. It's UTC by definition. I think the only issue would be if people created a partial parse saved file on a server with a seriously messed up clock.
'project hash mismatch: values missing, cache invalidated: {}' | ||
.format(missing_keys) | ||
) | ||
logger.info("Unable to do partial parsing because a project dependency has been added") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So glad to have info messages like this one!
Would it be possible (and simple) to do a full comparison between the project_hashes
, rather than just comparing k in self.manifest
?
Example: If I had a package installed, parsed my project, manually deleted the installed package files from dbt_modules
, I'll see a key error for the now-missing file (e.g. 'dbt_utils://macros/cross_db_utils/intersect.sql'
). This is definitely a corner case, since users almost never manually remove installed modules, they're much more likely to run dbt clean
(which also removes partial_parse.msgpack
).
This is taking me a long time to go through, and I'm not sure I have a full grasp of the edges of this change since it's so big. I would like to see comments on all the individual functions/methods describing more of why they need to exist. Most of them it's clear what they do, but I'm not really sure how they fit into the bigger picture if that makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is now working, for me, in almost all the ways I can think to test it in our internal analytics project. I'm also heartened by how quickly and iterative we've been able to make these last-mile improvements.
We should keep mettle-testing this, and be ready to respond to bugs that pop up during the RC period. I left a comment with one more error I managed to trigger, related to editing yaml files with tests.
The other biggest thing I'd like is to update our tracking to disambiguate between:
- Total number of files in a project
- Number of files parsed
- Whether partial parsing was enabled
- Whether partial parsing was possible (this could be inferred if total files == files parsed, but serves us well to be explicit)
The update to tracking does not need to come in the current PR, nor do I think it should block the RC. It should come ahead of the final release. We could coordinate alongside experimental-parser tracking.
else: | ||
pp_test_index[source_name] = [test_unique_id] | ||
else: | ||
tested_node_id = test_node.depends_on.nodes[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I rename a schema.yml
entry to be wrong:
seeds:
- name: wrong_name_of_seed
description: ...
columns: ... # includes tests
And then correctly rename it to be right:
seeds:
- name: correct_name_of_seed
description: ...
columns: ... # includes tests
I get:
File "/Users/jerco/dev/product/dbt/core/dbt/parser/partial.py", line 462, in get_tests_for
tested_node_id = test_node.depends_on.nodes[0]
IndexError: list index out of range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a little weird. We actually create those test nodes even when the seed is invalid. I was using the 'depends_on.nodes' to identify the node for which the tests were created, but that doesn't exist. I updated the code to check for the existence of test_node.depends_on.nodes, but that means those bogus test nodes will hang around. We might need a different way of determining what nodes a test node is connected to in the long run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that means those bogus test nodes will hang around
I think this is the best option we have today. We have other code that disables tests if they depend on a node that is disabled or missing, and raises a user-facing warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the new partial parsing those nodes hang around even after the name of the seed has been fixed, because without something to connect them up to the badly named entry they won't be deleted. They'll go away on a non-pp run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh ok, I follow you now. We can add that our list of partial parsing nice-to-haves. It's an annoyance, but it won't prevent people from using it in the meantime
b130d76
to
c6980dd
Compare
method. Switch to using msgpack for saved manifest.
cabc1a9
to
7563b99
Compare
resolves #3217, #3292
Description
This pull request switches to using msgpack for the partial parsing saved state file, and moves the partial parsing save point to after patches are applied and refs and sources are resolved and docs rendered.
Checklist
CHANGELOG.md
and added information about my change to the "dbt next" section.