Move partial parsing to end of parsing. Switch to using msgpack for saved manifest. #3364

gshank · 2021-05-17T17:43:09Z

Description

This pull request switches to using msgpack for the partial parsing saved state file, and moves the partial parsing save point to after patches are applied and refs and sources are resolved and docs rendered.

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

gshank · 2021-05-17T17:56:27Z

Things that were changed to support using msgpack instead of pickle:

Removed MutableMapping as a superclass of BaseConfig. The python code handling subclasses of typing.MutableMapping was buggy and problematic for serialization. There was also a bug in the implementation, so I suspect those features weren't heavily used. I did retain the methods to allow getting and setting a config attribute as a dictionary, since we had tests that required that.
Serialization had a hard time with the abstract base class PackageAwareCache, so I refactored that code to avoid it. In addition I renamed those classes from '...Cache' to '...Lookup' because they aren't caches.
The various node classes used a resource_type field metadata 'restrict' setting to force one ParsedNode class or another, but that only worked with Hologram. I had to add a mashumaro SerializableType and instantiate a particular node class based on the resource_type, since many of these node classes are not differentiated at all on attributes and we are removing empty attributes.
I had to do a similar thing to the SourceFile and SchemaSourceFile objects

gshank · 2021-05-17T18:10:01Z

Changes to support partial parsing:

I split SourceFile up into SourceFile and SchemaSourceFile, since they used many different attributes. Mypy had a hard time with some of that. I did my best to make it happy but had to resort to 'Any' for 'file' in the FileBlock class.
I split out Schema parsing so that all non-schema files are parsed before all schema files. That enabled us to resolve node unique_ids when processing schema files and skip the separate patching step (except for sources).
Switched to used file_ids of the format: 'my_package:://models/my_model.sql'. This positions us better for not having access to the actual filesystem.
Moved a bunch of source patching code from the schema parser where it was to the source patcher where it belongs. There is still common code in the parse generic test codepath, but there is less confusion about what object is doing what.
Removed old code that supported the old method of partial parsing (parse_with_cache, sanitized_update, etc).
Added a 'created_at' attribute to nodes. This enables us to skip processing nodes that were created before parsing started.
Added log messages for cases where partial parse is not enabled due to some issue.
Added core/dbt/parser/partial.py with the new partial parsing code
Created test/unit/test_partial_parsing.py and test/integration/067_partial_parsing_tests

gshank · 2021-05-17T20:05:40Z

Note: testing of most filetypes has not been completed, but that should mainly involve changes in core/dbt/parser/partial.py. The rest of it should be more or less stable.

jtcohen6 · 2021-05-17T20:15:18Z

@gshank I've only taken for a brief spin so far with our internal analytics project, and it works magnificently (and very quickly) when there are no file changes.

When I do make changes and re-parse (by running dbt parse command), I noticed two odd things:

Hooks seem to be parsed twice. We only have one on-run-end hook defined, but during the Manifest load step, I see a compilation error: dbt found two resources with the name "project_name-on-run-end-0"
Dependencies seem to be missing when re-parsing changed resource files / resource properties.

When I get around the hook error (by commenting it out + full re-parse), and I change a model file, I see a dependency-related compilation error:

Partial parsing enabled: 0 files deleted, 0 files added, 1 files changed.
Compilation Error in model model_name (path/to/model_name.sql)
  Model 'model.project_name.model_name' (models/path/to/model_name.sql) depends on a source named 'source.table' which was not found

If I do the same and change a .yml file, e.g. editing a test definition on a source table, I see:

Partial parsing enabled: 0 files deleted, 0 files added, 1 files changed.
Encountered an error:
'source.fishtown_internal_analytics.source.table'

Does that look familiar to you?

gshank · 2021-05-17T21:20:10Z

The double hook thing is because I'm not preventing hook parsing when partial parsing. I'll fix that.

I'm not surprised about the source thing; I haven't tested source changes at all, only models and model sections in schema files. Will test those next.

jtcohen6 · 2021-05-17T22:24:54Z

Got it, makes sense. Confirmed that the hook-related error went away with the latest commit. ~~Now if I edit just a model file (models/whatever.sql) that depends on other models (only ref, no source), I see:~~

Partial parsing enabled: 0 files deleted, 0 files added, 1 files changed.
Encountered an error:
'SchemaSourceFile' object has no attribute 'nodes'

~~Which looks like it's coming from:~~

  File "/Users/jerco/dev/product/dbt/core/dbt/parser/partial.py", line 182, in update_mssat_in_saved
    unique_id = old_source_file.nodes[0]

Update after reading comment below: You're right! I was testing this between some other fixes, and got my dependency versions crossed.

gshank · 2021-05-17T23:09:36Z

I think you don't have Mashumaro 2.5, which is needed to distinguish between SourceFile and SchemaSourceFile. Are you just running from a branch? You probably need to do pip install -r dev-requirements.txt -r editable-requirements.txt

jtcohen6 · 2021-05-18T01:34:57Z

Ok, I was able to get this working! Thanks for helping me debug the error above.

I'm going to jot down some of the things I've noticed while trying this out (in addition to the issues with sources that I noted above). My goal is to help with sketching out test cases—let me know if it would be better to note these elsewhere! I can also follow up tomorrow with feedback on the specific log messages.

If I disable a model that is ref'd by another model, partial parsing does not raise an error
If I delete a model (remove the .sql file) that is ref'd by another model, it encounters a key error:

Encountered an error:
'model.package_name.deleted_model_name'

  File "/Users/jerco/dev/product/dbt/core/dbt/parser/partial.py", line 193, in delete_node_in_saved
    node = self.saved_manifest.nodes.pop(unique_id)

In both cases, I was expecting to see the message from get_target_not_found_or_disabled_msg: Model ... depends on a node named '...' which is disabled / was not found. (I do see this error with partial parsing if I edit a ref to a bad reference.)

In all cases, it's very fast :)

One funny thing I noticed: the path_count in perf_info.json is now the number of changed files detected by partial parsing, rather than the total number of files in the project. That... makes sense! For the purposes of tracking these metrics, would it be possible to include a total file count as well (perhaps sourced from partial_parse.msgpack)? If that's too tricky, we could switch to using total resource count (models + tests + the rest) as our measure of project size.

gshank · 2021-05-18T13:25:29Z

Good catch on the refs. Will have to look for those and re-schedule them for parsing.

We could do a total file count too, probably by looping through the files dictionary and counting them. At the end of parsing, maybe.

jtcohen6

Thanks for the last few fixes! I think there's still some testing to do around deleted/disabled models, so I'm glad it's on the radar.

This is really coming along in local testing. The biggest outstanding issue is with sources. If I edit a resource that depends on a source (whether a model's .sql or associated yaml properties), I get a missing target error when I run dbt parse:

Model 'model.fishtown_internal_analytics.stg_slack__users' (models/staging/slack/stg_slack__users.sql) depends on a source named 'slack.users' which was not found

I also see that error if I edit unrelated files but then I run a more involved task, e.g. dbt run (or even dbt list). I know source dependency checks is very much front of mind; I'd say it's the biggest blocker to more extensive testing right now.

Other, smaller things:

If I edit a docs block, I get dbt found two resources with the name "<docs block name>", which seems similar to the error we saw previously with hook parsing.
If I edit the depends_on inside an exposure, I get a key error that's just the name of the exposure, with the line below in the stacktrace. (If I change any other exposure properties, we're in okay shape, and if I add a net-new one, that's ok too, but I also see this error if I delete an existing exposure entirely, since that obviously involves editing depends_on.)

 File ".../dbt/core/dbt/parser/partial.py", line 557, in delete_schema_exposure
    exposure = self.saved_manifest.exposures[unique_id])

jtcohen6 · 2021-05-19T01:52:02Z

core/dbt/contracts/graph/parsed.py

@@ -442,12 +477,13 @@ class ParsedMacro(UnparsedBaseNode, HasUniqueID):
    docs: Docs = field(default_factory=Docs)
    patch_path: Optional[str] = None
    arguments: List[MacroArgument] = field(default_factory=list)
+    created_at: int = field(default_factory=lambda: int(time.time()))


This is epoch time, right? No reason to expect any wackiness around timezone / system differences? I ask only because I've been burned before, as someone who used to do a lot of analytics work :)

It's epoch time, because it's a lot more straightforward to a simple "later than" test. It's UTC by definition. I think the only issue would be if people created a partial parse saved file on a server with a seriously messed up clock.

jtcohen6 · 2021-05-19T02:08:28Z

core/dbt/parser/manifest.py

-                'project hash mismatch: values missing, cache invalidated: {}'
-                .format(missing_keys)
-            )
+            logger.info("Unable to do partial parsing because a project dependency has been added")


So glad to have info messages like this one!

Would it be possible (and simple) to do a full comparison between the project_hashes, rather than just comparing k in self.manifest?

Example: If I had a package installed, parsed my project, manually deleted the installed package files from dbt_modules, I'll see a key error for the now-missing file (e.g. 'dbt_utils://macros/cross_db_utils/intersect.sql'). This is definitely a corner case, since users almost never manually remove installed modules, they're much more likely to run dbt clean (which also removes partial_parse.msgpack).

core/dbt/parser/manifest.py

nathaniel-may · 2021-05-21T17:24:50Z

This is taking me a long time to go through, and I'm not sure I have a full grasp of the edges of this change since it's so big. I would like to see comments on all the individual functions/methods describing more of why they need to exist. Most of them it's clear what they do, but I'm not really sure how they fit into the bigger picture if that makes sense.

jtcohen6

Nice! This is now working, for me, in almost all the ways I can think to test it in our internal analytics project. I'm also heartened by how quickly and iterative we've been able to make these last-mile improvements.

We should keep mettle-testing this, and be ready to respond to bugs that pop up during the RC period. I left a comment with one more error I managed to trigger, related to editing yaml files with tests.

The other biggest thing I'd like is to update our tracking to disambiguate between:

Total number of files in a project
Number of files parsed
Whether partial parsing was enabled
Whether partial parsing was possible (this could be inferred if total files == files parsed, but serves us well to be explicit)

The update to tracking does not need to come in the current PR, nor do I think it should block the RC. It should come ahead of the final release. We could coordinate alongside experimental-parser tracking.

core/dbt/parser/partial.py

jtcohen6 · 2021-05-27T17:12:23Z

core/dbt/parser/partial.py

+                        else:
+                            pp_test_index[source_name] = [test_unique_id]
+                else:
+                    tested_node_id = test_node.depends_on.nodes[0]


If I rename a schema.yml entry to be wrong:

seeds: - name: wrong_name_of_seed description: ... columns: ... # includes tests

And then correctly rename it to be right:

seeds: - name: correct_name_of_seed description: ... columns: ... # includes tests

I get:

File "/Users/jerco/dev/product/dbt/core/dbt/parser/partial.py", line 462, in get_tests_for tested_node_id = test_node.depends_on.nodes[0] IndexError: list index out of range

This one is a little weird. We actually create those test nodes even when the seed is invalid. I was using the 'depends_on.nodes' to identify the node for which the tests were created, but that doesn't exist. I updated the code to check for the existence of test_node.depends_on.nodes, but that means those bogus test nodes will hang around. We might need a different way of determining what nodes a test node is connected to in the long run.

but that means those bogus test nodes will hang around

I think this is the best option we have today. We have other code that disables tests if they depend on a node that is disabled or missing, and raises a user-facing warning.

In the new partial parsing those nodes hang around even after the name of the seed has been fixed, because without something to connect them up to the badly named entry they won't be deleted. They'll go away on a non-pp run

Ahhh ok, I follow you now. We can add that our list of partial parsing nice-to-haves. It's an annoyance, but it won't prevent people from using it in the meantime

method. Switch to using msgpack for saved manifest.

cla-bot bot added the cla:yes label May 17, 2021

gshank requested a review from jtcohen6 May 17, 2021 18:13

gshank force-pushed the move_partial_parsing branch from e98d6ed to 2c9b90e Compare May 17, 2021 20:04

gshank requested review from kwigley, leahwicz, nathaniel-may and iknox-fa May 17, 2021 20:05

gshank force-pushed the move_partial_parsing branch 2 times, most recently from c312f1e to 34519fa Compare May 17, 2021 22:05

gshank force-pushed the move_partial_parsing branch 2 times, most recently from d030902 to 2a7d86b Compare May 17, 2021 23:45

jtcohen6 reviewed May 19, 2021

View reviewed changes

jtcohen6 reviewed May 27, 2021

View reviewed changes

gshank force-pushed the move_partial_parsing branch 7 times, most recently from b130d76 to c6980dd Compare May 31, 2021 17:48

gshank temporarily deployed to Snowflake June 2, 2021 19:06 Inactive

gshank temporarily deployed to Redshift June 2, 2021 19:06 Inactive

gshank had a problem deploying to Postgres June 2, 2021 19:06 Failure

gshank temporarily deployed to Bigquery June 2, 2021 19:06 Inactive

gshank temporarily deployed to Redshift June 2, 2021 19:06 Inactive

gshank temporarily deployed to Snowflake June 2, 2021 19:06 Inactive

Move partial parsing to end of parsing and implement new partial parsing

7563b99

method. Switch to using msgpack for saved manifest.

gshank force-pushed the move_partial_parsing branch from cabc1a9 to 7563b99 Compare June 2, 2021 19:33

gshank temporarily deployed to Postgres June 2, 2021 19:39 Inactive

gshank temporarily deployed to Bigquery June 2, 2021 19:39 Inactive

gshank temporarily deployed to Snowflake June 2, 2021 19:39 Inactive

gshank temporarily deployed to Redshift June 2, 2021 19:39 Inactive

gshank temporarily deployed to Postgres June 2, 2021 19:39 Inactive

gshank temporarily deployed to Redshift June 2, 2021 19:39 Inactive

gshank temporarily deployed to Bigquery June 2, 2021 19:39 Inactive

gshank temporarily deployed to Snowflake June 2, 2021 19:39 Inactive

gshank merged commit e5c6388 into develop Jun 2, 2021

gshank deleted the move_partial_parsing branch June 2, 2021 20:02

dbeatty10 mentioned this pull request Nov 24, 2024

fix/ParsedNode for seed node dup #11041

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move partial parsing to end of parsing. Switch to using msgpack for saved manifest. #3364

Move partial parsing to end of parsing. Switch to using msgpack for saved manifest. #3364

gshank commented May 17, 2021

gshank commented May 17, 2021

gshank commented May 17, 2021

gshank commented May 17, 2021

jtcohen6 commented May 17, 2021 •

edited

Loading

gshank commented May 17, 2021

jtcohen6 commented May 17, 2021 •

edited

Loading

gshank commented May 17, 2021

jtcohen6 commented May 18, 2021 •

edited

Loading

gshank commented May 18, 2021

jtcohen6 left a comment •

edited

Loading

jtcohen6 May 19, 2021

gshank May 26, 2021

jtcohen6 May 19, 2021

nathaniel-may commented May 21, 2021

jtcohen6 left a comment

jtcohen6 May 27, 2021

gshank May 27, 2021

jtcohen6 May 27, 2021

gshank May 27, 2021

jtcohen6 May 27, 2021

Move partial parsing to end of parsing. Switch to using msgpack for saved manifest. #3364

Move partial parsing to end of parsing. Switch to using msgpack for saved manifest. #3364

Conversation

gshank commented May 17, 2021

Description

Checklist

gshank commented May 17, 2021

gshank commented May 17, 2021

gshank commented May 17, 2021

jtcohen6 commented May 17, 2021 • edited Loading

gshank commented May 17, 2021

jtcohen6 commented May 17, 2021 • edited Loading

gshank commented May 17, 2021

jtcohen6 commented May 18, 2021 • edited Loading

gshank commented May 18, 2021

jtcohen6 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nathaniel-may commented May 21, 2021

jtcohen6 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtcohen6 commented May 17, 2021 •

edited

Loading

jtcohen6 commented May 17, 2021 •

edited

Loading

jtcohen6 commented May 18, 2021 •

edited

Loading

jtcohen6 left a comment •

edited

Loading