Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move partial parsing to end of parsing. Switch to using msgpack for saved manifest. #3364

Merged
merged 1 commit into from
Jun 2, 2021

Conversation

gshank
Copy link
Contributor

@gshank gshank commented May 17, 2021

resolves #3217, #3292

Description

This pull request switches to using msgpack for the partial parsing saved state file, and moves the partial parsing save point to after patches are applied and refs and sources are resolved and docs rendered.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label May 17, 2021
@gshank
Copy link
Contributor Author

gshank commented May 17, 2021

Things that were changed to support using msgpack instead of pickle:

  • Removed MutableMapping as a superclass of BaseConfig. The python code handling subclasses of typing.MutableMapping was buggy and problematic for serialization. There was also a bug in the implementation, so I suspect those features weren't heavily used. I did retain the methods to allow getting and setting a config attribute as a dictionary, since we had tests that required that.
  • Serialization had a hard time with the abstract base class PackageAwareCache, so I refactored that code to avoid it. In addition I renamed those classes from '...Cache' to '...Lookup' because they aren't caches.
  • The various node classes used a resource_type field metadata 'restrict' setting to force one ParsedNode class or another, but that only worked with Hologram. I had to add a mashumaro SerializableType and instantiate a particular node class based on the resource_type, since many of these node classes are not differentiated at all on attributes and we are removing empty attributes.
  • I had to do a similar thing to the SourceFile and SchemaSourceFile objects

@gshank
Copy link
Contributor Author

gshank commented May 17, 2021

Changes to support partial parsing:

  • I split SourceFile up into SourceFile and SchemaSourceFile, since they used many different attributes. Mypy had a hard time with some of that. I did my best to make it happy but had to resort to 'Any' for 'file' in the FileBlock class.
  • I split out Schema parsing so that all non-schema files are parsed before all schema files. That enabled us to resolve node unique_ids when processing schema files and skip the separate patching step (except for sources).
  • Switched to used file_ids of the format: 'my_package:://models/my_model.sql'. This positions us better for not having access to the actual filesystem.
  • Moved a bunch of source patching code from the schema parser where it was to the source patcher where it belongs. There is still common code in the parse generic test codepath, but there is less confusion about what object is doing what.
  • Removed old code that supported the old method of partial parsing (parse_with_cache, sanitized_update, etc).
  • Added a 'created_at' attribute to nodes. This enables us to skip processing nodes that were created before parsing started.
  • Added log messages for cases where partial parse is not enabled due to some issue.
  • Added core/dbt/parser/partial.py with the new partial parsing code
  • Created test/unit/test_partial_parsing.py and test/integration/067_partial_parsing_tests

@gshank gshank requested a review from jtcohen6 May 17, 2021 18:13
@gshank gshank force-pushed the move_partial_parsing branch from e98d6ed to 2c9b90e Compare May 17, 2021 20:04
@gshank
Copy link
Contributor Author

gshank commented May 17, 2021

Note: testing of most filetypes has not been completed, but that should mainly involve changes in core/dbt/parser/partial.py. The rest of it should be more or less stable.

@jtcohen6
Copy link
Contributor

jtcohen6 commented May 17, 2021

@gshank I've only taken for a brief spin so far with our internal analytics project, and it works magnificently (and very quickly) when there are no file changes.

When I do make changes and re-parse (by running dbt parse command), I noticed two odd things:

  • Hooks seem to be parsed twice. We only have one on-run-end hook defined, but during the Manifest load step, I see a compilation error: dbt found two resources with the name "project_name-on-run-end-0"
  • Dependencies seem to be missing when re-parsing changed resource files / resource properties.

When I get around the hook error (by commenting it out + full re-parse), and I change a model file, I see a dependency-related compilation error:

Partial parsing enabled: 0 files deleted, 0 files added, 1 files changed.
Compilation Error in model model_name (path/to/model_name.sql)
  Model 'model.project_name.model_name' (models/path/to/model_name.sql) depends on a source named 'source.table' which was not found

If I do the same and change a .yml file, e.g. editing a test definition on a source table, I see:

Partial parsing enabled: 0 files deleted, 0 files added, 1 files changed.
Encountered an error:
'source.fishtown_internal_analytics.source.table'

Does that look familiar to you?

@gshank
Copy link
Contributor Author

gshank commented May 17, 2021

The double hook thing is because I'm not preventing hook parsing when partial parsing. I'll fix that.

I'm not surprised about the source thing; I haven't tested source changes at all, only models and model sections in schema files. Will test those next.

@gshank gshank force-pushed the move_partial_parsing branch 2 times, most recently from c312f1e to 34519fa Compare May 17, 2021 22:05
@jtcohen6
Copy link
Contributor

jtcohen6 commented May 17, 2021

Got it, makes sense. Confirmed that the hook-related error went away with the latest commit. Now if I edit just a model file (models/whatever.sql) that depends on other models (only ref, no source), I see:

Partial parsing enabled: 0 files deleted, 0 files added, 1 files changed.
Encountered an error:
'SchemaSourceFile' object has no attribute 'nodes'

Which looks like it's coming from:

  File "/Users/jerco/dev/product/dbt/core/dbt/parser/partial.py", line 182, in update_mssat_in_saved
    unique_id = old_source_file.nodes[0]

Update after reading comment below: You're right! I was testing this between some other fixes, and got my dependency versions crossed.

@gshank
Copy link
Contributor Author

gshank commented May 17, 2021

I think you don't have Mashumaro 2.5, which is needed to distinguish between SourceFile and SchemaSourceFile. Are you just running from a branch? You probably need to do pip install -r dev-requirements.txt -r editable-requirements.txt

@gshank gshank force-pushed the move_partial_parsing branch 2 times, most recently from d030902 to 2a7d86b Compare May 17, 2021 23:45
@jtcohen6
Copy link
Contributor

jtcohen6 commented May 18, 2021

Ok, I was able to get this working! Thanks for helping me debug the error above.

I'm going to jot down some of the things I've noticed while trying this out (in addition to the issues with sources that I noted above). My goal is to help with sketching out test cases—let me know if it would be better to note these elsewhere! I can also follow up tomorrow with feedback on the specific log messages.

  • If I disable a model that is ref'd by another model, partial parsing does not raise an error
  • If I delete a model (remove the .sql file) that is ref'd by another model, it encounters a key error:
Encountered an error:
'model.package_name.deleted_model_name'
  File "/Users/jerco/dev/product/dbt/core/dbt/parser/partial.py", line 193, in delete_node_in_saved
    node = self.saved_manifest.nodes.pop(unique_id)

In both cases, I was expecting to see the message from get_target_not_found_or_disabled_msg: Model ... depends on a node named '...' which is disabled / was not found. (I do see this error with partial parsing if I edit a ref to a bad reference.)

In all cases, it's very fast :)

One funny thing I noticed: the path_count in perf_info.json is now the number of changed files detected by partial parsing, rather than the total number of files in the project. That... makes sense! For the purposes of tracking these metrics, would it be possible to include a total file count as well (perhaps sourced from partial_parse.msgpack)? If that's too tricky, we could switch to using total resource count (models + tests + the rest) as our measure of project size.

@gshank
Copy link
Contributor Author

gshank commented May 18, 2021

Good catch on the refs. Will have to look for those and re-schedule them for parsing.

We could do a total file count too, probably by looping through the files dictionary and counting them. At the end of parsing, maybe.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the last few fixes! I think there's still some testing to do around deleted/disabled models, so I'm glad it's on the radar.

This is really coming along in local testing. The biggest outstanding issue is with sources. If I edit a resource that depends on a source (whether a model's .sql or associated yaml properties), I get a missing target error when I run dbt parse:

Model 'model.fishtown_internal_analytics.stg_slack__users' (models/staging/slack/stg_slack__users.sql) depends on a source named 'slack.users' which was not found

I also see that error if I edit unrelated files but then I run a more involved task, e.g. dbt run (or even dbt list). I know source dependency checks is very much front of mind; I'd say it's the biggest blocker to more extensive testing right now.

Other, smaller things:

  • If I edit a docs block, I get dbt found two resources with the name "<docs block name>", which seems similar to the error we saw previously with hook parsing.
  • If I edit the depends_on inside an exposure, I get a key error that's just the name of the exposure, with the line below in the stacktrace. (If I change any other exposure properties, we're in okay shape, and if I add a net-new one, that's ok too, but I also see this error if I delete an existing exposure entirely, since that obviously involves editing depends_on.)

File ".../dbt/core/dbt/parser/partial.py", line 557, in delete_schema_exposure
    exposure = self.saved_manifest.exposures[unique_id])


@@ -442,12 +477,13 @@ class ParsedMacro(UnparsedBaseNode, HasUniqueID):
docs: Docs = field(default_factory=Docs)
patch_path: Optional[str] = None
arguments: List[MacroArgument] = field(default_factory=list)
created_at: int = field(default_factory=lambda: int(time.time()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is epoch time, right? No reason to expect any wackiness around timezone / system differences? I ask only because I've been burned before, as someone who used to do a lot of analytics work :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's epoch time, because it's a lot more straightforward to a simple "later than" test. It's UTC by definition. I think the only issue would be if people created a partial parse saved file on a server with a seriously messed up clock.

'project hash mismatch: values missing, cache invalidated: {}'
.format(missing_keys)
)
logger.info("Unable to do partial parsing because a project dependency has been added")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So glad to have info messages like this one!

Would it be possible (and simple) to do a full comparison between the project_hashes, rather than just comparing k in self.manifest?

Example: If I had a package installed, parsed my project, manually deleted the installed package files from dbt_modules, I'll see a key error for the now-missing file (e.g. 'dbt_utils://macros/cross_db_utils/intersect.sql'). This is definitely a corner case, since users almost never manually remove installed modules, they're much more likely to run dbt clean (which also removes partial_parse.msgpack).

core/dbt/parser/manifest.py Show resolved Hide resolved
@nathaniel-may
Copy link
Contributor

This is taking me a long time to go through, and I'm not sure I have a full grasp of the edges of this change since it's so big. I would like to see comments on all the individual functions/methods describing more of why they need to exist. Most of them it's clear what they do, but I'm not really sure how they fit into the bigger picture if that makes sense.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is now working, for me, in almost all the ways I can think to test it in our internal analytics project. I'm also heartened by how quickly and iterative we've been able to make these last-mile improvements.

We should keep mettle-testing this, and be ready to respond to bugs that pop up during the RC period. I left a comment with one more error I managed to trigger, related to editing yaml files with tests.

The other biggest thing I'd like is to update our tracking to disambiguate between:

  • Total number of files in a project
  • Number of files parsed
  • Whether partial parsing was enabled
  • Whether partial parsing was possible (this could be inferred if total files == files parsed, but serves us well to be explicit)

The update to tracking does not need to come in the current PR, nor do I think it should block the RC. It should come ahead of the final release. We could coordinate alongside experimental-parser tracking.

core/dbt/parser/partial.py Outdated Show resolved Hide resolved
core/dbt/parser/partial.py Outdated Show resolved Hide resolved
else:
pp_test_index[source_name] = [test_unique_id]
else:
tested_node_id = test_node.depends_on.nodes[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I rename a schema.yml entry to be wrong:

seeds:
  - name: wrong_name_of_seed
    description: ...
    columns: ... # includes tests

And then correctly rename it to be right:

seeds:
  - name: correct_name_of_seed
    description: ...
    columns: ... # includes tests

I get:

  File "/Users/jerco/dev/product/dbt/core/dbt/parser/partial.py", line 462, in get_tests_for
    tested_node_id = test_node.depends_on.nodes[0]
IndexError: list index out of range

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a little weird. We actually create those test nodes even when the seed is invalid. I was using the 'depends_on.nodes' to identify the node for which the tests were created, but that doesn't exist. I updated the code to check for the existence of test_node.depends_on.nodes, but that means those bogus test nodes will hang around. We might need a different way of determining what nodes a test node is connected to in the long run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that means those bogus test nodes will hang around

I think this is the best option we have today. We have other code that disables tests if they depend on a node that is disabled or missing, and raises a user-facing warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new partial parsing those nodes hang around even after the name of the seed has been fixed, because without something to connect them up to the badly named entry they won't be deleted. They'll go away on a non-pp run

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh ok, I follow you now. We can add that our list of partial parsing nice-to-haves. It's an annoyance, but it won't prevent people from using it in the meantime

@gshank gshank force-pushed the move_partial_parsing branch 7 times, most recently from b130d76 to c6980dd Compare May 31, 2021 17:48
@gshank gshank temporarily deployed to Snowflake June 2, 2021 19:06 Inactive
@gshank gshank temporarily deployed to Snowflake June 2, 2021 19:06 Inactive
@gshank gshank temporarily deployed to Redshift June 2, 2021 19:06 Inactive
@gshank gshank temporarily deployed to Redshift June 2, 2021 19:06 Inactive
@gshank gshank temporarily deployed to Bigquery June 2, 2021 19:06 Inactive
@gshank gshank temporarily deployed to Bigquery June 2, 2021 19:06 Inactive
@gshank gshank temporarily deployed to Redshift June 2, 2021 19:06 Inactive
@gshank gshank temporarily deployed to Redshift June 2, 2021 19:06 Inactive
@gshank gshank temporarily deployed to Snowflake June 2, 2021 19:06 Inactive
@gshank gshank temporarily deployed to Snowflake June 2, 2021 19:06 Inactive
method. Switch to using msgpack for saved manifest.
@gshank gshank force-pushed the move_partial_parsing branch from cabc1a9 to 7563b99 Compare June 2, 2021 19:33
@gshank gshank temporarily deployed to Postgres June 2, 2021 19:39 Inactive
@gshank gshank temporarily deployed to Bigquery June 2, 2021 19:39 Inactive
@gshank gshank temporarily deployed to Bigquery June 2, 2021 19:39 Inactive
@gshank gshank temporarily deployed to Snowflake June 2, 2021 19:39 Inactive
@gshank gshank temporarily deployed to Snowflake June 2, 2021 19:39 Inactive
@gshank gshank temporarily deployed to Redshift June 2, 2021 19:39 Inactive
@gshank gshank temporarily deployed to Redshift June 2, 2021 19:39 Inactive
@gshank gshank temporarily deployed to Postgres June 2, 2021 19:39 Inactive
@gshank gshank temporarily deployed to Redshift June 2, 2021 19:39 Inactive
@gshank gshank temporarily deployed to Redshift June 2, 2021 19:39 Inactive
@gshank gshank temporarily deployed to Bigquery June 2, 2021 19:39 Inactive
@gshank gshank temporarily deployed to Bigquery June 2, 2021 19:39 Inactive
@gshank gshank temporarily deployed to Snowflake June 2, 2021 19:39 Inactive
@gshank gshank temporarily deployed to Snowflake June 2, 2021 19:39 Inactive
@gshank gshank merged commit e5c6388 into develop Jun 2, 2021
@gshank gshank deleted the move_partial_parsing branch June 2, 2021 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move the save point for partial parsing
3 participants