Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CT 1808 diff based partial parsing #6873

Merged
merged 19 commits into from
Mar 7, 2023
Merged

Conversation

gshank
Copy link
Contributor

@gshank gshank commented Feb 6, 2023

resolves #6592

Description

This is at the proof of concept stage, for discussion purposes. There are a number of edge cases and questions to resolve before being ready to officially review.

Checklist

@gshank gshank requested review from a team as code owners February 6, 2023 13:53
@gshank gshank requested review from emmyoop and Fleid February 6, 2023 13:53
@cla-bot cla-bot bot added the cla:yes label Feb 6, 2023
@gshank gshank marked this pull request as draft February 6, 2023 13:54
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start to discussion! Next step: pull together a group of colleagues to talk much more about this :)

Comment on lines +241 to +243
# For now, we assume that all files are in the root_project, until
# we've determined whether project name will be provided or deduced
# from the directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately, I think it would make sense to have a separate previous step where we resolve Project config, for the root project and all installed packages, and can use that to determine the project name. For now, good to call this out as a known limitation of the proof-of-concept approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it turns out, the only other local files besides the root project are local dependencies, and there currently is no way to connect up a local package with its directory, and in addition local dependency "files" are generated off of where they're copied by deps. This issue has not been resolved, because it's tied up in the question about what to do about deps.

if dfy:
validate_yaml(source_file.path.original_file_path, dfy)
source_file.dfy = dfy
# TODO: ensure we have a file object even for empty files, such as schema files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding - this is defensive code in the case that we've been handed an empty yaml file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern with empty files is whether those providing the file would consider it a new or a changed file when an empty file becomes non-empty. It feels more consistent to me that it would be a changed file, in which case we should change the dbt-core behavior to include empty files in the files dictionary, and handle the emptiness in parsing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we leave the empty part to where we figure out what are the file diffs? Would that leads to simpler code on the parsing side?
I think it is reasonable to thing the input to parsing could be a manifest + file diff in the following format

{
  "updated":{
    "file1": "full content of file1",
    "file2": "full content of file2",
    ...
  },
  "added":{
    "file3": "full content of file3",
    "file4": "full content of file4",
    ...
  },
 "deleted": ["file4", "file5"],
}

updated and added could also be merged into one dictionary.

From dat-core's perspective an empty file would be the same as that file doesn't exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the format that we're using now -- are you suggesting something different?

The empty file thing only applies to schema files, and we can't depend on it being simply an empty string -- it might have comments or whitespace, etc. I've changed not storing a schema file if it doesn't reduce to a real dictionary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I missed the fact that this is just for schema files. And the comments/whitespace are good call out!

project_root=self.all_projects[project_name].project_root,
)

# Now use the extension and "searched_path" to determine which file_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout here as well. The file extension feels like a perfectly reasonable way of determining file type, but it does start feeling a bit odd in a world without files. We might want the option of passing extension/language as an explicit argument.

@gshank
Copy link
Contributor Author

gshank commented Feb 7, 2023

A minor nit is that currently we do a checksum on the content of the file, then do a "strip" and store it. To keep the checksums from changing we should either do the checksum after the strip or not do the strip. My inclination is to do the checksum after the strip. If we start storing empty files, that would be slightly easier.

@gshank
Copy link
Contributor Author

gshank commented Feb 8, 2023

I put a little thought into how we would handle non-base project updates, and I think we have some open questions about how this will work with regard to "deps". The "deps" command takes both remote and local dependencies and copies them into the "dbt_packages" directory. Would these files have to be reflected back to the UI server which is providing the file_diff? Or would these files be "installed" only on the remote server which is accepting the file diffs? So would dependencies be local to the UI server or local to the remote dbt server? If the "deps" command is processing a local dependency, would the ui server need to know that and provide all of those files, in a different kind of interface?

@gshank
Copy link
Contributor Author

gshank commented Feb 9, 2023

There is currently nothing to connect a local dependency directory to the project that it produces, so you can't go from the PackageSpec (which is the only place that the local dependency shows up) to the Project, or from the Project to the local dependency. This would be fairly easy to fix by either copying the package name into the package spec or the source directory into the project.

But it does point out that the actual file/node objects in the manifest for a local dependency all point to the files that have been copied into dbt_packages. It would certainly be possible to parse a local_dependency in place instead, that way updates to those files in the remote repo could by synced, and we could even do a single project deps update when detecting those changes. Whether any of those possibilities are a good idea is an open question.

It would also be possible, once the package spec is connected to the project (or project name, at least), to connect file updates by path to the existing file objects in the dbt_package directory. So an update to "local_dependency/models/my_model.sql" would actually update a file "local_dep://models/my_model.sql". That would automatically cause those parts of the local dependency to update, but would not capture changes to the dbt_project.yml for the dependency, which would require loading and rebuilding the Project for that dependency and if it has changed force a reparse.

@@ -73,7 +73,6 @@ def __init__(self, saved_manifest: Manifest, new_files: MutableMapping[str, AnyS
self.project_parser_files: Dict = {}
self.saved_files = self.saved_manifest.files
self.project_parser_files = {}
self.deleted_manifest = Manifest()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the deleted_manifest because it's not necessary. Originally I thought it might be helpful for diagnosing, but it wasn't.

file_contents = load_file_contents(path.absolute_path, strip=False)
source_file.checksum = FileHash.from_contents(file_contents)
source_file.contents = file_contents.strip()
# We strip the file_contents before generating the checksum because we want
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in behavior. It felt really wrong that we couldn't actually generate the checksum from the stored file contents, since it was done on the file contents prior to the contents being stripped and stored.

The alternative would have been to store the unstripped version of the contents.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me.

@gshank gshank marked this pull request as ready for review March 4, 2023 15:04
@gshank gshank requested review from peterallenwebb and MichelleArk and removed request for Fleid and lostmygithubaccount March 4, 2023 15:05
@gshank
Copy link
Contributor Author

gshank commented Mar 4, 2023

Although we aren't moving forward with more complete plans for file diff parsing, this pull request contains some refactoring which we don't want to lose, and it can serve as a base for using file_diff parsing in our tests, as a possible performance improvement and additional proof-of-concept.

Copy link
Contributor

@peterallenwebb peterallenwebb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good and makes sense as far as I can tell, but I think we should clean up the unused event.


def message(self) -> str:
return f"Error retrieving modification time for file {self.path}"
# Skipped Z004
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like you removed this event from types.proto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pfft. Good catch. Have removed.

file_contents = load_file_contents(path.absolute_path, strip=False)
source_file.checksum = FileHash.from_contents(file_contents)
source_file.contents = file_contents.strip()
# We strip the file_contents before generating the checksum because we want
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me.

dbt_ignore_spec,
)
@dataclass
class ReadFilesFromFileSystem:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just generally, more typing would have helped me understand more of this file more quickly. That said, I think the best way to capture shared behavior between classes like ReadFilesFromFileSystem and ReadFilesFromDiff is by defining a Protocol. But Protocols were introduced in Python 3.8, so we have a few more months without them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do use protocols in a few places already, so I think there are compatibility things for before 3.8.

@gshank gshank merged commit 8d98752 into main Mar 7, 2023
@gshank gshank deleted the ct-1808-diff_based_partial_parsing branch March 7, 2023 21:37
peterallenwebb added a commit that referenced this pull request Mar 22, 2023
* ct-2198: clean up some type names and uses

* CT-2198: Unify constraints and constraints_check properties on columns

* Make mypy version consistently 0.981 (#7134)

* CT 1808 diff based partial parsing (#6873)

* model contracts on models materialized as views (#7120)

* first pass

* rename tests

* fix failing test

* changelog

* fix functional test

* Update core/dbt/parser/base.py

* Update core/dbt/parser/schemas.py

* Create method for env var deprecation (#7086)

* update to allow adapters to change model name resolution in py models (#7115)

* update to allow adapters to change model name resolution in py models

* add changie

* fix newline adds

* move quoting into macro

* use single quotes

* add env DBT_PROJECT_DIR support #6078 (#6659)

Co-authored-by: Jeremy Cohen <jeremy@dbtlabs.com>

* Add new index.html and changelog yaml files from dbt-docs (#7141)

* Make version configs optional (#7060)

* [CT-1584] New top level commands: interactive compile (#7008)

Co-authored-by: Github Build Bot <buildbot@fishtownanalytics.com>

* CT-2198: Add changelog entry

* CT-2198: Fix tests which broke after merge

* CT-2198: Add explicit validation of constraint types w/ unit test

* CT-2198: Move access property, per code review

* CT-2198: Remove a redundant macro

* CT-1298: Rework constraints to be adapter-generated in Python code

* CT-2198: Clarify function name per review

---------

Co-authored-by: Gerda Shank <gerda@dbtlabs.com>
Co-authored-by: Emily Rockman <emily.rockman@dbtlabs.com>
Co-authored-by: Stu Kilgore <stu.kilgore@dbtlabs.com>
Co-authored-by: colin-rogers-dbt <111200756+colin-rogers-dbt@users.noreply.github.com>
Co-authored-by: Leo Schick <67712864+leo-schick@users.noreply.github.com>
Co-authored-by: Jeremy Cohen <jeremy@dbtlabs.com>
Co-authored-by: FishtownBuildBot <77737458+FishtownBuildBot@users.noreply.github.com>
Co-authored-by: dave-connors-3 <73915542+dave-connors-3@users.noreply.github.com>
Co-authored-by: Kshitij Aranke <kshitij.aranke@dbtlabs.com>
Co-authored-by: Github Build Bot <buildbot@fishtownanalytics.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-1808] [Spike] "diff"-based entry point into partial parsing
5 participants