-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CT 1808 diff based partial parsing #6873
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start to discussion! Next step: pull together a group of colleagues to talk much more about this :)
# For now, we assume that all files are in the root_project, until | ||
# we've determined whether project name will be provided or deduced | ||
# from the directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately, I think it would make sense to have a separate previous step where we resolve Project
config, for the root project and all installed packages, and can use that to determine the project name. For now, good to call this out as a known limitation of the proof-of-concept approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it turns out, the only other local files besides the root project are local dependencies, and there currently is no way to connect up a local package with its directory, and in addition local dependency "files" are generated off of where they're copied by deps. This issue has not been resolved, because it's tied up in the question about what to do about deps.
if dfy: | ||
validate_yaml(source_file.path.original_file_path, dfy) | ||
source_file.dfy = dfy | ||
# TODO: ensure we have a file object even for empty files, such as schema files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding - this is defensive code in the case that we've been handed an empty yaml file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern with empty files is whether those providing the file would consider it a new or a changed file when an empty file becomes non-empty. It feels more consistent to me that it would be a changed file, in which case we should change the dbt-core behavior to include empty files in the files dictionary, and handle the emptiness in parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we leave the empty part to where we figure out what are the file diffs? Would that leads to simpler code on the parsing side?
I think it is reasonable to thing the input to parsing could be a manifest + file diff in the following format
{
"updated":{
"file1": "full content of file1",
"file2": "full content of file2",
...
},
"added":{
"file3": "full content of file3",
"file4": "full content of file4",
...
},
"deleted": ["file4", "file5"],
}
updated
and added
could also be merged into one dictionary.
From dat-core's perspective an empty file would be the same as that file doesn't exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the format that we're using now -- are you suggesting something different?
The empty file thing only applies to schema files, and we can't depend on it being simply an empty string -- it might have comments or whitespace, etc. I've changed not storing a schema file if it doesn't reduce to a real dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I missed the fact that this is just for schema files. And the comments/whitespace are good call out!
project_root=self.all_projects[project_name].project_root, | ||
) | ||
|
||
# Now use the extension and "searched_path" to determine which file_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callout here as well. The file extension feels like a perfectly reasonable way of determining file type, but it does start feeling a bit odd in a world without files. We might want the option of passing extension/language as an explicit argument.
A minor nit is that currently we do a checksum on the content of the file, then do a "strip" and store it. To keep the checksums from changing we should either do the checksum after the strip or not do the strip. My inclination is to do the checksum after the strip. If we start storing empty files, that would be slightly easier. |
directories (not done)
I put a little thought into how we would handle non-base project updates, and I think we have some open questions about how this will work with regard to "deps". The "deps" command takes both remote and local dependencies and copies them into the "dbt_packages" directory. Would these files have to be reflected back to the UI server which is providing the file_diff? Or would these files be "installed" only on the remote server which is accepting the file diffs? So would dependencies be local to the UI server or local to the remote dbt server? If the "deps" command is processing a local dependency, would the ui server need to know that and provide all of those files, in a different kind of interface? |
There is currently nothing to connect a local dependency directory to the project that it produces, so you can't go from the PackageSpec (which is the only place that the local dependency shows up) to the Project, or from the Project to the local dependency. This would be fairly easy to fix by either copying the package name into the package spec or the source directory into the project. But it does point out that the actual file/node objects in the manifest for a local dependency all point to the files that have been copied into dbt_packages. It would certainly be possible to parse a local_dependency in place instead, that way updates to those files in the remote repo could by synced, and we could even do a single project deps update when detecting those changes. Whether any of those possibilities are a good idea is an open question. It would also be possible, once the package spec is connected to the project (or project name, at least), to connect file updates by path to the existing file objects in the dbt_package directory. So an update to "local_dependency/models/my_model.sql" would actually update a file "local_dep://models/my_model.sql". That would automatically cause those parts of the local dependency to update, but would not capture changes to the dbt_project.yml for the dependency, which would require loading and rebuilding the Project for that dependency and if it has changed force a reparse. |
@@ -73,7 +73,6 @@ def __init__(self, saved_manifest: Manifest, new_files: MutableMapping[str, AnyS | |||
self.project_parser_files: Dict = {} | |||
self.saved_files = self.saved_manifest.files | |||
self.project_parser_files = {} | |||
self.deleted_manifest = Manifest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the deleted_manifest because it's not necessary. Originally I thought it might be helpful for diagnosing, but it wasn't.
file_contents = load_file_contents(path.absolute_path, strip=False) | ||
source_file.checksum = FileHash.from_contents(file_contents) | ||
source_file.contents = file_contents.strip() | ||
# We strip the file_contents before generating the checksum because we want |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change in behavior. It felt really wrong that we couldn't actually generate the checksum from the stored file contents, since it was done on the file contents prior to the contents being stripped and stored.
The alternative would have been to store the unstripped version of the contents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me.
Although we aren't moving forward with more complete plans for file diff parsing, this pull request contains some refactoring which we don't want to lose, and it can serve as a base for using file_diff parsing in our tests, as a possible performance improvement and additional proof-of-concept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good and makes sense as far as I can tell, but I think we should clean up the unused event.
|
||
def message(self) -> str: | ||
return f"Error retrieving modification time for file {self.path}" | ||
# Skipped Z004 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like you removed this event from types.proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pfft. Good catch. Have removed.
file_contents = load_file_contents(path.absolute_path, strip=False) | ||
source_file.checksum = FileHash.from_contents(file_contents) | ||
source_file.contents = file_contents.strip() | ||
# We strip the file_contents before generating the checksum because we want |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me.
dbt_ignore_spec, | ||
) | ||
@dataclass | ||
class ReadFilesFromFileSystem: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just generally, more typing would have helped me understand more of this file more quickly. That said, I think the best way to capture shared behavior between classes like ReadFilesFromFileSystem and ReadFilesFromDiff is by defining a Protocol. But Protocols were introduced in Python 3.8, so we have a few more months without them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do use protocols in a few places already, so I think there are compatibility things for before 3.8.
* ct-2198: clean up some type names and uses * CT-2198: Unify constraints and constraints_check properties on columns * Make mypy version consistently 0.981 (#7134) * CT 1808 diff based partial parsing (#6873) * model contracts on models materialized as views (#7120) * first pass * rename tests * fix failing test * changelog * fix functional test * Update core/dbt/parser/base.py * Update core/dbt/parser/schemas.py * Create method for env var deprecation (#7086) * update to allow adapters to change model name resolution in py models (#7115) * update to allow adapters to change model name resolution in py models * add changie * fix newline adds * move quoting into macro * use single quotes * add env DBT_PROJECT_DIR support #6078 (#6659) Co-authored-by: Jeremy Cohen <jeremy@dbtlabs.com> * Add new index.html and changelog yaml files from dbt-docs (#7141) * Make version configs optional (#7060) * [CT-1584] New top level commands: interactive compile (#7008) Co-authored-by: Github Build Bot <buildbot@fishtownanalytics.com> * CT-2198: Add changelog entry * CT-2198: Fix tests which broke after merge * CT-2198: Add explicit validation of constraint types w/ unit test * CT-2198: Move access property, per code review * CT-2198: Remove a redundant macro * CT-1298: Rework constraints to be adapter-generated in Python code * CT-2198: Clarify function name per review --------- Co-authored-by: Gerda Shank <gerda@dbtlabs.com> Co-authored-by: Emily Rockman <emily.rockman@dbtlabs.com> Co-authored-by: Stu Kilgore <stu.kilgore@dbtlabs.com> Co-authored-by: colin-rogers-dbt <111200756+colin-rogers-dbt@users.noreply.github.com> Co-authored-by: Leo Schick <67712864+leo-schick@users.noreply.github.com> Co-authored-by: Jeremy Cohen <jeremy@dbtlabs.com> Co-authored-by: FishtownBuildBot <77737458+FishtownBuildBot@users.noreply.github.com> Co-authored-by: dave-connors-3 <73915542+dave-connors-3@users.noreply.github.com> Co-authored-by: Kshitij Aranke <kshitij.aranke@dbtlabs.com> Co-authored-by: Github Build Bot <buildbot@fishtownanalytics.com>
resolves #6592
Description
This is at the proof of concept stage, for discussion purposes. There are a number of edge cases and questions to resolve before being ready to officially review.
Checklist
changie new
to create a changelog entry