-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Support for alternative hashes (#1676) #2022
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1920 - address comments from the previous PR
checksum_types, modchecksum.LOCAL_SUPPORTED_CHECKSUM_TYPES | ||
): | ||
return True | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not simply:
return bool(modchecksum.checksum_types_from_str(...))
?
info = self.info.get(t) | ||
if info: | ||
return info | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be:
return next((self.info[t] for t in self.remote.checksum_types() if t in self.info), None)
# or
return next(filter(bool, map(self.info.get, self.remote.checksum_types())), None)
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or
return self.info[self.checksum]
|
||
def checksum_type(self): | ||
return self._checksum_types[0] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this properties as above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is that an error if we have more than one here? If yes than I suggest adding an assert.
@@ -275,13 +299,55 @@ def get_checksum(self, path_info): | |||
checksum = self.get_file_checksum(path_info) | |||
|
|||
if checksum: | |||
self.state.save(path_info, checksum) | |||
self.state.save(path_info, checksum, checksum_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a decorator to abstract away caching?
types = set(checksum_info.keys()) | ||
supported_types = set(self.checksum_types()) | ||
intersection = supported_types.intersection(types) | ||
if not supported_types.intersection(types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to call .keys()
just types = set(checksum_info)
. Actually Python has optimized code path to make sets from dicts. intersection
is computed twice.
addition_check, path_info, checksum_info | ||
) | ||
) | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this addition_check
only purpose to pass there self.changed_cache
? If so won't it be more clear to simply use bool parameter?
AZURE_SUPPORTED_CHECKSUM_TYPES = [CHECKSUM_ETAG] | ||
GS_SUPPORTED_CHECKSUM_TYPES = [CHECKSUM_MD5] | ||
HDFS_SUPPORTED_CHECKSUM_TYPES = [CHECKSUM_CHECKSUM] | ||
S3_SUPPORTED_CHECKSUM_TYPES = [CHECKSUM_ETAG] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this go to appropriate modules and just stay contained there? This will eliminate one level of indirection - these constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will go to the appropriate modules for sure 👍
@@ -207,7 +203,7 @@ def walk(self, path_info): | |||
return os.walk(path_info.path) | |||
|
|||
def get_file_checksum(self, path_info): | |||
return file_md5(path_info.path)[0] | |||
return file_checksum(path_info.path, self._checksum_types[0])[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like file_md5/file_checksum
second result is never used. Maybe now is a good time to drop it and return only hex.
Optional(PARAM_CMD): Or(str, None), | ||
Optional(PARAM_WDIR): Or(str, None), | ||
Optional(PARAM_DEPS): Or(And(list, Schema([dependency.SCHEMA])), None), | ||
Optional(PARAM_OUTS): Or(And(list, Schema([output.SCHEMA])), None), | ||
Optional(PARAM_LOCKED): bool, | ||
Optional(PARAM_META): object, | ||
} | ||
}.items(): | ||
SCHEMA[k] = v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just:
SCHEMA.update({
Optional(PARAM_CMD): Or(str, None),
...
})
self._changed_deps(), | ||
self._changed_outs(), | ||
self._changed_checksum(), | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using ors instead of any()
here will simplify code and provide short-circuit behavior.
v = d.get(k) | ||
if v: | ||
checksum[k] = v | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checksum = {k: d[k] for k in modchecksum.CHECKSUM_MAP if d.get(k)}
# Or if you don't expect falsy values in `d`:
checksum = {k: d[k] for k in modchecksum.CHECKSUM_MAP if k in d}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or using funcy:
from funcy import project
checksum = project(d, modchecksum.CHECKSUM_MAP)
Funcy is my lib :)
r = {} | ||
for key, value in self.checksum.items(): | ||
if value: | ||
r[key] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r = {k: v for k, v in self.checksum.items() if v}
Or using funcy:
from funcy import compact
r = compact(self.checksum)
if value: | ||
r[key] = value | ||
|
||
return r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole thing is merging of two dicts with a filter:
from itertools import chain
d1 = self.checksum
d2 = {...}
return {k: v for k, v in chain(d1.items(), d2.items()) if v}
# or with funcy
from funcy import merge, compact
d1 = self.checksum
d2 = {...}
return compact(merge(d1, d2))
return None | ||
for h in checksum_types: | ||
if h not in supported_types: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be
if not set(checksum_types) <= set(supported_types):
return None
dvc/utils/checksum.py
Outdated
ret = [] | ||
for e in d: | ||
ret.append(dict_filter(e, exclude)) | ||
return ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good place to use list comprehension.
self.assertEqual( | ||
d["sha256"], | ||
"6bc1683385ab4218fed1f53195c41d95d186c085a9e929cd77f012d4f541dd34", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look like functional tests not unit ones. Also could be easily written in new pytest style using dvc
fixture:
def test_compatibility(dvc):
ret = main(...)
assert ret == 0
# ....
@Suor thanks for reviewing this. It was started by an external contributor and am trying to get it done, I don't expect it to be fast though. I'll ping you again when it's in a better shape and some fundamental issues are solved. I'll address your comments along the way, thanks! |
rename get_checksum_type_list to checksum_types
[x ] Have you followed the guidelines in our
Contributing document?
Does your PR affect documented changes or does it add new functionality
that should be documented? If yes, have you created a PR for
dvc.org documenting it or at
least opened an issue for it? If so, please add a link to it.
Fixes #1676
Based on a fork by @vyloy .