-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat task 0702 find enhancements #1072
Conversation
- detect replaced files already in data store - store column names in the pickle
hash (str): digest with sha1 algorithm | ||
""" | ||
with open(file_path, "rb") as f: | ||
digest = hashlib.sha1(f.read()).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo this is quite dangerous.
It attempts to read a whole image file in memory and eventually passes the whole content to the hashing function. In case of (maybe malicious) extremely large image files it could cause OOM (also depending on available memory).
Maybe is better to read and hash the file in chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you recommend to do something like this: https://stackoverflow.com/a/64994148/7846405
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly.
As a bonus value for safety I would include a configurable filter in the dataset traverse to exclude in any case files with sizes over a certain threshold: in fact in other parts of the deepface loads in any case the whole file in memory with cv.imread()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah having file threshold is wise. But will use file properties instead of file itself to hash.
alpha_hash = current_representation["hash"] | ||
beta_hash = package_utils.find_hash_of_file(identity) | ||
if alpha_hash != beta_hash: | ||
logger.debug(f"Even though {identity} represented before, it's replaced later.") | ||
replaced_images.append(identity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit expensive.
You compute the hash of a file here which is eventually discarded.
Then the file with unmatched hash is added to the list of new files which are passed to _find_bulk_embeddings
where the hash of the file is recomputed once again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest, instead of hashing the whole content of the file, to hash only most immediate (and lighter) properties: name, creation timestamp, last modification timestamp, and size.
Is quite unlikely that a completely different image overwrites the original one keeping the very same attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Liked that approach and convinced. Will do it in another PR soon.
expand_percentage=expand_percentage, | ||
) | ||
except ValueError as err: | ||
raise ValueError("Exception while processing img1_path") from err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tickets
find
#1061What has been done
With this PR,
1- some improvements done in find function: detect replaced images, store column names in the pickle.
2- do not discard upper case links anymore
3- support pre-calculated embeddings in verify function
How to test