Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat task 0702 find enhancements #1072

Merged
merged 6 commits into from
Mar 8, 2024
Merged

Conversation

serengil
Copy link
Owner

@serengil serengil commented Mar 8, 2024

Tickets

What has been done

With this PR,

1- some improvements done in find function: detect replaced images, store column names in the pickle.
2- do not discard upper case links anymore
3- support pre-calculated embeddings in verify function

How to test

make lint && make test

hash (str): digest with sha1 algorithm
"""
with open(file_path, "rb") as f:
digest = hashlib.sha1(f.read()).hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo this is quite dangerous.
It attempts to read a whole image file in memory and eventually passes the whole content to the hashing function. In case of (maybe malicious) extremely large image files it could cause OOM (also depending on available memory).
Maybe is better to read and hash the file in chunks.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you recommend to do something like this: https://stackoverflow.com/a/64994148/7846405

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.
As a bonus value for safety I would include a configurable filter in the dataset traverse to exclude in any case files with sizes over a certain threshold: in fact in other parts of the deepface loads in any case the whole file in memory with cv.imread()

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah having file threshold is wise. But will use file properties instead of file itself to hash.

@serengil serengil deleted the feat-task-0702-find-enhancements branch March 8, 2024 20:24
Comment on lines +155 to +159
alpha_hash = current_representation["hash"]
beta_hash = package_utils.find_hash_of_file(identity)
if alpha_hash != beta_hash:
logger.debug(f"Even though {identity} represented before, it's replaced later.")
replaced_images.append(identity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit expensive.
You compute the hash of a file here which is eventually discarded.
Then the file with unmatched hash is added to the list of new files which are passed to _find_bulk_embeddings where the hash of the file is recomputed once again.

Copy link
Contributor

@AndreaLanfranchi AndreaLanfranchi Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest, instead of hashing the whole content of the file, to hash only most immediate (and lighter) properties: name, creation timestamp, last modification timestamp, and size.
Is quite unlikely that a completely different image overwrites the original one keeping the very same attributes.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Liked that approach and convinced. Will do it in another PR soon.

expand_percentage=expand_percentage,
)
except ValueError as err:
raise ValueError("Exception while processing img1_path") from err

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants