Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't cleanup when overwriting uniqueness/mistakenness/hardness runs #164

Merged
merged 1 commit into from
Jan 14, 2024

Conversation

brimoor
Copy link
Contributor

@brimoor brimoor commented Dec 30, 2023

Requires voxel51/fiftyone#3978 to function.

Multiple users have wanted/expected that they can run the code below to compute uniqueness on disjoint subset views of a dataset and store the uniqueness values under the same field.

Previously, this code would silently delete the existing uniqueness values because, technically, the user is overwriting a brain run with same key (brain_key == uniquenesss_field), which triggers a call to the existing run's cleanup() method, which deletes the entire uniqueness_field.

Now, cleanup() is not called when running compute_uniqueness() multiple times with the same uniqueness_field, which allows the user to build up uniqueness values over multiple runs. The reason this was not previously allowed is that, technically, methods like load_brain_view() will now only load the last view on which uniqueness was computed; the dataset is no longer "aware" that multiple brain runs were executed on the dataset.

The way to have the best of both worlds would be to add a separate brain_key argument independent of uniqueness_field and also update the cleanup() method to only clear the uniqueness_field values if the input collection is a view rather than a dataset. However, I think the less invasive change in this PR is sufficient for now.

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz
from fiftyone import ViewField as F

dataset = foz.load_zoo_dataset("cifar10", split="test")

model = foz.load_zoo_model("mobilenet-v2-imagenet-torch")
dataset.compute_embeddings(model, embeddings_field="embeddings")

#
# Run uniqueness on each class individually
#

classes = dataset.distinct("ground_truth.label")
for label in classes:
    view = dataset.match(F("ground_truth.label") == label)
    fob.compute_uniqueness(view, embeddings="embeddings")

assert len(dataset) == len(dataset.exists("uniqueness"))

#
# Cleanup
#

dataset.delete_brain_run("uniqueness")
assert dataset.has_sample_field("uniqueness") is False

@brimoor brimoor added the enhancement Code enhancement label Dec 30, 2023
@brimoor brimoor requested a review from a team December 30, 2023 18:41
@brimoor brimoor self-assigned this Dec 30, 2023
@brimoor brimoor merged commit fc0d7a7 into develop Jan 14, 2024
2 of 4 checks passed
@brimoor brimoor deleted the run-cleanup-flag branch January 14, 2024 20:23
@kaixi-wang
Copy link
Contributor

Isn't the uniqueness value dependent on the input set of samples since it's a relative measure of the distance between samples? So don't you have to delete previous values because adding or removing samples may change the value for all?

@brimoor
Copy link
Contributor Author

brimoor commented Jan 15, 2024

yes the values are dependent on the input samples, but users seem to prefer to be able to run it on disjoint subsets (eg each class individually), knowing that the values cannot necessarily be compared between class.

@brimoor brimoor mentioned this pull request Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Code enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants