Use image retrieval techniques to find similiar images #27

dlangenk · 2019-01-04T09:41:40Z

More like a nice to have.

I just browsed through the results of novelty detection. Unfortunately the classes are quite scattered, so that selection takes some time. In addition, some classes are much more abundant than others, so the rare classes might be "lost" in the downstream steps. It would be nice to have a "show me more thumbnails that look like this one" mechanism. Algorithms for that are available in image retrieval. We could for example use mpeg7 features or something similar to create a tree structure from the data to make it easier browsable. Creation of that structure shouldn't take much time or resources.

mzur · 2021-04-07T14:12:50Z

#66 should be implemented first.

mzur · 2021-04-07T14:15:24Z

Idea for the UI: If this feature is active (which is optional or disabled if not enough training data is available), the grid of image patches in MAIA is split vertically (e.g. 80% rows showing the regular patches, 20% rows showing patches suggested by this method). This way the original MAIA workflow is still possible even if this method performs poorly for a given use case.

mzur · 2021-08-13T09:31:23Z

This can be done with the image features and similarity search implemented for biigle/core#336. The function should be available for training proposals and annotation candidates.

mzur · 2021-10-18T07:05:58Z

Next idea for the UI: The selected proposal/candidate is shown, fixed and highlighted at the first position in the grid. The remaining grid items are sorted according to the similarity to the patch. They scroll and can be interacted with as usual. The filtering can be enabled with a hover button on each patch. It can be disabled with a button on the highlighted fixed patch.

mzur · 2021-11-30T15:34:40Z

Updated the title to make clear that this should be implemented both for training proposals and annotation candidates.

mzur · 2023-02-09T09:29:22Z

With the student experiments based on Dino features and #96 done, this can move forward now.

mzur · 2023-10-04T12:37:56Z

I want to pick this up again. New thoughts:

Use DINOv2 for feature extraction.
Use pgvector to store the features directly in the database (will work with annotation patches, too).
I thought about using a separate (vector) database for storing the features but 1. it's too convenient to use the existing constraints and logic to update/delete the rows and 2. there are probably no performance issues (right now) with the amount of data we manage.
pgvector supports (indexing) up to 2000 dimensions per feature vector. Each dimension requires ~4 byte. DINOv2 can produce feature vectors with between 384 and 1536 dimensions. A 1536 dim. feature vector would have ~6144 bytes. From a rough estimate, the features of the current BIIGLE image annotations would require >90 GB which is too much, IMO. A 384 dim. feature vector would result in ~23 GB of additional storage. As a start, I'll experiment with patches of size 224x224 and vits14 (384 dims).
We could use PCA for dimension reduction with MAIA but we can't for the other use cases (i.e. Largo) as the annotations are created continuously and we can't know the PCs in advance.
When this goes live (also for regular annotations in Largo) we must think about migrating the database host to a flavor with more storage.
We also have to implement incremental backups, I think (for the "frequent" backup). The hourly backups should be fine even with a much larger DB size.

Here is a notebook with a minimal feature-extraction example with DINOv2: https://colab.research.google.com/drive/1LbtYkzdOezl2SadyxCRJFYhLd_aQNjlq?usp=sharing

mzur · 2023-10-04T13:41:11Z

Thinking about it, maybe I prefer decoupling the vector database from our main database. With MAIA and Largo it's easy to implement cleanup of vector database rows, since the annotation/candidate/proposal patch files are also cleaned. Cleanup can be asynchronous as well.

This has the advantage that the vector DB does not have an impact on the regular DB backups. It can have it's own (less frequent) backups and be run on a different host.

Laravel can work with different database connections (also for migrations). We only need to sync (and index) the model IDs from the regular DB to the vector DB but this shouldn't be a problem.

I'll still stick with pgvector, as I don't want to introduce a new technology to the stack.

References #27 References biigle/core#670

dlangenk added the discuss label Jan 4, 2019

mzur added the student label Jan 4, 2019

mzur mentioned this issue Aug 13, 2021

Sort proposals/candidates by similarity #66

Closed

mzur removed the discuss label Aug 13, 2021

mzur added this to BIIGLE Roadmap Oct 15, 2021

mzur moved this to Medium Priority in BIIGLE Roadmap Oct 15, 2021

mzur changed the title ~~Use image retrieval techniques to find similiar images after novelty detection~~ Use image retrieval techniques to find similiar images Nov 30, 2021

mzur mentioned this issue Sep 27, 2022

Maia data import #95

Closed

mzur removed the student label Oct 20, 2022

mzur moved this from Medium Priority to High Priority in BIIGLE Roadmap Feb 9, 2023

mzur self-assigned this Jun 13, 2023

mzur mentioned this issue Oct 4, 2023

Implement separate vector database biigle/core#670

Merged

1 task

mzur added a commit that referenced this issue Oct 4, 2023

WIP Start work on TrainingProposalEmbedding

fdcc2be

References #27 References biigle/core#670

mzur mentioned this issue Oct 4, 2023

Similarity sort #128

Merged

8 tasks

mzur linked a pull request Oct 12, 2023 that will close this issue

Similarity sort #128

Merged

8 tasks

mzur closed this as completed in #128 Oct 13, 2023

mzur mentioned this issue Dec 12, 2023

Outlier detection biigle/largo#120

Merged

33 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use image retrieval techniques to find similiar images #27

Use image retrieval techniques to find similiar images #27

dlangenk commented Jan 4, 2019

mzur commented Apr 7, 2021

mzur commented Apr 7, 2021

mzur commented Aug 13, 2021

mzur commented Oct 18, 2021 •

edited

Loading

mzur commented Nov 30, 2021

mzur commented Feb 9, 2023

mzur commented Oct 4, 2023 •

edited

Loading

mzur commented Oct 4, 2023 •

edited

Loading

Use image retrieval techniques to find similiar images #27

Use image retrieval techniques to find similiar images #27

Comments

dlangenk commented Jan 4, 2019

mzur commented Apr 7, 2021

mzur commented Apr 7, 2021

mzur commented Aug 13, 2021

mzur commented Oct 18, 2021 • edited Loading

mzur commented Nov 30, 2021

mzur commented Feb 9, 2023

mzur commented Oct 4, 2023 • edited Loading

mzur commented Oct 4, 2023 • edited Loading

mzur commented Oct 18, 2021 •

edited

Loading

mzur commented Oct 4, 2023 •

edited

Loading

mzur commented Oct 4, 2023 •

edited

Loading