Skip to content

Code for Breaking the Frame: Visual Place Recognition by Overlap Prediction

Notifications You must be signed in to change notification settings

weitong8591/vop

Repository files navigation

Breaking the Frame: Visual Place Recognition by Overlap Prediction

Updates

2024.06. Available on arxiv.

2024.10. Accepted at WACV 2025.

Summary

The proposed method enables the identification of visible image sections without requiring expensive feature detection and matching. By focusing on obtaining patch-level embeddings by DINOV2 backbone and establishing patch-to-patch correspondences, our approach uses a voting mechanism to assess overlap scores for potential database images, thereby providing a nuanced image retrieval metric in challenging scenarios.

Image 1

Installation

torch == 2.3.1
Python == 3.10.13
OpenCV == 4.10.0.84
OmegaConf == 2.3.0
h5py == 3.11.0
tqdm == 4.66.4
faiss-gpu == 1.7.2
lightglue
hloc

Try the proposed VOP on one example image pair and visualize their matched patches.

Evaluation

Step 1. Preprocess the test data and GT information (e.g., camera parameters R, K) if available. Load images and run the frozen DINOv2 on them, then, save the [CLS] tokens and patch embeddings.

Step 2. Load the best checkpoint and use the trained encoder on the test set. Do retrieval and save a list of images with high overlaps.

Step 3. Evaluate the retrieval results by running Relative pose estimation or localization.

Here are the instructions for the test sets used in the paper. The best checkpoint is downloaded automatically.

💥 important: before data preprocessing, create/update an original dirs for the specific dataset in dump_datasets/data_dirs.yaml.

dataset_dirs:
  inloc:<src_path>
[Megadepth]
  1. Download the data from glue-factory including images and scene_info.

  2. Data preprocess and top-1/5/10 retrieval.

python dump_data.py -ds megadepth
python register.py -k 5 -m best -pre 20 -ds megadepth
  1. Relative pose estimation using RANSAC.
python relative_pose.py -k 5 -m best -pre 20 -ds megadepth
[ETH3D]
  1. Download ETH3D (5.6G).
  2. Data preprocess and top-1/5/10 retrieval.
python dump_data.py -ds eth3d
python register.py -k 5 -m best -pre 20 -ds eth3d
  1. Relative pose estimation using RANSAC.
python relative_pose.py -k 5 -m best -pre 20 -ds eth3d
[Inloc]
  1. Download the DB images and format the data to database/cutouts/; download the queries into query/iphone7/.
  2. Data preprocess and top-40 retrieval.
python dump_data.py -ds inloc
python retrieve.py -ds inloc -k 40 -m best -pre 100
  1. Install and run hloc for localization.
python inloc_localization.py --loc_pairs outputs/inloc/best/cls_100/top40_overlap_pairs.txt -m best -ds inloc -out output_local
  1. Submit the result poses to the long-term visual localization benchmark.
[Customized data]
  1. Add the path of the custom data in data_dirs.yaml, and creat a dump script into here to load images and GT pose information if needed and available.

  2. Run retrieve.py to find overlapping DB images for the queries or register.py to search overlapping images for each image in the pool.

  3. Run the evaluation of relative pose estimation or localization, or use the saved retrieved pairs somewhere else as you want.

Training

Step 1. Download GT depths of Megadepth to for training supervision from here.

Step 2. Customize the configs and start training based on glue-factory. Here we provided a default config with fixed positive/negative image pairs saved (fast) and random positive/negative pairs in this config (slow).

python -m gluefactory.train best_easy_retrain --conf train_configs/best_easy.yaml

Note that the easy version requires prepared labels, pls download it from train and validation.

Important configs:

data:
    data_dir: ""
    info_dir: ""
    # choose the data augmentation type: 'flip, dark, lighglue'
    photometric: {
            "name": "flip",
           "p": 0.95,
            # 'difficulty': 1.0,  # currently unused
       }
    gt_label_path: ""


model:
    matcher:
        name: overlap_predictor # our model
        input_dim: 1024 # the dimension of the pretrained DINOv2 features
        embedding_dim: 256 # projected embedding dim
        dropout_prob: 0.5    # dropout probability

Notes

[Useful configs]
--model, name of the loaded model.
--k, top-k retrievals.
--radius, default=-1, compute the median similarity over 100 random samples as the radius threshold.
--cls, default=True, action True, whether CLS tokens (prefilter) is used.
--pre_filter, default=20, shortlist length.
--weighted, default=True, action True, whether to use TF-IDF weights for voting.
--overwrite, default=False, action True.
--conf, config path used for training.
[Acknowledgement]

glue-factory

long-term visual localization benchmark

pre-commit

[Contact] Contact me at weitongln@gmail.com or weitong@fel.cvut.cz.

About

Code for Breaking the Frame: Visual Place Recognition by Overlap Prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published