Segment Anything

Datasets

SA-1B, COCO2017, and LVIS annotations.

To conduct box-prompted instance segmentation, you must first obtain the source_json_file of detected bounding boxes. Follow the instructions of ViTDet, YOLOv8, and GroundingDINO to get the source_json_file. You can also download our pre-generated files.

Expected directory structure:

coco
├── train2017
├── val2017
├── annotations
│   ├── instances_val2017.json
│   ├── lvis_v1_val.json
|── source_json_file
│   ├── coco_groundingdino.json
│   ├── coco_vitdet.json
│   ├── coco_yolov8.json
│   ├── lvis_vitdet.json
sam
├── images
├── masks
├── sa_images_ids.txt

Pretrained Models

Latency/Throughput is measured on NVIDIA Jetson AGX Orin, and NVIDIA A100 GPU with TensorRT, fp16. Data transfer time is included.

Model	Resolution	COCO mAP	LVIS mAP	Params	MACs	Jetson Orin Latency (bs1)	A100 Throughput (bs16)	Checkpoint
EfficientViT-SAM-L0	512x512	45.7	41.8	34.8M	35G	8.2ms	762 images/s	link
EfficientViT-SAM-L1	512x512	46.2	42.1	47.7M	49G	10.2ms	638 images/s	link
EfficientViT-SAM-L2	512x512	46.6	42.7	61.3M	69G	12.9ms	538 images/s	link
EfficientViT-SAM-XL0	1024x1024	47.5	43.9	117.0M	185G	22.5ms	278 images/s	link
EfficientViT-SAM-XL1	1024x1024	47.8	44.4	203.3M	322G	37.2ms	182 images/s	link

Table1: Summary of All EfficientViT-SAM Variants. COCO mAP and LVIS mAP are measured using ViTDet's predicted bounding boxes as the prompt. End-to-end Jetson Orin latency and A100 throughput are measured with TensorRT and fp16.

Usage

# segment anything
from efficientvit.sam_model_zoo import create_sam_model

efficientvit_sam = create_sam_model(
  name="xl1", weight_url="assets/checkpoints/sam/xl1.pt",
)
efficientvit_sam = efficientvit_sam.cuda().eval()

from efficientvit.models.efficientvit.sam import EfficientViTSamPredictor

efficientvit_sam_predictor = EfficientViTSamPredictor(efficientvit_sam)

from efficientvit.models.efficientvit.sam import EfficientViTSamAutomaticMaskGenerator

efficientvit_mask_generator = EfficientViTSamAutomaticMaskGenerator(efficientvit_sam)

Evaluation

Note: For LVIS evaluation, please manually install the lvis package (check this issue for more details).

Box-Prompted Zero-Shot Instance Segmentation

Ground Truth Bounding Box

# COCO
torchrun --nproc_per_node=8 eval_sam_model.py --dataset coco --image_root coco/val2017 --annotation_json_file coco/annotations/instances_val2017.json --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --prompt_type box

# expected results: all=79.927, large=83.748, medium=82.210, small=75.833

# LVIS
torchrun --nproc_per_node=8 eval_sam_model.py --dataset lvis --image_root coco --annotation_json_file coco/annotations/lvis_v1_val.json --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --prompt_type box

# expected results: all=79.886, large=91.577, medium=88.447, small=74.412

Detected Bounding Box

# COCO
torchrun --nproc_per_node=8 eval_sam_model.py --dataset coco --image_root coco/val2017 --annotation_json_file coco/annotations/instances_val2017.json --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --prompt_type box_from_detector --source_json_file coco/source_json_file/coco_vitdet.json

# expected results: 
#  Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.478

# LVIS
torchrun --nproc_per_node=8 eval_sam_model.py --dataset lvis --image_root coco --annotation_json_file coco/annotations/lvis_v1_val.json --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --prompt_type box_from_detector --source_json_file coco/source_json_file/lvis_vitdet.json

# expected results: 
#  Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=300 catIds=all] = 0.444

Point-Prompted Zero-Shot Instance Segmentation

# COCO
torchrun --nproc_per_node=8 eval_sam_model.py --dataset coco --image_root coco/val2017 --annotation_json_file coco/annotations/instances_val2017.json --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --prompt_type point --num_click 1

# expected results: all=59.757, large=62.132, medium=63.837, small=55.029

# LVIS
torchrun --nproc_per_node=8 eval_sam_model.py --dataset lvis --image_root coco --annotation_json_file coco/annotations/lvis_v1_val.json --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --prompt_type point --num_click 1

# expected results: all=56.624, large=72.442, medium=71.796, small=47.750

Visualization

Please run demo_sam_model.py to visualize our segment anything models.

Example:

# segment everything
python demo_sam_model.py --model xl1 --mode all

# prompt with points
python demo_sam_model.py --model xl1 --mode point

# prompt with box
python demo_sam_model.py --model xl1 --mode box --box "[150,70,640,400]"

Deployment

ONNX Export

# Export Encoder
python deployment/sam/onnx/export_encoder.py --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --output assets/export_models/sam/onnx/xl1_encoder.onnx

# Export Decoder
python deployment/sam/onnx/export_decoder.py --model xl1 --weight_url assets/checkpoints/sam/xl1.pt --output assets/export_models/sam/onnx/xl1_decoder.onnx --return-single-mask

# ONNX Inference
python -m deployment.sam.onnx.inference --model xl1 --encoder_model assets/export_models/sam/onnx/xl1_encoder.onnx --decoder_model assets/export_models/sam/onnx/xl1_decoder.onnx --mode point

TensorRT Export

# Export Encoder
trtexec --onnx=assets/export_models/sam/onnx/xl1_encoder.onnx --minShapes=input_image:1x3x1024x1024 --optShapes=input_image:4x3x1024x1024 --maxShapes=input_image:4x3x1024x1024 --saveEngine=assets/export_models/sam/tensorrt/xl1_encoder.engine

# Export Decoder
trtexec --onnx=assets/export_models/sam/onnx/xl1_decoder.onnx --minShapes=point_coords:1x1x2,point_labels:1x1 --optShapes=point_coords:16x2x2,point_labels:16x2 --maxShapes=point_coords:16x2x2,point_labels:16x2 --fp16 --saveEngine=assets/export_models/sam/tensorrt/xl1_decoder.engine

# TensorRT Inference
python -m deployment.sam.tensorrt.inference --model xl1 --encoder_engine assets/export_models/sam/tensorrt/xl1_encoder.engine --decoder_engine assets/export_models/sam/tensorrt/xl1_decoder.engine --mode point

Training

Download the distilled models and place them under assets/distilled_checkpoints.

Single-Node

We make use of torchrun to launch distributed jobs.

torchrun --nproc_per_node=8 train_sam_model.py configs/sam/xl1.yaml --path .exp/sam/efficientvit_sam_xl1 --resume

SLURM

bash slurm_run_sam.sh

Note: The training config matches with the slurm script.

Citation

If EfficientViT or EfficientViT-SAM is useful or relevant to your research, please kindly recognize our contributions by citing our papers:

@article{cai2022efficientvit,
  title={Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition},
  author={Cai, Han and Gan, Chuang and Han, Song},
  journal={arXiv preprint arXiv:2205.14756},
  year={2022}
}

@article{zhang2024efficientvit,
  title={EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss},
  author={Zhang, Zhuoyang and Cai, Han and Han, Song},
  journal={arXiv preprint arXiv:2402.05008},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sam.md

sam.md

Segment Anything

Datasets

Pretrained Models

Usage

Evaluation

Box-Prompted Zero-Shot Instance Segmentation

Ground Truth Bounding Box

Detected Bounding Box

Point-Prompted Zero-Shot Instance Segmentation

Visualization

Deployment

ONNX Export

TensorRT Export

Training

Single-Node

SLURM

Citation

Files

sam.md

Latest commit

History

sam.md

File metadata and controls

Segment Anything

Datasets

Pretrained Models

Usage

Evaluation

Box-Prompted Zero-Shot Instance Segmentation

Ground Truth Bounding Box

Detected Bounding Box

Point-Prompted Zero-Shot Instance Segmentation

Visualization

Deployment

ONNX Export

TensorRT Export

Training

Single-Node

SLURM

Citation