PyTorch implementation of Masked AutoEncoder
Due to limited resources, I only test my randomly designed ViT-Tiny
on the CIFAR10
dataset. It is not my goal to reproduce MAE perfectly, but my implementation is aligned with the official MAE as much as possible so that users can learn MAE
quickly and accurately.
We have kindly provided the bash script train_pretrain.sh
file for pretraining. You can modify some hyperparameters in the script file according to your own needs.
bash train_pretrain.sh
We have kindly provided the bash script train_finetune.sh
file for finetuning. You can modify some hyperparameters in the script file according to your own needs.
bash train_finetune.sh
- Evaluate the
top1 & top5
accuracy ofViT-Tiny
on CIFAR10 dataset:
python train_finetune.py --dataset cifar10 -m vit_tiny --batch_size 256 --img_size 32 --patch_size 2 --eval --resume path/to/vit_tiny_cifar10.pth
- Evaluate the
top1 & top5
accuracy ofViT-Tiny
on ImageNet-1K dataset:
python train_finetune.py --dataset imagenet_1k -m vit_tiny --batch_size 256 --img_size 224 --patch_size 16 --eval --resume path/to/vit_tiny_imagenet_1k.pth
- Evaluate
MAE-ViT-Tiny
on CIFAR10 dataset:
python train_pretrain.py --dataset cifar10 -m mae_vit_tiny --resume path/to/mae_vit_tiny_cifar10.pth --img_size 32 --patch_size 2 --eval --batch_size 1
- Evaluate
MAE-ViT-Tiny
on ImageNet-1K dataset:
python train_pretrain.py --dataset imagenet_1k -m mae_vit_tiny --resume path/to/mae_vit_tiny_imagenet_1k.pth --img_size 224 --patch_size 16 --eval --batch_size 1
- Visualization on CIFAR10 validation
Masked Image | Original Image | Reconstructed Image
- Visualization on ImageNet validation
...
- On CIFAR10
Method | Model | Epoch | Top 1 | Weight | MAE weight |
---|---|---|---|---|---|
MAE | ViT-T | 100 | 91.2 | ckpt | ckpt |
- On ImageNet-1K
Method | Model | Epoch | Top 1 | Weight | MAE weight |
---|---|---|---|---|---|
MAE | ViT-T | 100 |
Thank you to Kaiming He for his inspiring work on MAE. His research effectively elucidates the semantic distinctions between vision and language, offering valuable insights for subsequent vision-related studies. I would also like to express my gratitude for the official source code of MAE. Additionally, I appreciate the efforts of IcarusWizard for reproducing the MAE implementation.