Unofficial implimentation of Visual Transformers: Token-based Image Representation and Processing for Computer Vision paper.
python main.py task_mode learning_mode data --model --weights
, where:
task_mode
:classification
orsemantic_segmentation
for corresponding tasklearning_mode
:train
to train--model
from scratch,test
to validate--model
with--weights
on validation data.data
: path to dataset, in case of classification should be path to image net, in case of semantic segmentation to coco.--model
:
○ classification:ResNet18
orVT_ResNet18
(will be used by default).
○ semantic segmentation:PanopticFPN
orVT_FPN
(will be used by default).--weights
must be provided iflearning_mode
equals totest
, won't be used intrain
mode.--from_pretrained
uses to continue training from some point, should bestate_dict
that containsmodel_state_dict
,optimizer_state_dict
andepoch
.
- final metrics and losses after 15 and 5 epochs of classification and semantic segmentation respectively:
|
|
- loss and metric curves of classification and semantic segmentation:
cross entropy loss | accuracy |
---|---|
pixel-wise cross entropy loss | mIOU |
---|---|
- Efficiency and parameters
Params (M) | FLOPs (M) | Forward-backward pass (s) | |
---|---|---|---|
ResNet18 | 11.2 | 822 | 0.016 |
VT-ResNet18 | 12.7 | 543 | 0.02 |
Panoptic FPN | 16.4 | 67412 | 0.08 |
VT-FPN | 40.3 | 110019 | 0.062 |
- classification: ResNet18, VT-ResNet18
- semantic segmentation: Panoptic FPN, VT-FPN