This repository is the implementation of MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering.
VQA2.0, GQA: LXMERT
visual7w, TDIUC: CTI
VQA1.0: VQA web
Under ./pretrain:
bash run.bash exp_name gpuid
Some parameters can be changed in run.bash.
Under ./main:
bash run.bash exp_name gpuid
Two stage workflow
Stage one: bilinear model (BAN, SAN, MLP)
Under ./bilinear_method:
bash run.bash exp_name gpuid mod dataset model
After training, we can generate answer list for each dataset. In this way, we simplify FFOE VQA into MC VQA.
Stage two: MIRTT. Under ./main
keep updating