Show and Tell, but on video
Quick Jupyter mini guide on how to use the Python API
pip install pycocotools-windows
Run the make script to get the COCO dataset (50 GB, 2017 challenge) (requires gnu wget)
Download the YOLO weights (I have included the class names and config file here, but these are too big)
cd YOLO
wget https://pjreddie.com/media/files/yolov3.weights
Train:
python train.py
Run:
python run.py
Based on the original architecture (and repo), this is using ResNet-152 as the encoder, and the LSTM as the decoder
CNN Encoder | RNN Decoder |
---|---|
Using Darknet's YOLO to constrain where the model should look
@misc{https://doi.org/10.48550/arxiv.1411.4555,
doi = {10.48550/ARXIV.1411.4555},
url = {https://arxiv.org/abs/1411.4555},
author = {Vinyals, Oriol and Toshev, Alexander and Bengio, Samy and Erhan, Dumitru},
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Show and Tell: A Neural Image Caption Generator},
publisher = {arXiv},
year = {2014},
copyright = {arXiv.org perpetual, non-exclusive lic
To be improved: ✔️ (Visit the new repo)
- Migrate to OpenCV GPU build
- Add an attention mechanism to the Decoder
- Optimize model parameter size for inference speed
- Change the greedy nearest word search to a beam search for the words in the vocabulary