Video Search Engine

Authors:

Abby Gray
Akshat Shrivastava
Kevin Bi
Sarah Yu

Semantically be able to search through a database of videos (using generated summaries)

Take a look at our poster!

System Overview

The video below shows exactly how the entire system works end to end.

The user facing system described here is the overview of the overall system architecture.

The backend and video summarizing system was distributed in an attempt to tackle large videos. The architecture is described in the image below

Video Summarization Overview

In this project, we attempted to solve video summarizatoin using image captioning. The architecture and motivation is explained in this section.

Below is the initial architecture of the video summarization network used to generate video summaries.

We converted this into the following network for the final project

We can walk through the steps occuring with explantions here:

We break apart frames into semantically different groups.
- Here we use SSMI (structured similarity measurment index) to determine if two frames are similar
- We define a threshold for comparison
- Any sequence of frames within that threshold belongs to a specific group.
Random Sample from each group
- Since each group are all the semantically similar frames, to reduce the redundancies in the frame captions we try to remove similar frames by selecting a very small subset (1-5) frames from each group
Feed each selected frame to an image captioning network to determine what happens in the frame
- This uses an Encoder-Decoder model for captioning the images as descibed in Object2Text
- Model description
  - Encoder
    - EncoderCNN
      - Uses ResNet-152 pretrained to feed all the features to an encoded feature vector
    - YoloEncoder
      - From a frame performs bounding box object detection on the frame to determine the objects and the bounding boxes for all of them.
      - Uses RNN structure (LSTM for this model) to encode the sequence of objects and their names
      - Uses the resul to create another encoded feature vector
  - Decoder
    - Combines the two feature vectors from the EncoderCNN and the YoloEncoder to create a new feature vector, and uses that feature vector as input to start language generation for the frame caption
- Training
  - Dataset: uses COCO for training
  - Bounding Box: during train uses TinyYOLO for faster training time as well as allowing the network to use a less reliable network to train on, and the more reliable version during testing
Uses Extractive Summarization to select unique phrases from all the frame captions seletected to create a reasonable description of what occured in the video.

The next section shows example output:

Example output

Given a minute long video of traffic in Dhaka Bangladesh.

(
    'a man riding a bike down a street next to a large truck .',
    'a man riding a bike down a street next to a traffic light .',
    'a green truck with a lot of cars on it',
    'a green truck with a lot of cars on the road .',
    'a city bus driving down a street next to a traffic light .'
)

User Interface

To use our search engine we built a Flask based application similar to google to search through our database.

Main UI

This page features the main search functionality. A simplistic design similar to Google.

Results UI

This page features all the results for a given query. Every video in our database is returned in sorted order for relevance. We use TF-IDF scoring for a query to a rank for each of the summaries.

Set Up

To set up the python code create a python3 environment with the following:

# create a virtual environment
$ python3 -m venv env

# activate environment
$ source env/bin/activate

# install all requirements
$ pip install -r requirements.txt

# install data files
$ python dataloader.py

If you add a new package you will have to update the requirements.txt with the following command:

# add new packages
$ pip freeze > requirements.txt

And if you want to deactivate the virtual environment

# decativate the virtual env
$ deactivate

Training Captioning Network

Caption Network Set up

python VideoSearchEngine/ImageCaptioningNoYolo/resize.py --image_dir data/coco/train2014/ 
python VideoSearchEngine/ImageCaptioningNoYolo/resize.py --image_dir data/coco/val2014/ --output_dir data/val_resized2014

Plan

Our project will, broadly defined, be attempting video searching through video summarization. To do this we propose the following objectives and resulting action plan:

Break videos down into semantically different groups of frames
Recognize objects in an image (i.e. a frame)
Convert a frame to text
Merge summaries of all frames of a video into one large overall summary
Build a search engine to query videos via summary.

Data Sets to Use

TaCos MulitModal Data Set

Lots of labeled data for text generation of video summaries.

Paper about how data was collected and performance.

The location of the video dataset: Source

Citations

Papers

Microsoft Research Paper on Video Summarization
YOLO Paper for bounding box object detection
Using YOLO for image captioning
Unsupervised Video Summarization with Adversarial Networks
Long-term Recurrent Convolutional Networks
Coherent Multi-Sentence Video Description with Variable Level of Detail

GitHubs

Original YOLO implementation
Code for YOLO -> LSTM for image captioning
YOLO PyTorch Implementation for Guidance
Tiny YOLO Implementation
machinebox -> video analysis/frame partitioning
Code to break video into frames
Po-Hsun-Su/pytorch-ssim

Blogs and Other Websites

A Guide for YOLO
Another YOLO Guide (same author as above)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Video Search Engine

Table of Contents

System Overview

Video Summarization Overview

Example output

User Interface

Main UI

Results UI

Set Up

Training Captioning Network

Caption Network Set up

Plan

Data Sets to Use

TaCos MulitModal Data Set

Common Object Data Set

Sum Me Data Set

MED Dataset

Citations

Papers

GitHubs

Blogs and Other Websites

Files

README.md

Latest commit

History

README.md

File metadata and controls

Video Search Engine

Table of Contents

System Overview

Video Summarization Overview

Example output

User Interface

Main UI

Results UI

Set Up

Training Captioning Network

Caption Network Set up

Plan

Data Sets to Use

TaCos MulitModal Data Set

Common Object Data Set

Sum Me Data Set

MED Dataset

Citations

Papers

GitHubs

Blogs and Other Websites