Ego-Vision: hand motion recognition

Introduction

This repository is the 1st place solution for DACON Ego-Vision Hand Gesture Recognition AI Contest. We developed a hand motion classification model using continuous image data extracted from Ego-Vision video. The key strategies used are as follows:

Custom augmentation is applied in consideration of data characteristics.
Due to the problem caused by simply distinguishing dynamic Hand Movement into continuous but static images, the final predictions for some labels are determined by postprocessing algorithms.

This is the overall model process.

Dataset description

data is image extracted from first-person perspective, and for 157 classes, training data consists of a total 649 folders and 217 folders for test data. (The purpose the model is to predict the class of a certain folder, by looking at images within)
157 classes can be classified through ‘hand_type’ and ‘view_type’, (Left / Right)-handed and (First / Third)-view respectively.
Additionally, key points of hand data containing coordinates (x, y, z) information are exploited in post-processing.

Main strategy(Other training techniques are omitted.)

Custom Data Augmentation
- Random Crop using key points
  - Most of images in a single folder are very similar with each others.
  - We applied random crop augmentation adopting probabilistic marginalizing based on key points so that hand positions are transformed differently.
- Flip Augmentation
  - Horizontal flip augmentation has to be handled carefully because some assigned different classes depending on left or right-handed image even if the poses are the same.
  - We filtered out those classes and changed the class label if the augmentation is applied.
  - Experimental results showed the best result when it comes to apply flip augmentation on 0.3 ratio among the range of 0.1 to 0.5.
Post-Processing
- Motivation
  - We found that the deviation of score per Fold was still severe even after appropriate Train/Valid Split.
  - Analysis was conducted on classes that the model performed weak prediction confidence.
  - As a result, the model predicted the certain classes with large log loss value out of whole class instances.
  - The class of the corresponding image is case1: (number 1, negative), case2: (warning, fist clenching).
  - Since each case is a very similar image, it is difficult to distinguish the class by looking at each image, but the class can be distinguished by calculating finger movement of images within a folder using the keypoint value.
- A Rule-based Approach using Key points
  case1 : Class "Number 1“ & “Negative Sign (Waving Index Finger)”
  Image stream of a folder showed that the class "Number 1“ maintained overall position of the hand, however the movement of index finger is observed in the case of “Negative” class.
  details :
  
  Using the amount of maximum difference observed in the change from the index finger, thresholding is processed to make a decision between class "Number 1“ and class “Negative Sign”.
  
  The key point on the index finger has the smallest coordinate on the Y-axis among several key points.
  
  Predictions were performed by calculating the amount of change of the corresponding key point to the X-axis.
  
  1. 숫자1
  
  2. 부정(검지흔들기)
  
  case2 : Class "Clenching Fist“ & “Sticking out one’s Fist"
  Image stream of a folder showed that the class “Clenching Fist“ maintained overall position of the hand, however the movement of hand itself get closer toward camera in the case of "Sticking out one’s Fist“ class.
  
  details :
  
  Class “Clenching Fist” and “Sticking out one’s Fist” can be specified with a change in position from certain key point.
  
  Applied thresholding on the certain point that has the maximum amount of change in the X-axis.
  
  The maximum change of any key point from Class “Clenching Fist” is the threshold of the decision boundary.
  
  1. 주먹 쥐기
  
  2. 주먹 내밀기
--> The thresholds of each case are decided from train dataset and applied the same on inference step.

Requirements

Ubuntu 18.04, Cuda 11.1
opencv-python
numpy
pandas
timm
torch==1.8.0 torchvision 0.9.0 with cuda 11.1
natsort
scikit-learn==1.0.0
pillow
torch_optimizer
tqdm
ptflops
easydict
matplotlib

pip install -r requirements.txt

How to Use

Prepare data

python make_data.py

Training

#1 Single Model Train
python main.py --img_size=288 --exp_num=0

#2 Multiple training using shell
sh multi_train.sh

Inference & Post-processing

# Pretrained weight download from github
os.makedirs('./results/', exist_ok=True)
!wget -i https://raw.githubusercontent.com/wooseok-shin/Egovision-1st-place-solution/main/load_pretrained.txt -P results

python test_post_processing.py

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
img		img
utils		utils
1st place solution.ipynb		1st place solution.ipynb
README.md		README.md
config.py		config.py
crop_image.py		crop_image.py
dataloader.py		dataloader.py
get_threshold.py		get_threshold.py
main.py		main.py
make_data.py		make_data.py
multi_train.sh		multi_train.sh
network.py		network.py
requirements.txt		requirements.txt
test_post_preocessing.py		test_post_preocessing.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ego-Vision: hand motion recognition

Introduction

Dataset description

Main strategy(Other training techniques are omitted.)

Requirements

How to Use

About

Releases

Packages

Languages


1. 숫자1	2. 부정(검지흔들기)


1. 주먹 쥐기	2. 주먹 내밀기

Han-YeJi/dacon-egovision

Folders and files

Latest commit

History

Repository files navigation

Ego-Vision: hand motion recognition

Introduction

Dataset description

Main strategy(Other training techniques are omitted.)

Requirements

How to Use

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages