The pose estimation task involves predicting the 2D position of human body keypoints of every person in each image.
The BDD100K dataset contains 2D human pose annotations for 14K images (10K/1.5K/2.5K for train/val/test). Each annotation contains labels for 18 body keypoints. For details about downloading the data and the annotation format for this task, see the official documentation.
For training the models listed below, we follow the common settings used by MMPose (model zoo here), unless otherwise stated. See the config files for the detailed setting for each model. All models are trained on either 4 GeForce RTX 2080 Ti GPUs or 4 TITAN RTX GPUs with a batch size of 4x64=256.
Top-down methods first detect human bounding boxes and then estimate the keypoint locations for each human.
For the models below, we use a Cascade R-CNN with R-101-FPN backbone as the human detector, which can achieve 32.69 AP on humans on the BDD100K detection validation set (model here). You can find the human detections for the validation set here and test set here.
Deep Residual Learning for Image Recognition [CVPR 2016]
Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Abstract
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28\% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.Backbone | Input Size | Pose AP-val | Scores-val | Pose AP-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 256 * 192 | 46.15 | scores | 43.73 | scores | config | model | MD5 | preds | visuals |
ResNet-101 | 256 * 192 | 46.87 | scores | 43.48 | scores | config | model | MD5 | preds | visuals |
ResNet-50 | 320 * 256 | 47.44 | scores | 44.36 | scores | config | model | MD5 | preds | visuals |
ResNet-101 | 320 * 256 | 48.08 | scores | 44.93 | scores | config | model | MD5 | preds | visuals |
MobileNetV2: Inverted Residuals and Linear Bottlenecks [CVPR 2018]
Authors: Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen
Abstract
In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3. The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters.Backbone | Input Size | Pose AP-val | Scores-val | Pose AP-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|
MobileNetV2 | 256 * 192 | 43.82 | scores | 41.02 | scores | config | model | MD5 | preds | visuals |
MobileNetV2 | 320 * 256 | 45.15 | scores | 42.32 | scores | config | model | MD5 | preds | visuals |
Deep High-Resolution Representation Learning for Visual Recognition [CVPR 2019 / TPAMI 2020]
Authors: Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, Bin Xiao
Abstract
High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at [this https URL](https://github.com/HRNet).Backbone | Input Size | Pose AP-val | Scores-val | Pose AP-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|
HRNet-w32 | 256 * 192 | 48.83 | scores | 46.13 | scores | config | model | MD5 | preds | visuals |
HRNet-w48 | 256 * 192 | 50.32 | scores | 47.36 | scores | config | model | MD5 | preds | visuals |
HRNet-w32 | 320 * 256 | 49.86 | scores | 46.90 | scores | config | model | MD5 | preds | visuals |
HRNet-w48 | 320 * 256 | 50.16 | scores | 47.32 | scores | config | model | MD5 | preds | visuals |
Distribution-Aware Coordinate Representation for Human Pose Estimation [CVPR 2020]
Authors: Feng Zhang, Xiatian Zhu, Hanbin Dai, Mao Ye, Ce Zhu
Abstract
While being the de facto standard coordinate representation in human pose estimation, heatmap is never systematically investigated in the literature, to our best knowledge. This work fills this gap by studying the coordinate representation with a particular focus on the heatmap. Interestingly, we found that the process of decoding the predicted heatmaps into the final joint coordinates in the original image space is surprisingly significant for human pose estimation performance, which nevertheless was not recognised before. In light of the discovered importance, we further probe the design limitations of the standard coordinate decoding method widely used by existing methods, and propose a more principled distribution-aware decoding method. Meanwhile, we improve the standard coordinate encoding process (i.e. transforming ground-truth coordinates to heatmaps) by generating accurate heatmap distributions for unbiased model training. Taking the two together, we formulate a novel Distribution-Aware coordinate Representation of Keypoint (DARK) method. Serving as a model-agnostic plug-in, DARK significantly improves the performance of a variety of state-of-the-art human pose estimation models. Extensive experiments show that DARK yields the best results on two common benchmarks, MPII and COCO, consistently validating the usefulness and effectiveness of our novel coordinate representation idea.Backbone | Input Size | Pose AP-val | Scores-val | Pose AP-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 256 * 192 | 47.07 | scores | 44.05 | scores | config | model | MD5 | preds | visuals |
ResNet-101 | 256 * 192 | 47.43 | scores | 43.98 | scores | config | model | MD5 | preds | visuals |
ResNet-50 | 320 * 256 | 47.55 | scores | 44.73 | scores | config | model | MD5 | preds | visuals |
ResNet-101 | 320 * 256 | 48.44 | scores | 45.06 | scores | config | model | MD5 | preds | visuals |
MobileNetV2 | 256 * 192 | 44.44 | scores | 41.25 | scores | config | model | MD5 | preds | visuals |
MobileNetV2 | 320 * 256 | 45.02 | scores | 42.26 | scores | config | model | MD5 | preds | visuals |
HRNet-w32 | 256 * 192 | 48.92 | scores | 46.02 | scores | config | model | MD5 | preds | visuals |
HRNet-w48 | 256 * 192 | 50.08 | scores | 47.30 | scores | config | model | MD5 | preds | visuals |
HRNet-w32 | 320 * 256 | 49.84 | scores | 46.95 | scores | config | model | MD5 | preds | visuals |
HRNet-w48 | 320 * 256 | 50.31 | scores | 46.91 | scores | config | model | MD5 | preds | visuals |
The Devil Is in the Details: Delving Into Unbiased Data Processing for Human Pose Estimation [CVPR 2020]
Authors: Junjie Huang, Zheng Zhu, Feng Guo, Guan Huang
Abstract
Being a fundamental component in training and inference, data processing has not been systematically considered in human pose estimation community, to the best of our knowledge. In this paper, we focus on this problem and find that the devil of human pose estimation evolution is in the biased data processing. Specifically, by investigating the standard data processing in state-of-the-art approaches mainly including coordinate system transformation and keypoint format transformation (i.e., encoding and decoding), we find that the results obtained by common flipping strategy are unaligned with the original ones in inference. Moreover, there is a statistical error in some keypoint format transformation methods. Two problems couple together, significantly degrade the pose estimation performance and thus lay a trap for the research community. This trap has given bone to many suboptimal remedies, which are always unreported, confusing but influential. By causing failure in reproduction and unfair in comparison, the unreported remedies seriously impedes the technological development. To tackle this dilemma from the source, we propose Unbiased Data Processing (UDP) consist of two technique aspect for the two aforementioned problems respectively (i.e., unbiased coordinate system transformation and unbiased keypoint format transformation). As a model-agnostic approach and a superior solution, UDP successfully pushes the performance boundary of human pose estimation and offers a higher and more reliable baseline for research community. Code is public available in [this https URL](https://github.com/HuangJunJie2017/UDP-Pose).Backbone | Input Size | Pose AP-val | Scores-val | Pose AP-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 256 * 192 | 46.38 | scores | 44.17 | scores | config | model | MD5 | preds | visuals |
ResNet-101 | 256 * 192 | 46.27 | scores | 43.72 | scores | config | model | MD5 | preds | visuals |
ResNet-50 | 320 * 256 | 47.45 | scores | 44.66 | scores | config | model | MD5 | preds | visuals |
MobileNetV2 | 256 * 192 | 43.97 | scores | 41.03 | scores | config | model | MD5 | preds | visuals |
MobileNetV2 | 320 * 256 | 45.98 | scores | 42.59 | scores | config | model | MD5 | preds | visuals |
HRNet-w32 | 256 * 192 | 49.52 | scores | 46.42 | scores | config | model | MD5 | preds | visuals |
a. Create a conda virtual environment and activate it.
conda create -n bdd100k-mmpose python=3.8
conda activate bdd100k-mmpose
b. Install PyTorch and torchvision following the official instructions, e.g.,
conda install pytorch torchvision -c pytorch
Note: Make sure that your compilation CUDA version and runtime CUDA version match. You can check the supported CUDA version for precompiled packages on the PyTorch website.
c. Install mmcv and mmpose.
pip install mmcv-full
pip install mmpose==0.18.0
You can also refer to the official instructions.
Single GPU inference:
python ./test.py ${CONFIG_FILE} --format-dir ${OUTPUT_DIR} [--cfg-options]
Multiple GPU inference:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
--nproc_per_node=4 --master_port=12000 ./test.py $CFG_FILE \
--format-dir ${OUTPUT_DIR} [--cfg-options] --launcher pytorch
To evaluate the pose estimation performance on the BDD100K validation set, you can follow the official evaluation scripts provided by BDD100K:
python -m bdd100k.eval.run -t pose \
-g ../data/bdd100k/labels/pose/pose_${SET_NAME}.json \
-r ${OUTPUT_DIR}/result_keypoints.json \
[--out-file ${RESULTS_FILE}] [--nproc ${NUM_PROCESS}]
You can obtain the performance on the BDD100K test set by submitting your model predictions to our evaluation server hosted on EvalAI.
For visualization, you can use the visualization tool provided by Scalabel.
Below is an example:
import os
import numpy as np
from PIL import Image
from scalabel.label.io import load
from scalabel.vis.label import LabelViewer
# load prediction frames
frames = load('$OUTPUT_DIR/result_keypoints.json').frames
viewer = LabelViewer()
for frame in frames:
img = np.array(Image.open(os.path.join('$IMG_DIR', frame.name)))
viewer.draw(img, frame)
viewer.save(os.path.join('$VIS_DIR', frame.name))
You can include your models in this repo as well! Please follow the contribution instructions.