In this project, we learn an encoder-decoder with skip-connections proposed by Ibraheem Alhashim[1], and construct improved deep neural networks to predict the depth map from a single RGB image in an end-to-end way. The input of the network is RGB images, and after applying the network the output of this system is the corresponding estimated depth maps. The two datasets we use for training the model, NYU depth v2 and KITTI, contain RGB images and corresponding ground truth of depth maps.
If you want to run the whole program, just run one of .ipynb
files on Colab that can directly download the entire NYU depth v2 dataset or the KITTI dataset.
NYU/Initial_Model/ train.py
trains a model for the NYU v2 dataset.
NYU/Initial_Model/ data.py
reads and pre-processes the NYU v2 dataset.
NYU/Initial_Model/ loss.py
contains loss functions.
NYU/Initial_Model/ model.py
contains an encoder-decoder model for monocular depth estimation. This part is from model.py that can be download from https://github.com/ialhashim/DenseDepth/tree/master/PyTorch.
NYU/NYU_Standard_MonocularDepth.ipynb
can be run directly on colab, and it will automatically download the entire NYU depth v2 dataset. The original encoder-decoder architecture is used in this .ipynb
file.
NYU/NYU_addbatch_MonocularDepth.ipynb
This is the modified encoder-decoder architecture adding BatchNormalization layer.
NYU/NYU_addup_MonocularDepth.ipynb
This is the modified encoder-decoder architecture adding one more 2x upsampling layer.
NYU/NYU_simpledecoder_MonocularDepth.ipynb
This is the modified encoder-decoder architecture without skip-connections.
KITTI/kitti_equalloss_MonocularDepth.ipynb
can be run directly on colab, and it will automatically download the KITTI dataset. This is the original encoder-decoder architecture with the equal-weighted loss function. In this .ipynb
file, the modular of data reading is rewritten for the KITTI dataset.
The NYU Depth dataset v2 is composed of 464 indoor scenes with the resolution of 640x480 as recorded by both the RGB and Depth cameras from the Microsoft Kinect that can collect ground truth of depth directly.
The KITTI is a large dataset that is composed of 56 outdoor scenes including the "city", "residential" categories of the raw data, and so on.
The small test dataset for NYU-depth-v2 can be downloaded here.
The total loss Function are the weighted sum of three loss functions:
Reference : https://ece.uwaterloo.ca/~z70wang/research/ssim/
[1] I. Alhashim and P. Wonka, “High quality monocular depth estimation via transfer learning,” arXiv:1812.11941, 2018.