Robotic Pick-and-Place of Novel Objects in Clutter

This repository contains implementations for the major components of robot perception as part of MIT-Princeton's 1st place winning entry for the stow task at the Amazon Robotics Challenge 2017. Featuring:

Suction-Based Grasping - a Torch implementation of fully convolutional neural networks (FCNs) for directly predicting suction-based grasping affordances from RGB-D images.
- Baseline Algorithm - a Matlab implementation of a baseline algorithm that predicts suction-based grasping affordances by computing the variance of surface normals of a 3D point cloud (projected from RGB-D images), where lower variance = higher affordance.
Parallel-Jaw Grasping - a Torch implementation of fully convolutional neural networks (FCNs) for directly predicting parallel-jaw grasping affordances from heightmaps (created from RGB-D images).
- Baseline Algorithm - a Matlab implementation of a baseline algorithm for detecting anti-podal parallel-jaw grasps by detecting "hill-like" geometric stuctures over a 3D point cloud (projected from RGB-D images).
Image Matching - a Torch implementation of two-stream convolutional neural networks for matching observed images of grasped objects to their product images for recognition.

For more information about our approach, please visit our project webpage and check out our paper:

Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching ( pdf | arxiv | webpage )

Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon, Francois R. Hogan, Maria Bauza, Daolin Ma, Orion Taylor, Melody Liu, Eudald Romo, Nima Fazeli, Ferran Alet, Nikhil Chavan Dafle, Rachel Holladay, Isabella Morona, Prem Qu Nair, Druck Green, Ian Taylor, Weber Liu, Thomas Funkhouser, Alberto Rodriguez

IEEE International Conference on Robotics and Automation (ICRA) 2018

Abstract This paper presents a robotic pick-and-place system that is capable of grasping and recognizing both known and novel objects in cluttered environments. The key new feature of the system is that it handles a wide range of object categories without needing any task-specific training data for novel objects. To achieve this, it first uses a category-agnostic affordance prediction algorithm to select among four different grasping primitive behaviors. It then recognizes picked objects with a cross-domain image classification framework that matches observed images to product images. Since product images are readily available for a wide range of objects (e.g., from the web), the system works out-of-the-box for novel objects without requiring any additional training data. Exhaustive experimental results demonstrate that our multi-affordance grasping achieves high success rates for a wide variety of objects in clutter, and our recognition algorithm achieves high accuracy for both known and novel grasped objects. The approach was part of the MIT-Princeton Team system that took 1st place in the stowing task at 2017 Amazon Robotics Challenge.

Citing

If you find this code useful in your work, please consider citing:

@article{zeng2018robotic, 
	title={Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching}, 
	author={Zeng, Andy and Song, Shuran and Yu, Kuan-Ting and Donlon, Elliott and Hogan, Francois Robert and Bauza, Maria and Ma, Daolin and Taylor, Orion and Liu, Melody and Romo, Eudald and Fazeli, Nima and Alet, Ferran and Dafle, Nikhil Chavan and Holladay, Rachel and Morona, Isabella and Nair, Prem Qu and Green, Druck and Taylor, Ian and Liu, Weber and Funkhouser, Thomas and Rodriguez, Alberto}, 
	booktitle={Proceedings of the IEEE International Conference on Robotics and Automation}, 
	year={2018} 
}

License

This code is released under the Apache License v2.0 (refer to the LICENSE file for details).

Datasets

Information and download links for our grasping dataset and image matching dataset can be found on our project webpage.

Contact

If you have any questions or find any bugs, please let me know: Andy Zeng andyz[at]princeton[dot]edu

Change Log

Nov. 16, 2017. Fix: added require 'util' to DataLoader.lua.

Requirements and Dependencies

NVIDIA GPU with compute capability 3.5+
Torch with packages: image, optim, inn, cutorch, cunn, cudnn, hdf5
Matlab 2015b or later

Our implementations have been tested on Ubuntu 16.04 with an NVIDIA Titan X. Our full pick-and-place system implementation (outside the scope of this repository) uses a lightweight C++ ROS service as a wrapper to control Torch/Lua and Matlab processes via TCP sockets. Data is shared between the processes by reading and writing from RAMDisk.

Suction-Based Grasping

A Torch implementation of fully convolutional neural networks for predicting pixel-level affordances (here higher values indicate better surface locations for grasping with suction) given an RGB-D image as input.

Quick Start

To run our pre-trained model to get pixel-level affordances for grasping with suction:

Clone this repository and navigate to arc-robot-vision/suction-based-grasping/convnet

git clone https://github.com/andyzeng/arc-robot-vision.git
cd arc-robot-vision/suction-based-grasping/convnet

Download our pre-trained model for suction-based grasping:
```
wget http://vision.princeton.edu/projects/2017/arc/downloads/suction-based-grasping-snapshot-10001.t7
```
Direct download link: suction-based-grasping-snapshot-10001.t7 (450.1 MB)
Run our model on an optional target RGB-D image. Input color images should be 24-bit RGB PNG, while depth images should be 16-bit PNG, where depth values are saved in deci-millimeters (10^-4m).
```
th infer.lua # creates results.h5
```
or
```
imgColorPath=<image.png> imgDepthPath=<image.png> modelPath=<model.t7> th infer.lua # creates results.h5
```
Visualize the predictions in Matlab. Shows a heat map of confidence values where hotter regions indicate better locations for grasping with suction. Also displays computed surface normals, which can be used to decide between robot motion primitives suction-down or suction-side. Run the following in Matlab:
```
visualize; % creates results.png and normals.png
```

Training

To train your own model:

Navigate to arc-robot-vision/suction-based-grasping
```
cd arc-robot-vision/suction-based-grasping
```
Download our suction-based grasping dataset and save the files into arc-robot-vision/suction-based-grasping/data. More information about the dataset can be found here.
```
wget http://vision.princeton.edu/projects/2017/arc/downloads/suction-based-grasping-dataset.zip
unzip suction-based-grasping-dataset.zip # unzip dataset
```
Direct download link: suction-based-grasping-dataset.zip (1.6 GB)
Download the Torch ResNet-101 model pre-trained on ImageNet:
```
cd convnet
wget http://vision.princeton.edu/projects/2017/arc/downloads/resnet-101.t7
```
Direct download link: resnet-101.t7 (409.4 MB)
Run training (set optional parameters through command line arguments):
```
th train.lua
```
Tip: if you run out of GPU memory (CUDA error=2), reduce batch size or modify the network architecture in model.lua to use the smaller ResNet-50 (256.7 MB) model pre-trained on ImageNet.

Evaluation

To evaluate a trained model:

Navigate to arc-robot-vision/suction-based-grasping/convnet
```
cd arc-robot-vision/suction-based-grasping/convnet
```
Run our pre-trained model to get affordance predictions for the testing split of our grasping dataset:
```
th test.lua # creates evaluation-results.h5
```
or run your own model:
```
modelPath=<model.t7> th test.lua # creates evaluation-results.h5
```
Run the evaluation script in Matlab to compute pixel-level precision against manual annotations from the grasping dataset, as reported in our paper:
```
evaluate;
```

Baseline Algorithm

Our baseline algorithm predicts affordances for suction-based grasping by first computing 3D surface normals of the point cloud (projected from the RGB-D image), then measuring the variance of the surface normals (higher variance = lower affordance). To run our baseline algorithm over the testing split of our grasping dataset:

Navigate to arc-robot-vision/suction-based-grasping/baseline
```
cd arc-robot-vision/suction-based-grasping/baseline
```
Run the following in Matlab:
```
test; % creates results.mat
evaluate;
```

Parallel-Jaw Grasping

A Torch implementation of fully convolutional neural networks for predicting pixel-level affordances for parallel-jaw grasping. The network takes an RGB-D heightmap as input, and outputs affordances for horizontal grasps. Input heightmaps can be rotated at any arbitrary angle. This structure allows the use of a unified model to predict grasp affordances for any possible grasping angle.

Heightmaps are generated by orthographically re-projecting 3D point clouds (from RGB-D images) upwards along the gravity direction where the height value of bin bottom = 0 (see getHeightmap.m).

Quick Start

To run our pre-trained model to get pixel-level affordances for parallel-jaw grasping:

Clone this repository and navigate to arc-robot-vision/parallel-jaw-grasping/convnet

git clone https://github.com/andyzeng/arc-robot-vision.git
cd arc-robot-vision/parallel-jaw-grasping/convnet

Download our pre-trained model for parallel-jaw grasping:
```
wget http://vision.princeton.edu/projects/2017/arc/downloads/parallel-jaw-grasping-snapshot-20001.t7
```
Direct download link: parallel-jaw-grasping-snapshot-20001.t7 (450.1 MB)
To generate a RGB-D heightmap given two RGB-D images, run the following in Matlab:
```
getHeightmap;
```
Run our model on an optional target RGB-D heightmap. Input color images should be 24-bit RGB PNG, while height images (depth) should be 16-bit PNG, where height values are saved in deci-millimeters (10^-4m) and bin bottom = 0.
```
th infer.lua # creates results.h5
```
or
```
imgColorPath=<image.png> imgDepthPath=<image.png> modelPath=<model.t7> th infer.lua # creates results.h5
```
Visualize the predictions in Matlab. Shows a heat map of confidence values where hotter regions indicate better locations for horizontal parallel-jaw grasping. Run the following in Matlab:
```
visualize; % creates results.png
```

Training

To train your own model:

Navigate to arc-robot-vision/parallel-jaw-grasping
```
cd arc-robot-vision/parallel-jaw-grasping
```
Download our parallel-jaw grasping dataset and save the files into arc-robot-vision/parallel-jaw-grasping/data. More information about the dataset can be found here.
```
wget http://vision.princeton.edu/projects/2017/arc/downloads/parallel-jaw-grasping-dataset.zip
unzip parallel-jaw-grasping-dataset.zip # unzip dataset
```
Direct download link: parallel-jaw-grasping-dataset.zip (711.8 MB)
Pre-process input data and labels for parallel-jaw grasping dataset and save the files into arc-robot-vision/parallel-jaw-grasping/convnet/training. Pre-processing includes rotating heightmaps into 16 discrete rotations, converting raw grasp labels (two-point lines) into dense pixel-wise labels, and augmenting labels with small amounts of jittering. Either run the following in Matlab:
```
cd convnet;
processLabels;
```
or download our already pre-processed input:
```
cd convnet;
wget http://vision.princeton.edu/projects/2017/arc/downloads/training-parallel-jaw-grasping-dataset.zip
unzip training-parallel-jaw-grasping-dataset.zip # unzip dataset
```
Direct download link: training-parallel-jaw-grasping-dataset.zip (740.0 MB)
Download the Torch ResNet-101 model pre-trained on ImageNet:
```
wget http://vision.princeton.edu/projects/2017/arc/downloads/resnet-101.t7
```
Direct download link: resnet-101.t7 (409.4 MB)
Run training (set optional parameters through command line arguments):
```
th train.lua
```
Tip: if you run out of GPU memory (CUDA error=2), reduce batch size or modify the network architecture in model.lua to use the smaller ResNet-50 (256.7 MB) model pre-trained on ImageNet.

Evaluation

To evaluate a trained model:

Navigate to arc-robot-vision/parallel-jaw-grasping/convnet
```
cd arc-robot-vision/parallel-jaw-grasping/convnet
```
Run the model to get affordance predictions for the testing split of our grasping dataset:
```
modelPath=<model.t7> th test.lua # creates evaluation-results.h5
```
Run the evaluation script in Matlab to compute pixel-level precision against manual annotations from the grasping dataset, as reported in our paper:
```
evaluate;
```

Baseline Algorithm

Our baseline algorithm detects anti-podal parallel-jaw grasps by detecting "hill-like" geometric features (through brute-force sliding window search) from the 3D point cloud of an input heightmap (no color). These geometric features should satisfy two constraints: (1) gripper fingers fit within the concavities along the sides of the hill, and (2) top of the hill should be at least 2cm above the lowest points of the concavities. A valid grasp is ranked by an affordance score, which is computed by the percentage of 3D surface points between the gripper fingers that are above the lowest points of the concavities. To run our baseline algorithm over the testing split of our grasping dataset:

Navigate to arc-robot-vision/parallel-jaw-grasping/baseline
```
cd arc-robot-vision/parallel-jaw-grasping/baseline
```
Run the following in Matlab:
```
test; % creates results.mat
evaluate;
```

Image Matching

A Torch implementation of two-stream convolutional neural networks for matching observed images of grasped objects to their product images for recognition. One stream computes 2048-dimensional feature vectors for product images while the other stream computes 2048-dimensional feature vectors for observed images. During training, both streams are optimized so that features are more similar for images of the same object and dissimilar otherwise. During testing, product images of both known and novel objects are mapped onto a common feature space. We recognize observed images by mapping them to the same feature space and finding the nearest neighbor product image match.

Training

To train a model:

Navigate to arc-robot-vision/image-matching
```
cd arc-robot-vision/image-matching
```
Download our image matching dataset and save the files into arc-robot-vision/image-matching/data. More information about the dataset can be found here.
```
wget http://vision.princeton.edu/projects/2017/arc/downloads/image-matching-dataset.zip
unzip image-matching-dataset.zip # unzip dataset
```
Direct download link: image-matching-dataset.zip (4.6 GB)
Download the Torch ResNet-50 model pre-trained on ImageNet:
```
wget http://vision.princeton.edu/projects/2017/arc/downloads/resnet-50.t7
```
Direct download link: resnet-50.t7 (256.7 MB)
Run training (change variable trainMode in train.lua depending on which architecture you want to train):
```
th train.lua
```

Evaluation

To evaluate a trained model:

Navigate to arc-robot-vision/image-matching
```
cd arc-robot-vision/image-matching
```

Download our pre-trained models (K-net and N-net) for two-stage cross-domain image matching:

wget http://vision.princeton.edu/projects/2017/arc/downloads/k-net.zip
unzip k-net.zip 
wget http://vision.princeton.edu/projects/2017/arc/downloads/n-net.zip
unzip n-net.zip

Direct download links: k-net.zip (175.3 MB) and n-net.zip (174.0 MB)

Run our pre-trained models to compute features for the testing split of our image matching dataset (change variable trainMode depending on which architecture you want to test):

trainMode=1 snapshotsFolder=snapshots-with-class snapshotName=snapshot-170000 th test.lua # for k-net: creates HDF5 output file and saves into snapshots folder
trainMode=2 snapshotsFolder=snapshots-no-class snapshotName=snapshot-8000 th test.lua # for n-net: creates HDF5 output file and saves into snapshots folder

Run the evaluation script in Matlab to compute 1 vs 20 object recognition accuracies over our image matching dataset, as reported in our paper:
```
evaluateTwoStage;
```
or run the following in Matlab for evaluation on a single model (instead of a two stage system):
```
evaluateModel;
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robotic Pick-and-Place of Novel Objects in Clutter