As stated in the challenge overview, we provide two data sources: one is the simulator Ground Truth and the other is a realistic Perception Pipeline.
- Ground Truth: The agent observes ground truth (i.e., error free) information that is provided directly from the simulator.
- Perception Pipeline: The agent observes output of Kimera, which is an open-source C++ library for real-time metric-semantic visual-inertial Simultaneous Localization And Mapping (SLAM). Note that the types (and dimensions) of observations provided are the same as before; however, the error characteristics are now representative of a real perception system.
This page gives an overview of Data Conventions and the Perception Pipeline Algorithms.
Th agent observes several data products: monocular RGB, semantic segmentation, depth, and agent pose. Conventions related to these data products are as follows.
See the details page for a list of the semantic segmentation classes and corresponding RGB values. Note, the class index corresponds to the value of the first color channel.
Depth is rendered within a specified range, known as clip planes, and then mapped to the range [0, 1]
. The simulator uses minimum and maximum clip planes of 0.05
and 50
, so to recover depth in meters multiply the provided image by 50
. Note, depth beyond 50 meters is truncated. In the evaluation interface this would look like:
from tesse_gym.tasks.goseek import decode_observations
class Agent:
""" Interface for submitting an agent for evaluation. """
def act(self, observation: np.ndarray) -> int:
far_clip_plane = 50
rgb, segmentation, depth, pose = decode_observations(observation)
depth *= far_clip_plane # convert depth to meters
...
We use the left handed coordinate system native to Unity (see Unity documentation). For world coordinates, x- and z-axes are aligned with the horizontal plane, and the y-axis is aligned with up. Pose is given as the vector (x, z, yaw)
where the z-axis is aligned with forward, the x-axis is positive to the right, and yaw is relative to the positive up y-axis.
During evaluation, participants will be provided with realistic perception data from Kimera, an open-source C++ library for real-time metric-semantic visual-inertial Simultaneous Localization And Mapping (SLAM). Realistic perception estimates are obtained by passing ground truth simulator data through this pipeline. Thus, the types and dimensions of observations will remain the same; however, the error characteristics are now representative of a real perception system.
Please note that running the perception pipeline requires significantly more computation than groundtruth. Episodes will run several times slower than when running the groundtruth data pipeline. The simulator is run in a continuous dynamics mode (compared to discrete dynamics when running the groundtruth data pipeline), and it is outputting imagery at a higher rate. This higher-rate data is needed for estimating pose only, so agent policies will still receive data at the same rate as before. In addition, several perception algorithms are now running as described below.
We recommend thinking carefully about how you use this pipeline. It may be less feasible to generate data with it for policy training, for example.
A U-Net provides segmentation estimates for the 11 GOSEEK semantic classes. The segmentation-models.pytorch project was used to train the model on data collected from scenes 1-4 of the GOSEEK simulator. Scene 5 was used to collect a validation set on which the model achieves an Intersection-over-Union (IoU) score of roughly 0.8. The model was then exported to an ONNX file, provided in this release, for inference in TensorRT.
Training details can be found in this notebook. The inference framework is implemented here.
Depth is estimated via stereo reconstruction. Using a stereo image pair from the simulator, we use the ROS stereo_image_proc
node to generate a point cloud which is then projected into the camera plane to produce a depth image.
Pose is provided by Kimera-VIO, a Visual Inertial Odometry pipeline for State Estimation from Stereo and IMU data.
Below is a comparison of the data provided in Ground Truth and Perception modes. Pose is illustrated over a 100 step trajectory with the carrots representing position and heading every 1 step.
Ground Truth | Perception | |
---|---|---|
Monocular RGB | ||
Semantic Segmentation | ||
Depth | ||
Pose |
DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.
This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Under Secretary of Defense for Research and Engineering.
(c) 2020 Massachusetts Institute of Technology.
MIT Proprietary, Subject to FAR52.227-11 Patent Rights - Ownership by the contractor (May 2014)
The software/firmware is provided to you on an As-Is basis
Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.