This repository is a version of VINS-Fusion with a CUDA reimplementation of Bundle Adjustment.
Modifications are as follow :
- The codes of Bundle Adjustment reimplementation are in vins_estimator/src/cuda_bundle_adjustment.
- Estimator::optimization() in estimator.cpp is rewritten as follow :
void Estimator::optimization() {
if(frame_count == WINDOW_SIZE) {
optimization_with_cuda(); // solve and marginalize with cuda when the sliding window is full
} else {
optimization_with_ceres(); // solve with ceres when the sliding window is not yet full
}
}
The essential software environment is as same as VINS-Fusion, tested on Ubuntu 18.04 & Ros Melodic.
While the Bundle Adjustment in estimator.cpp is reimplemented with CUDA and Eigen, this repository still requires ceres solver for non-linear optimization for :
- Bundle Adjustment in estimator.cpp before frame_count turn to WINDOW_SIZE
- GlobalSFM::construct() in initial_sfm.cpp
- PoseGraph::optimize4DoF() in pose_graph.cpp.
Meanwhile, the CUDA reimplementation of Bundle Adjustment in estimator.cpp requires :
Before build this repo, some CMAKE variables in vins_estimator/src/cuda_bundle_adjustment/CMakeLists.txt need to be modified to fit your enviroment :
set(CMAKE_CUDA_COMPILER /usr/local/cuda/bin/nvcc) # set it to your path to nvcc
set(CUDA_TOOLKIT_ROOT_DIR /usr/local/cuda/bin/nvcc) # set it to your path to nvcc
set(CMAKE_CUDA_ARCHITECTURES 52) # for example, if your device's compute capability is 6.2, then set it to 62
If your device's compute capability is >= 6.0, you can just change MyAtomicAdd() in vins_estimator/src/cuda_bundle_adjustment/cuda_kernel_funcs/device_utils.cu into the following :
template<typename T>
__device__ T MyAtomicAdd(T* address, T val) { return atomicAdd(address, val); }
Or, you can just replace MyAtomicAdd() with atomicAdd() wherever MyAtomicAdd() is called.
The basic steps to compile and run this repo is as same as VINS-Fusion.
Sequence | CPU (Intel I7-6700K) | GPU (Nvidia 980TI) | ||
---|---|---|---|---|
iterations | solve | marginalization | solve (at least 10 iterations) & marginalization |
|
MH_01_easy WINDOW_SIZE == 10 max feature count == 150 | 5.73 no RVIZ 5.73 with RVIZ | 61.17ms no RVIZ 74.74ms with RVIZ | 12.25ms no RVIZ 21.21ms with RVIZ | 29.38ms no RVIZ 37.39ms with RVIZ |
MH_05_difficult WINDOW_SIZE == 10 max feature count == 150 | 6.46 no RVIZ 6.46 with RVIZ | 63.17ms no RVIZ 74.36ms with RVIZ | 9.73ms no RVIZ 17.71ms with RVIZ | 27.23ms no RVIZ 35.49ms with RVIZ |
2011_10_03_drive_0027_sync WINDOW_SIZE == 10 max feature count == 200 | 4.71 no RVIZ 4.70 with RVIZ | 18.79ms no RVIZ 19.81ms with RVIZ | 6.07ms no RVIZ 7.18ms with RVIZ | 19.98ms no RVIZ 22.27ms with RVIZ |
MH_01_easy WINDOW_SIZE == 20 max feature count == 300 | 7.13 no RVIZ 7.13 with RVIZ | 126.09ms no RVIZ 163.89ms with RVIZ | 20.47ms no RVIZ 28.88ms with RVIZ | 51.56ms no RVIZ 65.35ms with RVIZ |
MH_05_difficult WINDOW_SIZE == 20 max feature count == 300 | 6.61 no RVIZ 6.61 with RVIZ | 103.50ms no RVIZ 137.44ms with RVIZ | 15.66ms no RVIZ 23.88ms with RVIZ | 45.66ms no RVIZ 58.60ms with RVIZ |
2011_10_03_drive_0027_sync WINDOW_SIZE == 20 max feature count == 400 | 4.94 no RVIZ 4.93 with RVIZ | 44.14ms no RVIZ 46.91ms with RVIZ | 9.22ms no RVIZ 11.99ms with RVIZ | 34.18ms no RVIZ 38.40ms with RVIZ |
Since the theoretical FP64 performance of Nvidia 980TI GPU (compute capability = 5.2) is only 189.4 GFLOPS, and atomicAdd() for FP64 is not available on devices with compute capability lower than 6.0, expecting a better speed-up with more recent hardwares is plausible.
P.S. It seems that RVIZ will largely slow down the speed of this reimplementation.
MH_01_easy (WINDOW_SIZE == 10, graphs are generated by evo) :
KITTI 2011_10_03_drive_0027_sync :
- WINDOW_SIZE is 10
- green path is generated by VO (estimator.cpp)
- blue path is generated by fusing VO with GPS (globalOpt.cpp)
Use Levenberg-Marquart to solve delta, at least 10 iterations.
Since the bottom right part (which corresponds to inverse depths) of the big hessian matrix is diagonal, a schur complement trick is used to solve the system states before solving the inverse depths
All jacobians, residuals, robust info matrices, hessian blocks, rhs blocks are computed on GPU.
No explicit big jacobian matrix and big residual vector is formed.
The big hessian matrix and rhs vector are formed this way :
- Multiply tiny jacobian blocks and residual blocks to form tiny hessian blocks and rhs blocks inside kernel functions.
- Add tiny hessian blocks and rhs blocks to the big hessian and rhs by calling atomicAdd() inside kernel functions.
In the original implementation with Ceres, a eigen decomposition is done in the process of marginalization to form a MarginalizationFactor for the next frame. This is because ceres-1.14 has to use jacobians to form hessians, instead of just taking formed hessians from API. Ceres-1.14 will also evaluates the error of MarginalizationFactor at new linearization points as part of the iteration strategy. However, in this reimplemetation, we don't use ceres and we ignore the error of MarginalizationFactor (we still use new linearization points to update hessian prior and rhs prior), so we don't need the eigen decomposition of the hessian prior (which usually take about 5~6ms on my Nvidia 980TI GPU) in the process of marginalization.
Currently only support scenarios where the following conditions are all satisfied :
- STEREO == true
- ESTIMATE_TD == false
- ESTIMATE_EXTRINSIC == false
Data type must be FP64. A version with FP32 is also implemented, however, it drifts away.
This repository is based upon VINS-Fusion. Also, many thanks to VINS-Course for its step-by-step demonstrations of how Bundle Adjustment works.