Skip to content

A CUDA reimplementation of Bundle Adjustment for VINS-Fusion

License

Notifications You must be signed in to change notification settings

qdLMF/VINS-Fusion-GPU-BA

Repository files navigation

VINS-Fusion-GPU-BA

This repository is a version of VINS-Fusion with a CUDA reimplementation of Bundle Adjustment.

Modifications are as follow :

void Estimator::optimization() {
    if(frame_count == WINDOW_SIZE) {
        optimization_with_cuda();  // solve and marginalize with cuda when the sliding window is full
    } else {
        optimization_with_ceres(); // solve with ceres when the sliding window is not yet full
    }
}

Dependencies

The essential software environment is as same as VINS-Fusion, tested on Ubuntu 18.04 & Ros Melodic.

While the Bundle Adjustment in estimator.cpp is reimplemented with CUDA and Eigen, this repository still requires ceres solver for non-linear optimization for :

  • Bundle Adjustment in estimator.cpp before frame_count turn to WINDOW_SIZE
  • GlobalSFM::construct() in initial_sfm.cpp
  • PoseGraph::optimize4DoF() in pose_graph.cpp.

Meanwhile, the CUDA reimplementation of Bundle Adjustment in estimator.cpp requires :

  • C++14

  • CUDA (>= 11.0)

  • CUBLAS

  • CUSOLVER

  • Eigen (>= 3.3.9)

How To Build

Before build this repo, some CMAKE variables in vins_estimator/src/cuda_bundle_adjustment/CMakeLists.txt need to be modified to fit your enviroment :

set(CMAKE_CUDA_COMPILER  /usr/local/cuda/bin/nvcc)        # set it to your path to nvcc
set(CUDA_TOOLKIT_ROOT_DIR  /usr/local/cuda/bin/nvcc)      # set it to your path to nvcc
set(CMAKE_CUDA_ARCHITECTURES  52)    # for example, if your device's compute capability is 6.2, then set it to 62

If your device's compute capability is >= 6.0, you can just change MyAtomicAdd() in vins_estimator/src/cuda_bundle_adjustment/cuda_kernel_funcs/device_utils.cu into the following :

template<typename T>
__device__ T MyAtomicAdd(T* address, T val) { return atomicAdd(address, val); }

Or, you can just replace MyAtomicAdd() with atomicAdd() wherever MyAtomicAdd() is called.

The basic steps to compile and run this repo is as same as VINS-Fusion.

Speed-up

SequenceCPU (Intel I7-6700K)GPU (Nvidia 980TI)
iterationssolvemarginalizationsolve
(at least 10 iterations)
& marginalization
MH_01_easy
WINDOW_SIZE == 10
max feature count == 150
5.73 no RVIZ
5.73 with RVIZ
61.17ms no RVIZ
74.74ms with RVIZ
12.25ms no RVIZ
21.21ms with RVIZ
29.38ms no RVIZ
37.39ms with RVIZ
MH_05_difficult
WINDOW_SIZE == 10
max feature count == 150
6.46 no RVIZ
6.46 with RVIZ
63.17ms no RVIZ
74.36ms with RVIZ
9.73ms no RVIZ
17.71ms with RVIZ
27.23ms no RVIZ
35.49ms with RVIZ
2011_10_03_drive_0027_sync
WINDOW_SIZE == 10
max feature count == 200
4.71 no RVIZ
4.70 with RVIZ
18.79ms no RVIZ
19.81ms with RVIZ
6.07ms no RVIZ
7.18ms with RVIZ
19.98ms no RVIZ
22.27ms with RVIZ
MH_01_easy
WINDOW_SIZE == 20
max feature count == 300
7.13 no RVIZ
7.13 with RVIZ
126.09ms no RVIZ
163.89ms with RVIZ
20.47ms no RVIZ
28.88ms with RVIZ
51.56ms no RVIZ
65.35ms with RVIZ
MH_05_difficult
WINDOW_SIZE == 20
max feature count == 300
6.61 no RVIZ
6.61 with RVIZ
103.50ms no RVIZ
137.44ms with RVIZ
15.66ms no RVIZ
23.88ms with RVIZ
45.66ms no RVIZ
58.60ms with RVIZ
2011_10_03_drive_0027_sync
WINDOW_SIZE == 20
max feature count == 400
4.94 no RVIZ
4.93 with RVIZ
44.14ms no RVIZ
46.91ms with RVIZ
9.22ms no RVIZ
11.99ms with RVIZ
34.18ms no RVIZ
38.40ms with RVIZ

Since the theoretical FP64 performance of Nvidia 980TI GPU (compute capability = 5.2) is only 189.4 GFLOPS, and atomicAdd() for FP64 is not available on devices with compute capability lower than 6.0, expecting a better speed-up with more recent hardwares is plausible.

P.S. It seems that RVIZ will largely slow down the speed of this reimplementation.

Precision

MH_01_easy (WINDOW_SIZE == 10, graphs are generated by evo) :

drawing

drawing drawing drawing

drawing drawing

drawing drawing

KITTI 2011_10_03_drive_0027_sync :

  • WINDOW_SIZE is 10
  • green path is generated by VO (estimator.cpp)
  • blue path is generated by fusing VO with GPS (globalOpt.cpp)

drawing

Some Implementation Details

Use Levenberg-Marquart to solve delta, at least 10 iterations.

Since the bottom right part (which corresponds to inverse depths) of the big hessian matrix is diagonal, a schur complement trick is used to solve the system states before solving the inverse depths

All jacobians, residuals, robust info matrices, hessian blocks, rhs blocks are computed on GPU.

No explicit big jacobian matrix and big residual vector is formed.

The big hessian matrix and rhs vector are formed this way :

  • Multiply tiny jacobian blocks and residual blocks to form tiny hessian blocks and rhs blocks inside kernel functions.
  • Add tiny hessian blocks and rhs blocks to the big hessian and rhs by calling atomicAdd() inside kernel functions.

In the original implementation with Ceres, a eigen decomposition is done in the process of marginalization to form a MarginalizationFactor for the next frame. This is because ceres-1.14 has to use jacobians to form hessians, instead of just taking formed hessians from API. Ceres-1.14 will also evaluates the error of MarginalizationFactor at new linearization points as part of the iteration strategy. However, in this reimplemetation, we don't use ceres and we ignore the error of MarginalizationFactor (we still use new linearization points to update hessian prior and rhs prior), so we don't need the eigen decomposition of the hessian prior (which usually take about 5~6ms on my Nvidia 980TI GPU) in the process of marginalization.

Limitations

Currently only support scenarios where the following conditions are all satisfied :

  • STEREO == true
  • ESTIMATE_TD == false
  • ESTIMATE_EXTRINSIC == false

Data type must be FP64. A version with FP32 is also implemented, however, it drifts away.

Acknowledgements

This repository is based upon VINS-Fusion. Also, many thanks to VINS-Course for its step-by-step demonstrations of how Bundle Adjustment works.

Star History

Star History Chart