@iStarLee 2019-08-01T11:30:16.000000Z 字数 6376 阅读 549

FMD Stereo SLAM

SLAM-VisualSLAM

FMD Stereo SLAM
1 Paper Reference
2 Introduction
- 2.1 SLAM分类介绍
- 2.2 几种现有SLAM的方法回顾
3 本文的stereo SLAM framework
4 Tracking
5 Mapping

1 Paper Reference

Title: FMD Stereo SLAM: Fusing MVG and Direct Formulation Towards Accurate and Fast Stereo SLAM
Author: Fulin Tang, Heping Li, Yihong Wu∗
Publication: ICRA 2019

2 Introduction

2.1 SLAM分类介绍

Learning SLAM
以来数据标注，泛化性能差
Geometric SLAM(Keyframe-based最popular)
- Feature-based SLAM
  tracking key-corners和key-edges，使用MVG(Multi View Geometry)理论，包括对极几何，三角化，SFM，BA，PnP构建误差进行优化得到camera pose和3d point，但是优化的精度去决定于feature matching的好坏。
- Direct method SLAM
  考虑整个图像上梯度较大的位置，在poorly-texture环境下表现更好；但是运行速度受photometric error optimization影响；初始值很重要，影响优化的收敛速度

对于Geometric SLAM，上面说的这两种方法，各有优劣，但是现有的SLAM很难说在速度和精度上都同时做得很好，这篇文章就是通过fusing MVG方法和Direct 方法同时达到高精度和高运行速度。

2.2 几种现有SLAM的方法回顾

SVO(Semi-direct Visual Odometry)
使用key features和direct formulation，在消费级pc上运行速度300 fps，SVO使用depth filer来估计深度，并不能很好的收敛到真实的深度值，因为它的noise initialization会影响定位的精度
ORB SLAM
feature-based，使用MVG理论构建优化，检测FAST corners，提取ORB descriptors，运行速度20fps，比SVO精度更好
DSO(Direct Sparse Odometry)
融合BA和sparse pixels with large gradients达到了很好的精度，运行速度和ORB相近

3 本文的stereo SLAM framework

front-end
考虑speed。系统最小化photometric error，使用constant motion model预测initial pose，reprojects
local map to current frame 来找到3D-2D correspondence，最后通过最小化 reprojection error来refine pose的精度。
back-end
考虑精度。使用BA维护全局地图，为了进一步提升系统精度，使用stereo constraint来优化全局地图。

image_1dh5d3ha3eo014kl1ii01li03ch9.png-207.7kB

整个系统包括Tracking和Mapping两个线程。
在Tracking线程中，估计camera pose。

第一步，估计current frame的initial pose，使用sparse-modle-based image alignment，如果这个方法失败，再使用constant motion model去估计。
第二步，将local map reproject到current frame，找到3d-2d correspondence。
第三步，通过最小化reprojection error来refine current frame camera pose。
第四步，判断current frame是否为keyframe，如果是keyframe使用stereo matching来得到深度值，并插入到mapping线程

在Mapping线程中，使用MVG来估计3D结构。

第一步，找到距离current keyframe临近的一些keyframes
第二步，使用feature matching来到keyframes之间的2d-2d correspondence
第三步，使用triangulation生成新的mappoints
第四步，优化局部地图，包括mappoints和pose of keyframes

经过一个peorid，我们使用全局BA来refine pose和reconstruction globally。

4 Tracking

4.1 Initial Pose: predict current frame camera pose ${T}_{k, w}$

image_1dh67ndvd3ck133u72e191al23m.png-57.8kB
这个图表示Sparse model-based image alignment，关于这个图的相关变量定义

$\begin{aligned}{_{c} \mathbf{X}} &=\mathbf{T}_{k, w} *{_{w} \mathbf{X}} \\ \mathbf{x} &=\pi\left(_{c} \mathbf{X}\right) \end{aligned}$

$\mathbf{T}_{k, w}$ 实际上表示 $\mathbf{T}_{cw}$
$_{c} \mathbf{X}$ 表示相机坐标系中的3d点
$\mathbf{x}$ 表示image坐标系下的2d feature像素坐标

当前帧的initial pose可以通过下面的式子得到

$\mathbf{T}_{k, w} = \mathbf{T}_{k, k-1} * \mathbf{T}_{k-1, w}$

我们通过最小化两帧的photometric error来求出两帧之间的变换矩阵 $\mathbf{T}_{k, k-1}$

$\delta I(\mathbf{T}_{k,k-1}, \mathbf{x})=I_{k}\left(\pi\left(\mathbf{T}_{k,k-1} *{_{c} \mathbf{X}}\right)\right)-I_{k-1}(\mathbf{x}) \quad \mathbf{x} \in \Omega$
其中，

$\Omega$ 是

$k-1$ 时刻image的2d feature points，我们使用

$4×4$ 的patch来描述feature points；

${_{c} \mathbf{X}}$ 是

$k-1$ 帧相机坐标系下feature 3d点。

构建优化目标函数

$\mathbf{T}_{k, k-1}=\underset{\mathbf{T}_{k, k-1}}{\arg \min } \frac{1}{2} \sum_{i \in \Omega}\left\|\delta I\left(\mathbf{T}_{k, k-1}, \mathbf{x}_{i}\right)\right\|^{2}$
优化变量的更新使用如下公式

$\mathbf{T}_{k, k-1}=\mathbf{T}_{k, k-1} * \mathbf{T}(\xi)^{-1}$
如果使用上面的sparse model-based image alignment失败了，我们使用constant motion model来估计

$\mathbf{T}_{k, w}$

$\mathbf{T}_{k, w}=\mathbf{T}_{k-1, w} * \mathbf{T}_{k-2, w}^{-1} * \mathbf{T}_{k-1, w}$
这个式子推导使用如下公式

$\mathbf{T}_{k,k-1} = \mathbf{T}_{k-1,k-2} = \mathbf{T}_{k-1, w} * \mathbf{T}_{k-2, w}^{-1}$

4.2 Reprojection: refine current frame 2d feature points 像素位置

注意上面的Initial Pose只是通过利用前后两帧的信息，构建一个最小化光度误差优化问题，求出了current frame的camera pose。这个camera pose并不是准确的。为了降低drift，接下来使用map中的信息来refine这个camera pose。

image_1dh69qerc26n1eclvjkg761qct1j.png-57.3kB
首先找到距离current frame最近的并且能够看到current frame camera pose的那些keyframes（这个文章后面有讲），然后将这些keyframe中space points重投影到current frame的image plane上，如上图所示，构建目标优化函数如下

$\mathbf{x}_{i}^{\prime}=\underset{\mathbf{x}_{i}^{\prime}}{\arg \min } \frac{1}{2}\left\|I_{k}\left(\mathbf{x}_{i}^{\prime}\right)-\mathbf{A}_{i} * I_{r}\left(\mathbf{x}_{i}\right)\right\|$
通过最小化这个光度误差，可以refine current frame中的2d featur像素位置。这里我们使用

$8×8$ 的patch来描述feature points，因为keyframes和current frame之间的距离要比previous frame与current frame之间的距离大，所以patch size要选的大一些。

4.3 Pose Optimization: refine current frame camera pose ${T}_{k, w}$

通过最小化reprojection error来refine current frame camera pose。

$\mathbf{T}_{k, w}=\underset{\mathbf{T}_{k, w}}{\arg \min } \frac{1}{2} \sum_{i}\left\|\mathbf{x}_{i}-\pi\left(\mathbf{T}_{k, w} *_{w} \mathbf{X}\right)\right\|^{2}$

4.4 New Keyframe Decision

current frame有超过50个inlier points被tracking
mapping thread处于idle（空闲状态）
当前帧与最后一个关键帧之间的视差角度(parallax angle)大于2度。

如果选择了current frame为keyframe，对左右数目图像提取ORB features，keypoints分为两种

Monocular Keypoints: $x_m = (u_L, v_L)$ , which is on the left image and the right image
Stereo Keypoints: $x_s = (u_L, v_L, u_R)$ , where $(u_L, v_L)$ is on the left image and $u_R$ is the h，orizontal coordinate on the right image

双目恢复深度的公式

$d=\frac{f_{x} * b}{u_{L}-u_{R}}$
上述两种keypoints的划分原则是，如果得到的depth小于baseline

$b$ 的40倍，认为是Stereo Keypoints(比较近)，否则认为是Monocular Keypoints(比较远)

总结来看，比较近的feature通过双目模型来恢复深度(近处点可靠)，比较远的点使用MVG的三角化来恢复深度。

5 Mapping

5.1 Inserting New Keyframe

在新的keyframe得到的时候，我们有了2d-3d的correspondence，但是这不是计算ORB 描述子得到的，而是通过前面tracking中的直接法得到的correspondence。
而之所以计算ORB feature points是为了得到new mappoints的深度。

5.2 Finding Overlap Keyframes

如何找到距离current frame最近的10个overlap的keyframes呢？我理解的就是直接计算camera pose的距离。

5.3 Triangulating

之所以进行这一步，就是应为深度滤波器方法得到的深度值不准确（由于noise initialization 和 fast motion）。
本系统中使用三角化用来估计深度和生成新的mappoints。

通过match ORB features between a current keyframe and overlap keyframes。这一步通过使用epipolar constraint, non-maximum suppression and cross check来剔除部分outliers。
match过后，进行三角化计算得到深度，根据Depth positivity in both cameras, parallax and reprojection errors再剔除一些outliers。

5.4 Bundle Adjustment

Local BA使用10个keyframes的sliding window。
Global BA是在一个perid结束之后进行的，优化所有的地图点。
stereo constraint is performed to improve the system，这句话没懂。。。

对于两种keypoints的投影公式如下

$\pi_{m}\left(\left[\begin{array}{c}{\mathbf{X}(1)} \\ {c \mathbf{X}(2)} \\ {c \mathbf{X}(3)}\end{array}\right]\right)=\left[\begin{array}{c}{f_{x} * \frac{c \mathbf{X}(1)}{c \mathbf{X}(3)}+c_{x}} \\ {f_{y} * \frac{c \mathbf{X}(2)}{c}+c_{y}}\end{array}\right]$

$\pi_{s}\left(\left[\begin{array}{c}{c \mathbf{X}(1)} \\ {c \mathbf{X}(2)} \\ {c \mathbf{X}(3)}\end{array}\right]\right)=\left[\begin{array}{c}{f_{x} * \frac{c \mathbf{X}(1)}{c \mathbf{X}(3)}+c_{x}} \\ {f_{y} * \frac{c \mathbf{X}(2)}{c}+c_{y}} \\ {f_{x} * \frac{c \mathbf{X}(1)-b}{c}+c_{x}}\end{array}\right]$

Local BA优化目标函数如下
image_1dh6env78ldkc2l6gqo3oilu20.png-34.4kB

$K_l$ 是local window里面的keyframes
$P_l$ 是local window里面keyframes看到的所有的3d points
$\pi_m$ 是计算monocular keypoints（远）
$\pi_s$ 是计算stereo keypoints（近）

global BA和local BA类似，但是global BA的第一帧keyframe上的点被fix，不进行优化。这里使用LM算法进行优化。