@ArrowLLL 2018-01-24T15:48:14.000000Z 字数 8630 阅读 5877

# Study-Note : Social LSTM - Human Trajectory Prediction in Crowded Spaces

Study-Note OPTIMAL

## 中文理解

Social LSTM则用于处理多个预测任务之间的交互影响。 ### 隐藏状态汇聚

1. LSTM模型的隐藏状态$h_i^t$ 捕获到第 i 个人在第 t 时刻的隐藏状态信息；
2. 通过建立隐藏状态张量 $H_i^t$ 和邻居分享隐藏状态信息 :

给定隐藏状态维度为 $D$ 以及相邻区域边界大小$N_o$，对于第 $i$ 个轨迹我们建立一个代销为 $N_o \times N_o \times D$ 的张量 $H_i^t$ :

• $h_j^{t-1}$ 表示第 j 个人在第 t-1 时刻从LSTM获得的隐藏状态
• $1_{mn}[x, y]$ 是一个 indictor函数，检查(x, y) 是否在(m, n) 表示的方格内部（在则返回1，不在返回0）；
• $\mathcal{N_i}$ 表示第i个人邻界区域内的人员集合
3. 将汇聚得到的张量映射到一个向量$a_i^t$，将坐标映射到一个向量 $e_i^t$

• $\phi(\cdot)$ 是映射函数，使用ReLU增加非线性
• $W_e$$W_a$ 是映射的权重
• LSTM的参数用 $W_l$ 表示

### 位置估计

t 时刻的隐藏状态用于预测 t+1 时刻的轨迹位置 $(\hat{x}, \hat{y})_i^{t+1}$ 分布。假定一个二元高斯分布的参数如下 ：

• 期望 $\mu_i^{t+1} = (\mu_x, \mu_y)_i^{t+1}$
• 标准差 $\sigma_i^{t+1} = (\sigma_x, \sigma_y)_i^{t+1}$
• 相关系数 $\rho_i^{t+1}$

LSTM模型的参数通过最小化最小化负对数似然损失函数($L^i$表示第$i^{th}$个轨迹)获得 :

### 模型实现的细节

• 将空间坐标信息转化为64维度的向量再输入LSTM模型；
• 空间汇聚尺度 $N_o$ 设置为32，每一个小格使用 $8 \times 8$ 的汇聚窗口
• 固定LSTM隐藏层输出状态维度为 128
• 在汇聚LSTM隐藏状态之前将隐藏状态信息使用一个带有ReLU的embedding层转化一下（具体维度多少论文没有说明，猜想还是128，单纯地加一个ReLU即可
• 超参数用交叉验证的方式获得
• 使用均方误差以及0.003的学习率训练模型
• 论文的实验使用Theano + 单个GPU训练

## Abstract

• Problem of trajecory prediction can be viewed as a sequence generation task, where we are interested in predicting the feture trajectory of people based on their positions.
• We propose an LSTM model which can learn general human movement and predict their future trajectories.
• Our model outperforms state-of-art methods on some of these datasets.
• Analyzing the trajectories predicted by our model to demonstrate the motion behaviour learned by our model.

## Introduction

Most of works are limited by the following two assumptions :

1. The use hand-crafted functions to model "interactions" for specific settings rather than inferring them in a data-driven fashion.
2. They focus on modeling interactions among people in close proximity to each other, however they do not anticipate interactions that could occur in the more distant future.

In this work, we proposed an approach that can address both challenges through a novel data-driven architecture for predicting human trajectiries in future instants.

We address this issue through a novel architecture which connects the LSTMs corresponding to nearby sequences. In particular, we introduce a "Social" pooling layer which allows the LSTMs of spatially proximal sequences to share their hidden-states with each other.This architecture, which we refer to as the “Social-LSTM”, can automatically learn typical interactions that take place among trajectories which coincide in time.

## Model

Problem formulation

• At any time-instant $t$, the $i^{th}$ person in the scene is represented by his/her xy-coordinates $(x^i_t, y_t^i)$.
• We observe the positions of all the people from time $1$ to $T_{obs}$, and predict their positions of all the people from time $T_{obs + 1}$ to $T_{pred}$.

### Social LSTM

Long Short-Term Memory (LSTM) networks have been shown to successfully learn and generalize the properties of isolated sequence. We have one LSTM for each person in a scene, then connect neighboring LSTMs through a new pooling strategy to share weights across all the sequences.

#### Social pooling of hidden states

At every time-step, the LSTM cell reveives pooled hidden-state information from LSTM cells of neighbors.

1. The hidden state $h_i^t$ of the LSTM at time $t$ captures the latent representation of the $i^{th}$ person in the scene at that instant.

2. Sharing this representation with neighbors by building a "Social" hidden-state tensor $H^t_i$.

Given a hidden-state dimension D and neighborhood size $N_o$, we construct a $N_o \times N_o \times D$ tensor $H_i^t$ for the $i^{th}$ trajectory :

• $h_j^{t-1}$ is th hidden state of LSTM corresponding to the $j^{th}$ person at $t -1$
• $1_{mn}[x, y]$ is an indictor function to check if $(x, y)$ is in the $(m, n)$ cell of the grid
• $\mathcal{N}_i$ is the set of neighbors corresponding to person $i$
3. Embedding the pooled social hidden-stae tensor into a vector $a_i^t$ and the co-ordinates into $e_i^t$

• $\phi(\cdot)$ is an embedding function with ReLU nolinearlity
• $W_e$ and $W_a$ are embedding weights.
• The LSTM weights are denoted by $W_l$

#### Position estimation

The hidden-state at time $t$ is used to predicted the distribution of the trajectory position $(\hat{x}, \hat{y})_i^{t+1}$ at the next time-step $t+1$. Assuming a bivariate Gaussian distribution parametrized by

• the mean $\mu_i^{t+1} = (\mu_x, \mu_y)_i^{t+1}$
• standard deviation $\sigma_i^{t+1} = (\sigma_x, \sigma_y)_i^{t+1}$
• correlation coefficient $\rho_i^{t+1}$

These parameters are predicted by alinear layer with a $5 \times D$ weight matrix $W_p$.

The predicted coordinates $(\hat{x}_i^t, \hat{y}_i^t)$ at time $t$ are given by

The parameters of the LSTM model are learned by minimizing the negative log-Likelihood loss($L^i$ for the $i^{th}$ trajectory) :

Training the model by minimizing this loss for all the trajectories in a training dataset.

#### Occupancy map pooling

As a simplification, we also eperiment with a model which only pools the co-ordinates of the neighbors(referred to as O-LSTM).

for a person $i$, we modify the definition of the tensor $H_i^t$, as a $N_o \times N_o$ matrix at time $t$ centered at the person's position, and call it the occupancy map $O_i^t$. The positions of all the neighbors are pooled in this map The $m, n$ element of the map is simply given by :

The vectorized occupancy map is used in place of $H_i^t$ in last section while learning this simpler model.

#### Inference for path prediction

Use the predicted position $(\hat{x}_i^t, \hat{y}_i^t)$ from the previous Social-LSTM cell in place of the true coordinates $(x_i^t, y_i^t)$, the predicted positions are also used to replace the actual coordinates while constructing the Social hidden-state tensor $H_i^t$ or the occupancy map $O_i^t$.

### Implementation details

1. use an embedding dimension of 64 for the spatial coordinates before using as input to the LSTM
2. set the spatial pooling size $N_o = 32$
3. $8 \times 8$ sum pooling window size without overlaps
4. fixed hidden-state dimension of 128 for all the LSTM models.
5. using an embedding layer with ReLU on top of the pooled hidden-state features, before using them for calculting the hidden state tensor $H_i^t$
6. hyper-parameters were chosed on cross-validation on a synthetic dataset
7. This synthetic was generated using a simulation that implemented the social forces model, containing trajectories for hundreds of scenes with an average crowd density of 30 per frame.
8. learning rate = 0.003 and RMS-prop for training the model
9. Trained on a single GPU with Theano implementation

### Experiments

• Human-trajectory datasets

ETH and UCY

• Report the prediction error with threedifferent metrics

1. Average displacement error - The mean square error(MSE) over all estimated points of a trajectory and the true points.
2. Final displacement error - The distance between the predicted final destination and the true final distination and the true final destination at the end of the prediction period $T_{pred}$
3. Average non-linear displacement error - This is the MSE at the non-linear regions of a trajectory.
• Leave-one-out approach

Train and validate this model on 4 sets and test on the remaining set. Repeat this for all the 5 sets.

• Test

Observe a trajectory for $3.2 secs$ and predict their paths for the next $4.8secs$.
At a frame rate of 0.4, this corresponds to observe 8 frames and predicting for the next 12 frames.

• Comparation

• Linear model
• Collision avoidance
• Social force
• Iterative Gaussian Process
• Our vanilla LSTM
• our LSTM with occupancy maps

Vanilla LSTM outperforms this linear basline since it can extrapolate non-linear cuives. However, this simple LSTM is noticeably worse than the Social Force and IGP models which explicitly model human-human interactions.

Social pooling based LSTM and O-LSTM outperfor the heavily engineered Social Force and IGP models in almost all datasets.

THe IGP model which knows the true final destination during testing achieves lower errors in parts of this dataset.

Social-LSTM ouperforms O-LSTM in the more crowed UCY datasets which shows the advantage of pooling the entire hidden state to capture complex interactions in dense crowds.

### Conclusions

Use one LSTM for each trajectory and share the information between the LSTMs through the introduction of a new Social pooling layer. We refer to the resulting model as the "Social" LSTM.

In addition, human-space interaction can be modeled in our framework by including the local static-scene image as an additional input to the LSTM. This could allow jointly modeling of human-human and human-space interactions in the same framework.  • 私有
• 公开
• 删除