@devilloser 2018-09-06T02:48:46.000000Z 字数 1321 阅读 1027

Interaction-aware Spatio-temporal Pyramid Attention Networks for Action Classification

action

Interaction-aware Spatio-temporal Pyramid Attention Networks for Action Classification
attention
- hard attention
- soft attention
  - predict attention score of the next time by LSTM
  - self attention
缺陷
Interaction-aware Spatial Pyramid Attention Layer
PCA
temporal aggregation
loss fuction
- interactive
- regularization

attention

attention主要是为了让model区分出无用信息

hard attention

make hard binary choices

soft attention

uses weighted average instead of hard selection

predict attention score of the next time by LSTM

self attention

a special form of non-local network

缺陷

在计算attention时都是对单独的frame计算attention，但是frame之间的interaction没有考虑

Interaction-aware Spatial Pyramid Attention Layer

对第i层的输出 $f_i\in R^{W_i\times H_i\times C_i}$ ,flatten成 $X_i\in R^{W_iH_i\times C_i}$
对 $i-N+1$ 层到第 $i$ 层,downsample成统一大小，即 $R^{W_iH_i\times C_j}$
image_1cmh6vkjj1v7116n4sfn1spc1qlf9.png-126.3kB
流程如上图。

PCA

$Y=PX$

image_1cmhb9ugq1krf17mbfejriq5tem.png-14.4kB

其中

$P_i\in \{{P_1,P_2,...,P_R}\}$ ,

$P_i\in R^{1\times N}$ 是行向量，表示空间中第

$i$ 个基。
将X中的列向量投影到新的基底上。
跟PCA关系：

$M_i=A_iX_i$ 其中的 $A_i$ 认为是PCA中找到的一个新的基底,对不同channel的feature map作为样本，得到过滤后的feature map。

temporal aggregation

用 $X_i\in R^{KWH\times C_i}$ 表示K frames的数据。

loss fuction

PCA的目标是方差最大化，所以

$S=arg\min_s -tr(S_iX_iX_i^TS_i^T),s.t. S_iS_i^T=I$
拉格朗日方程：

$J(S)=-tr(S_iX_iX_i^TS_i^T)+\lambda(I-S_iS_i^T)$

interactive

$l_{interactive}=-\chi(S_iX_iX_i^TS_i^T\cdot I)+\gamma((S_iS_i^T)\cdot (1-I))$

$\chi$ 为element sum，

$\gamma$ 为element平方和
这里将第二项改为这种形式是因为

$S_i$ 做过L2 normalize，所以对角线都是1。