@mShuaiZhao 2018-01-17T07:16:37.000000Z 字数 1585 阅读 332

YOLO

PaperReading 2017.12 ObjectDetection

You Only Look Once

2015 CVPR
Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.

image_1c2b9rj4rpq8d9r7e4s6t1t959.png-185.6kB

advantages
- prior work
  - take a classifier for that object and evaluate it at various locations and scales in a test image
  - region proposal methods
- receive the whole image, get a large context
- YOLO learns generalizable representations of objects.
Unified Detection
- Each bounding box consists of 5 predictions : $x,y,w,h$ and confidence.
  
  the confidence prediction represents the IOU between the predicted box and any ground truth box.
- Each grid cell also predicts $C$ conditional class probabilities.
- $S \times S$ grids
  for each grid cell predicts $B$ bounding boxes, confidence for those boxes
  $C$ class probabilities
  
  encoded as an $S\times S \times (B*5 + C)$
- 总结的来说，就是将图片划分为 $S\times S$ 个栅格，其实栅格的划分对于bounding box并没有太大意义。根据划分好的栅格确定最后的输出形式为 $S\times S \times (B*5 + C)$ 。
  前面 $S\times S \times (B*5)$ 个output units负责预测的是bounding box的坐标。
  后面 $S\times S \times (C)$ 个output units负责预测的是物体出现的概率和物体的类别。
Training
- normalize the bounding box width and height by the image width and height so that they fall between 0 and 1.
disadvantages
- small objects that appear in groups, such as flocks of birds
  这是由于划分grid所产生的问题。有优点也有缺点，trade-off。
- it struggle to generalize to objects in new or unusual aspect ratios or configurations. 毕竟是从数据中直接预测bounding box，好难...
comparison

对其他方法也不是很熟，就不比较了。
Notes

感觉这篇paper也是CNN的典型应用，设计好你想要的输出，当然这也确定了你的groundtruth label的形式，最后通过CNN强大的学习能力，来从输出到输入学习到你想要的映射，最后完成你想要实现的功能。

这篇paper输入的是整个图像，层数也很多，所以最后学习到是high-level representations，泛化能力也比较强。但是同样是因为输入的是整个图像，图像其他地方的点对object而言或许就是一种噪声，也许这就是这篇paper的accuracy比较低的原因吧。

YOLO

You Only Look Once

内容目录