[关闭]
@nataliecai1988 2017-09-18T06:37:18.000000Z 字数 3451 阅读 666

ImageNet Training in 24 Minutes论文重点翻译

投稿


另外还有表1、图2、表2、表3、图3的表格和图片下面的注解需要翻译。

Difficulty of Large-Batch Training

The asynchronous methods using parameter server are not
guaranteed to be stable on large-scale systems (Chen et al.
2016). As discussed in (Goyal et al. 2017), data-parallelism
synchronized approach is more stable for very large DNN
training. The idea is simple—by using a large batch size for
SGD, the work for each iteration can be easily distributed
to multiple processors. Consider the following ideal case.
ResNet-50 requires 7.72 billion single-precision operations
to process one 225x225 image. If we run 90 epochs for ImageNet
dataset, the number of operations is 90 * 1.28 Million
* 7.72 Billion (1018). Currently, the most powerful supercomputer
can finish 200 × 1015 single-precision operations
per second (Dongarra et al. 2017). If there is an algorithm
allowing us to make full use of the supercomputer, we can
finish the ResNet-50 training in 5 seconds.
To do so, we need to make the algorithm use more processors
and load more data at each iteration, which corresponds
to large batch size in SGD. Let us use one NVIDIA
M40 GPU to illustrate the case of a single machine. In a
certain range, larger batch size will make the single GPU’s
speed higher (Figure 2). The reason is that low-level matrix
computation libraries will be more efficient. The optimal
batch size per GPU is 512 for ImageNet training with
AlexNet model. If we want to use many GPUs and make
each GPU efficient, we need a larger batch size. For example,
if we have 16 GPUs, then we should set the batch size
to 16 × 512 = 8192. Ideally, if we fix total number of data
accesses and grow the batch size linearly with number of
processors, the number of SGD iterations will decrease linearly
and the time cost of each iteration remains constant,
so the total time will also reduce linearly with number of
processors (Table 1).
However, SGD with large batch size usually achieves
much lower accuracy than small batch size if they run the
same number of epochs, and currently there is no algorithm
allowing us to effectively use very large batch size (Keskaret al. 2016). Table 2 shows the target accuracy by standard
benchmarks. For example, when we set the batch size of
AlexNet larger than 1024 or the batch size of ResNet-50
larger than 8192, the test accuracy will be significantly decreased
(Table 4 and Figure 3).
For large-batch training, we need to ensure that the largebatch
achieve similar test accuracy with the small-batch by
running the same number of epochs. Here we fix the number
of epochs because: Statistically, one epoch means the
algorithm touches the entire dataset once; and computationally,
fixing the number of epochs means fixing the number
of floating point operations. State-of-the-art approaches for
large batch training include two techniques:
(1) Linear Scaling (Krizhevsky 2014): If we increase the
batch size from B to kB, we should also increase the learning
rate from η to kη.
(2) Warmup Scheme (Goyal et al. 2017): If we use a
large learning rate (η). We should start from a small η and
increase it to the large η in the first few epochs.
The intuition of linear scaling is related to the number of
iterations. Let us use B, η, and I to denote the batch size, the
learning rate, and the number of iterations. If we increase the
the batch size from B to kB, then the number of iterations
is reduced from I to I/k. This means that the frequency of
weight updating reduced by k times. Thus, we make the updating
of each iteration k× more efficient by enlarging the
learning rate by k times. The purpose of warmup scheme
is to avoid the algorithm diverges at the beginning because
we have to use a very large learning rate based on linear
scaling. With these techniques, researchers can use the relatively
large batch in a certain range (Table 3). However, we
observe that state-of-the-art approaches can only scale batch
size to 1024 for AlexNet and 8192 for ResNet-50. If we increase
the batch size to 4096 for AlexNet, we only achieve
53.1% in 100 epochs (Table 4). Our target is to achieve 58%
accuracy even when using large batch sizes.

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注