Iden. #Imgs. - 作业部落 Cmd Markdown 编辑阅读器

@sambodhi 2018-08-02T22:52:48.000000Z 字数 32253 阅读 3868

The Devil of Face Recognition is in the Noise
面向噪声控制的人脸识别数据集

【摘要：】藉由与日俱增的人脸识别数据集，我们得以能够训练强大的卷积网络来识别人脸。尽管人们已经设计了各种架构和损失函数，但对于现有数据集固有的标签噪声的来源和后果，我们仍然知之甚少。在本论文中，我们作出了如下贡献：
1. 提供了主流的人脸数据库的去污子集，即MegaFace和MS-Celeb-1M数据集，并构建一个新的大型噪声控制的IMDb-Face数据集。
2. 利用原始数据集和去污子集，对MegaFace和MS-Celeb-1M的标签噪声属性进行分析。在本论文中，我们展示了需要更多样本才能获得由去污子集产生的相同准确率。
3. 我们研究了根据人脸识别的准确率，不同类型的噪声（即标签翻转和离群值）之间的关联。
4. 我们研究了改善数据清洁度的方法，包括在数据标签策略对注释准确性的影响进行全面用户研究。
IMDB-Face数据集已发布在https://github.com/fwang91/IMDb-Face。

1 介绍

人脸识别的发展离不开数据集。从早期的FRRET数据集到最近的LFW、MegaFace、和MS-Celeb-1M，在推动新技术发展的方面上，人脸识别数据集发挥着不可替代的作用。数据集不仅变得更加多样化，而且数据规模也在急剧增长。例如，MS-Celeb-1M为10万名名人提供了提供了大约1000万张图像，远超过FERET，后者只有来组1199位个人的的14126张图像。得益于大规模数据集和深度学习的问世，人脸识别近年来取得了巨大的成功。

大规模数据集受到标签噪声的影响是不可避免的。这个问题是普遍存在的，因为大规模注释良好的数据集收集起来费时费力。研究人员就被迫采取廉价但不完美的替代方案。一种常见方案是在搜索引擎上按名人的名字来查询他们的图像，然后用自动或半自动的方法来清理标签。其他方法在社交照片共享站点引入了约束聚类。上述方法提供了一种可行的方法，能够方便地扩展训练样本，但是也带来了对模型的训练和性能产生不利影响的标签噪声。图1显示的是一些带有标签噪声的样本。从图1中可以看出，MegaFace和MS-Celeb-1M都存在不正确的身份标签相当多的现象。有些标签很容易去污，但其中也有许多标签很难去污。在MegaFace中，还有许多冗余图像（见最后一行）。

我们研究的第一个目标是，通过深度卷积神经网络（CNN）来了解标签噪声的来源及其对人脸识别的影响。我们寻求这些问题的答案：需要多少噪声样本才能达到相当于去污数据的效果？噪声与最终表现之间的关系是什么样的？注释面部身份的最佳策略是什么？更好地理解上述问题将有助于我们更好地设计一个数据收集和清洗策略，避免训练中的陷阱，并制定更强的算法来处理实际问题。为了便于研究，我们手工清理了两个最流行的人脸识别数据库的子集：MegaFace和MS-Celeb-1M。我们观察到，只用32%的MegaFace或20%的MS-Celeb-1M去污子集训练出来的模型，性能上可与在相应完整数据集上训练的模型相比。实验表明，若使用噪声样本，需要更多样本来进行面部识别模型训练。

The second goal of our study is to build a clean face recognition dataset for the community. The dataset could help training better models and facilitate further understanding of the relationship between noise and face recognition performance. To this end, we build a clean dataset called IMDb-Face. The dataset consists of 1.7M images of 59K celebrities collected from movie screenshots and posters from the IMDb website[1]. Due to the nature of the data source, the images exhibit large variations in scale, pose, lighting, and occlusion. We carefully clean the dataset and simulate corruption by injecting noise on the training labels. The experiments show that the accuracy of face recognition decreases rapidly and nonlinearly with the increase of label noises. In particular, we confirm the common belief that the performance of face recognition is more sensitive towards label flips (example has erroneously been given the label of another class within the dataset) than outliers (image does not belong to any of the classes under consideration, but mistakenly has one of their labels). We also conduct an interesting experiment to analyze the reliability of different ways of annotating a face recognition dataset. We found that label accuracy correlates with time spent on annotation. The study helps us to find the source of erroneous labels and thereafter design better strategies to balance annotation cost and accuracy.
我们研究的第二个目标是，为社区建立一个干净的人脸识别数据集。这个数据集有助于训练更好的模型，并有助于进一步了解噪声和人脸识别性能直接的关系。为此，我们构建了一个名为IMDb-Face的干净数据集。这个数据集由IMDb网站从电影截图和海报中收集的5.9万名人的170万张图像组成。由于数据源的性质，图像在尺度、姿势、光照和遮挡方面都有很大的变化。我们仔细清理了数据集，并通过训练标签上注入噪声来模拟损坏。实验结果表明，随着标签噪声的增加，人脸识别的准确率快速、非线性地降低。

【】【】【】【】我们研究的第二个目标是为社区建立一个干净的人脸识别数据集。数据集可以帮助训练更好的模型,方便进一步的理解每噪音和人脸识别性能之间的关系。为此，我们构建了一个名为IMDb-Face的干净数据集。该数据集由IMDb网站[1]从电影截图和海报中收集的170万张名人照片组成。由于数据源的性质，图像在尺度、姿态、光线和遮挡方面都有很大的变化。我们仔细清理数据集，通过在训练标签上注入噪音来模拟腐败。实验结果表明，随着标签噪声的增加，人脸识别的准确率快速、非线性地降低。特别是,我们确认普遍认为人脸识别的性能更敏感病房标签翻转(例子中的错误被另一个类的标签数据集)比局外人(图片不属于任何类的考虑,但错误的标签)。我们还进行了一项有趣的实验，分析不同的注释面部识别数据的方法的可靠性。我们发现标签的准确性与花费在注释上的时间有关。本研究有助于我们找到错误标签的来源，并在此基础上设计出更好的标注成本与准确性的平衡策略。

我们希望本文可以揭示数据噪声对人脸识别任务的影响，并指出可能的标签策略来缓解一些问题。我们为社区提供了新版本的IMDb-Face，它可以作为一种相对干净的数据来使用，为今后大规模人脸识别中噪声的研究提供便利。它还可以作为训练数据源来提高现有方法的性能，正如我们将在实验中所展示的那样。

2 现有数据有多嘈杂？

我们首先介绍了一些在人脸识别研究中使用的流行数据集，然后估算它们各自的信噪比。

2.1 人脸识别数据集

Table 2.1 provides a summary of representative datasets used in face recognition research.
表2.1总结了人脸识别研究中使用的代表性数据集。

LFW: Labeled Faces in the Wild (LFW) [7] is perhaps the most popular dataset to date for benchmarking face recognition approaches. The database consists of 13,000 facial images of 1, 680 celebrities. Images are collected from Yahoo News by running the Viola-Jones face detector. Limited by the detector, most of the faces in LFW is frontal. The dataset is considered sufficiently clean despite some incorrectly labeled matched pairs are reported. Errata of LFW are provided in http://vis-www.cs.umass.edu/lfw/.
LFW:对人脸识别方法进行基准测试，[7]可能是迄今为止最流行的数据集。该数据库由13000张1680名名人的面部照片组成。通过运行Viola-Jones面部探测器，可以从雅虎新闻(Yahoo News)上收集图像。由于探测器的限制，LFW中的大部分面都是正面的。尽管报告了一些标记不正确的匹配对，但数据集被认为是足够干净的。LFW的勘误表见http://vis-www.cs.umass.edu/lfw/。

CelebFaces: CelebFaces [19](#bookmark45 "Current Document")[,20] is one of the early face recognition training databases that are made publicly available. Its first version contains 5,436 celebrities and 87, 628 images, and it was upgraded to 10,177 identities and 202, 599 images in a year later. Images in CelebFaces were collected from search engines and manually cleaned by workers.

CelebFaces: CelebFaces [19](#bookmark45“Current Document”)[，20]是早期人脸识别培训数据库之一，这些数据库都是公开的。第一个版本包含5436 celebri关系,87年,628张图片,这是升级到10177的身份和202,一年后599图片。从搜索引擎中收集名人头像的图片，由工作人员手工清理。

VGG-Face: VGG-Face [15] contains 2,622 identities and 2.6M photos. More than 2,000 images per celebrity were downloaded from search engines. The authors treat the top 50 images as positive samples and train a linear SVM to select the top 1,000 faces. To avoid extensive manual annotation, the dataset was ‘block-wise’ verified, i.e., ranked images of each identity are displayed in blocks and annotators are asked to validate blocks as a whole. In this study we did not focus on VGG-Face [15] since it should have the similar ‘search-engine bias’ problem with MS-Celeb-1M [5].
VGG-Face: VGG-Face[15]包含2622个身份和2.6米照片。每个名人都有超过2000张图片从搜索引擎下载。非盟雷神治疗前50名的图像作为正样本和训练一个线性支持向量机选择前1000名的面孔。为了避免广泛的手工注释，数据集是“块-wise”验证的，即。，将每个标识的排序图像显示在块中，并要求注释器将块作为一个整体进行验证。在这项研究中，我们没有关注VGG-Face[15]，因为它应该和MS-Celeb-1M[5]有类似的“搜索引擎偏见”问题。

CASIA-WebFace: The images in CASIA-WebFace [25] were collected from IMDb website. The dataset contains 500K photos of 10K celebrities and it is semi-automatically cleaned via tag-constrained similarity clustering. The authors start with each celebrity’s main photo and those photos that contain only one face. Then faces are gradually added to the dataset constrained by feature similarity and name tag. CASIA-WebFace uses the same source as the proposed IMDb-Face dataset. However, limited by the feature and clustering steps, CASIA-WebFace may fail to recall many challenging faces.
CASIA-WebFace中的图像来自IMDb网站。该数据集包含10K名人的500K照片，通过标签约束的相似性聚类进行半自动清理。非盟雷神开始每个名人的主图和那些只包含一个脸的照片。然后，在特征相似度和名称标签的约束下，将人脸逐步添加到数据集中。CASIA-WebFace带来IMDb-Face pro使用相同的源数据集。然而，由于功能和集群步骤的限制，CASIA-WebFace可能无法回忆起许多具有挑战性的面孔。

MS-Celeb-lM: MS-Celeb-1M [5] contains 100K celebrities who are selected from the 1M celebrity list in terms of their popularities. Public search engines are then leveraged to provide approximately 100 images for each celebrity, resulting in about 10M web images. The data is deliberately left uncleaned for several reasons. Specifically, collecting a dataset of this scale requires tremendous efforts in cleaning the dataset. Perhaps more importantly, leaving the data in this form encourages researchers to devise new learning methods that can naturally deal with the inherent noises.
MS-Celeb-lM: MS-Celeb-1M[5]是指从100万名人榜中挑选出来的10万名名人。然后，公共搜索引擎为每个名人提供大约100张图片，从而产生约1000万张网络图片。由于几个原因，这些数据被有意地删除了。具体地说，收集这种规模的数据集需要在清理数据集方面付出巨大的努力。也许更重要的是，把数据以这种形式保留下来，鼓励研究人员设计出新的学习方法，能够自然地处理固有的噪音。

MegaFace: Kemelmacher-Shlizerman et al. [13] clean massive number of images published on Flickr by proposing algorithms to cluster and filter face data from the YFCC100M dataset. For each user’s albums, the authors merge face pairs with a distance closer than /3 times of average distance. Clusters that contain more than three faces are kept. Then they drop ‘garbage’ groups and clean potential outliers in each group. A total of 672K identities and 4.7M images were collected. MegaFace2 avoids ‘search-engine’ bias as in VGG-Face [15] and MS- Celeb-1M [5]. However, we found this cluster-based approach introduces new bias. MegaFace prefers small groups with highly duplicated images, e.g., face captured from the same video. Limited by the base model for clustering, considerable groups in MegaFace contain noises, or sometimes mess up multiple people in the same group.
MegaFace:Kemelmacher-Shlizerman et al。[13]清洁Flickr上发表的大量的im年龄提出集群算法和滤波器面临YFCC100M数据集的数据。对于每个用户的相册，作者将脸对合并为比平均距离近/3倍的距离。集群,反对锡箔超过三个的脸。然后他们丢弃“垃圾”组，清除每个组中的潜在异常值。总共收集了672K个身份和470万张图片。MegaFace2避免了“搜索引擎”的偏见，就像VGG-Face[15]和MS- Celeb-1M[5]那样。然而，我们发现这种基于集群的方法引入了新的偏见。MegaFace更喜欢有高度重复图像的小群体，例如从同一视频中捕捉到的脸。限制了集群的基本模型,considMegaFace erable团体包含噪音,有时搞砸很多人在同一组。

2.2 An Approximation of Signal-to-Noise Ratio
2.2信噪比的近似

Owing to the source of data and cleaning strategies, existing large-scale datasets invariably contain label noises. In this study, we aim to profile the noise distribution in existing datasets. Our analysis may provide a hint to future research on how one should exploit the distribution of these data.
由于数据来源和清洗策略，现有的大型数据集总是包含标签噪声。在这项研究中,我们的目标是分析噪声distribution在现有的数据集。我们的分析可能为未来研究如何利用这些数据的分布提供线索。

It is infeasible to obtain the exact number of these noises due to the scale of the datasets. We bypass this difficulty by randomly selecting a subset of a dataset and manually categorize them into three groups - ccorrect identity assigned，，cdoubtful，，and cwrong identity assigned，. We select a subset of 2.7M images from MegaFace [13] and 3.7M images from MS-Celeb-lM [5]. For CASIAWebFace [25] and CelebFaces [19,20], we sampled 30 identities to estimate their signal-to-noise ratio. The final statistics are visualized in Figure 2(a). Due to the difficulty in estimating the exact ratio, we approximate an upper and a lower bound of noisy data during the estimation. The lower-bound is more optimistic considering doubtful labels as clean data. The upper-bound is more pessimistic considering all doubtful cases as badly labeled. We provide more details on the estimations in the supplementary material. As observed in Figure 2(a), the noise percentage increases dramatically along the scale of data. This is not surprising given the difficulty in data annotation. It is noteworthy that the proposed IMDb- Face pushes the envelope of large-scale data with a very high signal-to-noise ratio (noise is under 10% of the full data).
由于数据集的规模，获得这些噪声的确切数字是不可行的。我们绕过这个困难通过随机选择一个数据集的一个子集,手动分类成三组——ccorrect身份分配,cdoubtful,分配和cwrong身份,。我们从MegaFace[13]中选取270万张图像，从MS-Celeb-lM[5]中选取370万张图像。在CASIAWebFace[25]和CelebFaces(19、20),我们抽样30身份估计信噪比。最终的统计数据如图2(a)所示。由于估计精确比的困难，在估计过程中，我们估计了噪声数据的上界和下界。下界则更乐观地认为可疑的标签是干净的数据。考虑到所有可疑的情况都被贴上了糟糕的标签，上限则更加悲观。我们在补充材料中提供更多关于估计的细节。如图2(a)所示，噪声百分比沿数据的尺度急剧增加。考虑到数据注释的困难，这并不奇怪。值得注意的是，拟议的IMDb- Face以非常高的信噪比(噪声低于全部数据的10%)推动了大规模数据的信封。

We investigate further the noise distribution of the two largest public datasets to date, MS-Celeb-lM [5] and MegaFace [13]. We first categorize identities in a dataset based on their number of images. A total of six groups/bins are established. We then plot a histogram showing the signal-to-noise ratio of each bin along the noise lower- and upper-bounds. As can be seen in Figure 2(b,c), both datasets exhibit a long-tailed distribution, i.e., most identities have very few images. This phenomenon is especially obvious on the MegaFace [13] dataset since it uses automatically formed clusters for determining identities, therefore, the same identity may be distributed in different clusters. Noises across all groups in MegaFace [13] are less in comparison to MS-Celeb-lM [5]. However, we found that many images in the clean portion of MegaFace [13] are duplicated images. In Sec. 4.2, we will perform experiments on the MegaFace and MS-Celeb-1M datasets to quantify the effect of noise on the face recognition task.
我们进一步研究了迄今最大的两个公共数据集——MS-Celeb-lM[5]和MegaFace[13]的噪声分布。我们首先根据图像的数量对数据集中的身份进行分类。共六组/箱子国栎社。然后，我们绘制了一个直方图，显示了每个箱子沿下界和上界噪声的信噪比。如图2(b,c)所示，这两个数据集都显示了一个长尾分布，即,大多数身份很少im。这种现象在MegaFace[13]数据集中尤为明显，因为它使用自动形成的集群来确定身份，因此相同的身份可能分布在不同的集群中。与MS-Celeb-lM[5]相比，MegaFace[13]中所有组间的噪声都较少。然而，我们发现MegaFace[13]干净部分的许多图像都是重复的图像。在4.2节，我们将对MegaFace和MS-Celeb-1M数据集进行实验，以量化噪声对人脸识别任务的影响。

3 Building a Noise-Controlled Face Dataset
构建一个噪声控制的人脸数据集。

As shown in the previous section, face recognition datasets that are more than a million scale typically have a noise ratio higher than 30%. How about building a large scale noise controlled face dataset? It can be used to train better face recognition algorithms. More importantly, it can be used to further understand the relationship between noise and face recognition performance. To this end, we seek not only a cleaner and more diverse source to collect face data, but also an effective way to label the data.
如前一节所示，100万倍以上的人脸识别数据集的噪声比通常高于30%。构建一个大规模噪声控制的人脸数据集怎么样?它可以用来训练更好的人脸识别算法。更重要的是，它可以进一步理解噪声与人脸识别性能之间的关系。为此，我们不仅寻求一种更干净、更多样化的来源来收集面部数据，而且还寻求一种标记数据的有效方法。

3.1 Celebrity Faces from IMDb
3.1 IMDb的明星脸

Search engines are important sources from which we can quickly construct a large-scale dataset. The widely used Image Net [3] was built by querying images from Google Image. Most of the face recognition datasets were built in the
same way (except MegaFace [13]). While querying from search engines offers the convenience of data collection, it also introduces data bias. Search engines usually operate in a high-precision regime [2]. Observing the queried images in Figure 3, they tend to have a simple background with sufficient illumination, and the subjects are often in a near frontal posture. These data, to a certain extent, are more restricted than those we could observe in reality, e.g., faces in videos (IJB-A [9] and YTF [24]) and selfie photos (millions of distractors in MegaFace). Another pitfall in crawling images from search engines is the low recall rate. We performed a simple analysis and found that on average the recall rate is only 40% for the first 200 photos we query for a particular name.
搜索引擎是我们快速构建大型数据集的重要来源。广泛使用的图像网络[3]是通过查询谷歌图像构建的。大多数人脸识别数据集都是建立在。

同样的方法(MegaFace[13]除外)。虽然从搜索引擎查询提供了数据收集的便利，但它也引入了数据偏差。搜索引擎通常在高精度的[2]机制下运行。观察图3所示的查询图像，他们往往有一个简单的背景，有足够的光照，而且受试者通常处于近正面的姿势。这些数据在一定程度上比我们在现实中所能观察到的数据更为有限，比如视频中的人脸(IJB-A[9]和YTF[24])和自拍照片(MegaFace中数百万个干扰物)。搜索引擎爬行图像的另一个缺陷是低召回率。我们做了一个简单的分析，发现我们查询一个特定名称的前200张照片的平均召回率只有40%。

In this study, we turn our data collection source to the IMDb website. IMDb is more structured. It includes a diverse range of photos under each celebrity’s profile, including official photos, lifestyle photos, and movie snapshots. Movie snapshots, we believe, provide essential data samples for training a robust face recognition model. Those screenshots are rarely returned by querying a search engine. In addition, the recall rate is much higher (90% on average) when we query a name on IMDb. This is much higher than 40% from search engines. The IMDb website lists about 300K celebrities who have official and gallery photos. By clawing IMDb dataset, we finally collected and cleaned 1.7M raw images from 59K celebrities.
在本研究中，我们将数据收集源转到IMDb网站。IMDb更结构化的。它在每个名人的个人资料下都包含了各种各样的照片，包括官方照片、生活方式照片和电影快照。我们认为，电影快照提供了必要的数据样本，用于训练一个健壮的面部识别模型。这些屏幕截图很少通过查询搜索引擎返回。此外，当我们在IMDb上查询一个名称时，召回率要高得多(平均90%)。这远远高于40%的搜索引擎。IMDb网站列出了大约30万名拥有官方和画廊照片的名人。通过抓取IMDb数据集，我们最终收集和清理了来自59K名人的170万张原始图片。

3.2 Data Distribution
3.2数据分布

Figure 4-a presents the distribution of yaw angle in our dataset compared with MS-Celeb-1M and MegaFace. Figures 4-c, -d and -e present the age, gender and race distributions. As can be observed, images in IMDb-Face exhibit larger pose variations, and they also show diversity in age, gender and race.
图4-a在我们的数据集中显示了偏航角的分布，与MS-Celeb-1M和MegaFace相比。图4-c、-d和-e表示年龄、性别和种族分布。可以观察到，IMDb-Face图像的姿态变化较大，在年龄、性别和种族上也呈现多样性。

3.3 How Good can Human Label Identity?
3.3人的标签身份到底有多好?

The data downloaded from IMDb are noisy as multiple celebrities may co-exist on the same image. We still need to clean the dataset before it can be used for training. We take this opportunity to study how human annotators would clean a face data. The study will help us to identify the source of noise during annotation and design a better data cleaning strategy for the full dataset.
从IMDb上下载的数据很吵，因为多个名人可能同时存在于同一张图片上。在将数据集用于培训之前，我们仍然需要对其进行清理。我们借此机会研究人类注释者如何清理面部数据。该研究将帮助我们识别注释的噪声源，并为完整的数据集设计一个更好的数据清理策略。

For the purpose of the user study, we extract a small subset of 30 identities from the IMDb raw data. We carefully select three images with confirmed identity serving as gallery images. The remaining images of these 30 identities are treated as query images. To make the user study more challenging and statistically more meaningful, we inject 20% outliers to the query set. Next, we prepare three annotation schemes as follows. The interface of each scheme is depicted in Figure 5.
为了进行用户研究，我们从IMDb原始数据中提取了30个身份的一小部分。我们精心选择了三个图片确认identity作为画廊图片。这30个标识的其余映像被视为查询映像。让用户研究更具挑战性和statisti卡莉更有意义,我们注入20%的离群值的查询集。接下来,我们准备三个注释方案如下。每个方案的接口如图5所示。

Scheme I - Draw the box: We present the target person to a volunteer by showing the three gallery faces. We then show a query image selected from the query set. The image may contain multiple persons. If the target appears in the query image, the volunteer is asked to draw a bounding box on the target. The volunteer can either confirm the selection or assign a ‘doubt’ flag on the box if he/she is not confident about the choice. ‘No target’ is selected when he/she cannot find the target person.
方案一:画一个方框:我们将目标人物展示给志愿者，展示三个画廊的面孔。然后显示从查询集中选择的查询映像。该映像可能包含多个人员。如果目标出现在查询图像中，则要求志愿者在目标上绘制边框。如果志愿者对自己的选择不自信，可以确认选择，也可以在方框中设置“怀疑”标志。当找不到目标人时，“无目标”被选中。

Scheme II - Choose l in 3: Similar to Scheme I, we present the target person to a volunteer by showing the gallery images. We then randomly sample three faces detected from the query set, from which the volunteer will select a single image as the target face. We ensure that all query faces have the same gender as the target person. Again, the volunteer can choose a ‘doubt’ flag if he/she is not confident about the selection or choose ‘no target’ at all.
方案二:选择方案三中的l:与方案一相似，我们通过展示画廊的图片将目标人物呈现给志愿者。然后我们从查询集中随机抽取三个面，志愿者将从中选择一个图像作为目标面。我们确保所有查询面都具有与目标人员相同的性别。同样，如果志愿者对选择没有信心，他们可以选择“怀疑”标志，或者选择“没有目标”标志。

Scheme III - Yes or No: Binary query is perhaps be the most natural and popular way to clean a face recognition set. We first rank all faces based on their similarity to probe faces in the gallery, and then ask a volunteer to make a choice if each belongs to the target person. The volunteer is allowed to answer ‘doubt’. Which scheme to choose?: Before we can quantify the effectiveness of different schemes, we first need to generate the ground truth of these 30 identities. We use a ‘consensus’ approach. Specifically, each of the aforementioned schemes was conducted on three different volunteers. We ensure that each query face was annotated nine times across the three schemes. If four of the annotations consistently point to the same identity, we assign the query face to the targeted identity. With this ground truth, we can measure the effectiveness of each annotation scheme.
方案III - Yes或No:二进制查询也许是最自然的和受欢迎的方式清洁面部识别集。我们首先面临排名根据他们的相似性来探测面临的画廊,然后让志愿者如果每个属于目标做出选择的人。志愿者被允许回答“疑问”。选择哪个方案呢?:之前我们可以量化ent不同方案的有效性,我们首先需要生成这些30的地面实况身份。我们采用“协商一致”的方法。具体来说，上述方案都是针对三名不同的志愿者进行的。我们确保在三个方案中，每个查询面都被注释了9次。如果四个注释始终指向同一个身份,我们指定查询的脸焦油得到身份。有了这个基本事实，我们可以度量每个注释方案的有效性。

Figure 6 shows the Receiver operating characteristic (ROC) curve of each of the three schemes[2]. Scheme I achieves the highest F\ score. It recalls more than 90% faces with under 10% false positive samples. Finding a face and drawing a box seems to make annotators more focused on finding the right face. Scheme II provides a high true positive rate when the false positive is low. The existence of distractors forces annotators to work harder to match the faces. Scheme III yields the worse true positive rate when the false positive is low. This is not surprising since this task is much easier than Schemes I and II. The annotators tend to make mistakes given this relaxing task, especially after a prolonged annotation process. We observe an interesting phenomenon: the longer a volunteer spends on annotating a sample, the more accurate the annotation is. With full speed in one hour, each volunteer can draw 180-300 faces in Scheme I, or finish around 600 selections in Scheme II, or answer over 1000 binary questions in Scheme III. We believe the most reliable way to clean a face recognition dataset is to leverage both Schemes I and II to achieve a high precision and recall. Limited by our budget, we only conducted Scheme I to clean the IMDb-Face dataset.
图6显示了三种方案[2]的接收机工作特性曲线。方案一达到F\评分最高。它会召回超过90%的脸，而假阳性样本不足10%。找一张脸，画一个方框，似乎让注释者更专注于找到正确的脸。方案二在假阳性低时提供高真实阳性率。干扰项的存在迫使注释器更加努力地匹配人脸。当假阳性值较低时，方案III的实际阳性率更低。这并不奇怪，因为这个任务比方案一和方案二要容易得多。注释者倾向于在这个轻松的任务下犯错误，特别是在一个长时间的注释过程之后。我们观察到一个有趣的现象:志愿者在注释一个样本上花费的时间越长，注释就越准确。每名志愿者在一小时内以最快速度画出方案一180-300张面孔，或完成方案二约600个选择，或回答方案三中超过1000个二元问题。我们认为，最可靠的清洁面部识别数据集的方法是利用方案I和方案II，以实现高精确度和回忆。由于预算的限制，我们只对IMDb-Face数据集进行了方案一的清理。

During the cleaning of the IMDb-Face, since multiple identities may co-exist on the same image, first we annotated gallery images to make sure the queried identity. The gallery images come from the official gallery provided by the IMDb website, which most of these official gallery images contain the true identity. We ask volunteers to look through the 10 gallery images back and forth and draw bounding box of the face that occurs most frequently. Then, annotators label the rest of the queried images guided by the three largest labeled faces as galleries. For identities having fewer than three gallery images, their queried images may have too much noise. To save labor, we did not annotate their images.
在清洗IMDb-Face过程中，由于多个标识可能同时存在于同一图像上，所以我们首先对图片库图像进行注释，以确保查询的标识。画廊图片来自IMDb网站提供的官方画廊，这些官方画廊图片大多包含真实的身份。我们要求志愿者在10个画廊图片中来回查看，并画出最频繁出现的脸部的边框。然后，注释者将其余由三个最大的标签面引导的被查询图像标记为图库。对于少于三个画廊图片的身份，他们查询的图片可能有太多的噪音。为了节省人力，我们没有注释他们的图片。

It took 50 annotators one month to clean the IMDb-Face dataset. Finally, we obtained 1.7M clean facial images from 2M raw images. We believe that the cleaning is of high quality. We estimate the noise level of IMBb-Face as the product of approximated noise level in the IMDb raw data (2.7 士 4.5%) and the false positive rate (8.7%) of Scheme I. The noise level is controlled under 2%. The quality of IMDb-Face is validated in our experiments.
一个月需要50个注释器来清理IMDb-Face数据集。最后，我们从200万张原始图像中获得了170万张面部图像。我们相信这种清洁是高质量的。我们估计的噪声水平IMBb-Face近似的产品噪音水平在IMDb原始数据(2.7士4.5%)和假阳性率(8.7%)的方案即噪音水平控制在2%以下。通过实验验证了IMDb-Face的质量。

4 Experiments
4实验

We divide our experiments into a few sections. First, we conduct ablation studies by simulating noise on our proposed dataset. The studies help us to observe the deterioration of performance in the presence of increasing noise, or when a fixed amount of clean data is diluted with noise. Second, we perform experiments on two existing datasets to further demonstrate the effect of noise. Third, we examine the effectiveness of our dataset by comparing it to other datasets with the same training condition. Finally, we compare the model trained on our dataset with other state-of-the-arts. Next, we describe the experimental setting. Evaluation Metric: We report rank-1 identification accuracy on the Megaface benchmark [8]. It is a very challenging task to evaluate the performance of face recognition methods at the million scale of distractors. The MegaFace benchmark consists of one gallery set and one probe set. The gallery set contains more than 1 million images and the probe set consists of two existing datasets: Facescrub [14] and FGNet. We use Facescrub [14] as MegaFace probe dataset in our experiments. Verification performance of MegaFace (reported as TPR at FPR= 10—6) is included in the supplementary material due to page limit. We also test LFW [7] and YTF [24] in Section 4.4.
我们把实验分成几个部分。首先，我们通过模拟实验数据上的噪声进行消融研究。这些研究帮助我们观察在增加噪音的情况下性能的恶化，或者当固定数量的清洁数据被噪音稀释时。其次，我们对两个现有的数据集进行实验，以进一步证明噪声的影响。第三,我们考试线数据集的有效性通过比较相同的其他数据集训练条件。最后，我们将在我们的数据集中训练的模型与其他情况进行比较。接下来，我们描述实验设置。评估指标:我们报告Megaface基准[8]的等级1识别精度。在上百万个干扰项中评估人脸识别方法的性能是一项非常具有挑战性的任务。MegaFace台上标志由一个美术馆和一个探测器集合。画廊集包含超过100万张图片和探针组包含两个现有的数据集:Facescrub FGNet[14]。我们在实验中使用Facescrub[14]作为MegaFace探测数据集。MegaFace的验证性能(在FPR= 10-6处报告为TPR)由于页面限制被包含在补充材料中。我们还在4.4节中测试LFW[7]和YTF[24]。

Architecture: To better examine the effect of noise, we use the same architecture in all experiments. After a comparison among ResNet-50, ResNet-101 and Attention-56 [22], we finally choose Attention-56 that achieves a good balance between computation and accuracy. As a reference, the model converges on a database with 80 hours on an 8-GPU server with a batch-size of 256. The output of Attention-56 is a 256-dimensional feature for each input image. We use cosine similarity to compute scores between image pairs.
架构:为了更好的研究噪声的影响,我们使用相同的architec真正的在所有的实验。通过对比ResNet-50、ResNet-101和Attention-56[22]，我们最终选择了Attention-56，在计算和准确性之间达到了很好的平衡。作为参考，该模型在一个8-GPU服务器上收敛时间为80小时，批处理大小为256。注意-56的输出是每个输入图像的256维特征。我们使用余弦相似度来计算图像对之间的分数。

Pre-processing: We cropped and aligned faces, then rigidly transferred them onto a mean shape. Then we resized the cropped image into 224 x 256, and subtracted them with the mean value in each RGB channel.
预处理:我们裁剪和对齐面，然后严格地将它们转换成一个平均形状。然后我们将裁剪后的图像大小调整为224 x 256，并在每个RGB通道中使用平均值减去它们。

Loss: We apply three losses: Soft Max [20], Center Loss [23] and A-Softmax [12]. Our implementation is based on the public implementation of these losses: Softmax: Soft max loss is the most commonly used loss, either for model initialization or establishing a baseline.
损失:我们应用三个损失:软Max[20]，中心损失[23]和A-Softmax[12]。我们的实现是基于公共实施这些损失:Softmax:柔软的最大损失是最常用的损失,无论是模型初始ization或建立一个基线。

Center Loss: Wen et al. [23] propose center loss, which minimizes the intra-class distance to enhance features’ discriminative power. The authors jointly trained CNN with the center loss and the soft max loss.
中心损耗:Wen等人提出中心损耗，使类内距离最小化，增强特征的识别能力。作者联合训练CNN与中心损失和软最大值损失。

A-Softmax: Liu et al. [12] formulate A-Softmax to explicitly enforce the angle margin between different identities. The weight vector of each category was restricted on a hypersphere.

A-Softmax:刘et al。[12]制定A-Softmax显式地执行一个gle利润率之间不同的身份。每个类别的权向量被限制在一个超球面上。

4.1 Investigating the Effect of Noise on IMDb-Face
4.1研究噪声对IMDb-Face的影响

The proposed IMDb-Face dataset enables us to investigate the effect of noise. There are two common types of noise in large-scale face recognition datasets: 1) label flips: example has erroneously been given the label of another class within the dataset 2) outliers: image does not belong to any of the classes under consideration, but mistakenly has one of their labels. Sometimes even non-faces may be mistakenly included. To simulate the first type of noise, we randomly perturb faces into incorrect categories. For the second type, we randomly replace faces in IMDb-Face with images from MegaFace.
提出的IMDb-Face数据集使我们能够研究噪声的影响。在大型人脸识别数据集中有两种常见的噪声类型:1)标签翻转:示例错误地给了数据集中另一个类的标签2)异常值:图像不属于正在考虑的任何类，但错误地有它们的一个标签。有时甚至连不认识的人也会被错误地包括在内。为了模拟第一种类型的噪声，我们随机打乱了人脸的分类。对于第二种类型，我们将IMDb-Face中的人脸随机替换为MegaFace的图像。

Here we perform two experiments: 1) We gradually contaminate our dataset with different types of noise. We gradually increase the noise in our dataset by 10%, 20% and 50%. 2) We fix the size of clean data and ‘dilute，it with label flips. We do not use ensemble models in these experiments.

这里我们做了两个实验:1)我们逐渐用不同类型的噪声污染数据集。我们逐渐将数据集中的噪声增加10%，20%和50%。2)我们修复清理数据的大小和稀释,用标签翻转。我们在这些实验中不使用集成模型。

A-Softmax, which used to achieve a better result on a clean dataset, becomes worse than Center loss and Softmax in the high-noise region. 3) Outliers seem to have a less abrupt effect on the performance across all losses, matching the observation in [10] and [17].
a -Softmax在干净的数据集上可以获得更好的结果，但在高噪声区域，它比中心丢失和Softmax更严重。3)异常值对所有损失的性能的影响似乎都没有那么突然，这与[10]和[17]的观察结果相符。

The second experiment was inspired by a recent work from Rolnick et al. [17]. They found that if a dataset contains sufficient clean data, a deep learning model can still be properly trained on it when the data is diluted by a large amount of noise. They show that a model can still achieve a feasible accuracy on CIFAR- 10, even the ratio of noise to clean data is increased to 20 : 1. Can we transfer their conclusion to face recognition? Here we sample four subsets from IMDb- Face with 1E5, 2E5, 5E5 and 1E6 images. And we dilute them with an equal number, double, and five times of label flip noise. Figure 7(c) shows that a large performance gap still exists against the completely clean baseline, even we maintain the same number of clean data. We conjecture two reasons that cleanliness of data still plays a key role in face recognition: 1) current dataset, even it is clean, still far from sufficient to address the challenging face recognition problem thus noise matters. 2) Noise is more lethal on a 10,000-class problem than on a 10-class problem.
1060/5000
第二个实验的灵感来自罗尔尼克等人最近的研究。他们发现，如果一个数据集包含足够的干净数据，当数据被大量的噪音稀释时，一个深度学习模型仍然可以得到适当的训练。结果表明，该模型在CIFAR- 10上仍能达到可行的精度，即使将噪声与干净数据的比值提高到20:1。我们能把他们的结论转换成人脸识别吗?在这里，我们从IMDb- Face中抽取了四个子集，分别是1E5、2E5、5E5和1E6。我们用等量的，两倍的，五倍的标签翻转噪声来稀释它们。图7(c)显示在完全清洁的基线上仍然存在很大的性能差异，即使我们保持相同数量的清洁数据。我们推测，数据的清洁度在人脸识别中仍然起着关键作用的原因有两个:1)当前的数据集，即使是干净的，也远远不能解决人脸识别的难题，因此噪声很重要。2)在一万级问题上，噪音比十级问题更致命。

4.2 The Effect of Noise on MegaFace and MS-Celeb-lM
4.2噪声对MegaFace和MS-Celeb-lM的影响。

To further demonstrate the effect of noise, we perform experiments on two public datasets: MegaFace and MS-Celeb-1M. In order to quantify the effect of noise on the face recognition, we sampled subsets from the two datasets and manually cleaned them. This provides us with a noisy sampled subset and a clean subset for each dataset. For a fair comparison, the noisy subset was sampled to have the same distribution of image numbers to identities as the original dataset. Also, we control the scale of noisy subsets to make sure the scales for each clean subset are nearly the same. Because of the large size of the sampled subsets, we have chosen the third labeling method mentioned in Sec. 3.3, which is the fastest.
为了进一步证明噪声的影响，我们对MegaFace和MS-Celeb-1M这两个公共数据集进行了实验。为了量化噪声对人脸识别的影响，我们从两个数据集中抽取子集并手工清理。这为我们提供了一个有噪声的抽样子集和每个数据集的一个干净的子集。为了进行公平的比较，对噪声子集进行采样，使其具有与原始数据集相同的图像编号分布。此外，我们控制噪声子集的比例，以确保每个干净子集的比例几乎相同。由于样本子集的大小较大，我们选择了第3.3节中提到的第三种标注方法，这是最快的。

Three different losses, namely, SoftMax, Center Loss and A-Softmax, are respectively applied to the original datasets, sampled, and cleaned subsets. Table 2 summarizes the results on the MegaFace recognition challenge [8]. The effect of clean datasets is tremendous. By comparing the results between cleaned datasets and sampled datasets, the average improvement of accuracy is as large as 4.14%. The accuracies on clean subsets even surpass those on raw datasets, which are 4 times larger on average. The results suggest the effectiveness of reducing noise for large-scale datasets. As the mater of fact, the result of this experiment is part of our motivation to collect IMDb-Face dataset.
三个不同的损失,即SoftMax、损失和A-Softmax中心,是再保险spectively应用到原始数据集,采样,清洗子集。表2总结了MegaFace识别挑战的结果[8]。干净数据集的效果是巨大的。通过比较清理数据集和采样数据集之间的结果，准确率的平均提高高达4.14%。干净子集的准确性甚至超过了原始数据集的准确性，原始数据集的准确性平均是原始数据集的4倍。结果表明，该方法能有效降低大型数据集的噪声。事实上，这个实验的结果是我们收集IMDb-Face数据集的动机之一。

It is worth pointing out that recent metric learning based methods such as A-Softmax [12] and Center-loss [23] also benefit from learning on clean datasets, although they already perform much better than Softmax [20]. As shown in Table 2, the improvements of accuracy on MegaFace using A-Softmax and Center- loss are over 5%. The results suggest that reducing dataset noise is still helpful, especially when metric learning is performed. Reducing noisy samples could help an algorithm focuses more on hard examples learning, rather than picking up meaningless noises.
值得指出的是，最近的基于度量学习的方法，如A-Softmax[12]和Center-loss[23]，也从清洁数据集的学习中获益，尽管它们已经比Softmax[20]表现得好得多。Table 2所示,精度的改进MegaFace使用A-Softmax和中心-损失超过5%。结果表明，减少数据集噪声仍然是有帮助的，尤其是在进行度量学习时。减少噪声样本可以帮助算法更专注于难例学习，而不是接收无意义的噪声。

4.3 Comparing IMDb-Face with other Face Datasets
4.3将IMDb-Face与其他人脸数据集进行比较

In the third experiment, we wish to show the competitiveness of IMDb-Face against several well-established face recognition training datasets including: 1) CelebFaces [19](#bookmark45 "Current Document")[,20], 2) CASIA-WebFace [25], 3) MS-Celeb-1M(v1) [5], and 4) MegaFace [13]. The data size of the two latter datasets is a few times larger than the proposed IMDb-Face. Note that MS-Celeb-1M has a larger subset(v2), containing 900,000 identities. Limited by our computational resources we did not conduct experiments on it. We do not use ensemble models in this experiment. Table 3 summarizes the results of using different datasets as the training source across three losses. We observed that the proposed noise-controlled IMDb-Face dataset is competitive as a training source despite its smaller size, validating the effectiveness of the IMDb data source and the cleanliness of IMDb-Face.
在第三个实验中，我们希望展示IMDb-Face相对于几个成熟的人脸识别训练数据集的竞争力，包括:1)CelebFaces [19](#bookmark45“Current Document”)[，20]，2)CASIA-WebFace [25]， 3) MS-Celeb-1M(v1) [5]， 4) MegaFace[13]。后两个数据集的数据大小比建议的IMDb-Face大几倍。注意，MS-Celeb-1M有一个较大的子集(v2)，包含90万个标识。由于计算资源有限，我们没有对其进行实验。我们在这个实验中不使用集合模型。表3总结了使用不同数据集作为三个损失的训练源的结果。我们注意到，尽管IMDb数据集的大小较小，但提出的噪声控制IMDb- face数据集作为训练源具有竞争力，这验证了IMDb数据源的有效性和IMDb- face的清洁度。

4.4 Comparisons with State-of-the-Arts
4.4比较先进的

We are interested to compare the performance of model trained on IMDb-Face with state-of-the-arts. Evaluation is conducted on MegaFace [8], LFW [7], and YTF [24] following the standard protocol. For LFW [7] we compute equals error rate (EER). For YTF [24] we report accuracy for recognition. To highlight the effect of training data, we do not adopt model ensemble. The comparative results are shown in Table 4. Our single model trained on IMDb-Face (A-Softmax^, IMDb-Face) achieves a state-of-the-art performance on LFW, MegaFace, and YTF against published methods. It is noteworthy that the performance of our final model is also comparable to a few private methods on MegaFace.
我们感兴趣的是将在IMDb-Face上训练的模型的性能与状况进行比较。按照标准协议对MegaFace[8]、LFW[7]、YTF[24]进行评价。对于LFW[7]，我们计算等于错误率(EER)。对于YTF[24]，我们报告识别的准确性。为了突出训练数据的效果，我们不采用模型集成。比较结果见表4。单一模型训练IMDb-Face(A-Softmax ^,IMDb-Face)实现了LFW先进的性能,MegaFace和YTF发布方法。值得注意的是，我们最终模型的性能也可与MegaFace上的一些私有方法相媲美。

5 Conclusion
5的结论

Beyond existing efforts of developing sophisticated losses and CNN architectures, our study has investigated the problem of face recognition from the data perspective. Specifically, we developed an understanding of the source of label noise and its consequences. We also collected a new large-scale data from IMDb website, which is naturally a cleaner and wilder source than search engines. Through user studies, we have discovered an effective yet accurate way to clean our data. Extensive experiments have demonstrated that both data source and cleaning effectively improve the accuracy of face recognition. As a result of our study, we have presented a noise-controlled IMDb-Face dataset, and a state-of-the-art model trained on it. A clean dataset is important as the face recognition community has been looking for large-scale clean datasets for two practical reasons: 1) to better study the training performance of contemporary deep networks as a function of noise level in data. Without a clean dataset, one cannot induce controllable noise to support a systematic study. 2) to benchmark large-scale automatic data cleaning methods. Although one can use the final performance of a deep network as a yardstick, this measure can be affected by many uncontrollable factors, e.g., network hyperparameters setting. A clean and large-scale dataset enables unbiased analysis.
超越现有的努力开发复杂的损失和CNN的架构,我们的研究调查了人脸识别的问题从数据每spective。具体地说，我们了解了标签噪声的来源及其后果。我们也收集了新的大规模数据从webIMDb网站,这自然是一个更干净、怀尔德源比搜索引擎。通过用户研究，我们发现了一种有效而准确的数据清理方法。大量的实验表明，数据源和清洗都能有效地提高人脸识别的准确率。作为我们研究的结果，我们提供了一个噪声控制的IMDb-Face数据集，以及一个经过培训的最先进的模型。干净的数据集是很重要的的人脸识别com易感性一直在寻找大规模清洁数据集有两个实用的原因:1)为了更好地研究当代深层网络的训练性能数据中的噪声电平的函数。没有一个干净的数据集，就无法产生可控的噪声来支持系统研究。2)对大型自动数据清理方法进行基准测试。尽管可以使用深层网络的最终性能作为标准,这个标准可以影响很多多数trollable因素,例如,网络hyperparameters设置。一个干净的、大规模的数据集可以进行无偏分析。

[2] We should emphasize that the curves in Figure 6 are different from actual human’s performance on verifying arbitrary face pairs. This is because in our study the faces from a query set are very likely to belong to the same person. The ROC thus represents human’s accuracy on ‘verifying face pairs that likely belong to the same identity’.
我们需要强调的是，图6中的曲线与实际人类在验证任意人脸对时的表现不同。这是因为在我们的研究中，来自查询集的面孔很可能属于同一个人。因此中华民国代表厌恶人类的准确性的验证的脸对可能属于同一身份”。

内容目录