@sambodhi 2018-08-02T22:52:48.000000Z 字数 32253 阅读 2131

The Devil of Face Recognition is in the Noise

【摘要：】藉由与日俱增的人脸识别数据集，我们得以能够训练强大的卷积网络来识别人脸。尽管人们已经设计了各种架构和损失函数，但对于现有数据集固有的标签噪声的来源和后果，我们仍然知之甚少。在本论文中，我们作出了如下贡献：
1. 提供了主流的人脸数据库的去污子集，即MegaFace和MS-Celeb-1M数据集，并构建一个新的大型噪声控制的IMDb-Face数据集。
2. 利用原始数据集和去污子集，对MegaFace和MS-Celeb-1M的标签噪声属性进行分析。在本论文中，我们展示了需要更多样本才能获得由去污子集产生的相同准确率。
3. 我们研究了根据人脸识别的准确率，不同类型的噪声（即标签翻转和离群值）之间的关联。
4. 我们研究了改善数据清洁度的方法，包括在数据标签策略对注释准确性的影响进行全面用户研究。
IMDB-Face数据集已发布在https://github.com/fwang91/IMDb-Face

1 介绍

The second goal of our study is to build a clean face recognition dataset for the community. The dataset could help training better models and facilitate further understanding of the relationship between noise and face recognition per­formance. To this end, we build a clean dataset called IMDb-Face. The dataset consists of 1.7M images of 59K celebrities collected from movie screenshots and posters from the IMDb website[1]. Due to the nature of the data source, the images exhibit large variations in scale, pose, lighting, and occlusion. We carefully clean the dataset and simulate corruption by injecting noise on the training labels. The experiments show that the accuracy of face recognition decreases rapidly and nonlinearly with the increase of label noises. In particular, we confirm the common belief that the performance of face recognition is more sensitive to­wards label flips (example has erroneously been given the label of another class within the dataset) than outliers (image does not belong to any of the classes under consideration, but mistakenly has one of their labels). We also conduct an interesting experiment to analyze the reliability of different ways of annotating a face recognition dataset. We found that label accuracy correlates with time spent on annotation. The study helps us to find the source of erroneous labels and thereafter design better strategies to balance annotation cost and accuracy.

【】【】【】【】我们研究的第二个目标是为社区建立一个干净的人脸识别数据集。数据集可以帮助训练更好的模型,方便进一步的理解每­噪音和人脸识别性能之间的关系。为此，我们构建了一个名为IMDb-Face的干净数据集。该数据集由IMDb网站[1]从电影截图和海报中收集的170万张名人照片组成。由于数据源的性质，图像在尺度、姿态、光线和遮挡方面都有很大的变化。我们仔细清理数据集，通过在训练标签上注入噪音来模拟腐败。实验结果表明，随着标签噪声的增加，人脸识别的准确率快速、非线性地降低。特别是,我们确认普遍认为人脸识别的性能更敏感­病房标签翻转(例子中的错误被另一个类的标签数据集)比局外人(图片不属于任何类的考虑,但错误的标签)。我们还进行了一项有趣的实验，分析不同的注释面部识别数据的方法的可靠性。我们发现标签的准确性与花费在注释上的时间有关。本研究有助于我们找到错误标签的来源，并在此基础上设计出更好的标注成本与准确性的平衡策略。

2 现有数据有多嘈杂？

2.1 人脸识别数据集

Table 2.1 provides a summary of representative datasets used in face recognition research.

LFW: Labeled Faces in the Wild (LFW) [7] is perhaps the most popular dataset to date for benchmarking face recognition approaches. The database consists of 13,000 facial images of 1, 680 celebrities. Images are collected from Yahoo News by running the Viola-Jones face detector. Limited by the detector, most of the faces in LFW is frontal. The dataset is considered sufficiently clean despite some incorrectly labeled matched pairs are reported. Errata of LFW are provided in http://vis-www.cs.umass.edu/lfw/.
LFW:对人脸识别方法进行基准测试，[7]可能是迄今为止最流行的数据集。该数据库由13000张1680名名人的面部照片组成。通过运行Viola-Jones面部探测器，可以从雅虎新闻(Yahoo News)上收集图像。由于探测器的限制，LFW中的大部分面都是正面的。尽管报告了一些标记不正确的匹配对，但数据集被认为是足够干净的。LFW的勘误表见http://vis-www.cs.umass.edu/lfw/

CelebFaces: CelebFaces [19](#bookmark45 "Current Document")[,20] is one of the early face recognition training databases that are made publicly available. Its first version contains 5,436 celebri­ties and 87, 628 images, and it was upgraded to 10,177 identities and 202, 599 images in a year later. Images in CelebFaces were collected from search engines and manually cleaned by workers.

CelebFaces: CelebFaces [19](#bookmark45“Current Document”)[，20]是早期人脸识别培训数据库之一，这些数据库都是公开的。第一个版本包含5436 celebri­关系,87年,628张图片,这是升级到10177的身份和202,一年后599图片。从搜索引擎中收集名人头像的图片，由工作人员手工清理。

VGG-Face: VGG-Face [15] contains 2,622 identities and 2.6M photos. More than 2,000 images per celebrity were downloaded from search engines. The au­thors treat the top 50 images as positive samples and train a linear SVM to select the top 1,000 faces. To avoid extensive manual annotation, the dataset was ‘block-wise’ verified, i.e., ranked images of each identity are displayed in blocks and annotators are asked to validate blocks as a whole. In this study we did not focus on VGG-Face [15] since it should have the similar ‘search-engine bias’ problem with MS-Celeb-1M [5].
VGG-Face: VGG-Face[15]包含2622个身份和2.6米照片。每个名人都有超过2000张图片从搜索引擎下载。非盟­雷神治疗前50名的图像作为正样本和训练一个线性支持向量机选择前1000名的面孔。为了避免广泛的手工注释，数据集是“块-wise”验证的，即。，将每个标识的排序图像显示在块中，并要求注释器将块作为一个整体进行验证。在这项研究中，我们没有关注VGG-Face[15]，因为它应该和MS-Celeb-1M[5]有类似的“搜索引擎偏见”问题。

CASIA-WebFace: The images in CASIA-WebFace [25] were collected from IMDb website. The dataset contains 500K photos of 10K celebrities and it is semi-automatically cleaned via tag-constrained similarity clustering. The au­thors start with each celebrity’s main photo and those photos that contain only one face. Then faces are gradually added to the dataset constrained by feature similarity and name tag. CASIA-WebFace uses the same source as the pro­posed IMDb-Face dataset. However, limited by the feature and clustering steps, CASIA-WebFace may fail to recall many challenging faces.
CASIA-WebFace中的图像来自IMDb网站。该数据集包含10K名人的500K照片，通过标签约束的相似性聚类进行半自动清理。非盟­雷神开始每个名人的主图和那些只包含一个脸的照片。然后，在特征相似度和名称标签的约束下，将人脸逐步添加到数据集中。CASIA-WebFace带来IMDb-Face pro­使用相同的源数据集。然而，由于功能和集群步骤的限制，CASIA-WebFace可能无法回忆起许多具有挑战性的面孔。

MS-Celeb-lM: MS-Celeb-1M [5] contains 100K celebrities who are selected from the 1M celebrity list in terms of their popularities. Public search engines are then leveraged to provide approximately 100 images for each celebrity, resulting in about 10M web images. The data is deliberately left uncleaned for several reasons. Specifically, collecting a dataset of this scale requires tremendous efforts in cleaning the dataset. Perhaps more importantly, leaving the data in this form encourages researchers to devise new learning methods that can naturally deal with the inherent noises.
MS-Celeb-lM: MS-Celeb-1M[5]是指从100万名人榜中挑选出来的10万名名人。然后，公共搜索引擎为每个名人提供大约100张图片，从而产生约1000万张网络图片。由于几个原因，这些数据被有意地删除了。具体地说，收集这种规模的数据集需要在清理数据集方面付出巨大的努力。也许更重要的是，把数据以这种形式保留下来，鼓励研究人员设计出新的学习方法，能够自然地处理固有的噪音。

MegaFace: Kemelmacher-Shlizerman et al. [13] clean massive number of im­ages published on Flickr by proposing algorithms to cluster and filter face data from the YFCC100M dataset. For each user’s albums, the authors merge face pairs with a distance closer than /3 times of average distance. Clusters that con­tain more than three faces are kept. Then they drop ‘garbage’ groups and clean potential outliers in each group. A total of 672K identities and 4.7M images were collected. MegaFace2 avoids ‘search-engine’ bias as in VGG-Face [15] and MS- Celeb-1M [5]. However, we found this cluster-based approach introduces new bias. MegaFace prefers small groups with highly duplicated images, e.g., face captured from the same video. Limited by the base model for clustering, consid­erable groups in MegaFace contain noises, or sometimes mess up multiple people in the same group.
MegaFace:Kemelmacher-Shlizerman et al。[13]清洁Flickr上发表的大量的im­年龄提出集群算法和滤波器面临YFCC100M数据集的数据。对于每个用户的相册，作者将脸对合并为比平均距离近/3倍的距离。集群,反对­锡箔超过三个的脸。然后他们丢弃“垃圾”组，清除每个组中的潜在异常值。总共收集了672K个身份和470万张图片。MegaFace2避免了“搜索引擎”的偏见，就像VGG-Face[15]和MS- Celeb-1M[5]那样。然而，我们发现这种基于集群的方法引入了新的偏见。MegaFace更喜欢有高度重复图像的小群体，例如从同一视频中捕捉到的脸。限制了集群的基本模型,consid­MegaFace erable团体包含噪音,有时搞砸很多人在同一组。

2.2 An Approximation of Signal-to-Noise Ratio
2.2信噪比的近似

Owing to the source of data and cleaning strategies, existing large-scale datasets invariably contain label noises. In this study, we aim to profile the noise distri­bution in existing datasets. Our analysis may provide a hint to future research on how one should exploit the distribution of these data.

It is infeasible to obtain the exact number of these noises due to the scale of the datasets. We bypass this difficulty by randomly selecting a subset of a dataset and manually categorize them into three groups - ccorrect identity assigned，，cdoubtful，，and cwrong identity assigned，. We select a subset of 2.7M images from MegaFace [13] and 3.7M images from MS-Celeb-lM [5]. For CASIA­WebFace [25] and CelebFaces [19,20], we sampled 30 identities to estimate their signal-to-noise ratio. The final statistics are visualized in Figure 2(a). Due to the difficulty in estimating the exact ratio, we approximate an upper and a lower bound of noisy data during the estimation. The lower-bound is more optimistic considering doubtful labels as clean data. The upper-bound is more pessimistic considering all doubtful cases as badly labeled. We provide more details on the estimations in the supplementary material. As observed in Figure 2(a), the noise percentage increases dramatically along the scale of data. This is not surprising given the difficulty in data annotation. It is noteworthy that the proposed IMDb- Face pushes the envelope of large-scale data with a very high signal-to-noise ratio (noise is under 10% of the full data).

We investigate further the noise distribution of the two largest public datasets to date, MS-Celeb-lM [5] and MegaFace [13]. We first categorize identities in a dataset based on their number of images. A total of six groups/bins are estab­lished. We then plot a histogram showing the signal-to-noise ratio of each bin along the noise lower- and upper-bounds. As can be seen in Figure 2(b,c), both datasets exhibit a long-tailed distribution, i.e., most identities have very few im­ages. This phenomenon is especially obvious on the MegaFace [13] dataset since it uses automatically formed clusters for determining identities, therefore, the same identity may be distributed in different clusters. Noises across all groups in MegaFace [13] are less in comparison to MS-Celeb-lM [5]. However, we found that many images in the clean portion of MegaFace [13] are duplicated images. In Sec. 4.2, we will perform experiments on the MegaFace and MS-Celeb-1M datasets to quantify the effect of noise on the face recognition task.

3 Building a Noise-Controlled Face Dataset

As shown in the previous section, face recognition datasets that are more than a million scale typically have a noise ratio higher than 30%. How about building a large scale noise controlled face dataset? It can be used to train better face recognition algorithms. More importantly, it can be used to further understand the relationship between noise and face recognition performance. To this end, we seek not only a cleaner and more diverse source to collect face data, but also an effective way to label the data.

3.1 Celebrity Faces from IMDb
3.1 IMDb的明星脸

Search engines are important sources from which we can quickly construct a large-scale dataset. The widely used Image Net [3] was built by querying images from Google Image. Most of the face recognition datasets were built in the
same way (except MegaFace [13]). While querying from search engines offers the convenience of data collection, it also introduces data bias. Search engines usually operate in a high-precision regime [2]. Observing the queried images in Figure 3, they tend to have a simple background with sufficient illumination, and the subjects are often in a near frontal posture. These data, to a certain extent, are more restricted than those we could observe in reality, e.g., faces in videos (IJB-A [9] and YTF [24]) and selfie photos (millions of distractors in MegaFace). Another pitfall in crawling images from search engines is the low recall rate. We performed a simple analysis and found that on average the recall rate is only 40% for the first 200 photos we query for a particular name.

In this study, we turn our data collection source to the IMDb website. IMDb is more structured. It includes a diverse range of photos under each celebrity’s profile, including official photos, lifestyle photos, and movie snapshots. Movie snapshots, we believe, provide essential data samples for training a robust face recognition model. Those screenshots are rarely returned by querying a search engine. In addition, the recall rate is much higher (90% on average) when we query a name on IMDb. This is much higher than 40% from search engines. The IMDb website lists about 300K celebrities who have official and gallery photos. By clawing IMDb dataset, we finally collected and cleaned 1.7M raw images from 59K celebrities.

3.2 Data Distribution
3.2数据分布

Figure 4-a presents the distribution of yaw angle in our dataset compared with MS-Celeb-1M and MegaFace. Figures 4-c, -d and -e present the age, gender and race distributions. As can be observed, images in IMDb-Face exhibit larger pose variations, and they also show diversity in age, gender and race.

3.3 How Good can Human Label Identity?
3.3人的标签身份到底有多好?

The data downloaded from IMDb are noisy as multiple celebrities may co-exist on the same image. We still need to clean the dataset before it can be used for training. We take this opportunity to study how human annotators would clean a face data. The study will help us to identify the source of noise during annotation and design a better data cleaning strategy for the full dataset.

For the purpose of the user study, we extract a small subset of 30 identities from the IMDb raw data. We carefully select three images with confirmed iden­tity serving as gallery images. The remaining images of these 30 identities are treated as query images. To make the user study more challenging and statisti­cally more meaningful, we inject 20% outliers to the query set. Next, we prepare three annotation schemes as follows. The interface of each scheme is depicted in Figure 5.

Scheme I - Draw the box: We present the target person to a volunteer by showing the three gallery faces. We then show a query image selected from the query set. The image may contain multiple persons. If the target appears in the query image, the volunteer is asked to draw a bounding box on the target. The volunteer can either confirm the selection or assign a ‘doubt’ flag on the box if he/she is not confident about the choice. ‘No target’ is selected when he/she cannot find the target person.

Scheme II - Choose l in 3: Similar to Scheme I, we present the target person to a volunteer by showing the gallery images. We then randomly sample three faces detected from the query set, from which the volunteer will select a single image as the target face. We ensure that all query faces have the same gender as the target person. Again, the volunteer can choose a ‘doubt’ flag if he/she is not confident about the selection or choose ‘no target’ at all.

Scheme III - Yes or No: Binary query is perhaps be the most natural and popular way to clean a face recognition set. We first rank all faces based on their similarity to probe faces in the gallery, and then ask a volunteer to make a choice if each belongs to the target person. The volunteer is allowed to answer ‘doubt’. Which scheme to choose?: Before we can quantify the effectiveness of differ­ent schemes, we first need to generate the ground truth of these 30 identities. We use a ‘consensus’ approach. Specifically, each of the aforementioned schemes was conducted on three different volunteers. We ensure that each query face was annotated nine times across the three schemes. If four of the annotations consistently point to the same identity, we assign the query face to the tar­geted identity. With this ground truth, we can measure the effectiveness of each annotation scheme.

Figure 6 shows the Receiver operating characteristic (ROC) curve of each of the three schemes[2]. Scheme I achieves the highest F\ score. It recalls more than 90% faces with under 10% false positive samples. Finding a face and drawing a box seems to make annotators more focused on finding the right face. Scheme II provides a high true positive rate when the false positive is low. The existence of distractors forces annotators to work harder to match the faces. Scheme III yields the worse true positive rate when the false positive is low. This is not surprising since this task is much easier than Schemes I and II. The annotators tend to make mistakes given this relaxing task, especially after a prolonged annotation process. We observe an interesting phenomenon: the longer a volunteer spends on annotating a sample, the more accurate the annotation is. With full speed in one hour, each volunteer can draw 180-300 faces in Scheme I, or finish around 600 selections in Scheme II, or answer over 1000 binary questions in Scheme III. We believe the most reliable way to clean a face recognition dataset is to leverage both Schemes I and II to achieve a high precision and recall. Limited by our budget, we only conducted Scheme I to clean the IMDb-Face dataset.

During the cleaning of the IMDb-Face, since multiple identities may co-exist on the same image, first we annotated gallery images to make sure the queried identity. The gallery images come from the official gallery provided by the IMDb website, which most of these official gallery images contain the true identity. We ask volunteers to look through the 10 gallery images back and forth and draw bounding box of the face that occurs most frequently. Then, annotators label the rest of the queried images guided by the three largest labeled faces as galleries. For identities having fewer than three gallery images, their queried images may have too much noise. To save labor, we did not annotate their images.

It took 50 annotators one month to clean the IMDb-Face dataset. Finally, we obtained 1.7M clean facial images from 2M raw images. We believe that the cleaning is of high quality. We estimate the noise level of IMBb-Face as the product of approximated noise level in the IMDb raw data (2.7 士 4.5%) and the false positive rate (8.7%) of Scheme I. The noise level is controlled under 2%. The quality of IMDb-Face is validated in our experiments.

4 Experiments
4实验

We divide our experiments into a few sections. First, we conduct ablation studies by simulating noise on our proposed dataset. The studies help us to observe the deterioration of performance in the presence of increasing noise, or when a fixed amount of clean data is diluted with noise. Second, we perform experiments on two existing datasets to further demonstrate the effect of noise. Third, we exam­ine the effectiveness of our dataset by comparing it to other datasets with the same training condition. Finally, we compare the model trained on our dataset with other state-of-the-arts. Next, we describe the experimental setting. Evaluation Metric: We report rank-1 identification accuracy on the Megaface benchmark [8]. It is a very challenging task to evaluate the performance of face recognition methods at the million scale of distractors. The MegaFace bench­mark consists of one gallery set and one probe set. The gallery set contains more than 1 million images and the probe set consists of two existing datasets: Facescrub [14] and FGNet. We use Facescrub [14] as MegaFace probe dataset in our experiments. Verification performance of MegaFace (reported as TPR at FPR= 10—6) is included in the supplementary material due to page limit. We also test LFW [7] and YTF [24] in Section 4.4.

Architecture: To better examine the effect of noise, we use the same architec­ture in all experiments. After a comparison among ResNet-50, ResNet-101 and Attention-56 [22], we finally choose Attention-56 that achieves a good balance between computation and accuracy. As a reference, the model converges on a database with 80 hours on an 8-GPU server with a batch-size of 256. The output of Attention-56 is a 256-dimensional feature for each input image. We use cosine similarity to compute scores between image pairs.

Pre-processing: We cropped and aligned faces, then rigidly transferred them onto a mean shape. Then we resized the cropped image into 224 x 256, and subtracted them with the mean value in each RGB channel.

Loss: We apply three losses: Soft Max [20], Center Loss [23] and A-Softmax [12]. Our implementation is based on the public implementation of these losses: Softmax: Soft max loss is the most commonly used loss, either for model initial­ization or establishing a baseline.

Center Loss: Wen et al. [23] propose center loss, which minimizes the intra-class distance to enhance features’ discriminative power. The authors jointly trained CNN with the center loss and the soft max loss.

A-Softmax: Liu et al. [12] formulate A-Softmax to explicitly enforce the an­gle margin between different identities. The weight vector of each category was restricted on a hypersphere.

A-Softmax:刘et al。[12]制定A-Softmax显式地执行一个­gle利润率之间不同的身份。每个类别的权向量被限制在一个超球面上。

4.1 Investigating the Effect of Noise on IMDb-Face
4.1研究噪声对IMDb-Face的影响

The proposed IMDb-Face dataset enables us to investigate the effect of noise. There are two common types of noise in large-scale face recognition datasets: 1) label flips: example has erroneously been given the label of another class within the dataset 2) outliers: image does not belong to any of the classes under consideration, but mistakenly has one of their labels. Sometimes even non-faces may be mistakenly included. To simulate the first type of noise, we randomly perturb faces into incorrect categories. For the second type, we randomly replace faces in IMDb-Face with images from MegaFace.

Here we perform two experiments: 1) We gradually contaminate our dataset with different types of noise. We gradually increase the noise in our dataset by 10%, 20% and 50%. 2) We fix the size of clean data and ‘dilute，it with label flips. We do not use ensemble models in these experiments.

A-Softmax, which used to achieve a better result on a clean dataset, becomes worse than Center loss and Softmax in the high-noise region. 3) Outliers seem to have a less abrupt effect on the performance across all losses, matching the observation in [10] and [17].
a -Softmax在干净的数据集上可以获得更好的结果，但在高噪声区域，它比中心丢失和Softmax更严重。3)异常值对所有损失的性能的影响似乎都没有那么突然，这与[10]和[17]的观察结果相符。

The second experiment was inspired by a recent work from Rolnick et al. [17]. They found that if a dataset contains sufficient clean data, a deep learning model can still be properly trained on it when the data is diluted by a large amount of noise. They show that a model can still achieve a feasible accuracy on CIFAR- 10, even the ratio of noise to clean data is increased to 20 : 1. Can we transfer their conclusion to face recognition? Here we sample four subsets from IMDb- Face with 1E5, 2E5, 5E5 and 1E6 images. And we dilute them with an equal number, double, and five times of label flip noise. Figure 7(c) shows that a large performance gap still exists against the completely clean baseline, even we maintain the same number of clean data. We conjecture two reasons that cleanliness of data still plays a key role in face recognition: 1) current dataset, even it is clean, still far from sufficient to address the challenging face recognition problem thus noise matters. 2) Noise is more lethal on a 10,000-class problem than on a 10-class problem.
1060/5000

4.2 The Effect of Noise on MegaFace and MS-Celeb-lM
4.2噪声对MegaFace和MS-Celeb-lM的影响。

To further demonstrate the effect of noise, we perform experiments on two public datasets: MegaFace and MS-Celeb-1M. In order to quantify the effect of noise on the face recognition, we sampled subsets from the two datasets and manually cleaned them. This provides us with a noisy sampled subset and a clean subset for each dataset. For a fair comparison, the noisy subset was sampled to have the same distribution of image numbers to identities as the original dataset. Also, we control the scale of noisy subsets to make sure the scales for each clean subset are nearly the same. Because of the large size of the sampled subsets, we have chosen the third labeling method mentioned in Sec. 3.3, which is the fastest.

Three different losses, namely, SoftMax, Center Loss and A-Softmax, are re­spectively applied to the original datasets, sampled, and cleaned subsets. Table 2 summarizes the results on the MegaFace recognition challenge [8]. The effect of clean datasets is tremendous. By comparing the results between cleaned datasets and sampled datasets, the average improvement of accuracy is as large as 4.14%. The accuracies on clean subsets even surpass those on raw datasets, which are 4 times larger on average. The results suggest the effectiveness of reducing noise for large-scale datasets. As the mater of fact, the result of this experiment is part of our motivation to collect IMDb-Face dataset.

It is worth pointing out that recent metric learning based methods such as A-Softmax [12] and Center-loss [23] also benefit from learning on clean datasets, although they already perform much better than Softmax [20]. As shown in Ta­ble 2, the improvements of accuracy on MegaFace using A-Softmax and Center- loss are over 5%. The results suggest that reducing dataset noise is still helpful, especially when metric learning is performed. Reducing noisy samples could help an algorithm focuses more on hard examples learning, rather than picking up meaningless noises.

4.3 Comparing IMDb-Face with other Face Datasets
4.3将IMDb-Face与其他人脸数据集进行比较

In the third experiment, we wish to show the competitiveness of IMDb-Face against several well-established face recognition training datasets including: 1) CelebFaces [19](#bookmark45 "Current Document")[,20], 2) CASIA-WebFace [25], 3) MS-Celeb-1M(v1) [5], and 4) MegaFace [13]. The data size of the two latter datasets is a few times larger than the proposed IMDb-Face. Note that MS-Celeb-1M has a larger subset(v2), containing 900,000 identities. Limited by our computational resources we did not conduct experiments on it. We do not use ensemble models in this experiment. Table 3 summarizes the results of using different datasets as the training source across three losses. We observed that the proposed noise-controlled IMDb-Face dataset is competitive as a training source despite its smaller size, validating the effectiveness of the IMDb data source and the cleanliness of IMDb-Face.

4.4 Comparisons with State-of-the-Arts
4.4比较先进的

We are interested to compare the performance of model trained on IMDb-Face with state-of-the-arts. Evaluation is conducted on MegaFace [8], LFW [7], and YTF [24] following the standard protocol. For LFW [7] we compute equals error rate (EER). For YTF [24] we report accuracy for recognition. To highlight the effect of training data, we do not adopt model ensemble. The comparative results are shown in Table 4. Our single model trained on IMDb-Face (A-Softmax^, IMDb-Face) achieves a state-of-the-art performance on LFW, MegaFace, and YTF against published methods. It is noteworthy that the performance of our final model is also comparable to a few private methods on MegaFace.

5 Conclusion
5的结论

Beyond existing efforts of developing sophisticated losses and CNN architectures, our study has investigated the problem of face recognition from the data per­spective. Specifically, we developed an understanding of the source of label noise and its consequences. We also collected a new large-scale data from IMDb web­site, which is naturally a cleaner and wilder source than search engines. Through user studies, we have discovered an effective yet accurate way to clean our data. Extensive experiments have demonstrated that both data source and cleaning effectively improve the accuracy of face recognition. As a result of our study, we have presented a noise-controlled IMDb-Face dataset, and a state-of-the-art model trained on it. A clean dataset is important as the face recognition com­munity has been looking for large-scale clean datasets for two practical reasons: 1) to better study the training performance of contemporary deep networks as a function of noise level in data. Without a clean dataset, one cannot induce controllable noise to support a systematic study. 2) to benchmark large-scale automatic data cleaning methods. Although one can use the final performance of a deep network as a yardstick, this measure can be affected by many uncon­trollable factors, e.g., network hyperparameters setting. A clean and large-scale dataset enables unbiased analysis.

[2] We should emphasize that the curves in Figure 6 are different from actual human’s performance on verifying arbitrary face pairs. This is because in our study the faces from a query set are very likely to belong to the same person. The ROC thus rep­resents human’s accuracy on ‘verifying face pairs that likely belong to the same identity’.

• 私有
• 公开
• 删除