@gekeshi 2017-02-28T04:00:04.000000Z 字数 16579 阅读 288

Cryo-EM single particle image classification开题

Cryo-EM

选题依据

Cryo-EM在确定生物分子结构上的巨大进步

Recent instrumental and methodological developments for cryoelectron
microscopy (cryo-EM) [1–19] made that the structures ofmacromolecular
complexes are nowoften determined at subnanometer and near-atomic resolutions [20–41]. The most exciting results in terms of resolution and size of solved structures are currently being obtained with the latest-generation cryo-electron microscopes equipped with direct electron detectors and software for automated collection of images, in combination with the use of advanced image analysis methods and high performance computing platforms

进一步提高分子结构分辨率的瓶颈在于异构大分子的分类

The achievement of such a homogeneous set of
particles is hindered by several problems. First, differences
among the images may be genuine or due to
positional factors such as rotational and/or translational
misalignment. The intrinsic structural heterogeneity
of a biochemically homogeneous population
of particles of the same biological specimen is another
important cause of differences in the projection
images. Finally, the characteristically low signal-
to-noise ratio of EM images renders this kind of
analysis very complex and difficult.
In this context, image classification is paramount
as a preprocessing step. It aims at sorting the original
population of images into different homogeneous
subpopulations in an attempt to help in the
comprehension of the specimen under study. These
different groups can later be used or discarded for
the three-dimensional reconstruction process. Because
in most of the instances no prior information
on the macromolecular structure is available, the
classification procedure can be more complicated.
Therefore, new powerful, noise-tolerant, and robust
classification techniques would be more than welcome.
Bartesaghi and collaborators have pointed out
that, rather than imaging technologies or image-processing methods,
the major bottleneck to a routine cryo-EM determination of structures
at resolutions close to 2 Å is currently the preparation of specimens of adequate quality that takes into account intrinsic protein flexibility [27].
Three-dimensional (3D) reconstruction from heterogeneous sets of
images normally results in low-resolution density maps. Thus, data heterogeneity analysis to isolate images of complexes of similar molecular compositions and conformations is a usual prerequisite to structural determination at high resolution. Biochemical procedures can usually be optimized so that the majority of complexes in the specimen, if not all of them, have the same molecular composition. However, the same composition rarely means the same conformation, due to the flexibility of complexes. Thus, conformational heterogeneity of specimens is usually analyzed by image analysis and classification methods. The reconstruction of different coexisting structures from the same sample will here be referred to as multiconformation reconstruction. It involves a classification strategy that assigns the particles having similar structures (similar molecular compositions and similar conformations) to the same class of particles. Multiconformation reconstruction is used to obtain high-resolution structures and provides insights into conformational dynamics of macromolecular complexes

2Dimage无监督确定颗粒朝向角度和异构情况方法

The orientation of images can be determined based on the central
section theorem [45]. This theorem states that the Fourier transform
of a 2D projection is a plane intersecting the origin of the 3D object's
Fourier transform and that this plane is parallel to the projection plane
[45,46]. Any two non-parallel 2D projections of the same 3D object
will therefore share a common line in Fourier space. Thus, the orientation
of images can be determined by determining the relative orientation
of common lines between the 2D Fourier transforms of images
[47,48]. The 3D model of the object obtained using images and the determined
orientation is referred to as ab initio 3D model.
If the given set of images is heterogeneous, the images have to be
sorted into structurally homogeneous subsets (image sorting) and 3D
geometrical relationships among the images have to be determined
(image orienting). When using no prior 3D model, image sorting and
orienting can be performed in two separate steps or simultaneously.
In the two-step approach proposed in [49], image orienting is preceded
by a classification of images in classes of similar orientations (orientation
classes) and a classification of each orientation class in classes of
similar structures (image sorting), and both classifications are based
on 2D multivariate statistical analysis (MSA) [50,51]
The main problem with the methods in this group is their low robustness to noise. They are thus usually used with 2D average images
that have a higher signal-to-noise ratio (SNR) than individual images
[53,55]. Also, their applications in studies with more than two conformational
states have not yet been demonstrated.
MSA方法只能得到2种异构体，其他方法？只能是3d分类来解决多种异构的问题？
2D average image 是如何得到的？在对齐和聚类过程中是怎样应用的？reference image是这个2D average image？

2d classification是获取class average image的方法

In the absence of an initial model, the noisy particle images are
first classified into groups, and a representative image, called a class average image, is built for each group from its members. These
class average images are assumed to capture the more frequently
observed views of the structure.

2d分类的流程

PCA

(A Bayesian method for classification of images from
electron micrographs)
In most single-particle classification techniques, the
images first undergo a decomposition of the total variance
by techniques such as correspondence analysis
(CA) or principal components analysis (PCA) (Frank
and van Heel, 1982; van Heel and Frank, 1981). These
techniques find orthogonal eigenvectors (factors) that
define the main components of interimage variation and describe every image in a new reduced-dimension coordinate
system. The factors produced by CA and PCA
are prioritized according to the eigenvalue weights that
account for the variance contribution of each factor.
Not all factors carry meaningful or signal-related information,
since noise or artifacts that are unrelated to
the shape of the macromolecule can also contribute to
the variance associated with a factor. Previous work has
addressed this and related issues. Often, factors that are
unrelated to the shape of the macromolecule (e.g., stain
depth) can be identified and have been eliminated from
the classification (Frank et al., 1982; Verschoor et al.,
1984). The effect of switching the order of factors on
classification has also been described (Frank, 1996).
These methods necessitate an extensive knowledge of the
system, tend to be time-consuming, are somewhat subjective,
and tend to break down as the signal-to-noise
ratio (SNR) decreases.

alignment

(Unsupervised Cryo-EM Data Clustering
through Adaptively Constrained K-Means
Algorithm)
If ignoring conformational dynamics of imaged macromolecules, the intrinsic difference
among projection images mainly comes from two sources: projection direction and in-plane
rotation. Prior to classification, single-particle images must be aligned to minimize the differences
in their translation and in-plane rotation. There are two popular approaches for initial
classification of 2D projection images, namely, multi-reference alignment (MRA) [19] and reference-
free alignment (RFA) [20]. In MRA, a 2D image alignment step and a data-clustering
step are performed iteratively until convergence. In the 2D image alignment step, each image
is rotated and shifted incrementally with respect to each reference. All possible correlations
between a rotated, translated image and a reference are computed. The distance between an
image and a reference is defined as the minimum of all correlation values between them. Based
on these distances, in the data-clustering step, traditional K-means clustering is used to classify
all images into many classes. An implementation of the MRA strategy can be found in SPARX
[21]. In RFA, all images are first aligned globally, which attempts to find rotations and translations
for all images that minimize the sum of squared deviation from their mean. These aligned
images are used as the input for data-clustering algorithms. This strategy was implemented in
SPIDER
(A Bayesian method for classification of images from
electron micrographs)
Because particles within a class are still randomly oriented
in the plane, they must also be aligned. Information
about the main orientations can be obtained by a
two-step process: alignment and classification. In most
common approaches, the particles are initially aligned
using one of several methods that include alignment to a
common template (Frank et al., 1978; Radermacher,
2001), reference-free alignment (Penczek et al., 1992;
Marco et al., 1996), and maximum-likelihood alignment
(Sigworth, 1998). In an alternative approach the particles
are transformed into a representation that is independent
of rotations and translations, classified, and
subsequently aligned within their classes (Schatz and
van Heel, 1990). Alignment and classification have also
been performed simultaneously, e.g., using neural networks
(Marabini and Carazo, 1994). This paper focuses
on the classification of previously aligned images. The
goals are to identify classes and to infer average images
using differences in the appearance of particles.

clustering

(Unsupervised Cryo-EM Data Clustering
through Adaptively Constrained K-Means
Algorithm)
MSA reduces the dimensionality of images by
projecting them into a subspace spanned by several eigenvectors, which are also called features.
Reducing dimensionality not only accelerates computing but also denoises projection images.
The resulting features can also be used as the references for image alignment. For example,
EMAN2 combines MSA with MRA (MSA/MRA) (see its script e2refine2d.py) [25]. It first generates
translational and rotational invariants for initial classification. Then, a MSA step is iterated
with a MRA step, in which images are aligned to those features and classified by the Kmeans
algorithm, until a pre-defined number of iterations is reached.

卷积神经网络在2d classification上可以做到的而之前的算法做不到的

以往的分类方法有哪些缺点...

降噪与特征提取

旋转与平移不变性

图片分类不需要对齐颗粒

分类准确率很高

相关研究

2d分类的相关算法

Over 25 years of research has
gone into methods for creating high signal-to-noise class average
images. These include multivariate statistical analysis (MSA; van
Heel and Frank, 1981; van Heel, 1984), the ‘alignment through
classification’ method (Harauz et al., 1988), Bayesian modeling
with Gibbs sampling (Samsó et al., 2002), and maximum likelihood
methods using single (Sigworth, 1998) and multiple references
(Scheres et al., 2005b). The higher signal-to-noise class average
images can then be oriented relative to each other by the common
lines method (Penczek et al., 1996) or by angular reconstitution (van
Heel et al., 1997; van Heel, 1987), and 3D structure determination
can be performed using the Fourier Slice Theorem.

（A Bayesian method for classification of images from
electron micrographs）
Because particles within a class are still randomly oriented
in the plane, they must also be aligned. Information
about the main orientations can be obtained by a
two-step process: alignment and classification. In most
common approaches, the particles are initially aligned
using one of several methods that include alignment to a
common template (Frank et al., 1978; Radermacher,
2001), reference-free alignment (Penczek et al., 1992;
Marco et al., 1996), and maximum-likelihood alignment
(Sigworth, 1998). In an alternative approach the particles
are transformed into a representation that is independent
of rotations and translations, classified, and
subsequently aligned within their classes (Schatz and
van Heel, 1990). Alignment and classification have also
been performed simultaneously, e.g., using neural networks
(Marabini and Carazo, 1994).

MSA改进

《A Bayesian method for classification of images from electron micrographs》

In this work, we introduce a new classification algorithm
based on Bayesian statistics. The technique employs
a Gibbs sampling algorithm, a form of a Markov
chain Monte Carlo (MCMC) sampling. The algorithm is
conceptually related to algorithms developed for the
alignment of nucleic acid/protein sequences (Lawrence
et al., 1993; Liu et al., 1995, 1999). The algorithm is able
to identify nonisotropic (elliptically shaped) clusters
arising from differences in variances of factors within
classes, where the orientation of such clusters is dependent
on the correlation between factors. It also employs
a novel Bayesian approach to distinguish factors useful
for classification from those that are not. The goal of
this work is to improve classification of image series with
low SNR.

《Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm》

（Structural Study of Heterogeneous Biological Samples by
Cryoelectron Microscopy and Image Processing）
More recently new approaches where the distance metric
learning from training data is used improve the prediction
performance of 𝐾-means clustering methods [70]. Recently
Extended Nearest Neighbour (ENN) Method for pattern
recognition has been described where the distance-weighted
approach is used. Improvement of the efficiency in ENN is
achieved by a preprocessing step where a subset (randomly
selected) of the dataset is used to make a classification decision.
Then all elements in the dataset are ranked according to
the distances fromthe initial classes and assignment to a class
is done to maximize the intraclass coherence [104].

《Schatz, M., van Heel, M., 1990. Invariant classification of molecular views in electron micrographs. Ultramicroscopy 32, 255–264.》

用于区分哪些是与方向有关的特征

SOM

Alignment and classification have also
been performed simultaneously, e.g., using neural networks
(Marabini and Carazo, 1994).

《A Novel Neural Network Technique for Analysis and Classification of EM Single-Particle Images》

SOM is such a method: It takes the original
large set of data and produces a reduced set of
good-quality “representatives” of the same reality.
These representatives, usually called “code vectors”
in neural network terminology, have the property of
being ordered over the grid, and in this way the map
tends to preserve the topological characteristics of
the input data. Another advantage of this approach
is that it requires no prior knowledge of the data set
under analysis, something that is generally common
in EM.
The main idea is to use SOM to create this
reduced set of representatives that, due to the training
technique used by the method, have improved
their signal-to-noise ratio. This set of code vectors
also comprises all the information about the original
data that they represent. In this way a posteriori
classification is possible
The algorithm we present here possesses a number of interesting
properties that may solve some of the problems
encountered when following the procedures described
above. As we will show, if the image data set is homogeneous,
then the method is able to directly analyze sets of
images that have not been rotationally aligned, by performing
a classification based on the particle rotation angle, therefore
eliminating the reference problem. Alternatively, if the
population of images is heterogeneous but they are known to
be correctly aligned, then the algorithm concentrates on
genuine particle differences by performing a classification
without any prior data reduction step.

maximun likelihood

(Maximum-likelihood multireference
refinement for electron microscopy images)
( A maximum-likelihood approach to single-particle image refinement. J. Struct. Biol.
122:328–39)
（Structural Study of Heterogeneous Biological Samples by
Cryoelectron Microscopy and Image Processing）

卷积神经网络在医疗生物领域的应用，Cryo-EM中的应用

3d classification

multireference classification

unsupervised
multireference classification methods first assign each image to
the best-matching reference from the set of given 3D references (by projection
matching) and then compute 3D reconstruction from images
assigned to the same 3D reference. However, contrary to supervised
multireference classification methods, these methods use the 3D reconstructions
obtained in the first iteration to update the 3D references for
the next iteration. The iterations consisting of alignment, classification,
and reconstruction steps are repeated until obtaining stable 3D references,
which allows obtaining new structural features (on the 3D
references) with new iterations. An iterative procedure consisting of
alignment-classification rounds was initially used in 2D work (using 2D
class averaging instead of 3D reconstruction from image classes), where
it was referred to as multi-reference alignment [71]. This approach is a
version of K-means clustering algorithm, which estimates the unknown
cluster centers based on the data and assigns the data to the nearest cluster
(the nearest cluster center). In the EMcontext, the cluster centers are
the reference structures (at a current iteration) and the usual measure of
distance between images and the centers of clusters is the correlation between
the images and the projections of the reference structures. Another
version of this approach adds new 3D references progressively [72]. This
approach, referred to as incremental K-means-like approach [72], starts
by aligning the entire set of images using only one initial reference.
After stabilizing the particle orientation parameters, it adds a new reference;
if the two references are not too similar, the second reference will
attract the particles that do not fit well the first reference and these particles
will be used to update this second reference. After a refinement of the
two-reference alignment (using a decreasing angular step size and a decreasing
search range) and a stabilization of the orientation parameters,
a newreferencemay be added and the process repeated until the numberof references starts to exceed the number of intrinsic divisions within the
dataset, which can be observed as a poor reconstruction (usually from a
very small subpopulation of particles) after adding a new reference [72].
The particle images assigned to corrupt reconstructions are assumed to
contain degraded particles and are removed
一种是首先确定几种（2种）初始模型，分别对齐，聚类，重构，再迭代
另一种是初始一个3D参考模型，在整个数据集上3D对齐，重构直到模型稳定，再加入一个模型，重新对数据及进行聚类，对齐重构，再加模型直到重构效果变差

研究内容

研究目标

本课题将针对冷冻电镜单颗粒图片的分类问题展开研究，利用卷积神经网络的方法提高2d分类的效果
相比于现有的分类方法，卷积神经网络在图片降噪，简化图片处理流程，提高处理速度和分类效果上具有优势

主要研究内容

本课题关注如何将比较成熟的卷积神经网络分类方法应用于冷冻电镜图片的分类问题。
冷冻电镜单颗粒图片信噪比低，且存在平移和面内旋转，传统分类方法有缺点（什么缺点）

Cryo-EM single particle image classification开题

选题依据

Cryo-EM在确定生物分子结构上的巨大进步

进一步提高分子结构分辨率的瓶颈在于异构大分子的分类

2Dimage无监督确定颗粒朝向角度和异构情况方法

2d classification是获取class average image的方法

2d分类的流程

PCA

alignment

clustering

卷积神经网络在2d classification上可以做到的而之前的算法做不到的

降噪与特征提取

旋转与平移不变性

分类准确率很高

相关研究

2d分类的相关算法

MSA改进

《A Bayesian method for classification of images from electron micrographs》

《Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm》

《Schatz, M., van Heel, M., 1990. Invariant classification of molecular views in electron micrographs. Ultramicroscopy 32, 255–264.》

SOM

《A Novel Neural Network Technique for Analysis and Classification of EM Single-Particle Images》

maximun likelihood

卷积神经网络在医疗生物领域的应用，Cryo-EM中的应用

3d classification

multireference classification

研究内容

研究目标

主要研究内容

问题及方案

对某一种颗粒应用传统聚类方法得到训练集

训练集数目过少，迭代训练？

对齐操作目的仅仅是消除平移和面内旋转，以便于k-means聚类？如果分类方法（特征工程）是具有旋转和平移不变性的，是否意味着仅仅需要在分类结束之后进行对齐以得到2d average images

已有的MSA方法，SOM方法，maximum likelihood方法存在的缺点，现有方法是速度上比较慢？慢到何种程度？效果较差？效果差的原因是什么？

relion中2d分类目的是什么？ML3D方法需要获得2d average image？是为了得到initial model？还是为了计算空间角度以减少计算量？如果是，改进2d分类对提高重构效果有用嘛？

当前RELION等软件都是用ML3D方法，2d分类在其中是否还有改进意义？

Cryo-EM single particle image classification开题

选题依据

Cryo-EM在确定生物分子结构上的巨大进步

进一步提高分子结构分辨率的瓶颈在于异构大分子的分类

2Dimage无监督确定颗粒朝向角度和异构情况方法

2d classification是获取class average image的方法

2d分类的流程

PCA

alignment

clustering

卷积神经网络在2d classification上可以做到的而之前的算法做不到的

降噪与特征提取

旋转与平移不变性

分类准确率很高

相关研究

2d分类的相关算法

MSA改进

《A Bayesian method for classification of images from electron micrographs》

《Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm》

《Schatz, M., van Heel, M., 1990. Invariant classification of molecular views in electron micrographs. Ultramicroscopy 32, 255–264.》

SOM

《A Novel Neural Network Technique for Analysis and Classification of EM Single-Particle Images》

maximun likelihood

卷积神经网络在医疗生物领域的应用，Cryo-EM中的应用

3d classification

multireference classification

研究内容

研究目标

主要研究内容

问题及方案

对某一种颗粒应用传统聚类方法得到训练集

训练集数目过少，迭代训练？

对齐操作目的仅仅是消除平移和面内旋转，以便于k-means聚类？如果分类方法（特征工程）是具有旋转和平移不变性的，是否意味着仅仅需要在分类结束之后进行对齐以得到2d average images

已有的MSA方法，SOM方法，maximum likelihood方法存在的缺点，现有方法是速度上比较慢？慢到何种程度？效果较差？效果差的原因是什么？

relion中2d分类目的是什么？ML3D方法需要获得2d average image？是为了得到initial model？还是为了计算空间角度以减少计算量？如果是，改进2d分类对提高重构效果有用嘛？

当前RELION等软件都是用ML3D方法，2d分类在其中是否还有改进意义？

内容目录

选择主题