[关闭]
@floatsd 2016-05-02T06:40:20.000000Z 字数 16584 阅读 643

Learning in higher dimensions双语

translation_tech


原文链接

Key takeaways
关键要点

David Beyer: Let’s start with your background.
David Beyer:让我们先来聊聊你的背景吧。

Anima Anandkumar: I have been fascinated with mathematics since my childhood—its uncanny ability to explain the complex world we live in. During my college days, I realized the power of algorithmic thinking in computer science and engineering. Combining these, I went on to complete a Ph.D. at Cornell University, then a short postdoc at MIT before moving to the faculty at UC Irvine, where I’ve spent the past six years.
Anima Anandkumar:我童年时,数学,以其解释这个复杂世界的能力令我深深着迷。上大学期间,我领略到在了计算机科学及工程中算法思维方式的力量。由此,我先后在康奈尔大学读博,并在麻省理工攻读博士后学位。之后我来到欧文,任教六年至今。

During my Ph.D., I worked on the problem of designing efficient algorithms for distributed learning. More specifically, when multiple devices or sensors are collecting data, can we design communication and routing schemes that perform “in-network” aggregation to reduce the amount of data transported, and yet, simultaneously, pre‐serve the information required for certain tasks, such as detecting an anomaly? I investigated these questions from a statistical viewpoint, incorporated probabilistic graphical models, and designed algorithms that significantly reduce communication requirements. Ever since, I have been interested in a range of machine learning problems.
读博期间,我研究的方向是分布式学习算法,更确切的说,当多个设备或传感器收集数据时,我们如何设计通讯模式及线路规划,通过网内聚合来减少数据传输量,同时,对信息的预处理要求实现一些特定的功能,如异常检测。我以数理统计的角度研究了这些问题,组合概率图模型,并设计了能够显著降低通讯需求的算法,此后,我的兴趣一直放在机器学习的问题上。

Modern machine learning naturally occurs in a world of higher dimensions, generating a lot of multivariate data in the process, including a large amount of noise. Searching for useful information hidden in this noise is challenging; it is like the proverbial needle in a haystack.
现代机器学习生来就与高维空间有不可分割的联系,我们要解决的问题中包含大量多变量数据,而且夹杂着大量的噪声。如何从海量的噪声中找到有用的信息是个富有挑战性的课题,不如说这简直像大海捞针一样。

The first step involves modeling the relationships between the hidden information and the observed data. Let me explain this with an example. In a recommender system, the hidden information represents users’ unknown interests, and the observed data consist of products they have purchased thus far. If a user recently bought a bike, she is interested in biking/outdoors, and is more likely to buy biking accessories in the near future. We can model her interest as a hidden variable and infer it from her buying pattern. To discover such relationships, however, we need to observe a whole lot of buying patterns from a lot of users—making this problem a big data one.
解决问题的第一步是找到观测数据和隐变量的关系,并构建其模型,让我来用一个例子解释这一点。如在一个推荐系统中,我们所说的内含信息在这里表示了用户的兴趣,而观测数据则表示他们至今为止的购买记录。如果一个用户最近刚购买了一辆自行车,这说明她对户外运动或骑行是感兴趣的,并更有可能在近期购买单车配件。我们把这作为一个隐变量,并由此推测她未来的购买模式。然而,为了找到这样的关系,我们需要获取大量用户的决策数据,这就变成了一个大数据的问题。

My work currently focuses on the problem of efficiently training such hidden variable models on a large scale. In such an unsupervised approach, the algorithm automatically seeks out hidden factors that drive the observed data. Machine learning researchers, by and large, agree this represents one of the key unsolved challenges in our field.
我最近主要的研究方向是如何从这种大规模数据中训练得到其隐变量模型。无监督学习可以自动找出那些驱动观测数据的隐藏因素。总得来说,在机器学习研究领域,这还是一个亟待解决的关键难题。

I take a novel approach to this challenge and demonstrate how tensor algebra can unravel these hidden, structured patterns without external supervision. Tensors are higher dimensional extensions of matrices. Just as matrices can represent pairwise correlations, tensors can represent higher order correlations. My research reveals that operations on higher order tensors can be used to learn a wide range of probabilistic latent variable models efficiently.
我采用了一种新方法来试图解决这种难题,并证明了张量函数是怎么无监督的找出这些隐藏的结构化的模式。张量是矩阵的高维扩展。正如矩阵可以体现二阶的相关性,张量可以体现更高阶的相关性。我的研究表明,操作高阶张量可以有效的学习多种不同的概率潜变量模型。

DB: What are the applications of your method?
DB:你的方法要怎样去应用呢?

AA: We have shown applications in a number of settings. For example, consider the task of categorizing text documents automatically without knowing the topics a priori. In such a scenario, the topics themselves constitute hidden variables that must be gleaned from the observed text. A possible solution might be to learn the topics using word frequency, but this naive approach doesn’t account for the same word appearing in multiple contexts.
AA: 我们方法可以在许多场景中应用,比如,在不知道主题的情况下对文档进行自动分类。在这种情况下,主题本身是隐变量,需要从我们待测的文本中去找寻。一个可能的解决办法是利用词频来推测主题,但这种方法过于简单,没有把在短短一段上下文中多次出现单个词的情况考虑进去。

What if, instead, we look at the co-occurrence of pairs of words, which is a more robust strategy than single word frequencies. But why stop at pairs? Why not examine the co-occurrences of triplets of words, and so on, into higher dimensions? What additional information might these higher order relationships reveal? Our work has demonstrated that uncovering hidden topics using the popular Latent Dirichlet allocation (LDA) requires third-order relationships; pairwise relationships are insufficient.
那么如果我们观察成对出现的词呢?这显然是一个比检测单个词词频要更加鲁棒的策略。既然已经想到两个词了,那么我们继续上升到检测三个词的同时出现,并把它继续延伸下去到更高的维数。这些高阶关系将会透露出什么新的信息?我们研究证明,如使用流行的隐含狄利克雷分布方法来对主题进行分类,需要至少三阶的关系,仅仅二阶的关系是不够的。

The above intuition is broadly applicable. Take networks, for example. You might try to discern hidden communities by observing the interaction of their members, examples of which include friendship connections in social networks, buying patterns in recommender systems, or neuronal connections in the brain. My research reveals the need to investigate at least at the level of “friends of friends” or higher order relationships to uncover hidden communities. Although such functions have been used widely before, we were the first to show the precise information they contain and how to extract them in a computationally efficient manner.
上述的推断在许多场合适用。我们再举一个网络上的例子。你可能尝试过通过观察成员之间的联络或交互信息来识别团体,比如在一个社交网络中的朋友关系,推荐系统中的购买行为特征。或大脑中的神经元联系。我的研究表明要弄清楚这样的关系至少要在“朋友的朋友”,甚至更高阶的关系层面进行研究。尽管这样的功能已经在此前被广泛应用,我们是第一个得出它们所包含的准确信息,并能提出一个有效的计算方法来提取这些信息的。

We can extend the notion of hidden variable models even further. Instead of trying to discover one hidden layer, we look to construct a hierarchy of hidden variables instead. This approach is better suited to a certain class of applications, including, for example, modeling the evolutionary tree of species or understanding the hierarchy of disease occurrence in humans. The goal in this case is to learn both the hierarchical structure of the latent variables as well as the parameters that quantify the effect of the hidden variables on the given observed data.
我们可以更进一步扩展隐变量模型的概念,我们用一种隐变量的层次结构来代替之前试图发现某一个隐层的方法。这种方法特别适合于某些应用,比如建立物种进化树,或理解人体疾病发生的层次。研究这种方法的目标在于两点,一为学习隐变量的结构层次结构,二为找到量化给定观测数据中隐变量的参数。

The resulting structure reveals the hierarchical groupings of the observed variables at the leaves and the parameters quantify the “strength” of the group effect on the observations at the leaf nodes. We then simplify this to finding a hierarchical tensor decomposition, for which we have developed efficient algorithms.
所得到的结构揭示了,在叶节点观察到的数据的层级分组形式和用于量化观察叶节点数据时分组作用的“强度”的参数,我们把这个问题简化为寻找一个层次张量的分解,为此我们已经开发出了有效的算法。

DB: So, why are tensors themselves crucial in these applications?
DB: 那么,在这些应用中为什么说张量是决定性的一点呢?

AA: First, I should note these tensor methods aren’t just a matter of theoretical interest; they can provide enormous speedups in practice and even better accuracy, evidence of which we’re seeing already. Kevin Chen from Rutgers University gave a compelling talk at the recent NIPS workshop on the superiority of these tensor methods in genomics: it offered better biological interpretation and yielded a 100x speedup when compared to the traditional expectation maximization (EM) method.
AA: 首先,我需要说明这些张量方法并不只局限于理论。从我们的观察来看,在实际应用中他们效率更高,也更准确。罗格斯大学的Kevin Chen在最近的神经信息处理系统会议(NIPS)上,就张量方法在基因研究中的优越性进行了一次备受关注的演说:它提供了更好的生物学解释并比传统的期望最大化(EM)方法要高效百倍。

Tensor methods are so effective because they draw on highly optimized linear algebra libraries and can run on modern systems for large-scale computation. In this vein, my student, Furong Huang, has deployed tensor methods on Spark, and it runs much faster than the variational inference algorithm, the default for training probabilistic models. All in all, tensor methods are now embarrassingly parallel and easy to run at large scale on multiple hardware platforms.
张量的方法非常有效,因为它们可以利用高度优化的线性代数库,并且可以在现代系统大规模计算运行。在这方面,我的学生黄芙蓉正在spark(一种开源运行框架)应用张量方法,这种方法比默认的训练概率模型——变推理算法要快得多。总而言之,张量方法的并行能力令人惊异,并容易实现在大规模的多个硬件平台上运行。

DB: Is there something about tensor math that makes it so useful for these high-dimensional problems?
DB: 是什么使得张量算法在解决高维问题上如此出众?

AA: Tensors model a much richer class of data, allowing us to grapple with multirelational data—both spatial and temporal. The different modes of the tensor, or the different directions in the tensor, represent different kinds of data.
AA: 张量可以拟合多种种类的数据,这使我们可以同时得到空间和时间上的多关系型数据。张量的不同模式、不同方向代表不同类型的数据。

At its core, the tensor describes a richer algebraic structure than the matrix and can thereby encode more information. For context, think of matrices as representing rows and columns—a two-dimensional array, in other words. Tensors extend this idea to multi‐dimensional arrays.
从本质上来说,张量可以描述比矩阵更丰富的代数结构,从而能够对更多的信息进行编码。从上下文来说,想象矩阵的行和列——这是一个二维阵列。张量把这个概念扩展——可以表示多维的阵列。

A matrix, for its part, is more than just columns and rows. You can sculpt it to your purposes though the math of linear operations, the study of which is called linear algebra. Tensors build on these malleable forms and their study, by extension, is termed multilinear algebra.
一个矩阵就其本身而言,并不能单纯看成行和列的组合。你可以通过数学上的线性运算方法把它们变换为你需要的形式,这就是线性代数的研究。张量的研究建立在这种形式的拓展和其研究基础之上,通过扩展线性代数,得到了多重线性代数。

Given such useful mathematical structures, how can we squeeze them for information? Can we design and analyze algorithms for tensor operations? Such questions require a new breed of proof techniques built around non-convex optimization.
当我们得到这样一个数学结构的时候,要如何获得其中蕴含的信息呢?我们可以设计和分析张量运算的算法么?要解决这个问题,我们需要建立非凸优化的证明方法。

DB: What do you mean by convex and non-convex optimization?
DB:凸优化和非凸优化是什么意思?

AA: The last few decades have delivered impressive progress in convex optimization theory and technique. The problem, unfortunately, is that most optimization problems are not by their nature convex.
AA: 在过去的几十年里,我们在凸优化理论和技术上有长足的发展,然而问题在于,大多数优化问题的性质是非凸的。

Let me expand on the issue of convexity by example. Let’s say you’re minimizing a parabolic function in one dimension: if you make a series of local improvements (at any starting point in the parabola) you are guaranteed to reach the best possible value. Thus, local improvements lead to global improvements. This property even holds for convex problems in higher dimensions. Computing local improvements is relatively easy using techniques such as gradient descent.
现在用一个例子来拓展的讨论凸性问题。假设你在拟合一个一维的最小化抛物线函数:你做了一系列你认为可以获得最佳结果的局部优化(从抛物线的任意起点)。因此进行局部优化可以最终得到全局优化。这种性质甚至适用于高维的凸问题,而利用如梯度下降等技术可以使局部优化的实现比较容易。

The real world, by contrast, is more complicated than any parabola. It contains a veritable zoo of shapes and forms. This translates to parabolas far messier than their ideal counterparts: any optimization algorithm that makes local improvements will inevitably encounter ridges, valleys, and flat surfaces; it is constantly at risk of getting stuck in a valley or some other roadblock—never reaching its global optimum.
然而在实际应用中,我们面对的情况远比任何抛物线要复杂,它包含了超级多的形式,这使实际得到的“抛物线”和他们理想的形式相去甚远,要远远复杂的多:任何优化算法在进行局部优化时总会不可避免的遇到山脊,山谷和平地;他经常陷入山谷或者由于其它什么路障而无法获得全局最优。

As the number of variables increases, the complexity of these ridges and valleys explodes. In fact, there can be an exponential number of points where algorithms based on local steps, such as gradient descent, become stuck. Most problems, including the ones on which I am working, encounter this hardness barrier.
随着变量数增加,这种山脊和山谷的复杂度爆炸性的增长。事实上,这样局部算法会带来指数级增长的运算量,这时像梯度下降这样的算法将无法适用。而大多数问题,包括我最近在研究的,都会遇到这个障碍。

DB: How does your work address the challenge of non-convex optimization?
DB: 你是如何解决非凸优化的问题的?

AA: The traditional approach to machine learning has been to first define learning objectives and then to use standard optimization frameworks to solve them. For instance, when learning probabilistic latent variable models, the standard objective is to maximize likelihood, and then to use the expectation maximization (EM) algorithm, which conducts a local search over the objective function. However, there is no guarantee that EM will arrive at a good solution. As it searches over the objective function, what may seem like a global optimum might merely be a spurious local one. This point touches on the broader difficulty with machine learning algorithm analysis, including backpropagation in neural networks: we cannot guarantee where the algorithm will end up or if it will arrive at a good solution.
AA: 传统的机器学习是这么做的:首先定义学习的目标,然后用标准优化流程框架来解决问题。例如,当学习概率潜在变量模型(probabilistic latent variable models)时,标准的方法是先最大似然,然后用最大期望(EM)算法对目标函数进行局部搜索。然而,这无法保证EM算法就能得到一个好的解决方案了,当它对目标函数进行搜索后,一个看起来像是全区最优的结果可能仅仅只是一个局部最优。这是许多种用机器学习算法来分析的方法,包括神经网络算法中的反向传播算法的通病:我们无法确保算法最终得出的结果是一个好结果。

To address such concerns, my approach looks for alternative, easy-to-optimize, objective functions for any given task. For instance, when learning latent variable models, instead of maximizing the likelihood function, I have focused on the objective of finding a good spectral decomposition of matrices and tensors, a more tractable problem given the existing toolset. That is to say, the spectral decomposition of the matrix is the standard singular-value decomposition (SVD), and we already possess efficient algorithms to compute the best such decomposition.
为了解决这些问题,我开始寻找替代方案,要对任何给定目标函数都能易于优化。如当学习潜变量模型,为了代替最大似然函数方法,我试图找到矩阵和张量的较优的频谱分解,就现有工具技术而言,这种方法要更易于实现。更确切的说,矩阵的频谱分解是标准奇异值分解(SVD),而我们已经有了有效的相关算法。

Since matrix problems can be solved efficiently despite being non-convex, and given matrices are special cases of tensors, we decided on a new research direction: can we design similar algorithms to solve the decomposition of tensors? It turns out that tensors are much more difficult to analyze and can be NP-hard. Given that, we took a different route and sought to characterize the set of conditions under which such a decomposition can be solved optimally. Luckily, these conditions turn out to be fairly mild in the context of machine learning.
由于矩阵问题即使在非凸的情况下也可以有效解决,而给定矩阵是张量的一种特殊形式,我们定下了一个新的研究方向:我们可不可以设计一个相似的算法来分解张量?事实证明,分析张量要困难得多,而且有时会涉及到非确定性多项式问题。因此,我们采取了不同的方法来描述这种分解可以被最优解决的情况。幸运的是,在机器学习的情况下这并不十分艰巨。

How do these tensor methods actually help solve machine learning problems? At first glance, tensors may appear irrelevant to such tasks. Making the connection to machine learning demands one additional idea, that of relationships (or moments). As I noted earlier, we can use tensors to represent higher order relationships among variables. And by looking at these relationships, we can learn the parameters of the latent variable models efficiently.
张量方法实际上是怎样帮助解决机器学习问题的?乍一看,张量看起来和这样的任务并无关联,将他们和机器学习联系在一起的,是关系。就像我之前提到的,我们可以用张量来表示变量间的高阶关系,研究这些关系可以有效的帮助我们获取潜变量模型的参数。

DB: So, you’re able to bring a more elegant representation to modeling higher dimensional data. Is this generally applicable in any form of machine learning?
DB:那么现在,你可以得到一个更好的模型来表示高维数据,这对所有的机器学习形式都适用么?

AA: I feel like we have only explored the tip of the iceberg. We can use tensor methods for training a wide class of latent variable models, such as modeling topics in documents, communities in networks, Gaussian mixtures, mixtures of ranking models, and so on. These models, on their face, seem unrelated. Yet, they are unified by the ability to translate statistical properties, such as the conditional independence of variables, into algebraic constraints on tensors. In all these models, suitable moment tensors (usually the third- or fourth-order correlations) are decomposed to estimate the model parameters consistently. Moreover, we can prove that this requires only a small (precisely, a low-order polynomial) amount of samples and computation to work well.
AA: 我觉得我们还只探索了冰山一角,张量方法可以用来训练一个广泛领域上的潜变量模型,如识别文件的主题,辨别网络中的社团,高斯混合,排名模型混合等等。这些模型从表面看似乎没有联系。然而当他们转换到统计性质的角度上,如变量的条件独立性,张量的代数约束等等,他们又是存在一致性的。一般由分解适合矩张量(经常是三阶或四阶相关)来估计模型参数。此外,我们可以证明这只需要少量(精确地说,一个低阶多项式)的样本和少量的计算就可以得到很好地结果。

So far, I discussed using tensors for unsupervised learning. We have also demonstrated that tensor methods provide guarantees for training neural networks, which sit in the supervised domain. We are currently tackling even harder questions such as reinforcement learning, where the learner interacts with and possibly changes the environment he/she is trying to understand. In general, I believe using higher order relationships and tensor algebraic techniques holds promise across a range of challenging learning problems.
到目前为止,我讨论了利用张量进行无监督学习,我们还证明了由张量方法可以完美的训练神经网络,这让其在有监督学习也有用武之地。目前我们正在解决更困难的问题——强化学习,这种方法里,被训练者会与外界交互,并且我们可能会改变他或她正在试图理解的环境。一般情况下,我们认为用更高阶的关系和张量代数级数可以胜任这一系列有挑战性的学习问题。

DB: What’s next on the theoretical side of machine learning research?
DB: 接下来机器学习的理论性热点在哪里?

AA: These are exciting times to be a researcher in machine learning. There is a whole spectrum of problems, ranging from fundamental research to real-world at-scale deployment. I have been pursuing research from an interdisciplinary lens; by combining tensor algebra with probabilistic modeling, we have developed a completely new set of learning algorithms with strong theoretical guarantees. I believe making such non-obvious connections is crucial toward breaking the hardness barriers in machine learning.
AA: 这正是成为一个机器学习研究者所令人兴奋的地方,我们要面对一个完整的问题,从基础研究到现实世界中的大规模部署。我一直在追求多学科协同研究的盛况,将张量代数和概率模型相结合,我们已经开发出一种全新的有可靠理论保证的学习算法我相信这样不甚明显的联系正式打破机器学习瓶颈的关键所在。

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注