@zhengyuhong 2015-05-06T02:49:52.000000Z 字数 7578 阅读 1717

Tomas Mikolov Paper note

Mikolov embedding paper note

Distributed Representations of Sentences and Documents JMLR2014
在原来word2vec的框架下提出一个paragraph2vec模型，上下文中加入了paragraph的信息来推导下一个词，文章中还有一个重要的技术点，就是hierarchical softmax来降低训练时间复杂度

Recurrent neural network based language model. INTERSPEECH 2010
文章看得不算细，有一点trick是 “we merge all words that occur less often than a threshold (in the training text) into a special rare token，All rare words are thus treated equally, ie. probability is distributed uniformly between them.”

Context dependent recurrent neural network language model. SLT 2012
神经网络不像Bengio的前向神经网络仅仅依赖于前n个上下文的词，神经网络的上下文理论上延伸得比较远，每来一个新词用于预测下一个词的时候加入之后的上下文信息，上下文信息可能是最后的隐藏层的输出，本文当中还在基于循环神经网络加入了上下文的主题信息，主题信息利用LDA模型求解

Statistical Language Models Based on Neural Networks. PhD thesis
Mkolov的博士论文，在N-gram中提到一个dynamic N-gram model，但是只是存放局部（前K）的long-term的词，因为在实际中观察到，一些不常见的long-term word局部高频出现概率高，所以使用一个dynamic N-gram model可以捕获共现信息，这个dynamic model也只是辅助static model（原来的global N-gram model）
Class Based Models： In the simplest case, each word is mapped to a single class, which usually represents several words. Next, n-gram model is trained on these classes. This allows better generalization to novel patterns which were not seen in the training data.relies on expert knowledge (for manually assigned classes)
当中提及BPTT算法，讲到原来的BP算法在捕获long context时失效，但是我也没有看到为何失效，（可能是原来的BP算法求梯度公式没有包含long context的信息，BPTT算法就把long context展开变成了一个深度神经网络就可以继续求解梯度）
Practical Advices for the Training it seems to be the best practice to update recurrent weights in mini-batches (such as after processing 10-20 training examples)
Vocabulary Truncation The simplest solution is to reduce the size of the output vocabulary V . Originally, Bengio merged all infrequent words into a special class that represents probability of all rare words. The rare words within the class have probability estimated based on their unigram frequency. This approach has been later improved by Schwenk, who redistributed probabilities of rare words using n-gram model. I have proposed an algorithm that assigns words to classes based just on the unigram frequency of words. Every word w i from the vocabulary V is assigned to a single $c_i$ . Assignment to classes is done before the training starts, and is based just on relative frequency of words - the approach is commonly referred to as frequency binning. This results in having low amount of frequent words in a single class, thus frequently $V^{'}$ is small. For rare words, $V^{'}$ can still be huge, but rare words are processed infrequently.
Dynamic Model The simplest way to overcome this problem is to use dynamic models, which has been already proposed by Jelinek in . In the case of n-gram models, we can simply train another n-gram model during processing of the test data based on the recent history, and interpolate it with the static one - such dynamic model is usually called cache model. Another approach is to maintain just a single model, and update its parameters online during processing of the test data. This can be easily achieved using neural network models.In [38], we have shown that adaptation of RNN models works better in some cases if the model is retrained separately on subsets of the test data

Statistical Language Models Based on Neural Networks. PhD thesis speech 2012
内容涉及神经网络RNN，N-gram，且利用hash函数将 $p(w(t)|conext)$ 中的context聚类，且以高频的为主，这样子就把复杂度降低了很多，做了三个实验，但是实验我并没有详细看，因为都是数据表，我经常是酱紫忽略了

Efficient Estimation of Word Representations in Vector Space. CoRR abs1301.3781 (2013)
Continuous Bag-of-Words Model The first proposed architecture is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix);thus, all words get projected into the same position (their vectors are averaged).
Continuous Skip-gram Model More precisely, we use each current word as an input to a log-linear classifier with continuous projection layer, and predict words within a certain range before and after the current word.We found that increasing the range improves quality of the resulting word vectors, but it also increases the computational complexity. Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples
此文主要在分析计算时间复杂度以及如何降低计算复杂度，在此之上与其他模型进行对比分析

Distributed Representations of Words and Phrases and their Compositionality. NIPS 2013
Word representations are limited by their inability to represent idiomatic phrases that are not compositions of the individual words. For example, “Boston Globe” is a newspaper, and so it is not a natural combination of the meanings of “Boston” and “Globe”. Therefore, using vectors to represent the whole phrases makes the Skip-gram model considerablymore expressive. Other techniques that aim to represent meaning of sentences by composing the word vectors, such as the recursive autoencoders [15], would also benefit from using phrase vectors instead of the word vectors.
skip gram 介绍了skip gram模型
Hierarchical Softmax在训练中使用了h softmax。In the context of neural network language models, it was first introduced by Morin and Bengio. Themain advantage is that instead of evaluating $W$ output nodes in the neural network to obtain the probability distribution, it is needed to evaluate only about $log_2 (W)$ nodes.
Negative Sampling Our experiments indicate that values of $k$ in the range 5–20 are useful for small training datasets, while for large datasets the $k$ can be as small as 2–5.
Subsampling of Frequent Words In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g.,“in”, “the”, and “a”). Such words usually provide less information value than the rare words. For example, while the Skip-gram model benefits from observing the co-occurrences of “France” and “Paris”, it benefits much less from observing the frequent co-occurrences of “France” and “the”, as nearly every word co-occurs frequently within a sentence with “the”. This idea can also be applied in the opposite direction; the vector representations of frequent words do not change significantly after training on several million examples.. Although this subsampling formula was chosen heuristically, we found it to work well in practice. It accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words, as will be shown in the following sections.

Linguistic Regularities in Continuous Space Word Representations. HLT-NAACL 2013
By training a neural network language model, one obtains not just the model itself, but also the learned word representations, which may be used for other, potentially unrelated, tasks.
train method： These models were first studied in the context of feed-forward networks (Bengio et al.2003; Bengio et al., 2006), and later in the context of recurrent neural network models (Mikolov et al., 2010; Mikolov et al., 2011b)
The RNN is trained with back-propagation to maximize the data log-likelihood under the model
Context dependent recurrent neural network language model. SLT 2012
LDA By performing Latent Dirichlet Allocation using a block of preceding text, we achieve a topic-conditioned RNNLM.
The feature layer represents an externalinput vector that should contain complementary information to the input word vector w(t). In the rest of this paper, we will be using features that represent topic information
In this paper, we study the use of long-span context in RNNLMs. One approach to increasing the effective context is to improve the learning algorithm to avoid the problem of vanishing gradients identified in 10
In contrast to
these approaches, we have chosen to explicitly compute a context vector based on the sentence history, and provide it directlyto the networkas an additionalinput. This has the advantage of allowing us to bring sophisticated and pre-existing topicmodelingtechniquesto bear with little overhead,specifically Latent Dirichlet Allocation (LDA) [24]
Learning Longer Memory in Recurrent Neural Networks. CoRR abs1412.7753 2014
Feedforward architectures such as time-delayed neural networks usually represent time explicitly with a fixed-length window of the recent history (Rumelhart et al., 1985). While this type of models work well in practice, fixing the window size makes long-term dependency harder to learn and can only be done at the cost of a linear increase of the number of parameters.
In this section, we propose an extension of SRN by adding a hidden layer specifically designed to
capture long term dependencies. We design this layer following two observations: (1) the nonlin-
earity can cause gradients to vanish, (2) a fully connected hidden layer changes its state completely
at every time step.

Tomas Mikolov Paper note

内容目录

选择主题