@knight 2016-01-14T14:03:17.000000Z 字数 3830 阅读 1847

[译]Recurrent neural networks tutorial part 2

神经网络

4. 构建训练数据矩阵

RNN的输入是向量而不是字符串。所以我们要在词和索引之间映射index_to_wordword_to_index.例如词"friendly"可能在索引2001.训练样例x可能是[0,179,341,416] 0对应SENTENCE_START。对应的标签y[179,341,416,1].记住我们的目的是预测下一个词。所有y只是向量x的偏移量，1 对应SENTENCE_END。换句话说，179的词是预测341的，也是实际的下一个词。

vocabulary_size = 8000unknown_token = "UNKNOWN_TOKEN"sentence_start_token = "SENTENCE_START"sentence_end_token = "SENTENCE_END"# Read the data and append SENTENCE_START and SENTENCE_END tokensprint "Reading CSV file..."with open('data/reddit-comments-2015-08.csv', 'rb') as f:    reader = csv.reader(f, skipinitialspace=True)    reader.next()    # Split full comments into sentences    sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])    # Append SENTENCE_START and SENTENCE_END    sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]print "Parsed %d sentences." % (len(sentences))# Tokenize the sentences into wordstokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]# Count the word frequenciesword_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))print "Found %d unique words tokens." % len(word_freq.items())# Get the most common words and build index_to_word and word_to_index vectorsvocab = word_freq.most_common(vocabulary_size-1)index_to_word = [x[0] for x in vocab]index_to_word.append(unknown_token)word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])print "Using vocabulary size %d." % vocabulary_sizeprint "The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1])# Replace all words not in our vocabulary with the unknown tokenfor i, sent in enumerate(tokenized_sentences):    tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]print "\nExample sentence: '%s'" % sentences[0]print "\nExample sentence after Pre-processing: '%s'" % tokenized_sentences[0]# Create the training dataX_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences])y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences])

x:
[0, 51, 27, 16, 10, 856, 53, 25, 34, 69]

y: