@Frankchen
2016-05-17T02:25:21.000000Z
字数 6584
阅读 3137
Tutorial
Word2Vec
Project
首先将原文件转化为text文件,在我的机器上(Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz 16G内存)约持续了8分钟。
python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.cn.text
使用OpenCC进行繁简转换,OpenCC,注意使用下载编译的方法安装,因为json文件缺失,我从项目github上下载了t2s.json
文件。
opencc -i wiki.cn.text -o wiki.cn.text.simplified -c t2s.json
使用jieba进行分词,这里因为python2.x对于中文支持不好,故使用python3。
python3 separate_words.py wiki.cn.text.simplified wiki.cn.text.simplified.seperated
接下来用正则表达式进行去除无关字符。
python3 remove_words.py wiki.cn.text.simplified.seperated wiki.cn.text.simplified.seperated.removed
这里我发现,如果使用python2.x跑,得到的是英文(即除去的是中文),这里的问题有待研究。
执行训练
python train_word2vec_model.py wiki.cn.text.simplified.seperated.removed wiki.cn.text.simplified.seperated.removed.model wiki.cn.text.simplified.seperated.removed.vector
执行速度为:
training on 674351690 raw words (621970843 effective words) took 1535.2s, 405140 effective words/s
6. 测试效果
In [1]: import gensim
In [2]: model = gensim.models.Word2Vec.load('wiki.cn.text.simplified.seperated.removed.model')
In [3]: model.mos
model.most_similar model.most_similar_cosmul
In [3]: model.most_similar(u'足球')
Out[3]:
[(u'\u56fd\u9645\u8db3\u7403', 0.5941175818443298),
(u'\u7bee\u7403', 0.5309327244758606),
(u'\u8db3\u7403\u8fd0\u52a8', 0.5207462906837463),
(u'\u7537\u5b50\u7bee\u7403', 0.5116416811943054),
(u'\u4e16\u754c\u8db3\u7403', 0.5084630250930786),
(u'\u56fd\u5bb6\u8db3\u7403\u961f', 0.5042006969451904),
(u'\u8db3\u7403\u961f', 0.5040441751480103),
(u'\u8db3\u7403\u8054\u8d5b', 0.5007984042167664),
(u'\u677f\u7403', 0.49249252676963806),
(u'\u6392\u7403', 0.48762720823287964)]
In [4]: for e in model.mos
model.most_similar model.most_similar_cosmul
In [4]: for e in model.most_similar(u'足球'):print e[0],e[1]
国际足球 0.594117581844
篮球 0.530932724476
足球运动 0.520746290684
男子篮球 0.511641681194
世界足球 0.508463025093
国家足球队 0.504200696945
足球队 0.504044175148
足球联赛 0.500798404217
板球 0.49249252677
排球 0.487627208233
使用python2貌似不太方便,但是用python3处理得到的模型会报错,我初步认为是因为我交替使用和python2和python3版本来处理,于是我重新进行了分词、去除和训练的过程,全部使用python3命令。
7. 重新得到了模型与词向量
training on 674351690 raw words (621967454 effective words) took 3142.9s, 197898 effective words/s
(此时因为电脑同时在训练英文余料,所以速度有所下降)
8. 测试
In [1]: import gensim
In [2]: model = gensim.models.Word2Vec.load('wiki.cn.text.simplified.seperated.removed.model')
In [3]: model.most_similar('深圳')
Out[3]:
[('深圳市', 0.6228553652763367),
('珠海', 0.5962932109832764),
('东莞', 0.5813443064689636),
('蛇口', 0.5745633840560913),
('广州', 0.5641505122184753),
('深圳湾', 0.536582350730896),
('福田区', 0.5346789360046387),
('珠三角', 0.5253632068634033),
('南山区', 0.5130573511123657),
('佛山', 0.5102678537368774)]
In [4]: model.similarity('编程','程序员')
Out[4]: 0.66386519296107205
In [5]: model.similarity('编程','火锅')
Out[5]: -0.053431331280979134
In [6]: model.similarity('编程','小说')
Out[6]: 0.14391657395869928
In [7]: model.similarity('编程','护发素')
Out[7]: 0.035860913136417573
In [8]: model.doesnt_match('深圳 广州 北京 纽约'.split())
Out[8]: '纽约'
可以看到python3在处理编码上面相比python2有不小的优势。
首先使用原版代码demo下载的约100M英文文档进行测试
python3 train_word2vec_model.py ../trunk/text8 text8.model text8.vector
语料很小,训练速度很快
INFO: training on 85026035 raw words (62534778 effective words) took 915.3s, 68322 effective words/s
测试
In [1]: import gensim
In [2]: model = gensim.models.Word2Vec.load('text8.model')
In [3]: model.mos
model.most_similar model.most_similar_cosmul
In [3]: model.most_similar('one')
Out[3]:
[('seven', 0.8102973699569702),
('eight', 0.7997660040855408),
('six', 0.7932437062263489),
('four', 0.7776107788085938),
('five', 0.7714489102363586),
('three', 0.7500687837600708),
('nine', 0.7399319410324097),
('two', 0.644408106803894),
('oct', 0.611247181892395),
('june', 0.6030701398849487)]
In [8]: model.similarity('cat','dog')
Out[8]: 0.80140867240706526
In [9]: model.similarity('cat','human')
Out[9]: 0.19338057325537722
In [10]: model.doesnt_match('run walk sit lunch'.split())
Out[10]: 'lunch'
首先对源文件进行处理为text文件,这个过程持续了大约2小时
python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
训练花费了约十三小时
training on 10992637830 raw words (8908362214 effective words) took 49625.3s, 179513 effective words/s
测试效果
加载模型时出现错误
model2 = gensim.models.Word2Vec.load('wiki.en.text.model')
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-5-a220f0ff4b85> in <module>()
----> 1 model2 = gensim.models.Word2Vec.load('wiki.en.text.model')
/home/frank/.local/lib/python3.4/site-packages/gensim/models/word2vec.py in load(cls, *args, **kwargs)
1483 @classmethod
1484 def load(cls, *args, **kwargs):
-> 1485 model = super(Word2Vec, cls).load(*args, **kwargs)
1486 # update older models
1487 if hasattr(model, 'table'):
/home/frank/.local/lib/python3.4/site-packages/gensim/utils.py in load(cls, fname, mmap)
246 compress, subname = SaveLoad._adapt_by_suffix(fname)
247
--> 248 obj = unpickle(fname)
249 obj._load_specials(fname, mmap, compress, subname)
250 return obj
/home/frank/.local/lib/python3.4/site-packages/gensim/utils.py in unpickle(fname)
910 with smart_open(fname) as f:
911 # Because of loading from S3 load can't be used (missing readline in smart_open)
--> 912 return _pickle.loads(f.read())
913
914
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0: ordinal not in range(128)
查了一下,这里的错误不明,但是应该使用原版加载方式来测试
model2 = gensim.models.Word2Vec.load_word2vec_format('wiki.en.text.vector', binary=False)
模型很大,加载很耗时间
终于加载完
In [8]: model2.most_similar('one')
Out[8]:
[('each', 0.5213934779167175),
('two', 0.5099077820777893),
('five', 0.5021112561225891),
('three', 0.4996245503425598),
('four', 0.49141108989715576),
('the', 0.457046240568161),
('six', 0.4431409537792206),
('only', 0.4366394579410553),
('seven', 0.4334326684474945),
('number', 0.42605626583099365)]
In [9]: mode
model model2
In [9]: model2.s
model2.sample model2.score model2.sort_vocab
model2.save model2.seed model2.sorted_vocab
model2.save_word2vec_format model2.seeded_vector model2.syn0
model2.scale_vocab model2.sg model2.syn0norm
model2.scan_vocab model2.similarity
In [9]: model2.similarity('cat','dog')
Out[9]: 0.70010628594820035
In [10]: model2.similarity('cat','human')
Out[10]: 0.10438150155484917
In [11]: model2.doesnt_match('run walk sit lunch'.split())
Out[11]: 'lunch'
从train_word2vec_model.py文件内的model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
可以看到,使用gensim训练时设置的词向量维数为400,窗口为5,min_count(稀疏词最低频率)为5,相应的我们对于word2vec原版的训练也需要更改相应的参数。
workers=multiprocessing.cpu_count())
使用如下命令进行训练
./word2vec -train ../word2vec-for-wiki/wiki.cn.text.simplified.seperated.removed -output wiki.cn.text.simplified.seperated.removed.txt -size 400 -window 5
训练了约半小时,得到一个2.1Gb的文件wiki.cn.text.simplified.seperated.removed.txt
发现无论是用gensim中model加载或者是原版代码中的distance都无法加载成功,于是考虑转换为二进制类型文件。
./word2vec -train ../word2vec-for-wiki/wiki.cn.text.simplified.seperated.removed -output wiki.cn.text.simplified.removed.bin -cbow 0 -size 400 -window 5 -negative 0 -hs 1 -sample 1e-4 -threads 20 -binary 1 -iter 100
其中–min-count设置最低频率,默认是5
start at 05-10 09:34
可以发现训练速度极其慢,这个1G大小的中文语料耗费近3个小时只训练了5%左右,可行性很不高,这里的原因有待研究。