@Frankchen
2016-05-17T02:25:21.000000Z
字数 6584
阅读 3257
Tutorial Word2Vec Project
首先将原文件转化为text文件,在我的机器上(Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz 16G内存)约持续了8分钟。
python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.cn.text
使用OpenCC进行繁简转换,OpenCC,注意使用下载编译的方法安装,因为json文件缺失,我从项目github上下载了t2s.json文件。
opencc -i wiki.cn.text -o wiki.cn.text.simplified -c t2s.json
使用jieba进行分词,这里因为python2.x对于中文支持不好,故使用python3。
python3 separate_words.py wiki.cn.text.simplified wiki.cn.text.simplified.seperated
接下来用正则表达式进行去除无关字符。
python3 remove_words.py wiki.cn.text.simplified.seperated wiki.cn.text.simplified.seperated.removed
这里我发现,如果使用python2.x跑,得到的是英文(即除去的是中文),这里的问题有待研究。
执行训练
python train_word2vec_model.py wiki.cn.text.simplified.seperated.removed wiki.cn.text.simplified.seperated.removed.model wiki.cn.text.simplified.seperated.removed.vector
执行速度为:
training on 674351690 raw words (621970843 effective words) took 1535.2s, 405140 effective words/s
6. 测试效果
In [1]: import gensimIn [2]: model = gensim.models.Word2Vec.load('wiki.cn.text.simplified.seperated.removed.model')In [3]: model.mosmodel.most_similar model.most_similar_cosmulIn [3]: model.most_similar(u'足球')Out[3]:[(u'\u56fd\u9645\u8db3\u7403', 0.5941175818443298),(u'\u7bee\u7403', 0.5309327244758606),(u'\u8db3\u7403\u8fd0\u52a8', 0.5207462906837463),(u'\u7537\u5b50\u7bee\u7403', 0.5116416811943054),(u'\u4e16\u754c\u8db3\u7403', 0.5084630250930786),(u'\u56fd\u5bb6\u8db3\u7403\u961f', 0.5042006969451904),(u'\u8db3\u7403\u961f', 0.5040441751480103),(u'\u8db3\u7403\u8054\u8d5b', 0.5007984042167664),(u'\u677f\u7403', 0.49249252676963806),(u'\u6392\u7403', 0.48762720823287964)]In [4]: for e in model.mosmodel.most_similar model.most_similar_cosmulIn [4]: for e in model.most_similar(u'足球'):print e[0],e[1]国际足球 0.594117581844篮球 0.530932724476足球运动 0.520746290684男子篮球 0.511641681194世界足球 0.508463025093国家足球队 0.504200696945足球队 0.504044175148足球联赛 0.500798404217板球 0.49249252677排球 0.487627208233
使用python2貌似不太方便,但是用python3处理得到的模型会报错,我初步认为是因为我交替使用和python2和python3版本来处理,于是我重新进行了分词、去除和训练的过程,全部使用python3命令。
7. 重新得到了模型与词向量
training on 674351690 raw words (621967454 effective words) took 3142.9s, 197898 effective words/s
(此时因为电脑同时在训练英文余料,所以速度有所下降)
8. 测试
In [1]: import gensimIn [2]: model = gensim.models.Word2Vec.load('wiki.cn.text.simplified.seperated.removed.model')In [3]: model.most_similar('深圳')Out[3]:[('深圳市', 0.6228553652763367),('珠海', 0.5962932109832764),('东莞', 0.5813443064689636),('蛇口', 0.5745633840560913),('广州', 0.5641505122184753),('深圳湾', 0.536582350730896),('福田区', 0.5346789360046387),('珠三角', 0.5253632068634033),('南山区', 0.5130573511123657),('佛山', 0.5102678537368774)]In [4]: model.similarity('编程','程序员')Out[4]: 0.66386519296107205In [5]: model.similarity('编程','火锅')Out[5]: -0.053431331280979134In [6]: model.similarity('编程','小说')Out[6]: 0.14391657395869928In [7]: model.similarity('编程','护发素')Out[7]: 0.035860913136417573In [8]: model.doesnt_match('深圳 广州 北京 纽约'.split())Out[8]: '纽约'
可以看到python3在处理编码上面相比python2有不小的优势。
首先使用原版代码demo下载的约100M英文文档进行测试
python3 train_word2vec_model.py ../trunk/text8 text8.model text8.vector
语料很小,训练速度很快
INFO: training on 85026035 raw words (62534778 effective words) took 915.3s, 68322 effective words/s
测试
In [1]: import gensimIn [2]: model = gensim.models.Word2Vec.load('text8.model')In [3]: model.mosmodel.most_similar model.most_similar_cosmulIn [3]: model.most_similar('one')Out[3]:[('seven', 0.8102973699569702),('eight', 0.7997660040855408),('six', 0.7932437062263489),('four', 0.7776107788085938),('five', 0.7714489102363586),('three', 0.7500687837600708),('nine', 0.7399319410324097),('two', 0.644408106803894),('oct', 0.611247181892395),('june', 0.6030701398849487)]In [8]: model.similarity('cat','dog')Out[8]: 0.80140867240706526In [9]: model.similarity('cat','human')Out[9]: 0.19338057325537722In [10]: model.doesnt_match('run walk sit lunch'.split())Out[10]: 'lunch'
首先对源文件进行处理为text文件,这个过程持续了大约2小时
python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
训练花费了约十三小时
training on 10992637830 raw words (8908362214 effective words) took 49625.3s, 179513 effective words/s
测试效果
加载模型时出现错误
model2 = gensim.models.Word2Vec.load('wiki.en.text.model')UnicodeDecodeError Traceback (most recent call last)<ipython-input-5-a220f0ff4b85> in <module>()----> 1 model2 = gensim.models.Word2Vec.load('wiki.en.text.model')/home/frank/.local/lib/python3.4/site-packages/gensim/models/word2vec.py in load(cls, *args, **kwargs)1483 @classmethod1484 def load(cls, *args, **kwargs):-> 1485 model = super(Word2Vec, cls).load(*args, **kwargs)1486 # update older models1487 if hasattr(model, 'table'):/home/frank/.local/lib/python3.4/site-packages/gensim/utils.py in load(cls, fname, mmap)246 compress, subname = SaveLoad._adapt_by_suffix(fname)247--> 248 obj = unpickle(fname)249 obj._load_specials(fname, mmap, compress, subname)250 return obj/home/frank/.local/lib/python3.4/site-packages/gensim/utils.py in unpickle(fname)910 with smart_open(fname) as f:911 # Because of loading from S3 load can't be used (missing readline in smart_open)--> 912 return _pickle.loads(f.read())913914UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0: ordinal not in range(128)
查了一下,这里的错误不明,但是应该使用原版加载方式来测试
model2 = gensim.models.Word2Vec.load_word2vec_format('wiki.en.text.vector', binary=False)
模型很大,加载很耗时间
终于加载完
In [8]: model2.most_similar('one')Out[8]:[('each', 0.5213934779167175),('two', 0.5099077820777893),('five', 0.5021112561225891),('three', 0.4996245503425598),('four', 0.49141108989715576),('the', 0.457046240568161),('six', 0.4431409537792206),('only', 0.4366394579410553),('seven', 0.4334326684474945),('number', 0.42605626583099365)]In [9]: modemodel model2In [9]: model2.smodel2.sample model2.score model2.sort_vocabmodel2.save model2.seed model2.sorted_vocabmodel2.save_word2vec_format model2.seeded_vector model2.syn0model2.scale_vocab model2.sg model2.syn0normmodel2.scan_vocab model2.similarityIn [9]: model2.similarity('cat','dog')Out[9]: 0.70010628594820035In [10]: model2.similarity('cat','human')Out[10]: 0.10438150155484917In [11]: model2.doesnt_match('run walk sit lunch'.split())Out[11]: 'lunch'
从train_word2vec_model.py文件内的model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, 可以看到,使用gensim训练时设置的词向量维数为400,窗口为5,min_count(稀疏词最低频率)为5,相应的我们对于word2vec原版的训练也需要更改相应的参数。
workers=multiprocessing.cpu_count())
使用如下命令进行训练
./word2vec -train ../word2vec-for-wiki/wiki.cn.text.simplified.seperated.removed -output wiki.cn.text.simplified.seperated.removed.txt -size 400 -window 5
训练了约半小时,得到一个2.1Gb的文件wiki.cn.text.simplified.seperated.removed.txt
发现无论是用gensim中model加载或者是原版代码中的distance都无法加载成功,于是考虑转换为二进制类型文件。
./word2vec -train ../word2vec-for-wiki/wiki.cn.text.simplified.seperated.removed -output wiki.cn.text.simplified.removed.bin -cbow 0 -size 400 -window 5 -negative 0 -hs 1 -sample 1e-4 -threads 20 -binary 1 -iter 100
其中–min-count设置最低频率,默认是5
start at 05-10 09:34
可以发现训练速度极其慢,这个1G大小的中文语料耗费近3个小时只训练了5%左右,可行性很不高,这里的原因有待研究。