jieba 并行处理进行测试,注意:并行分词仅支持默认分词器 jieba.dt
和 jieba.posseg.dt
import sys import time import jieba jieba.enable_parallel() #url = sys.argv[1] content = open("/ssd/ailab-dataset/THUCNewsSubset/cnews.train.txt","rb").read() t1 = time.time() words = "/ ".join(jieba.cut(content)) t2 = time.time() tm_cost = t2-t1 log_f = open("1.log","wb") log_f.write(words.encode('utf-8')) print('speed %s bytes/second' % (len(content)/tm_cost))
测试结果:
#把jieba.enable_parallel()注释掉了 [root@n6 jieba-parallel-test]# python test.py Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model cost 0.289 seconds. Prefix dict has been built succesfully. speed 259919.622884 bytes/second #加上了jieba.enable_parallel() [root@n6 jieba-parallel-test]# vi test.py [root@n6 jieba-parallel-test]# vi test.py [root@n6 jieba-parallel-test]# python test.py Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model cost 0.263 seconds. Prefix dict has been built succesfully. speed 2215307.40079 bytes/second
加了并行,快很多哟!!!