zoukankan      html  css  js  c++  java
  • 利用搜狐新闻语料库训练100维的word2vec——使用python中的gensim模块

      关于word2vec的原理知识参考文章https://www.cnblogs.com/Micang/p/10235783.html

      语料数据来自搜狐新闻2012年6月—7月期间国内,国际,体育,社会,娱乐等18个频道的新闻数据
      数据处理参考这篇文章

      模型训练:

    # -*- coding: utf-8-*-
    from gensim.models.word2vec import Word2Vec 
    sentences = [['A1','A2'],['A1','A3','A2']] 
    
    num=0
    with open('sohu_train.txt') as trainText:  #, encoding='utf-8'
        for line in trainText:
            id,catgre,body= line.split('^_^')
            words=body.replace('
    ','').split('    ')
            sentences.append(words)
            # if num>1000:break
            num+=1
            # print(sentences)
    
    model= Word2Vec(min_count=1)
    print("start train ...")
    model.build_vocab(sentences)
    model.train(sentences,total_examples = model.corpus_count,epochs = model.iter)
    print("train finished!",num)
    
    model.save('./sohu_model/Model')
    #model.save_word2vec_format('/tmp/mymodel.txt',binary = False)
    #model.save_word2vec_format('/tmp/mymodel.bin.gz',binary = True)
    #前一组方法保存的文件不能利用文本编辑器查看但是保存了训练的全部信息,可以在读取后追加训练
    #后一组方法保存为word2vec文本格式但是保存时丢失了词汇树等部分信息,不能追加训练
    print("save finished!")
    

      模型使用:

    # #模型使用
    model = Word2Vec.load('./sohu_model/Model')
    print("load model sesuess!")
    # model.most_similar(['北京'])
    
    print u'most similar with 北京:'
    for i in model.most_similar("北京"): #计算余弦距离最接近“北京”的10个词
        print i[0].decode('utf-8'),i[1]
    
    print u'皇帝+女性-男性:'
    for i in model.most_similar(positive = ['皇帝','女性'],negative = ['男性'],topn = 3):print i[0].decode('utf-8'),i[1]
    
    print u'手机+移动-智能:'
    for i in model.most_similar(positive = ['手机','移动'],negative = ['智能'],topn = 3):print i[0].decode('utf-8'),i[1]
    
    print u'电影+科幻-剧情:'
    for i in model.most_similar(positive = ['电影','科幻'],negative = ['剧情'],topn = 3):print i[0].decode('utf-8'),i[1]
    
    print u'北京 vector:'
    print model['北京']

      输出:

    C:Python27libsite-packagesgensimutils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
      warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
    load model sesuess!
    most similar with 北京:
    C:Python27libsite-packagesgensimmatutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int32 == np.dtype(int).type`.
      if np.issubdtype(vec.dtype, np.int):
    南京 0.670382142067
    上海 0.661236405373
    成都 0.639219224453
    杭州 0.63784122467
    广州 0.631313323975
    深圳 0.624626278877
    武汉 0.624594151974
    昆明 0.620243370533
    长春 0.61394149065
    长沙 0.60389906168
    皇帝+女性-男性:
    哥 0.60431176424
    魔术师 0.586149096489
    魔女 0.581812143326
    手机+移动-智能:
    智能手机 0.605030536652
    互联网 0.54615008831
    苹果 0.539426982403
    电影+科幻-剧情:
    纪录片 0.648482918739
    动画 0.639703273773
    迪斯尼 0.61851131916
    北京 vector:
    [-0.08981118  0.18538047 -4.7453156  -1.7730242   2.0390635   2.6085184
      5.088326    2.8057106   2.6798103  -1.4660915   2.778077    2.4279277
      0.69682086 -3.0003173   2.1341784   0.32419717 -5.2817945   0.18809023
     -1.3016417   3.8344557  -0.87402123 -0.26100433  2.8857462  -2.725345
     -2.5024219  -0.70686543 -0.4838663  -2.2535524   0.23617841  3.329134
      3.9053504  -1.9609474  -3.4581995   1.2530506  -2.079397    1.6266809
      0.23296945  1.4600109  -1.9104419   0.80835503 -0.13650164  3.355157
      2.4561696   0.6016032  -1.0312346   1.6474588   1.320931    1.4579619
      1.8017172  -3.5526018   1.2293625   4.798621   -3.5554793   0.5800354
      3.7429204  -0.4906999  -1.3069346  -1.0603447  -0.95469594 -0.35445935
     -1.7658769  -3.2370284  -2.2224278  -0.56134427 -0.46095294  2.8492029
      2.7202766  -3.3692176   1.1739812  -1.9770668   0.37050596  1.1764477
     -0.27834406  5.033905    0.09570877 -0.5670941  -2.1803875  -0.9094422
      1.0485793   0.03497482 -2.07145    -0.8045679  -1.8192968   2.6160874
      0.5630188  -0.45463613 -0.22750562  2.2233796   3.4276621  -0.8689221
      1.5558586  -0.39026013 -1.1843458  -3.378433   -4.2200727   1.6359595
      2.27458    -1.6011585  -0.89109504  2.3993087 ]
  • 相关阅读:
    HDU——T 3342 Legal or Not
    Web框架本质
    February 5 2017 Week 6 Sunday
    February 4 2017 Week 5 Saturday
    February 3 2017 Week 5 Friday
    February 2 2017 Week 5 Thursday
    February 1 2017 Week 5 Wednesday
    January 31 2017 Week 5 Tuesday
    January 30 2017 Week 5 Monday
    January 29 2017 Week 5 Sunday
  • 原文地址:https://www.cnblogs.com/Micang/p/10367603.html
Copyright © 2011-2022 走看看