zoukankan      html  css  js  c++  java
  • gensim word2vec实践

    语料下载地址

    # -*- coding: utf-8 -*-
    
    import jieba
    import jieba.analyse
    
    # suggest_freq调节单个词语的词频,使其能(或不能)被分出来
    jieba.suggest_freq('沙瑞金', True)
    jieba.suggest_freq('田国富', True)
    jieba.suggest_freq('高育良', True)
    jieba.suggest_freq('侯亮平', True)
    jieba.suggest_freq('钟小艾', True)
    jieba.suggest_freq('陈岩石', True)
    jieba.suggest_freq('欧阳菁', True)
    jieba.suggest_freq('易学习', True)
    jieba.suggest_freq('王大路', True)
    jieba.suggest_freq('蔡成功', True)
    jieba.suggest_freq('孙连城', True)
    jieba.suggest_freq('季昌明', True)
    jieba.suggest_freq('丁义珍', True)
    jieba.suggest_freq('郑西坡', True)
    jieba.suggest_freq('赵东来', True)
    jieba.suggest_freq('高小琴', True)
    jieba.suggest_freq('赵瑞龙', True)
    jieba.suggest_freq('林华华', True)
    jieba.suggest_freq('陆亦可', True)
    jieba.suggest_freq('刘新建', True)
    jieba.suggest_freq('刘庆祝', True)
    
    with open('./in_the_name_of_people.txt', 'rb') as f:
        document = f.read()
        document_cut = jieba.cut(document)
        result = ' '.join(document_cut)
        result = result.encode('utf-8')
        with open('./in_the_name_of_people_segment.txt', 'wb+') as f2:
            f2.write(result)
    
    f.close()
    f2.close()
    
    

    读分词后的文件到内存,这里使用了word2vec提供的LineSentence类来读文件,然后使用word2vec的模型

    • min_count:忽略总频率低于此值的所有单词
    • size:指定了训练时词向量维度,默认为100
    • window:句中当前词与预测词之间的最大距离
    • hs:If 1, hierarchical softmax .If 0 negative sampling.
    # import modules & set up logging
    import logging
    import os
    from gensim.models import word2vec
    
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    
    sentences = word2vec.LineSentence('./in_the_name_of_people_segment.txt')
    
    model = word2vec.Word2Vec(sentences, hs=1, min_count=1, window=3, size=100)
    
    2019-05-14 17:13:22,538 : INFO : collecting all words and their counts
    2019-05-14 17:13:22,540 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
    2019-05-14 17:13:22,593 : INFO : collected 17878 word types from a corpus of 161343 raw words and 2311 sentences
    2019-05-14 17:13:22,594 : INFO : Loading a fresh vocabulary
    2019-05-14 17:13:22,673 : INFO : effective_min_count=1 retains 17878 unique words (100% of original 17878, drops 0)
    2019-05-14 17:13:22,674 : INFO : effective_min_count=1 leaves 161343 word corpus (100% of original 161343, drops 0)
    2019-05-14 17:13:22,724 : INFO : deleting the raw counts dictionary of 17878 items
    2019-05-14 17:13:22,724 : INFO : sample=0.001 downsamples 38 most-common words
    2019-05-14 17:13:22,725 : INFO : downsampling leaves estimated 120578 word corpus (74.7% of prior 161343)
    2019-05-14 17:13:22,738 : INFO : constructing a huffman tree from 17878 words
    2019-05-14 17:13:23,069 : INFO : built huffman tree with maximum node depth 17
    2019-05-14 17:13:23,097 : INFO : estimated required memory for 17878 words and 100 dimensions: 33968200 bytes
    2019-05-14 17:13:23,098 : INFO : resetting layer weights
    2019-05-14 17:13:23,271 : INFO : training model with 3 workers on 17878 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=3
    2019-05-14 17:13:23,457 : INFO : worker thread finished; awaiting finish of 2 more threads
    2019-05-14 17:13:23,458 : INFO : worker thread finished; awaiting finish of 1 more threads
    2019-05-14 17:13:23,470 : INFO : worker thread finished; awaiting finish of 0 more threads
    2019-05-14 17:13:23,471 : INFO : EPOCH - 1 : training on 161343 raw words (120329 effective words) took 0.2s, 613072 effective words/s
    2019-05-14 17:13:23,655 : INFO : worker thread finished; awaiting finish of 2 more threads
    2019-05-14 17:13:23,658 : INFO : worker thread finished; awaiting finish of 1 more threads
    2019-05-14 17:13:23,676 : INFO : worker thread finished; awaiting finish of 0 more threads
    2019-05-14 17:13:23,677 : INFO : EPOCH - 2 : training on 161343 raw words (120484 effective words) took 0.2s, 592001 effective words/s
    2019-05-14 17:13:23,865 : INFO : worker thread finished; awaiting finish of 2 more threads
    2019-05-14 17:13:23,866 : INFO : worker thread finished; awaiting finish of 1 more threads
    2019-05-14 17:13:23,882 : INFO : worker thread finished; awaiting finish of 0 more threads
    2019-05-14 17:13:23,883 : INFO : EPOCH - 3 : training on 161343 raw words (120571 effective words) took 0.2s, 589983 effective words/s
    2019-05-14 17:13:24,065 : INFO : worker thread finished; awaiting finish of 2 more threads
    2019-05-14 17:13:24,075 : INFO : worker thread finished; awaiting finish of 1 more threads
    2019-05-14 17:13:24,084 : INFO : worker thread finished; awaiting finish of 0 more threads
    2019-05-14 17:13:24,085 : INFO : EPOCH - 4 : training on 161343 raw words (120615 effective words) took 0.2s, 600460 effective words/s
    2019-05-14 17:13:24,273 : INFO : worker thread finished; awaiting finish of 2 more threads
    2019-05-14 17:13:24,274 : INFO : worker thread finished; awaiting finish of 1 more threads
    2019-05-14 17:13:24,277 : INFO : worker thread finished; awaiting finish of 0 more threads
    2019-05-14 17:13:24,279 : INFO : EPOCH - 5 : training on 161343 raw words (120605 effective words) took 0.2s, 631944 effective words/s
    2019-05-14 17:13:24,279 : INFO : training on a 806715 raw words (602604 effective words) took 1.0s, 598553 effective words/s
    

    与某个词最相近的3个字的词

    req_count = 5
    for key in model.wv.similar_by_word('李达康', topn=100):
        if len(key[0]) == 3:
            req_count -= 1
            print(key[0], key[1])
            if req_count == 0:
                break
    
    2019-05-14 17:13:27,276 : INFO : precomputing L2-norms of word weight vectors
    
    
    赵东来 0.9634759426116943
    陆亦可 0.9602197408676147
    蔡成功 0.9589439034461975
    王大路 0.9569779634475708
    祁同伟 0.9561013579368591
    
    req_count = 5
    for key in model.wv.similar_by_word('赵东来', topn=100):
        if len(key[0]) == 3:
            req_count -= 1
            print(key[0], key[1])
            if req_count == 0:
                break
    
    李达康 0.9634760618209839
    陆亦可 0.9614400863647461
    易学习 0.9584609866142273
    祁同伟 0.9565587639808655
    王大路 0.9549983739852905
    
    req_count = 5
    for key in model.wv.similar_by_word('高育良', topn=100):
        if len(key[0]) == 3:
            req_count -= 1
            print(key[0], key[1])
            if req_count == 0:
                break
    
    沙瑞金 0.9721000790596008
    侯亮平 0.9408242702484131
    祁同伟 0.9268442392349243
    李达康 0.9241408705711365
    季昌明 0.913619339466095
    
    req_count = 5
    for key in model.wv.similar_by_word('沙瑞金', topn=100):
        if len(key[0]) == 3:
            req_count -= 1
            print(key[0], key[1])
            if req_count == 0:
                break
    
    高育良 0.9721001386642456
    李达康 0.9424692392349243
    易学习 0.9424353241920471
    无表情 0.9378770589828491
    祁同伟 0.9351213574409485
    

    计算两个词向量的相似度

    print(model.wv.similarity('沙瑞金', '高育良'))
    print(model.wv.similarity('李达康', '王大路'))
    
    0.9721002
    0.95697814
    

    计算某个词的相关列表

    try:
        sim3 = model.most_similar(u'侯亮平',topn =20)
        print(u'和 侯亮平 与相关的词有:
    ')
        for key in sim3:
            print(key[0],key[1])
    except:
        print(' error')
    
    和 侯亮平 与相关的词有:
    
    祁同伟 0.9691112041473389
    陆亦可 0.9684256911277771
    季昌明 0.9582957625389099
    李达康 0.952505886554718
    她 0.9482855200767517
    他们 0.9475176334381104
    易学习 0.9456426501274109
    陈岩石 0.9433715343475342
    马上 0.941593587398529
    高育良 0.9408242702484131
    郑西坡 0.9396289587020874
    王大路 0.9381627440452576
    沙瑞金 0.9350594282150269
    赵东来 0.9322312474250793
    陈海 0.9311630725860596
    司机 0.9282065033912659
    蔡成功 0.9281994104385376
    他 0.92684006690979
    组织 0.9237431287765503
    大家 0.9234919548034668
    
    
    E:Anaconda3envssklearnlibsite-packagesipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
    

    找出不同类的词

    print(model.wv.doesnt_match(u"沙瑞金 高育良 李达康 刘庆祝".split()))
    
    刘庆祝
    

    保留模型,方便重用

    model.save(u'人民的名义.model')
    
    2019-05-14 17:13:39,338 : INFO : saving Word2Vec object under 人民的名义.model, separately None
    2019-05-14 17:13:39,338 : INFO : not storing attribute vectors_norm
    2019-05-14 17:13:39,339 : INFO : not storing attribute cum_table
    2019-05-14 17:13:39,906 : INFO : saved 人民的名义.model
    

    加载模型

    model_2 = word2vec.Word2Vec.load('人民的名义.model')
    
    2019-05-14 17:13:42,714 : INFO : loading Word2Vec object from 人民的名义.model
    2019-05-14 17:13:42,942 : INFO : loading wv recursively from 人民的名义.model.wv.* with mmap=None
    2019-05-14 17:13:42,943 : INFO : setting ignored attribute vectors_norm to None
    2019-05-14 17:13:42,943 : INFO : loading vocabulary recursively from 人民的名义.model.vocabulary.* with mmap=None
    2019-05-14 17:13:42,944 : INFO : loading trainables recursively from 人民的名义.model.trainables.* with mmap=None
    2019-05-14 17:13:42,944 : INFO : setting ignored attribute cum_table to None
    2019-05-14 17:13:42,945 : INFO : loaded 人民的名义.model
    
    try:
        sim3 = model_2.most_similar(u'侯亮平',topn =20)
        print(u'和 侯亮平 与相关的词有:
    ')
        for key in sim3:
            print(key[0],key[1])
    except:
        print(' error')
    
    E:Anaconda3envssklearnlibsite-packagesipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
      
    2019-05-14 17:14:02,083 : INFO : precomputing L2-norms of word weight vectors
    
    
    和 侯亮平 与相关的词有:
    
    祁同伟 0.9691112041473389
    陆亦可 0.9684256911277771
    季昌明 0.9582957625389099
    李达康 0.952505886554718
    她 0.9482855200767517
    他们 0.9475176334381104
    易学习 0.9456426501274109
    陈岩石 0.9433715343475342
    马上 0.941593587398529
    高育良 0.9408242702484131
    郑西坡 0.9396289587020874
    王大路 0.9381627440452576
    沙瑞金 0.9350594282150269
    赵东来 0.9322312474250793
    陈海 0.9311630725860596
    司机 0.9282065033912659
    蔡成功 0.9281994104385376
    他 0.92684006690979
    组织 0.9237431287765503
    大家 0.9234919548034668
    
    
    
  • 相关阅读:
    Neo.Geo系统视频硬件结构模拟 v2.0
    [原创] CPS1模拟器开发日志
    在博客园发现恶意群体回复打广告的
    [原创] Neo.Geo系统视频硬件结构模拟
    在 ASP.NET 中执行 URL 重写(读书笔记)
    c#中什么情况下用(int)什么情况下用Convert.ToInt32
    ASP.NET 例程完全代码版(7)——2.0中实现自配置的成员角色管理库
    Request.UrlReferrer详解
    .NET中获取电脑名、IP及用户名方法
    ASP.NET 2.0中的跨页面提交
  • 原文地址:https://www.cnblogs.com/chenxiangzhen/p/10863344.html
Copyright © 2011-2022 走看看