zoukankan      html  css  js  c++  java
  • 中文词向量训练

    1. 英文预训练词向量很不错,  https://nlp.stanford.edu/projects/glove/

    使用时首行加入一行行数和向量维度, gensim即可调用.

    # sed -i '1i 400000 300' glove.6b.300d.txt
    
    from gensim.models.keyedvectors import KeyedVectors
    
    model = KeyedVectors.load_word2vec_format('glove.6b.300d.txt', binary=False)
    
    # 获取最相似
    for w, s in model.most_similar('apple', topn=5):
        print w, s
    
    # 获取向量
    print model['apple']

    2. 网上找了很多中文,不尽人意,直接自己训练, 也不会很复杂.

    2.1 构建中文语料库, 下载推荐: http://www.sogou.com/labs/resource/list_news.php

    # 搜狐新闻 2.1G
    tar -zxvf news_sohusite_xml.full.tar.gz 
    cat news_sohusite_xml.full.tar.gz | iconv -f gb18030 -t utf-8 | grep "<content>" > news_sohusite.txt
    sed -i 's/<content>//g' news_sohusite.txt
    sed -i 's/</content>//g' news_sohusite.txt
    python -m jieba -d ' ' news_sohusite.txt > news_sohusite_cutword.txt
    
    # 全网新闻 1.8G
    tar -zxvf news_tensites_xml.full.tar.gz 
    cat news_tensites_xml.full.tar.gz | iconv -f gb18030 -t utf-8 | grep "<content>" > news_tensite.txt
    sed -i 's/<content>//g' news_tensite.txt
    sed -i 's/</content>//g' news_tensite.txt
    python -m jieba -d ' ' news_tensite.txt > news_tensite_cutword.txt
    
    # 其它自身的结合业务需要的预料, 如公司简介
    python -m jieba -d ' ' other_entdesc.txt > other_entdesc_cutword.txt
    
    # 合并切割好的语料
    cat news_sohusite_cutword.txt news_tensite_cutword.txt other_entdesc_cutword.txt > w2v_chisim_corpus.txt

    2.2 利用gensim库进行训练#!/usr/bin/env python

    from gensim.models.word2vec import Word2Vec
    from gensim.models.word2vec import LineSentence
    
    sentences = LineSentence('w2v_chisim_corpus.txt')
    model = Word2Vec(sentences, size=300, window=8, min_count=10, sg=1, workers=4)  # sg=0 使用cbow训练, sg=1对低频词较为敏感
    model.save('w2v_chisim.300d.txt')
    
    for w, s in model.most_similar(u'苹果'): print w, s for w, s in model.most_similar(u'中国'): print w, s for w, s in model.most_similar(u'中山大学'): print w, s

    如何, 是不是也很简单, your show time now, good luck!

  • 相关阅读:
    【题解】P3388 【模板】割点(割顶)
    【题解】T156527 直角三角形
    【题解】T156526 各数字之和
    【题解】P5318 【深基18.例3】查找文献
    数据结构:邻接表
    【题解】P3387 【模板】缩点
    全网最最详细!一文讲懂Tarjan算法求强连通分量&缩点
    vue组件间通信
    vue实现头像上传
    解读JavaScript中的Hoisting机制(js变量声明提升机制)
  • 原文地址:https://www.cnblogs.com/jkmiao/p/7007763.html
Copyright © 2011-2022 走看看