zoukankan      html  css  js  c++  java
  • gesim_word2vec训练词向量

    记一下,懒得找了。

    语料是NER的古文献语料,参考其他博客代码。

    我先将标记的实体提出来,作为自定义字典,加入jieba中,然后再入停用词,再分词,最后训练词向量。效果还不知如何,后续再说。

    #加载自定义词典
    jieba.load_userdict("cidian.txt")
    
    #加载停用词
    def getStopwords():
        stopwords = []
        with open("stopwords.txt", "r", encoding='utf8') as f:
            lines = f.readlines()
            for line in lines:
                stopwords.append(line.strip())
        return stopwords
    stopwords = getStopwords()
    
    #分词
    def segment():
    
        segment_file = open('fenci.txt', 'a', encoding='utf8')
        with open('guwen.txt', encoding='utf8') as f:
            text = f.readlines()
            for sentence in text:
                sentence = list(jieba.cut(sentence))
                sentence_segment = []
                for word in sentence:
                    if word not in stopwords:
                        sentence_segment.append(word)
                segment_file.write(" ".join(sentence_segment))
            del text
            f.close()
    
    segment()
    
    #训练词向量
    import logging
    import os.path
    import sys
    
    from gensim.models import Word2Vec
    from gensim.models.word2vec import PathLineSentences
    
    if __name__ == '__main__':
        program = os.path.basename(sys.argv[0])
        logger = logging.getLogger(program)
        logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
        logging.root.setLevel(level=logging.INFO)
        logger.info("running %s" % ' '.join(sys.argv))
        input_file = 'fenci.txt'
        outp1 = 'guwen.model'
        outp2 = 'guwen_word2vec_format'
    #     fileNames = os.listdir(input_dir)
        # 训练模型 输入语料目录 embedding size 256,共现窗口大小10,去除出现次数5以下的词,迭代10次
        model = Word2Vec(PathLineSentences(input_file),
                         size=256, window=10, min_count=5,
                         iter=10)
        model.save(outp1)
        model.wv.save_word2vec_format(outp2, binary=False)
  • 相关阅读:
    php读取大文件如日志文件
    大型站点高并发架构技术
    Nginx配置文件nginx.conf详细说明文档
    关于PHP高并发抢购系统设计
    Mysql常用的锁机制
    Sping基础
    Reliable Master持续集成环境搭建Centos
    Win7 macaca自动化环境搭建 PC篇
    安卓appium无线调试
    Selenium PageFactory使用
  • 原文地址:https://www.cnblogs.com/harbin-ho/p/13311600.html
Copyright © 2011-2022 走看看