zoukankan      html  css  js  c++  java
  • word2vec:基本的安装及使用简介

    官方word2vec的github下载地址:https://github.com/svn2github/word2vec

    环境,linux-ubuntu-14.04LST,安装好git, gcc版本4.8.4

    linux下的安装方式:

    % git clone https://github.com/svn2github/word2vec.git

    % cd word2vec

    % make

    命令解析:

    -train <file>
      Use text data from <file> to train the model
    -output <file>
      Use <file> to save the resulting word vectors / word clusters
    -size <int>
      Set size of word vectors; default is 100
    -window <int>
      Set max skip length between words; default is 5
    -sample <float>
      Set threshold for occurrence of words. Those that appear with higher frequency in the training data
      will be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5)
    -hs <int>
      Use Hierarchical Softmax; default is 0 (not used)
    -negative <int>
      Number of negative examples; default is 5, common values are 3 - 10 (0 = not used)
    -threads <int>
      Use <int> threads (default 12)
    -iter <int>
      Run more training iterations (default 5)
    -min-count <int>
      This will discard words that appear less than <int> times; default is 5
    -alpha <float>
      Set the starting learning rate; default is 0.025 for skip-gram and 0.05 for CBOW
    -classes <int>
      Output word classes rather than word vectors; default number of classes is 0 (vectors are written)
    -debug <int>
      Set the debug mode (default = 2 = more info during training)
    -binary <int>
      Save the resulting vectors in binary moded; default is 0 (off)
    -save-vocab <file>
      The vocabulary will be saved to <file>
    -read-vocab <file>
      The vocabulary will be read from <file>, not constructed from the training data
    -cbow <int>
      Use the continuous bag of words model; default is 1 (use 0 for skip-gram model)

    之后准备训练预料就可以了,将分词后的文件拼成一行,训练即可,

    ./word2vec -train fudan_corpus_final -output fudan_100_skip.bin -cbow 0 -size 100 -windows 10 -negative 5 -hs 0 -binary 1 -sample 1e-4 -threads 20 -iter 15

    对于生成 “fudan_100_skip.bin” 文件,可以用gensim 转换为txt明文形式:

    from gensim.models import word2vec
    
    model = word2vec.Word2Vec.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
    model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)
    View Code

    注意:windows下需要先 切换到 gensim的环境(activate gensim),然后再执行

    但是以上关于gensim读取的在我这有问题,因此采用原生方法:参考自http://stackoverflow.com/questions/27324292/convert-word2vec-bin-file-to-text 将以上链接中的c代码copy下来,取名readbin.c 编译readbin.c文件时由于涉及math库,因此命令为: % gcc -o readbin readbin.c -lm 之后执行将bin文件转换为txt文件的操作即可: % ./readbin fudan_100_skip.bin fudan_100.txt
  • 相关阅读:
    openwrt的内核版本是在哪个文件中指定的?
    git如何将一个分支合并到另一个分支?
    cygwin如何下编译安装tmux?
    如何合并ts文件?
    在cygwin下创建的文件位于windows的哪个目录下?
    linux shell的for循环语法是怎样的?
    内部类访问局部变量时,为什么需要加final关键字
    Java8函数式编程的宏观总结
    Maven私服使用经验总结
    java关于Integer设置-128到127的静态缓存
  • 原文地址:https://www.cnblogs.com/ooon/p/6413065.html
Copyright © 2011-2022 走看看