  • word2vec初探



    gensim的安装很简单,pip install gensim即可.


    class gensim.models.word2vec.Word2Vec(sentences=Nonecorpus_file=Nonesize=100alpha=0.025window=5min_count=5max_vocab_size=Nonesample=0.001seed=1workers=3min_alpha=0.0001sg=0hs=0negative=5ns_exponent=0.75cbow_mean=1hashfxn=<built-in function hash>iter=5null_word=0trim_rule=Nonesorted_vocab=1batch_words=10000compute_loss=Falsecallbacks=()max_final_vocab=None)

    • sentences (iterable of iterablesoptional) – The sentences iterable can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See BrownCorpusText8Corpus or LineSentence in word2vec module for such examples. See also the tutorial on data streaming in Python. If you don’t supply sentences, the model is left uninitialized – use if you plan to initialize it in some other way.
    • corpus_file (stroptional) – Path to a corpus file in LineSentence format. You may use this argument instead of sentences to get performance boost. Only one of sentences or corpus_file arguments need to be passed (or none of them).
    • size (intoptional) – Dimensionality of the word vectors.
    • window (intoptional) – Maximum distance between the current and predicted word within a sentence.
    • min_count (intoptional) – Ignores all words with total frequency lower than this.
    • workers (intoptional) – Use these many worker threads to train the model (=faster training with multicore machines).
    • sg ({01}optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.
    • hs ({01}optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.
    • negative (intoptional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
    • ns_exponent (floatoptional) – The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications.
    • cbow_mean ({01}optional) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.
    • alpha (floatoptional) – The initial learning rate.
    • min_alpha (floatoptional) – Learning rate will linearly drop to min_alpha as training progresses.
    • seed (intoptional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).
    • max_vocab_size (intoptional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
    • max_final_vocab (intoptional) – Limits the vocab to a target vocab size by automatically picking a matching min_count. If the specified min_count is more than the calculated min_count, the specified min_count will be used. Set to None if not required.
    • sample (floatoptional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
    • hashfxn (functionoptional) – Hash function to use to randomly initialize weights, for increased training reproducibility.
    • iter (intoptional) – Number of iterations (epochs) over the corpus.
    • trim_rule (functionoptional) –

      Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns eithergensim.utils.RULE_DISCARDgensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.

      The input parameters are of the following types:
      • word (str) - the word we are examining
      • count (int) - the word’s frequency count in the corpus
      • min_count (int) - the minimum count threshold.
    • sorted_vocab ({01}optional) – If 1, sort the vocabulary by descending frequency before assigning word indexes. See sort_vocab().
    • batch_words (intoptional) – Target size (in words) for batches of examples passed to worker threads (and thus cython routines).(Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
    • compute_loss (booloptional) – If True, computes and stores loss value which can be retrieved usingget_latest_training_loss().
    • callbacks (iterable of CallbackAny2Vec, optional) – Sequence of callbacks to be executed at specific stages during training.

    sentences: 我们要分析的语料。



    window: 词向量上下文最大距离。默认值为5  



       Word2VecKeyedVectors – This object essentially contains the mapping between words and embeddings. After training, it can be used directly             to query those embeddings in various ways. See the module level docstring for examples.

    Word2VecVocab – This object represents the vocabulary (sometimes called Dictionary in gensim) of the model. Besides keeping track of all unique words, this object provides extra functionality, such as constructing a huffman tree (frequent words are closer to the root), or discarding extremely rare words.


    Word2VecTrainables – This object represents the inner shallow neural network used to train the embeddings. The semantics of the network differ slightly in the two available training modes (CBOW or SG) but you can think of it as a NN with a single projection and hidden layer which we train on the corpus. The weights are then used as our embeddings (which means that the size of the hidden layer is equal to the number of features self.size).

    这里注意一下下面的问题,在第一次用word2vec api的时候我踩了坑了.


    比如sentences = [['first', 'sentence'], ['second', 'sentence']]

    则经过word2vec以后,得到'first', 'sentence','second'几个词的词向量.

    如果sentences = [['first sentence'], ['second sentence']],

    则经过word2vec以后,得到'first sentence', 'second sentence'几个词的词向量.这里word2vec把'first sentence','second sentence'视为是一个词.

    如果sentences = ['first', 'sentence'],则'firsst'被认为是一个句子,‘sentence’被认为是一个句子,‘first’对应的words为‘f’,'i','r','s','t',经过word2vec以后得到的词向量中的词是‘f’,'i','r','s','t'....而没有'first'。具体参考stackoverflow的这个回答.


    X_all = train_words + test_words 
    model = word2vec.Word2Vec(X_all,min_count=1,window=5,size=100) 

    其中X_all形如[ ['i','love','you'], ['do','you','know'] ]。这样我们就把X_all中涉及到的words转换成了对应的向量.

    我们可以通过model.wv['love']这样的方式来得到一个词对应的向量.   wv是一个k-v结构,表示word-->vector。


    print(model.wv.similar_by_word('family'))      #求出与'family'最相近的10个词.
    print(model.wv.similarity('family','parents')) ##求出相似程度
    [('parents', 0.6177123785018921), ('father', 0.5987046957015991), ('families', 0.5883874297142029), ('mother', 0.5699872970581055), ('children', 0.5613149404525757), ('parent', 0.5575612783432007), ('community', 0.5537818074226379), ('friendship', 0.5431720018386841), ('life', 0.5359925627708435), ('wife', 0.5311812162399292)]


      model = word2vec.Word2Vec.load('./words.model')   ##载入词向量模型



    model = gensim.models.Word2Vec.load('/tmp/mymodel')



    而使用word2vec的话,假如一个句子有50个词,假设经过word2vec以后,每个词转变为一个100维的向量. 直接替换的话,那每个句子就变成了5000个特征,样本就变成了了M*5000的矩阵.维度太高了,机器学习的训练速度将大大降低,显然不能这么做.

    X_all_new = []
    for sent in X_all:
      X_all_new.append(np.mean([model.wv[w] for w in sent if w in model.wv],axis=0))




  • 原文地址:https://www.cnblogs.com/sdu20112013/p/10212858.html
