zoukankan      html  css  js  c++  java
  • Word2vec教程

    Word2vec Tutorial

     RADIM ŘEHŮŘEK 2014-02-02 GENSIM PROGRAMMING 157 COMMENTS

    I never got round to writing a tutorial on how to use word2vec in gensim. It’s simple enough and the API docs are straightforward, but I know some people prefer more verbose formats. Let this post be a tutorial and a reference example.

    UPDATE: the complete HTTP server code for the interactive word2vec demo below is now open sourced on Github. For a high-performance similarity server for documents, see ScaleText.com.

    Preparing the Input

    Starting from the beginning, gensim’s word2vec expects a sequence of sentences as its input. Each sentence a list of words (utf8 strings):

    1
    2
    3
    4
    5
    6
    7
    # import modules & set up logging
    import gensim, logging
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
     
    sentences = [['first', 'sentence'], ['second', 'sentence']]
    # train word2vec on the two sentences
    model = gensim.models.Word2Vec(sentences, min_count=1)

    Keeping the input as a Python built-in list is convenient, but can use up a lot of RAM when the input is large.

    Gensim only requires that the input must provide sentences sequentially, when iterated over. No need to keep everything in RAM: we can provide one sentence, process it, forget it, load another sentence…

    For example, if our input is strewn across several files on disk, with one sentence per line, then instead of loading everything into an in-memory list, we can process the input file by file, line by line:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    class MySentences(object):
        def __init__(self, dirname):
            self.dirname = dirname
     
        def __iter__(self):
            for fname in os.listdir(self.dirname):
                for line in open(os.path.join(self.dirname, fname)):
                    yield line.split()
     
    sentences = MySentences('/some/directory') # a memory-friendly iterator
    model = gensim.models.Word2Vec(sentences)

    Say we want to further preprocess the words from the files — convert to unicode, lowercase, remove numbers, extract named entities… All of this can be done inside the MySentences iterator and word2vec doesn’t need to know. All that is required is that the input yields one sentence (list of utf8 words) after another.

    Note to advanced users: calling Word2Vec(sentences, iter=1) will run two passes over the sentences iterator (or, in general iter+1 passes; default iter=5). The first pass collects words and their frequencies to build an internal dictionary tree structure. The second and subsequent passes train the neural model. These two (or, iter+1) passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:

    1
    2
    3
    model = gensim.models.Word2Vec(iter=1# an empty model, no training yet
    model.build_vocab(some_sentences)  # can be a non-repeatable, 1-pass generator
    model.train(other_sentences)  # can be a non-repeatable, 1-pass generator

    In case you’re confused about iterators, iterables and generators in Python, check out our tutorial on Data Streaming in Python.

    Training

    Word2vec accepts several parameters that affect both training speed and quality.

    One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:

    1
    model = Word2Vec(sentences, min_count=10# default value is 5

    A reasonable value for min_count is between 0-100, depending on the size of your dataset.

    Another parameter is the size of the NN layers, which correspond to the “degrees” of freedom the training algorithm has:

    1
    model = Word2Vec(sentences, size=200# default value is 100

    Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

    The last of the major parameters (full list here) is for training parallelization, to speed up training:

    1
    model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization

    The workers parameter has only effect if you have Cython installed. Without Cython, you’ll only be able to use one core because of the GIL (and word2vec training will be miserably slow).

    Note from Radim: Get my latest machine learning tips & articles delivered straight to your inbox (it's free).

     

     Unsubscribe anytime, no spamming. Max 2 posts per month, if lucky.

     

    Memory

    At its core, word2vec model parameters are stored as matrices (NumPy arrays). Each array is #vocabulary (controlled by min_count parameter) times #size (size parameter) of floats(single precision aka 4 bytes).

    Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer size=200, the model will require approx. 100,000*200*4*3 bytes = ~229MB.

    There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.

    Evaluating

    Word2vec training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

    Google have released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task: https://raw.githubusercontent.com/RaRe-Technologies/gensim/develop/gensim/test/test_data/questions-words.txt.

    Gensim support the same evaluation set, in exactly the same format:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    model.accuracy('/tmp/questions-words.txt')
    2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342)
    2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812)
    2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380)
    2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332)
    2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702)
    2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870)
    2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482)
    2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992)
    2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702)
    2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)

    This accuracy takes an optional parameter restrict_vocab which limits which test examples are to be considered.

    Once again, good performance on this test set doesn’t mean word2vec will work well in your application, or vice versa. It’s always best to evaluate directly on your intended task.

    Storing and loading models

    You can store/load models using the standard gensim methods:

    1
    2
    model.save('/tmp/mymodel')
    new_model = gensim.models.Word2Vec.load('/tmp/mymodel')

    which uses pickle internally, optionally mmap‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.

    In addition, you can load models created by the original C tool, both using its text and binary formats:

    1
    2
    3
    model = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False)
    # using gzipped/bz2 input works too, no need to unzip:
    model = Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)

    Online training / Resuming training

    Advanced users can load a model and continue training it with more sentences:

    1
    2
    model = gensim.models.Word2Vec.load('/tmp/mymodel')
    model.train(more_sentences)

    You may need to tweak the total_words parameter to train(), depending on what learning rate decay you want to simulate.

    Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.

    Using the model

    Word2vec supports several word similarity tasks out of the box:

    1
    2
    3
    4
    5
    6
    model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
    [('queen', 0.50882536)]
    model.doesnt_match("breakfast cereal dinner lunch";.split())
    'cereal'
    model.similarity('woman', 'man')
    0.73723527

    If you need the raw output vectors in your application, you can access these either on a word-by-word basis

    1
    2
    model['computer'# raw NumPy vector of a word
    array([-0.00449447, -0.003100970.02421786, ...], dtype=float32)

    …or en-masse as a 2D NumPy matrix from model.syn0.

    Bonus app

    As before with finding similar articles in the English Wikipedia with Latent Semantic Analysis, here’s a bonus web app for those who managed to read this far. It uses the word2vec model trained by Google on the Google News dataset, on about 100 billion words:

     

    If you don’t get “queen” back, something went wrong and baby SkyNet cries.
    Try more examples too: “he” is to “his” as “she” is to ?“Berlin” is to “Germany” as “Paris” is to ? (click to fill in).

     is to  as  is to 

    Try: U.S.A.Monty_PythonPHPMadiba (click to fill in).

     

    Also try: “monkey ape baboon human chimp gorilla”“blue red green crimson transparent” (click to fill in).

     

    The model contains 3,000,000 unique phrases built with layer size of 300.

    Note that the similarities were trained on a news dataset, and that Google did very little preprocessing there. So the phrases are case sensitive: watch out! Especially with proper nouns.

    On a related note, I noticed about half the queries people entered into the LSA@Wikidemo contained typos/spelling errors, so they found nothing. Ouch.

    To make it a little less challenging this time, I added phrase suggestions to the forms above. Start typing to see a list of valid phrases from the actual vocabulary of Google News’ word2vec model.

    The “suggested” phrases are simply ten phrases starting from whatever bisect_left(all_model_phrases_alphabetically_sorted, prefix_you_typed_so_far) from Python’s built-in bisect module returns.

    See the complete HTTP server code for this “bonus app” on github (using CherryPy).

    Outro

    Full word2vec API docs here; get gensim here. Original C toolkit and word2vec papers by Google here.

  • 相关阅读:
    成为 Team Leader 后我最关心的那些事
    《管理的实践》读书心得
    玩黑客学校CTF
    DHCP中继器
    test
    初窥XSS跨站脚本攻击
    TCP/IP模型
    逻辑漏洞-客户端验证的邮箱-Web渗透实例之中国教育部青少年普法网站逻辑漏洞
    逻辑漏洞-支付风险-大疆某处支付逻辑漏洞可1元买无人机
    逻辑漏洞-密码找回之验证码发给了客户端
  • 原文地址:https://www.cnblogs.com/gongdiwudu/p/10720641.html
Copyright © 2011-2022 走看看