zoukankan      html  css  js  c++  java
  • word2vec学习笔记

    word2vec学习笔记

    前言

    最近一个月事情多,心力交瘁,临近过年这几天进入到啥也不想干的状态,要想摆脱这种状态最好的方法就是赶紧看书写东西,给自己一些正反馈,走出负面循环。过完年要做一些NLP相关的事情了,所有要大致了解下相关内容,第一个准备深入了解的就是word2vec,这是一种词嵌入模型主要作用就是为语言单词寻找一种尽可能合理的向量化表示,一方面能保持单词的一些语义特征(如相似性);另一方面能是向量维度大小比较合理。Word2vec是身兼这两种特点的词嵌入表示。当然没有免费的午餐,我们要通过训练得到这种表达。NLP和CV对待特征的思路很不一样,这也是我刚入NLP的感觉。

    word2vec理论

    这部分要仔细写起来很纠结,网上也有一堆类似的教程,我就不做详细介绍了,这里只讲个大概。一下内容大多来自standford CS224d lecture1。NLP需要先将文档进行分词然后对分词进行编码,编码最简单的就是One-hot vector一个单词占一个坑,但是这样一方面一个单词的维度过高,另一方面无法表达向量之间的关系。word2vec有前端和后端之分,前端有CBOW和SKIP-GRAM这两种模型,后端有负采样和哈弗曼树这两种模型,前端和后端可以自由组合。不过常用的高效实现都是采用Skip-gram + 负采样.

    Skip-gram

    Skip-gram的原理是对输入的单词预测其上下文,比如有一句话是{“The”, “cat”, ”jumped”,”over”, “the”, “puddle”},skip-gram模型对输入中心词语"jumped"进行预测输出"jumped"的上下文“The”, “cat”, ”over”, “the”, “puddle”,听起来感觉很神奇。下面这张图片表示了Skip-gram模型运行的过程。Skip-gram本质上就是一个逻辑回归。

    Skip-gram的运行方式主要有以下几步骤:

    1. 对单词生成one-hot输入向量(x_k)
    2. 得到上下文的嵌入词向量(v_c = Vx)
    3. 通过(u = Uu_c)产生2m个得分向量(u_{c-m},...,u_{c-1},u_{c+1},...,u_{c+m})
    4. 将分向量转换成概率分布(y=softmax(u))
    5. 最后将产生的概率与真实的概率分布做匹配
      Skip-gram的目标/损失函数如下:

    [egin{eqnarray} minimize L &=& -logP(w_{c-m},...,w_{c-1},w_{c+1},...,w_{c+m}|w_c) \ &=& -logprod_{j=0,j ot=m}^{2m}P(w_{c-m+j}|w_c)\ &=& -logprod_{j=0,j ot=m}^{2m}P(u_{c-m+j}|u_c)\ &=& -logprod_{j=0,j ot=m}^{2m}frac{exp(u^T_{c-m+j}v_c)}{sum^{|V|}_{k=1}exp(u_k^Vv_c)}\ &=& -sum_{j=0,j ot=m}^{2m}u^T_{c-m+j}v_c + 2mlogsum_{k=1}^{|V|}exp(u_k^Tv_c) end{eqnarray} ]

    负采样

    上面的目标/损失函数需要对整个词汇表(|V|)进行计算,代价非常的高,因此引入了负采样。负采样的思想是:我们不用去循环整个单词表,而只是采样一些负面的样本就够了,其概率分布与单词表中的频率相匹配。考虑一个词的"词-上下文"对((w,c)),令(P(D=1|w,c))((w,c))来自语料库的概率,则(P(D=1|w,c))为不是来自语料库的概率,我们有:

    [P(D=1|w,c, heta)=frac{1}{1+e^{-v^T_cv_w}} ]

    我们需要建立一个新的目标函数。如果((w,c))真是来自与语料库,目标函数能够最大化(P(D=1|w,c))。我们可以采用最大似然估计来得到模型参数。

    [egin{eqnarray} heta &=&mathop{argmax}_{ heta}prod_{(w,c)in D}P(D=1|w,c, heta)prod_{(w,c)in ilde{D}}P(D=0|w,c, heta)\ &=&mathop{argmax}_{ heta}prod_{(w,c)in D}P(D=1|w,c, heta)prod_{(w,c)in ilde{D}}(1-P(D=1|w,c, heta))\ &=&mathop{argmax}_{ heta}sum_{(w,c)in D}logfrac{1}{1+exp(-u^T_wv_c)}+sum_{(w,c)in ilde{D}}log(1-frac{1}{1+exp(-u^T_wv_c)}) \ &=&mathop{argmax}_{ heta}sum_{(w,c)in D}logfrac{1}{1+exp(-u^T_wv_c)}+sum_{(w,c)in ilde{D}}logfrac{1}{1+exp(u^T_wv_c)} \ &=&mathop{argmax}_{ heta}sum_{(w,c)in D}logsigma(-u^T_wv_c)+sum_{(w,c)in ilde{D}}logsigma(u^T_wv_c)\ end{eqnarray} ]

    这是的( heta)可以看做是上面的(U,V)( ilde{D})表示负面的语料库。我们可一进一步把目标函数写成:

    [egin{eqnarray} logsigma(-u_{c-m+j}^Tv_c) + sum^{K}_{k=1}logsigma( ilde{u}^T_kv_c) end{eqnarray} ]

    这里( ilde{u}_k)是由负采样得到。

    基于tensorflow的word2vec实现

    上面大概介绍了一下word2vec的原理,讲的很简略,要想仔细了解还是去看看网上的《word2vec的数学原理》一文,下面介绍tensorflow里面自带的例子word2vec的实现。

    # These are all the modules we'll be using later. Make sure you can import them
    # before proceeding further.
    %matplotlib inline
    from __future__ import print_function
    import collections
    import math
    import numpy as np
    import os
    import random
    import tensorflow as tf
    import zipfile
    import seaborn as sbn
    from matplotlib import pylab
    %config InlineBackend.figure_format = 'svg'
    from six.moves import range
    from six.moves.urllib.request import urlretrieve
    from sklearn.manifold import TSNE
    
    url = 'http://mattmahoney.net/dc/'
    
    def maybe_download(filename, expected_bytes):
      """Download a file if not present, and make sure it's the right size."""
      if not os.path.exists(filename):
        filename, _ = urlretrieve(url + filename, filename)
      statinfo = os.stat(filename)
      if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filename)
      else:
        print(statinfo.st_size)
        raise Exception(
          'Failed to verify ' + filename + '. Can you get to it with a browser?')
      return filename
    
    filename = maybe_download('text8.zip', 31344016)
    
    def read_data(filename):
      """Extract the first file enclosed in a zip file as a list of words"""
      with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0])).split()
      return data
      
    words = read_data(filename)
    print('Data size %d' % len(words))
    

    上面的代码主要功能是下载数据集并且读取数据,载入内存的是一个很长的文本序列。

    vocabulary_size = 50000
    
    def build_dataset(words):
      count = [['UNK', -1]]
      count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
      dictionary = dict()
      for word, _ in count:
        dictionary[word] = len(dictionary)
      data = list()
      unk_count = 0
      for word in words:
        if word in dictionary:
          index = dictionary[word]
        else:
          index = 0  # dictionary['UNK']
          unk_count = unk_count + 1
        data.append(index)
      count[0][1] = unk_count
      reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
      return data, count, dictionary, reverse_dictionary
    
    data, count, dictionary, reverse_dictionary = build_dataset(words)
    print('Most common words (+UNK)', count[:5])
    print('Sample data', data[:10])
    del words  # Hint to reduce memory.
    

    上面的代码短主要功能是为数据集进行编码,其中使用了most_common,所以单词会按照在文档中出现的次数进行编码,具体来说就是出现次数多的单词的编码会相对小一些,这个在后面负采样中会用到。

    data_index = 0
    
    def generate_batch(batch_size, num_skips, skip_window):
      global data_index
      assert batch_size % num_skips == 0
      assert num_skips <= 2 * skip_window
      batch = np.ndarray(shape=(batch_size), dtype=np.int32)
      labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
      span = 2 * skip_window + 1 # [ skip_window target skip_window ]
      buffer = collections.deque(maxlen=span) # deque窗口  大小为 2*skip_window + 1
      for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
      for i in range(batch_size // num_skips):  #两层循环,一个batch有batch/num_skips个数据,每个数据的label大小为num_skips
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [ skip_window ]
        for j in range(num_skips):
          while target in targets_to_avoid:
            target = random.randint(0, span - 1)
          targets_to_avoid.append(target)
          batch[i * num_skips + j] = buffer[skip_window]
          labels[i * num_skips + j, 0] = buffer[target]
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
      return batch, labels
    
    print('data:', [reverse_dictionary[di] for di in data[:8]])
    
    for num_skips, skip_window in [(2, 1), (4, 2)]:
        data_index = 0
        batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
        print('
    with num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
        print(batch)
        print('    batch:', [reverse_dictionary[bi] for bi in batch])
        print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
    

    对于data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first']上面的操作会形成一个这样的输出 batch中存储的是id, 假设我们去skip_size = 4, skip_window = 2那么,单词 as 所对应的context的word个数就是4个,所以batch中有4个as, 所对应的就是context中的word
    12 as -> 195 term
    12 as -> 5239 anarchism
    12 as -> 6 a
    12 as -> 3084 originated
    6 a -> 12 as
    6 a -> 3084 originated
    6 a -> 2 of
    6 a -> 195 term

    batch_size = 128
    embedding_size = 128 # Dimension of the embedding vector.
    skip_window = 1 # How many words to consider left and right.
    num_skips = 2 # How many times to reuse an input to generate a label.
    # We pick a random validation set to sample nearest neighbors. here we limit the
    # validation samples to the words that have a low numeric ID, which by
    # construction are also the most frequent. 
    valid_size = 16 # Random set of words to evaluate similarity on.
    valid_window = 100 # Only pick dev samples in the head of the distribution.
    valid_examples = np.array(random.sample(range(valid_window), valid_size))
    num_sampled = 64 # Number of negative examples to sample.
    
    graph = tf.Graph()
    
    with graph.as_default(), tf.device('/cpu:0'):
    
      # Input data.
      train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
      train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
      valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
      
      # Variables.
      embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
      softmax_weights = tf.Variable(
        tf.truncated_normal([vocabulary_size, embedding_size],
                             stddev=1.0 / math.sqrt(embedding_size)))
      softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
      
      # Model.
      # Look up embeddings for inputs.
      embed = tf.nn.embedding_lookup(embeddings, train_dataset) #其实就是按照train_dataset顺序返回embeddings中的第train_dataset行。
      # Compute the softmax loss, using a sample of the negative labels each time.
      loss = tf.reduce_mean(
        tf.nn.nce_loss(softmax_weights, softmax_biases, embed,
                                   train_labels, num_sampled, vocabulary_size))#是对类别太多的情况下loss计算的一种加速方法,具体可以参考文档
    
      # Optimizer.
      # Note: The optimizer will optimize the softmax_weights AND the embeddings.
      # This is because the embeddings are defined as a variable quantity and the
      # optimizer's `minimize` method will by default modify all variable quantities 
      # that contribute to the tensor it is passed.
      # See docs on `tf.train.Optimizer.minimize()` for more details.
      optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
      
      # Compute the similarity between minibatch examples and all embeddings.
      # We use the cosine distance:
      norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
      normalized_embeddings = embeddings / norm
      valid_embeddings = tf.nn.embedding_lookup(
        normalized_embeddings, valid_dataset)
      similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
    

    上面的代码就是tensorflow实现的word2vec的skip-gram模型,本质上就是一个逻辑回归啊,和上面的理论还是有区别的,不过这里用的到了nce_loss,这个函数里面包括了negtive sample,后面会详细介绍。

    num_steps = 100001
    
    with tf.Session(graph=graph) as session:
      tf.initialize_all_variables().run()
      print('Initialized')
      average_loss = 0
      for step in range(num_steps):
        batch_data, batch_labels = generate_batch(
          batch_size, num_skips, skip_window)
        feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
        _, l = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += l
        if step % 2000 == 0:
          if step > 0:
            average_loss = average_loss / 2000
          # The average loss is an estimate of the loss over the last 2000 batches.
          print('Average loss at step %d: %f' % (step, average_loss))
          average_loss = 0
        # note that this is expensive (~20% slowdown if computed every 500 steps)
        if step % 10000 == 0:
          sim = similarity.eval()
          for i in range(valid_size):
            valid_word = reverse_dictionary[valid_examples[i]]
            top_k = 8 # number of nearest neighbors
            nearest = (-sim[i, :]).argsort()[1:top_k+1]
            log = 'Nearest to %s:' % valid_word
            for k in range(top_k):
              close_word = reverse_dictionary[nearest[k]]
              log = '%s %s,' % (log, close_word)
            print(log)
      final_embeddings = normalized_embeddings.eval()
    
    
    num_points = 400
    
    tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
    two_d_embeddings = tsne.fit_transform(final_embeddings[1:num_points+1, :])
    
    def plot(embeddings, labels):
      assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
      pylab.figure(figsize=(15,15))  # in inches
      for i, label in enumerate(labels):
        x, y = embeddings[i,:]
        pylab.scatter(x, y)
        pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                       ha='right', va='bottom')
      pylab.savefig('softmax_loss.svg', format='svg')
      pylab.show()
      
    
    words = [reverse_dictionary[i] for i in range(1, num_points+1)]
    plot(two_d_embeddings, words)
    

    最后得到的结果如下

    nce_loss

    nce_loss的源码如下

    def nce_loss(weights, #[num_classes, dim] dim就是emdedding_size
                 biases,  #[num_classes] num_classes就是word的个数(不包括重复的)
                 inputs, #[batch_size, dim]
                 labels,  #[batch_size, num_true] 这里,我们的num_true设置为1,就是一个输入对应一个输出
                 num_sampled,#要取的负样本的个数(per batch)
                 num_classes,#类别的个数(在这里就是word的个数(不包含重复的))
                 num_true=1,
                 sampled_values=None,
                 remove_accidental_hits=False,
                 partition_strategy="mod",
                 name="nce_loss"):
          logits, labels = _compute_sampled_logits(
          weights,
          biases,
          inputs,
          labels,
          num_sampled,
          num_classes,
          num_true=num_true,
          sampled_values=sampled_values,
          subtract_log_q=True,
          remove_accidental_hits=remove_accidental_hits,
          partition_strategy=partition_strategy,
          name=name)
      sampled_losses = sigmoid_cross_entropy_with_logits(
          logits, labels, name="sampled_losses") 
          #此函数返回的tensor与输入logits同维度。 _sum_rows之后,就得到了每个样本的corss entropy。
      # sampled_losses is batch_size x {true_loss, sampled_losses...}
      # We sum out true and sampled losses.
      return _sum_rows(sampled_losses)
      #在word2vec中对此函数的返回调用了reduce_mean() 就获得了平均 cross entropy
    
    # _compute_sampled_logits源码如下
    def _compute_sampled_logits(weights,
                                biases,
                                inputs,
                                labels,
                                num_sampled,
                                num_classes,
                                num_true=1,
                                sampled_values=None,
                                subtract_log_q=True,
                                remove_accidental_hits=False,
                                partition_strategy="mod",
                                name=None):
      if not isinstance(weights, list):
        weights = [weights]
    
      with ops.op_scope(weights + [biases, inputs, labels], name,
                        "compute_sampled_logits"):
        if labels.dtype != dtypes.int64:
          labels = math_ops.cast(labels, dtypes.int64)
        labels_flat = array_ops.reshape(labels, [-1])
    
        # Sample the negative labels.
        #   sampled shape: [num_sampled] tensor
        #   true_expected_count shape = [batch_size, 1] tensor
        #   sampled_expected_count shape = [num_sampled] tensor
        if sampled_values is None:
          sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
              true_classes=labels,
              num_true=num_true,
              num_sampled=num_sampled,
              unique=True,
              range_max=num_classes)
    

    NOTE:这个函数是通过log-uniform进行取样的(P(class)=frac{(log(class+2)−log(class+1))}{log(rang\_max+1)}),取样范围是[0, range_max] ,用这种方法取样就要求我们的word是按照频率从高到低排列的。之前对word的处理的确是这样,class越小取的概率越大。

    sampled_softmax_loss

    tensorflow的word2vec有的版本的损失函数用到了sampled_softmax_loss他和nce_loss很相似,参数是一模一样的。

    def sampled_softmax_loss(weights,
                             biases,
                             labels,
                             inputs,
                             num_sampled,
                             num_classes,
                             num_true=1,
                             sampled_values=None,
                             remove_accidental_hits=True,
                             partition_strategy="mod",
                             name="sampled_softmax_loss"):
      logits, labels = _compute_sampled_logits(
          weights=weights,
          biases=biases,
          labels=labels,
          inputs=inputs,
          num_sampled=num_sampled,
          num_classes=num_classes,
          num_true=num_true,
          sampled_values=sampled_values,
          subtract_log_q=True,
          remove_accidental_hits=remove_accidental_hits,
          partition_strategy=partition_strategy,
          name=name)
      sampled_losses = nn_ops.softmax_cross_entropy_with_logits(labels=labels,
                                                                logits=logits)
      # sampled_losses is a [batch_size] tensor.
      return sampled_losses
    

    主要区别就是sigmoid_cross_entropy_with_logits和softmax_cross_entropy_with_logits,前者不要求类别之间是互斥的,后者要求是互斥的。nce_loss得到的结果会更加平滑一些。下面贴出了用sampled_softmax_loss得到的结果

    参考

    暂略

  • 相关阅读:
    Python之路(第二十篇) subprocess模块
    Python之路(第十九篇)hashlib模块
    Python之路(第十八篇)shutil 模块、zipfile模块、configparser模块
    Python之路(第十六篇)xml模块、datetime模块
    Java性能优化之编程技巧总结
    Java消息中间件入门笔记
    Java线程池实现原理与技术(ThreadPoolExecutor、Executors)
    Java系统高并发之Redis后端缓存优化
    Java实现一个简单的加密解密方法
    Java实现动态修改Jar包内文件内容
  • 原文地址:https://www.cnblogs.com/liujshi/p/6351520.html
Copyright © 2011-2022 走看看