  • Python Tensorflow下的Word2Vec代码解释



    第一章博客我将会分为两个部分,这一部分将讲述Word2Vec在tensorflow中官方提供的basic版本的构造原理以及如何搭建一个CBOW模型来弥补提供版本里缺失的模型构架。于下一个部分里,我会重点对比tensorflow下basic, optimised以及gensim三个版本的Word2Vec的运行结果情况。




    def build_dataset(words, min_cut_freq):
      count_org = [['UNK', -1]]
      count_org.extend(collections.Counter(words).most_common()) #这里我们收集全部的单词的词频
      count = [['UNK', -1]]
      for word, c in count_org:
        word_tuple = [word, c]
        if word == 'UNK':   #保留UNK的位置已备后用
            count[0][1] = c
        if c > min_cut_freq: #这里定义一个para为min_cut_freq,少于这个数量的将会被咔掉
      dictionary = dict()
      for word, _ in count:
        dictionary[word] = len(dictionary)
      data = list()
      unk_count = 0
      for word in words:
        if word in dictionary:
          index = dictionary[word]
          index = 0  # dictionary['UNK']
          unk_count += 1
      count[0][1] = unk_count
      reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
      return data, count, dictionary, reverse_dictionary

     之后,源代码第91行的generate_batch其实就是构建skip-gram模型的入口,而不是自第137行with graph.as_default()之后的框架。137行之后的为建立一个简单的MLP模型以便tensor在模型里flow。而这个tensor以及其target的形式才是构建模型的要素。如果大家仔细阅读后会发现在一个输入为“蝙蝠侠战胜了超人,美国队长却被钢铁侠暴打”这句中,在build_dataset函数转换后可能蝙蝠侠被它的在dictionary中的代码3替代,战胜了被90替代,超人被600替代,美国队长为58,被为77,钢铁侠为888以及暴打为965。于是这句话变成了[3,90,600,58,77,888,965]. 假设window size是3, 这里的模型是skip-gram,这个generate_batch函数从90出发,输出的batch为[90,90,600,600,58,58,77,77,888,888], 输出的target为[3,600,90,58,600,77,58,888,77,965]. 那么,如何构建CBOW模型呢?其实很简单,注意到CBOW模型的输入以及预测跟SkipGram正好相反,那么我们把第109行的batch和第110行的labels对调不就okay了么?具体代码如下:

    def generate_cbow_batch(batch_size, num_skips, skip_window):
      global data_index
      assert batch_size % num_skips == 0
      assert num_skips <= 2 * skip_window
      batch = np.ndarray(shape=(batch_size), dtype=np.int32)
      labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
      span = 2 * skip_window + 1 # [ skip_window target skip_window ]
      buffer = collections.deque(maxlen=span)
      for _ in range(span):
        data_index = (data_index + 1) % len(data)
      for i in range(batch_size // num_skips):
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [ skip_window ]
        for j in range(num_skips):
          while target in targets_to_avoid:
            target = random.randint(0, span - 1)
    #这里的batch和labels是skipgram模型的 #batch[i * num_skips + j] = buffer[skip_window] #labels[i * num_skips + j, 0] = buffer[target]
    #这里的batch和labels是CBOW模型的,原理是对掉上面skipgram模型的两行。 batch[i * num_skips + j] = buffer[target] labels[i * num_skips + j, 0] = buffer[skip_window] buffer.append(data[data_index]) data_index = (data_index + 1) % len(data) return batch, labels

    由此,我们只需要在后面的batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)函数更换函数为你的CBOW模型函数就好了。


    感谢深圳大学陈老师推荐的关于word embedding的论文How to Generate a Good Word Embedding。 文中不仅阐述了如何对词向量的质量进行分析外,也充分介绍了不同模型间的区别。在阅读论文时发现,Skip-Gram与CBOW模型的区别并不单单存在于其模型的输入与输出为颠倒状态,还有一个比较特别的地方,在模型上,CBOW模型的输入层为sum函数,结果为输入矢量的加权平均值,而Skip-gram采用的是中间单词代表环境,即one of the context owrds as the representation of the context. 在考虑了这个因素后,对比之上的generate_cbow_batch函数的代码,我们发现的问题是batch和labels的期望输出不应该是[3,600,90,58,600,77,58,888,77,965]和[90,90,600,600,58,58,77,77,888,888], 而应该是[[3,600], [90, 58], [600,77],[58,888],[77,965]]为输入,[90, 600, 58, 77, 88]为输出。如何修改generate_cbow_batch代码做到这个呢?改动很简单,如下:

    def generate_cbow_batch(batch_size, num_skips, skip_window):
      global data_index
      assert batch_size % num_skips == 0
      assert num_skips <= 2 * skip_window
      batch = np.ndarray(shape=(batch_size, num_skips), dtype=np.int32)
      labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
      span = 2 * skip_window + 1 # [ skip_window target skip_window ]
      buffer = collections.deque(maxlen=span)
      for _ in range(span):
        data_index = (data_index + 1) % len(data)
      for i in range(batch_size):
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [ skip_window ]
        #定义一个temp的batch array作为暂时储存环境的array,在储存完毕后输出
        batch_temp = np.ndarray(shape=(num_skips), dtype=np.int32)
        for j in range(num_skips):
          while target in targets_to_avoid:
            target = random.randint(0, span - 1)
          batch_temp[j] = buffer[target]
        batch[i] = batch_temp
        labels[i,0] = buffer[skip_window]
        data_index = (data_index + 1) % len(data)
      return batch, labels


    graph = tf.Graph()
    with graph.as_default():
      # Input data.
      # 这里的输入对应的是skip-gram,input大小是batch_size X 1
      #train_inputs = tf.placeholder(tf.int32, shape=[batch_size]) 
      #这里由于我们的输入对于每个词而言有一个context的输入,我们的input的大小为batch_size X context
      train_inputs = tf.placeholder(tf.int32,shape=[batch_size, skip_window * 2])
      train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
      valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
      # Ops and variables pinned to the CPU because of missing GPU implementation
      with tf.device('/cpu:0'):
        # Look up embeddings for inputs.
        embeddings = tf.Variable(
            tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
        # Embedding size is calculated as shape(train_inputs) + shape(embeddings)[1:]
        embed = tf.nn.embedding_lookup(embeddings, train_inputs)
        #原因在于假设我们的batch_size是200, window_size是4, 然后词向量size是200, 我们会得到
        reduced_embed = tf.div(tf.reduce_sum(embed, 1), skip_window*2)
        # Construct the variables for the NCE loss
        nce_weights = tf.Variable(
            tf.truncated_normal([vocabulary_size, embedding_size],
                                stddev=1.0 / math.sqrt(embedding_size)))
        nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
      # Compute the average NCE loss for the batch.
      # tf.nce_loss automatically draws a new sample of the negative labels each
      # time we evaluate the loss.
      loss = tf.reduce_mean(
          tf.nn.nce_loss(nce_weights, nce_biases, reduced_embed, train_labels,
                         num_sampled, vocabulary_size))
      # Construct the SGD optimizer using a learning rate of 1.0.
      optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
      # Compute the cosine similarity between minibatch examples and all embeddings.
      norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
      normalized_embeddings = embeddings / norm
      valid_embeddings = tf.nn.embedding_lookup(
          normalized_embeddings, valid_dataset)
      similarity = tf.matmul(
          valid_embeddings, normalized_embeddings, transpose_b=True)
      # Add variable initializer.
      init = tf.initialize_all_variables()
    # Step 5: Begin training.
    num_steps = 100001
    with tf.Session(graph=graph) as session:
      # We must initialize all variables before we use them.
      average_loss = 0
      for step in xrange(num_steps):
        batch_inputs, batch_labels = generate_cbow_batch(
            batch_size, num_skips, skip_window)
        feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}
        # We perform one update step by evaluating the optimizer op (including it
        # in the list of returned values for session.run()
        _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += loss_val


     Nearest to to: cruel, must, would, should, will, could, nigeria, captive,

     Nearest to may: can, would, could, will, might, must, should, cannot,

     Nearest to was: is, had, has, were, became, be, been, perceive,

     Nearest to into: through, delicious, from, comrades, reflexive, pellets, awarding, slowly,

     Nearest to some: many, these, any, various, several, both, their, wise,

     Nearest to that: which, meadow, how, battlefront, however, powell, animism, this,

     Nearest to also: never, still, often, actually, sometimes, usually, originally, below,

     Nearest to are: were, have, is, be, include, do, sprites, been,

     Nearest to new: nominally, dns, fermentable, final, proprietorships, aloe, junior, reservoirs,

     Nearest to their: its, his, her, the, your, some, my, whose,

     Nearest to years: decades, year, history, times, days, months, marmoset, wrangler,

     Nearest to there: they, it, she, he, these, generally, lemon, we,

     Nearest to th: eight, zero, nine, plasticizers, fairies, characteristic, documentation, anecdotes,

     Nearest to many: some, several, these, such, most, various, wise, other,

     Nearest to but: however, and, although, while, pursuing, marmoset, glowing, components,

     Nearest to see: wants, atomic, charlotte, crimson, tanaka, caius, maine, scuttled,

    由此可见,该系统运行的还是可以的。其中,are对应词有were, have, is be, include, do等,有英语基础的朋友都了解,这些词确实在在用法及意义上相似于are。 另外包括their在内的很多词效果看似还是不错的。有兴趣的朋友欢迎阅读我的源代码

