zoukankan      html  css  js  c++  java
  • TensorFlow学习笔记12-word2vec模型

    为什么学习word2word2vec模型?

    该模型用来学习文字的向量表示。图像和音频可以直接处理原始像素点和音频中功率谱密度的强度值,
    把它们直接编码成向量数据集。但在"自然语言处理"中,对语句中的单词(Word)进行编码,无法提供
    不同词汇之间的关联信息。这种"独立的、离散的"符号将导致数据稀疏,训练模型时将必须寻求更多
    数据。word2vec旨在克服上述问题。

    向量空间模型(VSMs)将语义近似的词汇映射为相邻的数据点,它假设出现于上下文情景中的词汇有相
    类似的语义。采用该假设的研究方法分为:1. 基于计数的方法(计算词汇与邻近词汇在语料库中共同
    出现的频率,并将其映射到小型且稠密的向量中);2. 预测方法(直接从词汇的邻近词汇进行预测,利
    用已学习到的小型且稠密的嵌套向量)。

    Word2vec是一种可以进行高效率词嵌套学习的预测模型。其两种变体分别为:连续词袋模型(CBOW)
    及Skip-Gram模型。从算法角度看,这两种方法非常相似,其区别为CBOW根据源词上下文词汇('the
    cat sits on the')来预测目标词汇(例如,‘mat’),而Skip-Gram模型做法相反,它通过目标
    词汇来预测源词汇。Skip-Gram模型采取CBOW的逆过程的动机在于:CBOW算法对于很多分布式信息
    进行了平滑处理(例如将一整段上下文信息视为一个单一观察量)。很多情况下,对于小型的数据
    集,这一处理是有帮助的。相形之下,Skip-Gram模型将每个“上下文-目标词汇”的组合视为一个新
    观察量,这种做法在大型数据集中会更为有效。本教程余下部分将着重讲解Skip-Gram模型。

    概率化语言模型

    通常使用极大似然法 (ML) 进行训练,其中通过 softmax function 来最大化当提供前一个单词
    (或几个单词构成的)上下文环境h(代表 "history")中,后一个单词的概率(代表 "target"):

    [egin{aligned} P(w_t|h)&=softmax(score(w_t,h)) \ &=frac{exp(score(w_t,h))}{sum_{Word w' in Vocab}exp {score(w',h)}} end{aligned} ]

    然而这个方法实际执行起来开销非常大,因为在每一步训练迭代中,我们需要去计算并正则化当前上下
    文环境 h 中所有其他单词 w' 的概率得分。为了避免对概率模型中的所有单词进行计算,使用二分
    类器(逻辑回归)在同一个上下文环境h中从k虚构的(噪声)单词(w')中区分出真正的目标单词(w_t)

    所以其损失函数为

    [J=log Q_{ heta}(D=1|w_t,h)+k_{w'sim P_{noise}}E[log Q_{ heta}(D=0|w',h)] ]

    其中(Q_{ heta}(D=1|w_t,h))是数据集在上下文h,根据所学习的嵌套向量( heta),目标单词(w)
    使用逻辑回归计算得到的概率。当真实目标单词被分配到较高的概率,同时噪声单词分配到的概率很低时,
    目标函数才会达到最大值。

    Skip-gram模型

    下面看实践。

    • 数据集:the quick brown fox jumped over the lazy dog
    • 定义"目标单词前一个和后一个单词作为上下文"(窗口为1),得到数据集为:([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...
    • Skip-gram模型中将目标单词和上下文颠倒,得到数据集:(quick, the), (quick, brown), (brown, quick), (brown, fox), ...
    • 本例中对每一个样本或batch_size(16 <= batch_size <= 512)很小的样本集(一句话或几句话)执行随机梯度下降(SGD)。

    例如根据quick预测the时,随机选取了一个噪声单词为sheep,则损失函数为

    [J_t=log Q_{ heta}(D=1|the,quick)+log Q_{ heta}(D=0|sheep,quick)] ]

    计算(frac{partial J}{partial heta})并更新嵌套参数( heta),将(J)最大化,直到把
    真实单词和噪声单词很好地区分开。

    完整代码:

    # Copyright 2015 The TensorFlow Authors. All Rights Reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    # ==============================================================================
    """Basic word2vec example."""
    
    from __future__ import absolute_import
    from __future__ import division
    from __future__ import print_function
    
    import collections
    import math
    import os
    import random
    import zipfile
    import sys
    
    import numpy as np
    from six.moves import urllib
    from six.moves import xrange  # pylint: disable=redefined-builtin
    import tensorflow as tf
    
    # Step 1: Download the data.
    url = 'http://mattmahoney.net/dc/'
    
    # 下载文件
    def maybe_download(filename, expected_bytes):
      """Download a file if not present, and make sure it's the right size."""
      if not os.path.exists(filename):
        def _progress(count, block_size,total_size):
          sys.stdout.write('
    >> Downloading %s %.1f%%' %(filename,float(count*block_size)/float(total_size)*100.0))
          sys.stdout.flush()
        filename, _ = urllib.request.urlretrieve(url + filename, filename,_progress)
      print()
      statinfo = os.stat(filename)
      if statinfo.st_size == expected_bytes:
        print('Found and verified', filename)
      else:
        print(statinfo.st_size)
        raise Exception(
            'Failed to verify ' + filename + '. Can you get to it with a browser?')
      return filename
    
    filename = maybe_download('text8.zip', 31344016)
    
    
    # 解压并读取文件
    def read_data(filename):
      """Extract the first file enclosed in a zip file as a list of words."""
      with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0])).split()
      return data
    
    vocabulary = read_data(filename)
    print('Data size', len(vocabulary))
    
    # Step 2: Build the dictionary and replace rare words with UNK token.
    vocabulary_size = 50000
    
    # 建立数据集,words是所有单词的列表,n_words是想建的字典中单词的个数
    def build_dataset(words, n_words):
      """Process raw inputs into a dataset."""
      #将所有低频单词设为UNK,个数先设为-1
      count = [['UNK', -1]]
      #将words集合中的单词按频数排序,将频率最高的前n_words-1个单词以及他们的出现的个数按顺序输出到count中,将频数排在n_words-1之后的单词设为UNK。同时,count的规律为索引越小,单词出现的频率越高
      count.extend(collections.Counter(words).most_common(n_words - 1))
      #建一个字典dict
      dictionary = dict()
      for word, _ in count:
        #对count中所有单词进行编号,赋予ID,由0开始,保存在字典dict中
        dictionary[word] = len(dictionary)
      #建一个列表
      data = list()
      unk_count = 0
    
      #对原words列表中的单词使用字典中的ID进行编号,即将单词转换成整数,储存在data列表中,同时对UNK进行计数
      for word in words:
        if word in dictionary:
          index = dictionary[word]
        else:
          index = 0  # dictionary['UNK']
          unk_count += 1
        data.append(index)
      #记录UNK个数
      count[0][1] = unk_count
      #将dictionary中的数据反转,即可以通过ID找到对应的单词,保存在reversed_dictionary中
      reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
      return data, count, dictionary, reversed_dictionary
    
    data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
                                                                vocabulary_size)
    del vocabulary  # Hint to reduce memory.
    
    #输出频数最高的前5个单词
    print('Most common words (+UNK)', count[:5])
    
    print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])
    
    data_index = 0
    
    # Step 3: Function to generate a training batch for the skip-gram model.
    
    #这个函数的功能是对数据data中的每个单词,分别与前一个单词和后一个单词生成一个batch,即[data[1],data[0]]和[data[1],data[2]],其中当前单词data[1]存在batch中,前后单词存在labels中
    def generate_batch(batch_size, num_skips, skip_window):
      global data_index                       #全局索引,在data中的位置
      assert batch_size % num_skips == 0
      assert num_skips <= 2 * skip_window
      batch = np.ndarray(shape=(batch_size), dtype=np.int32)   #建一个batch大小的数组,保存任意单词
      labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)  #建一个(batch,1)大小的二位数组,保存任意单词前一个或者后一个单词,从而形成一个pair
      span = 2 * skip_window + 1  # #窗的大小,为3,结构为[ skip_window target skip_window ]
      buffer = collections.deque(maxlen=span) #建立一个结构为双向队列的缓冲区,大小不超过3
      if data_index + span > len(data):  #如果索引超过了数据长度,则重新从数据头部开始
        data_index = 0
      buffer.extend(data[data_index:data_index + span])   #将数据index到index+3段赋值给buffer,大小刚好为span
      data_index += span  #将index向后移3位          -----------------------------------------------------------------(1)
      for i in range(batch_size // num_skips):   #128//2 四舍五入
        target = skip_window  # 将target赋值为1,即当前单词
        targets_to_avoid = [skip_window]      #将target存入targets_to_avoid中,避免重复存入
        for j in range(num_skips):
          while target in targets_to_avoid:            #选出还没出现在targets_to_avoid中的单词索引
            target = random.randint(0, span - 1)
          targets_to_avoid.append(target)               #存入targets_to_avoid
          batch[i * num_skips + j] = buffer[skip_window]    #在batch中存入当前单词
          labels[i * num_skips + j, 0] = buffer[target]      #在labels中存入当前单词前一个单词或者后一个单词
        if data_index == len(data):          #  如果到达数据尾部
          buffer[:] = data[:span]           #重新开始,将数据前三位存入buffer中,也就是说,是从数据第二个单词开始的
          data_index = span
        else:
          buffer.append(data[data_index])       #如果没有越界,则在buffer尾部插入一个新单词,同时挤出buffer中第一个单词,相当于是span的范围向后移了一位
          data_index += 1              #当前单词的索引向后移一位
      # Backtrack a little bit to avoid skipping words in the end of a batch
      data_index = (data_index + len(data) - span) % len(data)  #避免循环结束后刚好停在data尾部,以防下次运行该函数向后移动三位index时越界
      return batch, labels
    
    batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)
    for i in range(8):
      print(batch[i], reverse_dictionary[batch[i]],
            '->', labels[i, 0], reverse_dictionary[labels[i, 0]])
    
    # Step 4: Build and train a skip-gram model.
    
    batch_size = 128
    embedding_size = 128  # Dimension of the embedding vector.
    skip_window = 1       # How many words to consider left and right.
    num_skips = 2         # How many times to reuse an input to generate a label.
    
    # We pick a random validation set to sample nearest neighbors. Here we limit the
    # validation samples to the words that have a low numeric ID, which by
    # construction are also the most frequent.
    valid_size = 16     # Random set of words to evaluate similarity on.
    valid_window = 100  # Only pick dev samples in the head of the distribution.
    valid_examples = np.random.choice(valid_window, valid_size, replace=False)
    num_sampled = 64    # Number of negative examples to sample.
    
    graph = tf.Graph()
    
    with graph.as_default():
    
      # Input data.
      # 输入一个batch的训练数据,是当前单词在字典中的索引id
      train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
      # 输入一个batch的训练数据的标签,是当前单词前一个或者后一个单词在字典中的索引id
      train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
      #从字典前100个单词,即频率最高的前100个单词中,随机选出16个单词,将它们的id储存起来,作为验证集
      valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
    
      # Ops and variables pinned to the CPU because of missing GPU implementation
      with tf.device('/cpu:0'):
        # Look up embeddings for inputs.
        # 初始化字典中每个单词的embeddings,值为-1到1的均匀分布
        embeddings = tf.Variable(
            tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
        #找到训练数据对应的embeddings
        embed = tf.nn.embedding_lookup(embeddings, train_inputs)
    
        # Construct the variables for the NCE loss
        #初始化训练参数
        nce_weights = tf.Variable(
            tf.truncated_normal([vocabulary_size, embedding_size],
                                stddev=1.0 / math.sqrt(embedding_size))
        #初始化偏置
        nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
    
      # Compute the average NCE loss for the batch.
      # tf.nce_loss automatically draws a new sample of the negative labels each
      # time we evaluate the loss.
      '''
      算法非常简单,根据词频或者类似词频的概率选出64个负采样v,联同正确的输入w(都是词的id),用它们在nce_weights
      对应的向量组成一个训练子集mu。
      对于训练子集中各个元素mu(i),如果是w或者m(i)==w(w这里是输入对应的embedding),loss(i)=log(sigmoid(w*mu(i)))
                            如果是负采样,则loss(i)=log(1-sigmoid(w*mu(i)))
      然后将所有loss加起来作为总的loss,loss越小越相似(余弦定理)
      用总的loss对各个参数求导数,来更新nce_weight以及输入的embedding
      '''
      loss = tf.reduce_mean(
          tf.nn.nce_loss(weights=nce_weights,
                         biases=nce_biases,
                         labels=train_labels,
                         inputs=embed,
                         num_sampled=num_sampled,
                         num_classes=vocabulary_size))
    
      # Construct the SGD optimizer using a learning rate of 1.0.
      optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
    
      # Compute the cosine similarity between minibatch examples and all embeddings.
      #对embedding进行归一化
      norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
      normalized_embeddings = embeddings / norm
      #找到验证集中的id对应的embedding
      valid_embeddings = tf.nn.embedding_lookup(
          normalized_embeddings, valid_dataset)
      #判断验证集和整个归一化的embedding的相似性
      similarity = tf.matmul(
          valid_embeddings, normalized_embeddings, transpose_b=True)
    
      # Add variable initializer.
      init = tf.global_variables_initializer()
    
    # Step 5: Begin training.
    num_steps = 100001
    
    with tf.Session(graph=graph) as session:
      # We must initialize all variables before we use them.
      init.run()
      print('Initialized')
    
      average_loss = 0
      for step in xrange(num_steps):
        #生成一个batch的训练数据
        batch_inputs, batch_labels = generate_batch(
            batch_size, num_skips, skip_window)
        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
    
        # We perform one update step by evaluating the optimizer op (including it
        # in the list of returned values for session.run()
        _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += loss_val
        #求移动平均loss
        if step % 2000 == 0:
          if step > 0:
            average_loss /= 2000
          # The average loss is an estimate of the loss over the last 2000 batches.
          print('Average loss at step ', step, ': ', average_loss)
          average_loss = 0
    
        # Note that this is expensive (~20% slowdown if computed every 500 steps)
        if step % 10000 == 0:
          #每10000步评估一下验证集和整个embeddings的相似性
          #结果是验证集中每个词和字典中所有词的相似性
          sim = similarity.eval()
          #对于验证集里面的每一个词
          for i in xrange(valid_size):
            #根据id找回词
            valid_word = reverse_dictionary[valid_examples[i]]
            #因为两个向量相乘,值越小越相似(余弦定理),这里找出前8个最相似的词
            top_k = 8  # number of nearest neighbors
            #排序后输出值最小的前8个的id
            nearest = (-sim[i, :]).argsort()[1:top_k + 1]
            log_str = 'Nearest to %s:' % valid_word
            for k in xrange(top_k):
              #根据id找到对应的word
              close_word = reverse_dictionary[nearest[k]]
              log_str = '%s %s,' % (log_str, close_word)
            print(log_str)
      final_embeddings = normalized_embeddings.eval()
    
    # Step 6: Visualize the embeddings.
    
    
    def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
      assert low_dim_embs.shape[0] >= len(labels), 'More labels than embeddings'
      plt.figure(figsize=(18, 18))  # in inches
      for i, label in enumerate(labels):
        x, y = low_dim_embs[i, :]
        plt.scatter(x, y)
        plt.annotate(label,
                     xy=(x, y),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    
      plt.savefig(filename)
    
    try:
      # pylint: disable=g-import-not-at-top
      from sklearn.manifold import TSNE
      import matplotlib.pyplot as plt
    
      tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000, method='exact')
      plot_only = 500
      low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
      labels = [reverse_dictionary[i] for i in xrange(plot_only)]
      plot_with_labels(low_dim_embs, labels)
    
    except ImportError:
      print('Please install sklearn, matplotlib, and scipy to show embeddings.')
    

    当前最新版本的tutorial已经更新了版本(我还没跑,你可以试试)。

  • 相关阅读:
    hdu 1455 N个短木棒 拼成长度相等的几根长木棒 (DFS)
    hdu 1181 以b开头m结尾的咒语 (DFS)
    hdu 1258 从n个数中找和为t的组合 (DFS)
    hdu 4707 仓鼠 记录深度 (BFS)
    LightOJ 1140 How Many Zeroes? (数位DP)
    HDU 3709 Balanced Number (数位DP)
    HDU 3652 B-number (数位DP)
    HDU 5900 QSC and Master (区间DP)
    HDU 5901 Count primes (模板题)
    CodeForces 712C Memory and De-Evolution (贪心+暴力)
  • 原文地址:https://www.cnblogs.com/charleechan/p/11435177.html
Copyright © 2011-2022 走看看