zoukankan      html  css  js  c++  java
  • cs224n word2vec

    cs224n-word2vec

    摘要:

    • word2vec原始想法的来源:分布式假设;
    • word2vec原始的优化函数,基于似然函数;
    • 如何通过softmax计算概率,以及softmax的简单解释;
    • word2vec的两种变形,及加速方法的介绍;
    • HW2 采样是如何进行的,以及为什么需要这样的操作;
    • 基于共现矩阵的方法,主要是 LSI (LSA);
    • 对 LSI 的一些改进;

    描述

    希望将单词的语义编码成向量表示;

    分布式语义表示(Distributional semantics)

    A word’s meaning is given by the words that frequently appear close-by

    word2vec的想法很简单,就是假设一个单词的语义和这个单词的上下文是相关的,我们可以使用这个单词的上下文来表示这个单词的语义信息。

    注释:一定程度上可以这样理解,但是是否有更好的假设?周围的context是否就一定能很好的表示当前这个单词。

    word2vec介绍

    框架:

    • 足够多的语料;
    • 词库里的单词都可以表示为一个向量;
    • 遍历文本中的每个位置 (t) ,可以得到中心单词 (w_t) 和 上下文信息 (c_t)
    • 用单词向量之间的相似度来计算条件概率(p(c|w_t))
    • 调整单词的表示,最大化概率;

    目标函数:

    For each position (i = 1, … ,T) predict context words within a
    window of fixed size (m), given center word (w_j):

    [Likehood = L( heta)=prod_{t=1}^{T} prod_{-m leq j leq m} Pleft(w_{t+j} | w_{t} ; heta ight) ]

    The objective function (J( heta)) is the (average) negative log likelihood:

    [J( heta)=-frac{1}{T} log L( heta)=-frac{1}{T} sum_{t=1}^{T} sum_{-m leq j leq m atop j eq 0} log Pleft(w_{t+j} | w_{t} ; heta ight) ]

    注:这里有一个常规的操作,先写出似然函数,再变为损失函数

    问题: 如何计算 (Pleft(w_{t+j} | w_{t} ; heta ight))?

    For a center word (c) and a context word (o):

    [P(o | c)=frac{exp left(u_{o}^{T} v_{c} ight)}{sum_{w in V} exp left(u_{w}^{T} v_{c} ight)} ]

    这个问题其实困扰了我好久,一直不知道如何计算概率。看到这个好像忽然明白了,感觉和这个形式有点像: (p(y|x)=frac{p(x,y)}{p(x)})

    softmax函数:

    其实上面计算概率的公式本质上是一个softmax 函数:

    [operatorname{softmax}left(x_{i} ight)=frac{exp left(x_{i} ight)}{sum_{j=1}^{n} exp left(x_{j} ight)}=p_{i} ]

    The softmax function maps arbitrary values (x_i) to a probability distribution (p_i):

    • “max” because amplifies probability of largest (x_i)
    • “soft” because still assigns some probability to smaller (x_i)

    参数优化

    ( heta) represents all model parameters, in one long vector. In our case with d-dimensional vectors and V-many words:

    [ heta=left[ egin{array}{l}{v_{ ext {aardvark}}} \ {v_{a}} \ {vdots} \ {v_{z e b r a}} \ {u_{ ext {aardvark}}} \ {u_{a}} \ {vdots} \ {u_{z e b r a}}end{array} ight] in mathbb{R}^{2 d V} ]

    Remember: every word has two vectors.
    Why two vectors? Easier optimization. Average both at the end.

    word2vec参数求导

    • Useful basics:

    [frac{partial mathbf{x}^{T} mathbf{a}}{partial mathbf{x}}=frac{partial mathbf{a}^{T} mathbf{x}}{partial mathbf{x}}=mathbf{a} ]

    • Chain rule: if (y=f(u) ext { and } u=g(x), ext { i.e. } y=f(g(x)),):

    [frac{d y}{d x}=frac{d y}{d u} frac{d u}{d x} ]

    两种模型

    • Skip-grams (SG)
      Predict context (”outside”) words (position independent) given center
    • Continuous Bag of Words (CBOW)
      Predict center word from (bag of) context words

    加速方法

    • Negative sampling
    • Hierarchical softmax

    The skip-gram model with negative sampling (HW2)

    问题: The normalization factor is too computationally expensive.

    主要想法: train binary logistic regressions for a true pair (center
    word and word in its context window) versus several noise pairs
    (the center word paired with a random word)

    目标函数:

    [J( heta)=frac{1}{T} sum_{t=1}^{T} J_{t}( heta) ]

    [J_{t}( heta)=log sigmaleft(u_{o}^{T} v_{c} ight)+sum_{i=1}^{k} mathbb{E}_{j sim P(w)}left[log sigmaleft(-u_{j}^{T} v_{c} ight) ight] ]

    注:

    • take k negative samples (using word probabilities)
    • (mathrm{P}(w)=U(w)^{3 / 4} / Z)
    • The power makes less frequent words be sampled more often

    But why not capture co-occurrence counts directly?

    • 2 options: windows vs. full document
    • Window: Similar to word2vec, use window around each word -> captures both syntactic (POS) and semantic information
    • Word-document co-occurrence matrix will give general topics (all sports terms will have similar entries) leading to “Latent Semantic Analysis” semantic information

    Problems with simple co-occurrence vectors

    • Increase in size with vocabulary
    • Very high dimensional: requires a lot of storage
    • Subsequent classification models have sparsity issues -> Models are less robust

    Solution: Low dimensional vectors

    This questions equals "how to reduce the dimensionality?".

    Method 1: Dimensionality Reduction on X (HW1)

    • Singular Value Decomposition of co-occurrence matrix X
      Factorizes X into (USigma V^T), where U and V are orthonormal

    Hacks to X

    • Scaling the counts in the cells can help a lot
    • Problem: function words (the, he, has) are too frequent -> syntax has too much impact.
      • (min(X,t), with t = 100)
      • Ignore them all
      • Use Pearson correlations instead of counts, then set negative values to 0

    Count based vs. direct prediction :

    • LSA, HAL, COALS, Hellinger-PCA
      • Fast training
      • Efficient usage of statistics
      • Primarily used to capture word similarity
      • Disproportionate importance given to large counts
    • Skip-gram/CBOW, NNLM, HLBL, RNN
      • Generate improved performance on other tasks
      • Can capture complex patterns beyond word similarity
      • Scales with corpus size
      • Inefficient usage of statistics

    Global Vectors for Word Representation (GloVe)

    • The first set are count-based and rely on matrix factorization (e.g. LSA, HAL). While these methods effectively leverage global statistical information, they are primarily used to capture word similarities and do poorly on tasks such as word analogy, indicating a sub-optimal vector space structure.
    • The other set of methods are shallow window-based (e.g. the skip-gram and the CBOW models), which learn word embeddings by making predictions in local context windows. These models demonstrate the capacity to capture complex linguistic patterns beyond word similarity, but fail to make use of the global co-occurrence statistics.

    主要思想:

    GloVe consists of a weighted least squares model that trains on global word-word co-occurrence counts and thus makes efficient use of statistics;

    算法:

    Let (X) denote the word-word co-occurrence matrix, where (x_{ij}) indicates the number of times word (j) occur in the context of word (i), y, let (P_{i j}=Pleft(w_{j} | w_{i} ight)=frac{X_{i j}}{X_{i}}) be the probability of j appearing in the context of word i.

    The objective of skip-gram is :

    [J=-sum_{i in ext {corpus} j in operatorname{context}(i)} log Q_{i j} ]

    [Q_{i j}=frac{exp left(vec{u}_{j}^{T} vec{v}_{i} ight)}{sum_{a w=1}^{W} exp left(vec{u}_{w}^{T} vec{v}_{i} ight)} ]

    One significant drawback of the cross-entropy loss is that it requires the distribution Q to be properly normalized, which involves the expensive summation over the entire vocabulary. Instead, we use a least square objective in which the normalization factors in P and Q are discarded:

    [hat{J}=sum_{i=1}^{W} sum_{j=1}^{W} X_{i}left(hat{P}_{i j}-hat{Q}_{i j} ight)^{2} ]

    where (hat{P}_{i j}=X_{i j}) and (hat{Q}_{i j}=exp left(vec{u}_{j}^{T} vec{v}_{i} ight))are the unnormalized distributions.

    This formulation introduces a new problem – (X_{ij}) often takes on very large values and makes the optimization difficult.

    [egin{aligned} hat{jmath} &=sum_{i=1}^{W} sum_{j=1}^{W} X_{i}left(log (hat{P})_{i j}-log left(hat{Q}_{i j} ight) ight)^{2} \ &=sum_{i=1}^{W} sum_{j=1}^{W} X_{i}left(vec{u}_{j}^{T} vec{v}_{i}-log X_{i j} ight)^{2} end{aligned} ]

    Another observation is that the weighting factor Xi
    is not guaranteed to be optimal. Instead, we introduce a more general weighting
    function, which we are free to take to depend on the context word as
    well:

    [hat{jmath}=sum_{i=1}^{W} sum_{j=1}^{W} fleft(X_{i j} ight)left(vec{u}_{j}^{T} vec{v}_{i}-log X_{i j} ight)^{2} ]

    word2vec 评测方法

    参考文献

    [1] Learning representations by back-propagating errors (Rumelhart et al.,1986)
    [2] A neural probabilistic language model (Bengio et al, 2003)
    [3] NLP (almost) from Scratch (Collobert & Weston, 2008)

  • 相关阅读:
    Help-C#-属性-生成事件:预先生成事件和后期生成事件
    小说-长篇小说:《追风筝的人》
    散文-笔记:《皮囊》
    小说-励志:《妥协的力量》
    ons.ONSFactory.cs
    ons.ONSFactoryAPI.cs
    ons.ONSFactoryPorperty.cs
    System.Object.cs
    ons.MessageOrderListener.cs
    ons.MessageLisenter.cs
  • 原文地址:https://www.cnblogs.com/curtisxiao/p/10615301.html
Copyright © 2011-2022 走看看