cs224n word2vec

zoukankan html css js c++ java

cs224n word2vec
cs224n-word2vec

摘要：
word2vec原始想法的来源：分布式假设;

word2vec原始的优化函数，基于似然函数；

如何通过softmax计算概率，以及softmax的简单解释；

word2vec的两种变形，及加速方法的介绍；

HW2 采样是如何进行的，以及为什么需要这样的操作；

基于共现矩阵的方法，主要是 LSI (LSA)；

对 LSI 的一些改进；
描述

希望将单词的语义编码成向量表示；

分布式语义表示(Distributional semantics)

A word’s meaning is given by the words that frequently appear close-by

word2vec的想法很简单，就是假设一个单词的语义和这个单词的上下文是相关的，我们可以使用这个单词的上下文来表示这个单词的语义信息。

注释：一定程度上可以这样理解，但是是否有更好的假设？周围的context是否就一定能很好的表示当前这个单词。

word2vec介绍

框架：
足够多的语料；

词库里的单词都可以表示为一个向量；

遍历文本中的每个位置 (t) ，可以得到中心单词 (w_t) 和上下文信息 (c_t)；

用单词向量之间的相似度来计算条件概率(p(c|w_t))；

调整单词的表示，最大化概率；
目标函数：

For each position (i = 1, … ,T) predict context words within a
window of fixed size (m), given center word (w_j):

[Likehood = L( heta)=prod_{t=1}^{T} prod_{-m leq j leq m} Pleft(w_{t+j} | w_{t} ; heta ight) ]
The objective function (J( heta)) is the (average) negative log likelihood:

[J( heta)=-frac{1}{T} log L( heta)=-frac{1}{T} sum_{t=1}^{T} sum_{-m leq j leq m atop j eq 0} log Pleft(w_{t+j} | w_{t} ; heta ight) ]

注：这里有一个常规的操作，先写出似然函数，再变为损失函数

问题： 如何计算 (Pleft(w_{t+j} | w_{t} ; heta ight))?

For a center word (c) and a context word (o):

[P(o | c)=frac{exp left(u_{o}^{T} v_{c} ight)}{sum_{w in V} exp left(u_{w}^{T} v_{c} ight)} ]
这个问题其实困扰了我好久，一直不知道如何计算概率。看到这个好像忽然明白了，感觉和这个形式有点像: (p(y|x)=frac{p(x,y)}{p(x)})。

softmax函数：

其实上面计算概率的公式本质上是一个softmax 函数：

[operatorname{softmax}left(x_{i} ight)=frac{exp left(x_{i} ight)}{sum_{j=1}^{n} exp left(x_{j} ight)}=p_{i} ]
The softmax function maps arbitrary values (x_i) to a probability distribution (p_i):

“max” because amplifies probability of largest (x_i)

“soft” because still assigns some probability to smaller (x_i)
参数优化

( heta) represents all model parameters, in one long vector. In our case with d-dimensional vectors and V-many words:

[ heta=left[ egin{array}{l}{v_{ ext {aardvark}}} \ {v_{a}} \ {vdots} \ {v_{z e b r a}} \ {u_{ ext {aardvark}}} \ {u_{a}} \ {vdots} \ {u_{z e b r a}}end{array} ight] in mathbb{R}^{2 d V} ]

Remember: every word has two vectors.
Why two vectors? Easier optimization. Average both at the end.

word2vec参数求导
Useful basics:

[frac{partial mathbf{x}^{T} mathbf{a}}{partial mathbf{x}}=frac{partial mathbf{a}^{T} mathbf{x}}{partial mathbf{x}}=mathbf{a} ]

Chain rule: if (y=f(u) ext { and } u=g(x), ext { i.e. } y=f(g(x)),):

[frac{d y}{d x}=frac{d y}{d u} frac{d u}{d x} ]
两种模型
Skip-grams (SG)
Predict context (”outside”) words (position independent) given center

Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
加速方法
Negative sampling

Hierarchical softmax
The skip-gram model with negative sampling (HW2)

问题： The normalization factor is too computationally expensive.

主要想法: train binary logistic regressions for a true pair (center
word and word in its context window) versus several noise pairs
(the center word paired with a random word)

目标函数：

[J( heta)=frac{1}{T} sum_{t=1}^{T} J_{t}( heta) ]
[J_{t}( heta)=log sigmaleft(u_{o}^{T} v_{c} ight)+sum_{i=1}^{k} mathbb{E}_{j sim P(w)}left[log sigmaleft(-u_{j}^{T} v_{c} ight) ight] ]
注：

take k negative samples (using word probabilities)

(mathrm{P}(w)=U(w)^{3 / 4} / Z)

The power makes less frequent words be sampled more often
But why not capture co-occurrence counts directly?
2 options: windows vs. full document

Window: Similar to word2vec, use window around each word -> captures both syntactic (POS) and semantic information

Word-document co-occurrence matrix will give general topics (all sports terms will have similar entries) leading to “Latent Semantic Analysis” semantic information
Problems with simple co-occurrence vectors
Increase in size with vocabulary

Very high dimensional: requires a lot of storage

Subsequent classification models have sparsity issues -> Models are less robust
Solution: Low dimensional vectors

This questions equals "how to reduce the dimensionality?".
Method 1: Dimensionality Reduction on X (HW1)

Singular Value Decomposition of co-occurrence matrix X
Factorizes X into (USigma V^T), where U and V are orthonormal
Hacks to X

Scaling the counts in the cells can help a lot

Problem: function words (the, he, has) are too frequent -> syntax has too much impact.

(min(X,t), with t = 100)

Ignore them all

Use Pearson correlations instead of counts, then set negative values to 0
Count based vs. direct prediction :
LSA, HAL, COALS, Hellinger-PCA

Fast training

Efficient usage of statistics

Primarily used to capture word similarity

Disproportionate importance given to large counts

Skip-gram/CBOW, NNLM, HLBL, RNN

Generate improved performance on other tasks

Can capture complex patterns beyond word similarity

Scales with corpus size

Inefficient usage of statistics
Global Vectors for Word Representation (GloVe)
- The first set are count-based and rely on matrix factorization (e.g. LSA, HAL). While these methods effectively leverage global statistical information, they are primarily used to capture word similarities and do poorly on tasks such as word analogy, indicating a sub-optimal vector space structure.
- The other set of methods are shallow window-based (e.g. the skip-gram and the CBOW models), which learn word embeddings by making predictions in local context windows. These models demonstrate the capacity to capture complex linguistic patterns beyond word similarity, but fail to make use of the global co-occurrence statistics.
主要思想：

GloVe consists of a weighted least squares model that trains on global word-word co-occurrence counts and thus makes efficient use of statistics;

算法：

Let (X) denote the word-word co-occurrence matrix, where (x_{ij}) indicates the number of times word (j) occur in the context of word (i), y, let (P_{i j}=Pleft(w_{j} | w_{i} ight)=frac{X_{i j}}{X_{i}}) be the probability of j appearing in the context of word i.

The objective of skip-gram is :

[J=-sum_{i in ext {corpus} j in operatorname{context}(i)} log Q_{i j} ]
[Q_{i j}=frac{exp left(vec{u}_{j}^{T} vec{v}_{i} ight)}{sum_{a w=1}^{W} exp left(vec{u}_{w}^{T} vec{v}_{i} ight)} ]

One significant drawback of the cross-entropy loss is that it requires the distribution Q to be properly normalized, which involves the expensive summation over the entire vocabulary. Instead, we use a least square objective in which the normalization factors in P and Q are discarded:

[hat{J}=sum_{i=1}^{W} sum_{j=1}^{W} X_{i}left(hat{P}_{i j}-hat{Q}_{i j} ight)^{2} ]
where (hat{P}_{i j}=X_{i j}) and (hat{Q}_{i j}=exp left(vec{u}_{j}^{T} vec{v}_{i} ight))are the unnormalized distributions.

This formulation introduces a new problem – (X_{ij}) often takes on very large values and makes the optimization difficult.

[egin{aligned} hat{jmath} &=sum_{i=1}^{W} sum_{j=1}^{W} X_{i}left(log (hat{P})_{i j}-log left(hat{Q}_{i j} ight) ight)^{2} \ &=sum_{i=1}^{W} sum_{j=1}^{W} X_{i}left(vec{u}_{j}^{T} vec{v}_{i}-log X_{i j} ight)^{2} end{aligned} ]
Another observation is that the weighting factor Xi
is not guaranteed to be optimal. Instead, we introduce a more general weighting
function, which we are free to take to depend on the context word as
well:

[hat{jmath}=sum_{i=1}^{W} sum_{j=1}^{W} fleft(X_{i j} ight)left(vec{u}_{j}^{T} vec{v}_{i}-log X_{i j} ight)^{2} ]
word2vec 评测方法

参考文献

[1] Learning representations by back-propagating errors (Rumelhart et al.,1986）
[2] A neural probabilistic language model (Bengio et al, 2003)
[3] NLP (almost) from Scratch (Collobert & Weston, 2008)
查看全文

相关阅读:
table
html <input>
html基本结构
 Spark join连接
 combineByKey
scala mkstring
countByValue
spark aggregate
scala flatmap、reduceByKey、groupByKey
生态圈安装

原文地址：https://www.cnblogs.com/curtisxiao/p/10615301.html

cs224n-word2vec

描述

分布式语义表示(Distributional semantics)

word2vec介绍

word2vec参数求导

两种模型

加速方法

The skip-gram model with negative sampling (HW2)

Global Vectors for Word Representation (GloVe)

word2vec 评测方法

参考文献