Notes from Notes on Noise Contrastive Estimation and Negative Sampling
one sample:
[x_i o [y_i^0,cdots,y_{i}^{k}]
]
where (y_i^0) are true labeled words , and (y_i^1,cdots,y_i^{k}) are noise samples word index, which is generated by unigram distribution (q(w)) of the dataset.
the probability of true data:
[p(y_i^0=1|x_i, heta)=frac{exp(y_i^0,h_ heta)}{exp(y_i^0 h_ heta) + k*q(y_i^0)}
]
the noise sample probability:
[p(y_i^t=0|x_i, heta)=frac{k*q(y_i^t)}{exp(y_i^t h_ heta) + k*q(y_i^t)},t=1,cdots,k
]
the cost function of this sample:
[l_{nce}=log p(y_i^0|x_i, heta)+sum_{t=1}^k{log p(y_i^t|x_i, heta)}
]
the overall cost function of the dataset:
[mathcal{L}_{nce}=frac{1}{N}sum_i^N{left{log p(y_i^0|x_i, heta)+sum_{t=1}^k{log p(y_i^t|x_i, heta)}
ight}}
]
Related Paper
[Noise-Contrastive Estimation of Unnormalized Statistical Models with Applications to Natural Image Statistics]
[Word2vec Parameter Learning Explained]
[Efficient Estimation of Word Representation in Vector Space]
[Distributed Representations of Words and Phrases and their Compositionality]
[Notes on Noise Contrastive Estimation and Negative Sampling]