Bag-of-words Model
Previous state-of-the-art document representations were based on the bag-of-words model, which represent input documents as a fixed-length vector. For example, borrowing from the Wikipedia article, the two documents
(1) John likes to watch movies. Mary likes movies too.
(2) John also likes to watch football games.
are used to construct a length 10 list of words["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games"]
so then we can represent the two documents as fixed length vectors whose elements are the frequencies of the corresponding words in our list
(1) [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
(2) [1, 1, 1, 1, 0, 0, 0, 1, 1, 1]
Bag-of-words models are surprisingly effective but still lose information about word order. Bag of n-grams models consider word phrases of length n to represent documents as fixed-length vectors to capture local word order but suffer from data sparsity and high dimensionality.
word2vec 中的数学原理详解
自己动手写word2vec (一):主要概念和流程
1、稀疏向量,又称为one-hot representation
2、密集向量,又称distributed representation,即分布式表示。
word2vec采用的是n元语法模型(n-gram model),即假设一个词只与周围n个词有关,而与文本中的其他词无关。这种模型构建简单直接,也有后续的各种平滑方法。
#-*- coding:utf-8 -*-
from sklearn.datasets import fetch_20newsgroups from bs4 import BeautifulSoup import nltk import re from gensim.models import word2vec
news = fetch_20newsgroups(subset='all') X, y =, def news_to_sentences(news): news_text = BeautifulSoup(news, 'html.parser').get_text() #分成句子 tokenizer ='tokenizers/punkt/english.pickle') raw_sentences = tokenizer.tokenize(news_text)
#分成单词 sentences = [] for sent in raw_sentences: sentences.append(re.sub('[^a-zA-Z]', ' ', sent.lower().strip()).split()) return sentences sentences = [] for x in X: sentences += news_to_sentences(x) # Set values for various parameters num_features = 300 # Word vector dimensionality min_word_count = 20 # Minimum word count num_workers = 2 # Number of threads to run in parallel context = 5 # Context window size downsampling = 1e-3 # Downsample setting for frequent words model = word2vec.Word2Vec(sentences, workers=num_workers, size=num_features, min_count = min_word_count, window = context, sample = downsampling)
model.init_sims(replace=True) print model.most_similar('morning')