(1)首先下载word2vec,地址:https://code.google.com/p/word2vec/,可能下载的时候有问题,google上不去,那么可以从csdn上面下载。
解压后目录如下:
w2v/ `-- trunk |-- LICENSE |-- README.txt |-- compute-accuracy.c |-- demo-analogy.sh |-- demo-classes.sh |-- demo-phrase-accuracy.sh |-- demo-phrases.sh |-- demo-train-big-model-v1.sh |-- demo-word-accuracy.sh |-- demo-word.sh |-- distance.c |-- makefile |-- questions-phrases.txt |-- questions-words.txt |-- word-analogy.c |-- word2phrase.c `-- word2vec.c
(2) 进入w2c/trunk文件夹,运行make,编辑文件。从makefile中可以看到,需要编译的文件,主要有两个word2vec.c和distance.c,编译后生成word2vec和distance。但是在编译的时候可能出现问题,参照http://blog.csdn.net/zshunmiao/article/details/15339105,可以对问题进行解决。
makefile内容如下:
(3)然后就可以跑个demo了,运行./demo-word.sh。
demo-word.sh内代码如下:
CC = gcc #Using -Ofast instead of -O3 might result in faster code, but is supported only by newer GCC versions CFLAGS = -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result all: word2vec word2phrase distance word-analogy compute-accuracy word2vec : word2vec.c $(CC) word2vec.c -o word2vec $(CFLAGS) word2phrase : word2phrase.c $(CC) word2phrase.c -o word2phrase $(CFLAGS) distance : distance.c $(CC) distance.c -o distance $(CFLAGS) word-analogy : word-analogy.c $(CC) word-analogy.c -o word-analogy $(CFLAGS) compute-accuracy : compute-accuracy.c $(CC) compute-accuracy.c -o compute-accuracy $(CFLAGS) chmod +x *.sh clean: rm -rf word2vec word2phrase distance word-analogy compute-accuracy
然后输入单词,就可以计算其近义词,并按照顺序排列。
Enter word or sentence (EXIT to break): china Word: china Position in vocabulary: 486 Word Cosine distance ------------------------------------------------------------------------ japan 0.648631 taiwan 0.630534 manchuria 0.599535 tibet 0.583566 prc 0.560898 kalmykia 0.558937 xiamen 0.556037 jiang 0.553501 chinese 0.547065 liao 0.543676 india 0.536273 korea 0.534758 roc 0.530741 thailand 0.529334 hunan 0.527629 liang 0.527374 shanghai 0.526314 chongqing 0.525559 nanjing 0.521342 yunnan 0.518669 wuhan 0.516914 zhao 0.513246 xinjiang 0.509939 tuva 0.507322 guangdong 0.507288 hubei 0.505540 guangxi 0.501068 taipei 0.497673 macao 0.497303 hainan 0.494808 shandong 0.493323 shenzhen 0.491871 hangzhou 0.489323 balhae 0.488846 guangzhou 0.486907 fujian 0.485473 zhejiang 0.485011 harbin 0.483171