1.从文本中构建词向量
将每个文本用python分割成单词,构建成词向量,这里首先需要一个语料库,为了简化我们直接从所给文本中抽出所有出现的单词构成一个词库。
2.利用词向量计算概率p(x|y)
When we attempt to classify a document, we multiply a lot of probabilities together to
get the probability that a document belongs to a given class. This will look something
like p(w0|1)p(w1|1)p(w2|1). If any of these numbers are 0, then when we multiply
them together we get 0. To lessen the impact of this, we’ll initialize all of our occurrence
counts to 1, and we’ll initialize the denominators to 2.Another problem is underflow: doing too many multiplications of small numbers.
When we go to calculate the product p(w0|ci)p(w1|ci)p(w2|ci)...p(wN|ci) and many
of these numbers are very small, we’ll get underflow, or an incorrect answer. (Try to
multiply many small numbers in Python. Eventually it rounds off to 0.) One solution
to this is to take the natural logarithm of this product. If you recall from algebra,
ln(a*b) = ln(a)+ln(b). Doing this allows us to avoid the underflow or round-off
error problem. Do we lose anything by using the natural log of a number rather than
the number itself? The answer is no.3.使用词袋模型
Up until this point we’ve treated the presence or absence of a word as a feature. This
could be described as a set-of-words model. If a word appears more than once in a
document, that might convey some sort of information about the document over just
the word occurring in the document or not. This approach is known as a bag-of-words
model.4.代码
1 # -*- coding: utf-8 -*- 2 """ 3 Created on Tue Mar 28 17:22:48 2017 4 5 @author: MyHome 6 """ 7 '''使用python把文本分割成一个个单词,构建词向量 8 利用朴素贝叶斯构建分类器从概率的角度对文本进行分类''' 9 import numpy as np 10 import re 11 from random import shuffle 12 13 '''创建一个词汇表''' 14 def createVocabList(Dataset): 15 vocabSet = set([]) 16 for document in Dataset: 17 vocabSet = vocabSet | set(document) 18 19 return list(vocabSet) 20 21 22 ''' 将文本转化成词向量''' 23 24 def setOfWords2Vec(vocabList,inputSet): 25 returnVec = [0]*len(vocabList) 26 for word in inputSet: 27 if word in vocabList: 28 29 #returnVec[vocabList.index(word)] = 1#词集模型 30 returnVec[vocabList.index(word)] += 1#词袋模型 31 else: 32 print "the word:%s is not in VocabList"%word 33 return returnVec 34 35 36 '''训练''' 37 def trainNB(trainMatrix,trainCategory): 38 numTrainDocs = len(trainMatrix) 39 numWords = len(trainMatrix[0]) 40 p = sum(trainCategory)/float(numTrainDocs)#属于类1的概率 41 '''初始化在类0和类1中单词出现个数及概率''' 42 p0Num = np.ones(numWords) 43 p1Num = np.ones(numWords) 44 p0Denom = 0.0 45 p1Denom = 0.0 46 for i in range(numTrainDocs): 47 if trainCategory[i] == 1: 48 p1Num += trainMatrix[i] 49 p1Denom += sum(trainMatrix[i]) 50 else: 51 p0Num += trainMatrix[i] 52 p0Denom += sum(trainMatrix[i]) 53 p1_vec = np.log(p1Num/p1Denom) 54 p0_vec = np.log(p0Num/p0Denom) 55 56 return p0_vec,p1_vec,p 57 58 59 '''构造分类器''' 60 61 def classifyNB(Input,p0,p1,p): 62 p1 = sum(Input*p1) + np.log(p) 63 p0 = sum(Input*p0) + np.log(1.0-p) 64 if p1 > p0: 65 return 1 66 else: 67 return 0 68 69 70 '''预处理文本''' 71 def textParse(bigString): 72 listOfTokens = re.split(r"W*",bigString) 73 return [tok.lower() for tok in listOfTokens if len(tok)>2] 74 75 """垃圾邮件分类""" 76 def spamTest(): 77 docList = [] 78 classList = [] 79 fullText = [] 80 81 for i in range(1,26): 82 wordList = textParse(open('email/spam/%d.txt'%i).read()) 83 docList.append(wordList) 84 fullText.extend(wordList) 85 classList.append(1) 86 wordList = textParse(open("email/ham/%d.txt"%i).read()) 87 docList.append(wordList) 88 fullText.extend(wordList) 89 classList.append(0) 90 91 vocabList = createVocabList(docList) 92 DataSet = zip(docList,classList) 93 print shuffle(DataSet) 94 Data ,Y = zip(*DataSet) 95 trainMat = [] 96 trainClass=[] 97 testData = Data[40:] 98 test_label = Y[40:] 99 for index in xrange(len(Data[:40])): 100 trainMat.append(setOfWords2Vec(vocabList,Data[index])) 101 trainClass.append(Y[index]) 102 103 p0,p1,p = trainNB(np.array(trainMat),np.array(trainClass)) 104 errorCount = 0 105 for index in xrange(len(testData)): 106 wordVector = setOfWords2Vec(vocabList,testData[index]) 107 if classifyNB(np.array(wordVector),p0,p1,p) != test_label[index]: 108 errorCount += 1 109 print "the error rate is : " ,float(errorCount)/len(testData) 110 111 112 if __name__ == "__main__": 113 spamTest() 114 115 116 117 118 119 120
5.总结
Using probabilities can sometimes be more effective than using hard rules for classification.
Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities
from known values.
You can reduce the need for a lot of data by assuming conditional independence
among the features in your data. The assumption we make is that the probability of
one word doesn’t depend on any other words in the document. We know this assumption
is a little simple. That’s why it’s known as naïve Bayes. Despite its incorrect
assumptions, naïve Bayes is effective at classification.
There are a number of practical considerations when implementing naïve Bayes in
a modern programming language. Underflow is one problem that can be addressed
by using the logarithm of probabilities in your calculations. The bag-of-words model is
an improvement on the set-of-words model when approaching document classification.
There are a number of other improvements, such as removing stop words, and
you can spend a long time optimizing a tokenizer.