zoukankan html css js c++ java

使用朴素贝叶斯分类器过滤垃圾邮件

1.从文本中构建词向量

将每个文本用python分割成单词，构建成词向量，这里首先需要一个语料库，为了简化我们直接从所给文本中抽出所有出现的单词构成一个词库。

2.利用词向量计算概率p(x|y)

When we attempt to classify a document, we multiply a lot of probabilities together to
get the probability that a document belongs to a given class. This will look something
like p(w0|1)p(w1|1)p(w2|1). If any of these numbers are 0, then when we multiply
them together we get 0. To lessen the impact of this, we’ll initialize all of our occurrence
counts to 1, and we’ll initialize the denominators to 2.

Another problem is underflow: doing too many multiplications of small numbers.
When we go to calculate the product p(w0|ci)p(w1|ci)p(w2|ci)...p(wN|ci) and many
of these numbers are very small, we’ll get underflow, or an incorrect answer. (Try to
multiply many small numbers in Python. Eventually it rounds off to 0.) One solution
to this is to take the natural logarithm of this product. If you recall from algebra,
ln(a*b) = ln(a)+ln(b). Doing this allows us to avoid the underflow or round-off
error problem. Do we lose anything by using the natural log of a number rather than
the number itself? The answer is no.

3.使用词袋模型

Up until this point we’ve treated the presence or absence of a word as a feature. This
could be described as a set-of-words model. If a word appears more than once in a
document, that might convey some sort of information about the document over just
the word occurring in the document or not. This approach is known as a bag-of-words
model.

4.代码

  1 # -*- coding: utf-8 -*-
  2 """
  3 Created on Tue Mar 28 17:22:48 2017
  4 
  5 @author: MyHome
  6 """
  7 '''使用python把文本分割成一个个单词，构建词向量
  8 利用朴素贝叶斯构建分类器从概率的角度对文本进行分类'''
  9 import numpy as np
 10 import re
 11 from random import shuffle
 12 
 13 '''创建一个词汇表'''
 14 def createVocabList(Dataset):
 15     vocabSet = set([])
 16     for document in Dataset:
 17         vocabSet = vocabSet | set(document)
 18 
 19     return list(vocabSet)
 20 
 21 
 22 '''  将文本转化成词向量'''
 23 
 24 def setOfWords2Vec(vocabList,inputSet):
 25     returnVec = [0]*len(vocabList)
 26     for word in inputSet:
 27         if word in vocabList:
 28 
 29             #returnVec[vocabList.index(word)] = 1#词集模型
 30             returnVec[vocabList.index(word)] += 1#词袋模型
 31         else:
 32             print "the word:%s is not in VocabList"%word
 33     return returnVec
 34 
 35 
 36 '''训练'''
 37 def trainNB(trainMatrix,trainCategory):
 38     numTrainDocs = len(trainMatrix)
 39     numWords = len(trainMatrix[0])
 40     p = sum(trainCategory)/float(numTrainDocs)#属于类1的概率
 41     '''初始化在类0和类1中单词出现个数及概率'''
 42     p0Num = np.ones(numWords)
 43     p1Num = np.ones(numWords)
 44     p0Denom = 0.0
 45     p1Denom = 0.0
 46     for i in range(numTrainDocs):
 47         if trainCategory[i] == 1:
 48             p1Num += trainMatrix[i]
 49             p1Denom += sum(trainMatrix[i])
 50         else:
 51             p0Num += trainMatrix[i]
 52             p0Denom += sum(trainMatrix[i])
 53     p1_vec = np.log(p1Num/p1Denom)
 54     p0_vec = np.log(p0Num/p0Denom)
 55 
 56     return p0_vec,p1_vec,p
 57 
 58 
 59 '''构造分类器'''
 60 
 61 def classifyNB(Input,p0,p1,p):
 62     p1 = sum(Input*p1) + np.log(p)
 63     p0 = sum(Input*p0) + np.log(1.0-p)
 64     if p1 > p0:
 65         return 1
 66     else:
 67         return 0
 68 
 69 
 70 '''预处理文本'''
 71 def textParse(bigString):
 72     listOfTokens = re.split(r"W*",bigString)
 73     return [tok.lower() for tok in listOfTokens if len(tok)>2]
 74 
 75 """垃圾邮件分类"""
 76 def spamTest():
 77     docList = []
 78     classList = []
 79     fullText = []
 80 
 81     for i in range(1,26):
 82         wordList = textParse(open('email/spam/%d.txt'%i).read())
 83         docList.append(wordList)
 84         fullText.extend(wordList)
 85         classList.append(1)
 86         wordList = textParse(open("email/ham/%d.txt"%i).read())
 87         docList.append(wordList)
 88         fullText.extend(wordList)
 89         classList.append(0)
 90 
 91     vocabList = createVocabList(docList)
 92     DataSet = zip(docList,classList)
 93     print shuffle(DataSet)
 94     Data ,Y = zip(*DataSet)
 95     trainMat = []
 96     trainClass=[]
 97     testData = Data[40:]
 98     test_label = Y[40:]
 99     for index in xrange(len(Data[:40])):
100         trainMat.append(setOfWords2Vec(vocabList,Data[index]))
101         trainClass.append(Y[index])
102 
103     p0,p1,p = trainNB(np.array(trainMat),np.array(trainClass))
104     errorCount = 0
105     for index in xrange(len(testData)):
106         wordVector = setOfWords2Vec(vocabList,testData[index])
107         if classifyNB(np.array(wordVector),p0,p1,p) != test_label[index]:
108             errorCount += 1
109     print "the error rate is : " ,float(errorCount)/len(testData)
110 
111 
112 if __name__ == "__main__":
113     spamTest()
114 
115 
116 
117 
118 
119 
120

5.总结

Using probabilities can sometimes be more effective than using hard rules for classification.
Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities
from known values.
You can reduce the need for a lot of data by assuming conditional independence
among the features in your data. The assumption we make is that the probability of
one word doesn’t depend on any other words in the document. We know this assumption
is a little simple. That’s why it’s known as naïve Bayes. Despite its incorrect
assumptions, naïve Bayes is effective at classification.
There are a number of practical considerations when implementing naïve Bayes in
a modern programming language. Underflow is one problem that can be addressed
by using the logarithm of probabilities in your calculations. The bag-of-words model is
an improvement on the set-of-words model when approaching document classification.
There are a number of other improvements, such as removing stop words, and
you can spend a long time optimizing a tokenizer.

查看全文

相关阅读:
[HDU 4828] Grids
约瑟夫问题合集
 [POJ 1365] Prime Land
[POJ 3270] Cow Sorting
[POJ 1674] Sorting by Swapping
SGU 188.Factory guard
POJ 2942.Knights of the Round Table （双连通）
POJ 1236.Network of Schools （强连通）
POJ 2186.Popular Cows （强连通）
POJ 1734.Sightseeing trip （Floyd 最小环）

原文地址：https://www.cnblogs.com/lpworkstudyspace1992/p/6636709.html