zoukankan      html  css  js  c++  java
  • 使用朴素贝叶斯分类器过滤垃圾邮件

    1.从文本中构建词向量

    将每个文本用python分割成单词,构建成词向量,这里首先需要一个语料库,为了简化我们直接从所给文本中抽出所有出现的单词构成一个词库。

    2.利用词向量计算概率p(x|y)

    When we attempt to classify a document, we multiply a lot of probabilities together to
    get the probability that a document belongs to a given class. This will look something
    like p(w0|1)p(w1|1)p(w2|1). If any of these numbers are 0, then when we multiply
    them together we get 0. To lessen the impact of this, we’ll initialize all of our occurrence
    counts to 1, and we’ll initialize the denominators to 2.

    Another problem is underflow: doing too many multiplications of small numbers.
    When we go to calculate the product p(w0|ci)p(w1|ci)p(w2|ci)...p(wN|ci) and many
    of these numbers are very small, we’ll get underflow, or an incorrect answer. (Try to
    multiply many small numbers in Python. Eventually it rounds off to 0.) One solution
    to this is to take the natural logarithm of this product. If you recall from algebra,
    ln(a*b) = ln(a)+ln(b). Doing this allows us to avoid the underflow or round-off
    error problem. Do we lose anything by using the natural log of a number rather than
    the number itself? The answer is no.

    3.使用词袋模型

    Up until this point we’ve treated the presence or absence of a word as a feature. This
    could be described as a set-of-words model. If a word appears more than once in a
    document, that might convey some sort of information about the document over just
    the word occurring in the document or not. This approach is known as a bag-of-words
    model.

    4.代码

      1 # -*- coding: utf-8 -*-
      2 """
      3 Created on Tue Mar 28 17:22:48 2017
      4 
      5 @author: MyHome
      6 """
      7 '''使用python把文本分割成一个个单词,构建词向量
      8 利用朴素贝叶斯构建分类器从概率的角度对文本进行分类'''
      9 import numpy as np
     10 import re
     11 from random import shuffle
     12 
     13 '''创建一个词汇表'''
     14 def createVocabList(Dataset):
     15     vocabSet = set([])
     16     for document in Dataset:
     17         vocabSet = vocabSet | set(document)
     18 
     19     return list(vocabSet)
     20 
     21 
     22 '''  将文本转化成词向量'''
     23 
     24 def setOfWords2Vec(vocabList,inputSet):
     25     returnVec = [0]*len(vocabList)
     26     for word in inputSet:
     27         if word in vocabList:
     28 
     29             #returnVec[vocabList.index(word)] = 1#词集模型
     30             returnVec[vocabList.index(word)] += 1#词袋模型
     31         else:
     32             print "the word:%s is not in VocabList"%word
     33     return returnVec
     34 
     35 
     36 '''训练'''
     37 def trainNB(trainMatrix,trainCategory):
     38     numTrainDocs = len(trainMatrix)
     39     numWords = len(trainMatrix[0])
     40     p = sum(trainCategory)/float(numTrainDocs)#属于类1的概率
     41     '''初始化在类0和类1中单词出现个数及概率'''
     42     p0Num = np.ones(numWords)
     43     p1Num = np.ones(numWords)
     44     p0Denom = 0.0
     45     p1Denom = 0.0
     46     for i in range(numTrainDocs):
     47         if trainCategory[i] == 1:
     48             p1Num += trainMatrix[i]
     49             p1Denom += sum(trainMatrix[i])
     50         else:
     51             p0Num += trainMatrix[i]
     52             p0Denom += sum(trainMatrix[i])
     53     p1_vec = np.log(p1Num/p1Denom)
     54     p0_vec = np.log(p0Num/p0Denom)
     55 
     56     return p0_vec,p1_vec,p
     57 
     58 
     59 '''构造分类器'''
     60 
     61 def classifyNB(Input,p0,p1,p):
     62     p1 = sum(Input*p1) + np.log(p)
     63     p0 = sum(Input*p0) + np.log(1.0-p)
     64     if p1 > p0:
     65         return 1
     66     else:
     67         return 0
     68 
     69 
     70 '''预处理文本'''
     71 def textParse(bigString):
     72     listOfTokens = re.split(r"W*",bigString)
     73     return [tok.lower() for tok in listOfTokens if len(tok)>2]
     74 
     75 """垃圾邮件分类"""
     76 def spamTest():
     77     docList = []
     78     classList = []
     79     fullText = []
     80 
     81     for i in range(1,26):
     82         wordList = textParse(open('email/spam/%d.txt'%i).read())
     83         docList.append(wordList)
     84         fullText.extend(wordList)
     85         classList.append(1)
     86         wordList = textParse(open("email/ham/%d.txt"%i).read())
     87         docList.append(wordList)
     88         fullText.extend(wordList)
     89         classList.append(0)
     90 
     91     vocabList = createVocabList(docList)
     92     DataSet = zip(docList,classList)
     93     print shuffle(DataSet)
     94     Data ,Y = zip(*DataSet)
     95     trainMat = []
     96     trainClass=[]
     97     testData = Data[40:]
     98     test_label = Y[40:]
     99     for index in xrange(len(Data[:40])):
    100         trainMat.append(setOfWords2Vec(vocabList,Data[index]))
    101         trainClass.append(Y[index])
    102 
    103     p0,p1,p = trainNB(np.array(trainMat),np.array(trainClass))
    104     errorCount = 0
    105     for index in xrange(len(testData)):
    106         wordVector = setOfWords2Vec(vocabList,testData[index])
    107         if classifyNB(np.array(wordVector),p0,p1,p) != test_label[index]:
    108             errorCount += 1
    109     print "the error rate is : " ,float(errorCount)/len(testData)
    110 
    111 
    112 if __name__ == "__main__":
    113     spamTest()
    114 
    115 
    116 
    117 
    118 
    119 
    120 

    5.总结

      Using probabilities can sometimes be more effective than using hard rules for classification.
    Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities
    from known values.
      You can reduce the need for a lot of data by assuming conditional independence
    among the features in your data. The assumption we make is that the probability of
    one word doesn’t depend on any other words in the document. We know this assumption
    is a little simple. That’s why it’s known as naïve Bayes. Despite its incorrect
    assumptions, naïve Bayes is effective at classification.
      There are a number of practical considerations when implementing naïve Bayes in
    a modern programming language. Underflow is one problem that can be addressed
    by using the logarithm of probabilities in your calculations. The bag-of-words model is
    an improvement on the set-of-words model when approaching document classification.
    There are a number of other improvements, such as removing stop words, and
    you can spend a long time optimizing a tokenizer.

  • 相关阅读:
    用递归获取文件夹以及子文件夹下的所有文件
    C#导入XLS数据到数据库
    张老师生日问题 c# CopyRight: http://blog.moozi.net/
    convert.cpp
    C#中判断扫描枪输入与键盘输入
    C# 执行多条SQL语句,实现数据库事务(通过Hashtable存储数据) .
    GridView 根据多个字段值删除
    泛型入门
    TreeView 控件应用
    事务控制案例(一)
  • 原文地址:https://www.cnblogs.com/lpworkstudyspace1992/p/6636709.html
Copyright © 2011-2022 走看看