zoukankan      html  css  js  c++  java
  • 使用朴素贝叶斯分类器过滤垃圾邮件




    When we attempt to classify a document, we multiply a lot of probabilities together to
    get the probability that a document belongs to a given class. This will look something
    like p(w0|1)p(w1|1)p(w2|1). If any of these numbers are 0, then when we multiply
    them together we get 0. To lessen the impact of this, we’ll initialize all of our occurrence
    counts to 1, and we’ll initialize the denominators to 2.

    Another problem is underflow: doing too many multiplications of small numbers.
    When we go to calculate the product p(w0|ci)p(w1|ci)p(w2|ci)...p(wN|ci) and many
    of these numbers are very small, we’ll get underflow, or an incorrect answer. (Try to
    multiply many small numbers in Python. Eventually it rounds off to 0.) One solution
    to this is to take the natural logarithm of this product. If you recall from algebra,
    ln(a*b) = ln(a)+ln(b). Doing this allows us to avoid the underflow or round-off
    error problem. Do we lose anything by using the natural log of a number rather than
    the number itself? The answer is no.


    Up until this point we’ve treated the presence or absence of a word as a feature. This
    could be described as a set-of-words model. If a word appears more than once in a
    document, that might convey some sort of information about the document over just
    the word occurring in the document or not. This approach is known as a bag-of-words


      1 # -*- coding: utf-8 -*-
      2 """
      3 Created on Tue Mar 28 17:22:48 2017
      5 @author: MyHome
      6 """
      7 '''使用python把文本分割成一个个单词,构建词向量
      8 利用朴素贝叶斯构建分类器从概率的角度对文本进行分类'''
      9 import numpy as np
     10 import re
     11 from random import shuffle
     13 '''创建一个词汇表'''
     14 def createVocabList(Dataset):
     15     vocabSet = set([])
     16     for document in Dataset:
     17         vocabSet = vocabSet | set(document)
     19     return list(vocabSet)
     22 '''  将文本转化成词向量'''
     24 def setOfWords2Vec(vocabList,inputSet):
     25     returnVec = [0]*len(vocabList)
     26     for word in inputSet:
     27         if word in vocabList:
     29             #returnVec[vocabList.index(word)] = 1#词集模型
     30             returnVec[vocabList.index(word)] += 1#词袋模型
     31         else:
     32             print "the word:%s is not in VocabList"%word
     33     return returnVec
     36 '''训练'''
     37 def trainNB(trainMatrix,trainCategory):
     38     numTrainDocs = len(trainMatrix)
     39     numWords = len(trainMatrix[0])
     40     p = sum(trainCategory)/float(numTrainDocs)#属于类1的概率
     41     '''初始化在类0和类1中单词出现个数及概率'''
     42     p0Num = np.ones(numWords)
     43     p1Num = np.ones(numWords)
     44     p0Denom = 0.0
     45     p1Denom = 0.0
     46     for i in range(numTrainDocs):
     47         if trainCategory[i] == 1:
     48             p1Num += trainMatrix[i]
     49             p1Denom += sum(trainMatrix[i])
     50         else:
     51             p0Num += trainMatrix[i]
     52             p0Denom += sum(trainMatrix[i])
     53     p1_vec = np.log(p1Num/p1Denom)
     54     p0_vec = np.log(p0Num/p0Denom)
     56     return p0_vec,p1_vec,p
     59 '''构造分类器'''
     61 def classifyNB(Input,p0,p1,p):
     62     p1 = sum(Input*p1) + np.log(p)
     63     p0 = sum(Input*p0) + np.log(1.0-p)
     64     if p1 > p0:
     65         return 1
     66     else:
     67         return 0
     70 '''预处理文本'''
     71 def textParse(bigString):
     72     listOfTokens = re.split(r"W*",bigString)
     73     return [tok.lower() for tok in listOfTokens if len(tok)>2]
     75 """垃圾邮件分类"""
     76 def spamTest():
     77     docList = []
     78     classList = []
     79     fullText = []
     81     for i in range(1,26):
     82         wordList = textParse(open('email/spam/%d.txt'%i).read())
     83         docList.append(wordList)
     84         fullText.extend(wordList)
     85         classList.append(1)
     86         wordList = textParse(open("email/ham/%d.txt"%i).read())
     87         docList.append(wordList)
     88         fullText.extend(wordList)
     89         classList.append(0)
     91     vocabList = createVocabList(docList)
     92     DataSet = zip(docList,classList)
     93     print shuffle(DataSet)
     94     Data ,Y = zip(*DataSet)
     95     trainMat = []
     96     trainClass=[]
     97     testData = Data[40:]
     98     test_label = Y[40:]
     99     for index in xrange(len(Data[:40])):
    100         trainMat.append(setOfWords2Vec(vocabList,Data[index]))
    101         trainClass.append(Y[index])
    103     p0,p1,p = trainNB(np.array(trainMat),np.array(trainClass))
    104     errorCount = 0
    105     for index in xrange(len(testData)):
    106         wordVector = setOfWords2Vec(vocabList,testData[index])
    107         if classifyNB(np.array(wordVector),p0,p1,p) != test_label[index]:
    108             errorCount += 1
    109     print "the error rate is : " ,float(errorCount)/len(testData)
    112 if __name__ == "__main__":
    113     spamTest()


      Using probabilities can sometimes be more effective than using hard rules for classification.
    Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities
    from known values.
      You can reduce the need for a lot of data by assuming conditional independence
    among the features in your data. The assumption we make is that the probability of
    one word doesn’t depend on any other words in the document. We know this assumption
    is a little simple. That’s why it’s known as naïve Bayes. Despite its incorrect
    assumptions, naïve Bayes is effective at classification.
      There are a number of practical considerations when implementing naïve Bayes in
    a modern programming language. Underflow is one problem that can be addressed
    by using the logarithm of probabilities in your calculations. The bag-of-words model is
    an improvement on the set-of-words model when approaching document classification.
    There are a number of other improvements, such as removing stop words, and
    you can spend a long time optimizing a tokenizer.

  • 相关阅读:
    Attach Files to Objects 将文件附加到对象
    Provide Several View Variants for End-Users 为最终用户提供多个视图变体
    Audit Object Changes 审核对象更改
    Toggle the WinForms Ribbon Interface 切换 WinForms 功能区界面
    Change Style of Navigation Items 更改导航项的样式
    Apply Grouping to List View Data 将分组应用于列表视图数据
    Choose the WinForms UI Type 选择 WinForms UI 类型
    Filter List Views 筛选器列表视图
    Make a List View Editable 使列表视图可编辑
    Add a Preview to a List View将预览添加到列表视图
  • 原文地址:https://www.cnblogs.com/lpworkstudyspace1992/p/6636709.html
Copyright © 2011-2022 走看看