zoukankan      html  css  js  c++  java
  • 基于概率论的分类方法:朴素贝叶斯——使用朴素贝叶斯进行文档分类

    前言

    之前讨论过的k-近邻算法和决策树都是结果确定的分类算法,今天讨论的分类算法将不能完全确定数据实例应该划分到某个分类,或者只能给出数据实例属于给定分类的概率。

    嘤嘤语录:朴素贝叶斯解决的问题是,今天下雨的概率问题,你需要根据概率确定今天要不要带伞。

    说明:从本章开始,将不提供完整代码,只提供某个算法对应的代码块。

    需求

    以各大社交媒体为例,我们经常屏蔽一些关键性的词汇。我们要构建一个快速过滤器,如果某条留言使用了负面或者侮辱性的语言,那么就将该留言标识为内容不当。

    步骤

    1.准备数据

     1 def loadDataSet():
     2     postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
     3                  ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
     4                  ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
     5                  ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
     6                  ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
     7                  ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
     8     classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
     9     return postingList,classVec
    10                  
    11 def createVocabList(dataSet):
    12     vocabSet = set([])  #create empty set
    13     for document in dataSet:
    14         vocabSet = vocabSet | set(document) #union of the two sets
    15     return list(vocabSet)
    16 
    17 def setOfWords2Vec(vocabList, inputSet):
    18     returnVec = [0]*len(vocabList)
    19     for word in inputSet:
    20         if word in vocabList:
    21             returnVec[vocabList.index(word)] = 1
    22         else: print "the word: %s is not in my Vocabulary!" % word
    23     return returnVec
    函数loadDataSet()创建了一些实验样本。postingList是一系列的词条集合,classVec是一个类别标签的集合。

    函数createVocabList(dataSet)创建一个包含在文档中出现的不重复词的列表,词汇表。
    函数setOfWords2Vec(vocabList, inputSet)首先创建一个和词汇表等长的向量,并将其元素都设置为0.
                          接着,遍历文档中的所有单词,如果出现了词汇表中的单词,则将输出的文档向量中的对应值设为1.

    打开IDE,我们进一步熟悉一下刚才的三个函数:
    >>> import bayes
    >>> listOPosts,listClasses = bayes.loadDataSet()
    >>> listOPosts
    [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    >>> listClasses
    [0, 1, 0, 1, 0, 1]
    >>> myVocabList = bayes.createVocabList(listOPosts)
    >>> myVocabList
    ['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']

    发现现在没有出现重复的单词

    >>> bayes.setOfWords2Vec(myVocabList,listOPosts[0])
    [0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
    myVocabList
    ['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park',
    'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying',
    'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog',
    'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']

    listOPosts[0]

    ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please']

    2.训练算法
     1 def trainNB0(trainMatrix,trainCategory):
     2     numTrainDocs = len(trainMatrix) #6
     3     numWords = len(trainMatrix[0])  #32
     4     pAbusive = sum(trainCategory)/float(numTrainDocs)   #3/6.0
     5     p0Num = zeros(numWords); p1Num = zeros(numWords)      #change to ones() 
     6     p0Denom = 0.0; p1Denom = 0.0                        #change to 2.0
     7     for i in range(numTrainDocs):   # 0 1 2 3 4 5 6
     8         if trainCategory[i] == 1:
     9             p1Num += trainMatrix[i]
    10             p1Denom += sum(trainMatrix[i])
    11         else:
    12             p0Num += trainMatrix[i]
    13             p0Denom += sum(trainMatrix[i])
    14     p1Vect = (p1Num/p1Denom)          #change to log()
    15     p0Vect = (p0Num/p0Denom)          #change to log()
    16     return p0Vect,p1Vect,pAbusive

    trainCategory
    [0, 1, 0, 1, 0, 1]

    trainMat
    [[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],

    [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0],

    [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],

    [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],

    [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1],

    [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]]

     >>> for postinDoc in listOPosts:
    trainMat.append(bayes.setOfWords2Vec(myVocabList,postinDoc))

    >>> p0v,p1v,pab=bayes.trainNB0(trainMat,listClasses)

    >>> p0v array([ 0.04166667, 0.04166667, 0.04166667, 0. , 0. , 0.04166667, 0.04166667, 0.04166667, 0. , 0.04166667, 0.04166667, 0.04166667, 0.04166667, 0. , 0. , 0.08333333, 0. , 0. , 0.04166667, 0. , 0.04166667, 0.04166667, 0. , 0.04166667, 0.04166667, 0.04166667, 0. , 0.04166667, 0. , 0.04166667, 0.04166667, 0.125 ]) >>> p1v array([ 0. , 0. , 0. , 0.05263158, 0.05263158, 0. , 0. , 0. , 0.05263158, 0.05263158, 0. , 0. , 0. , 0.05263158, 0.05263158, 0.05263158, 0.05263158, 0.05263158, 0. , 0.10526316, 0. , 0.05263158, 0.05263158, 0. , 0.10526316, 0. , 0.15789474, 0. , 0.05263158, 0. , 0. , 0. ])

    pab=0.5,说明文档属于侮辱类的概率是0.5。一共输入了6句话,其中3句是侮辱性言论,因此侮辱性言论的概率是0.5

    嘤嘤语录,前面处理数据的方式,可以看成是把

    [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],

    ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],

    ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],

    ['stop', 'posting', 'stupid', 'worthless', 'garbage'],

    ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],

    ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

    里面的数据按照事先给好的标签【0,1,0,1,0,1】分成两类

    第一类是0的

    [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],

    ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],

    ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],

    分别计算每行在字典出现的次数/除以总的小数据量24

    (关于在字典里出现的次数的理解:看到一个单词去字典查阅,有就标记一下,tag随查阅到的字数的增加而增加)

    ([ 1., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1., 1.,
    0., 0., 2., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1.,
    0., 1., 0., 1., 1., 3.])

    同理,对于标签为1的侮辱性

    ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],

    ['stop', 'posting', 'stupid', 'worthless', 'garbage'],

     ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

    查阅字典后,得到的是

    ([ 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0.,
    1., 1., 1., 1., 1., 0., 2., 0., 1., 1., 0., 2., 0.,
    3., 0., 1., 0., 0., 0.])

    分别计算每行在字典出现的次数/除以总的小数据量19

    这样理解一下,思路就清晰多了

    为符合实际情况,我们把所有词出现的次数初始化为1,并将分母初始化为2,为方便计算,我们定义概率为log(p)

     p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
        p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
     p1Vect = log(p1Num/p1Denom)          #change to log()
        p0Vect = log(p0Num/p0Denom)          #change to log()

    朴素贝叶斯分类函数

     1 def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
     2     p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
     3     p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
     4     if p1 > p0:
     5         return 1
     6     else: 
     7         return 0
     8    
     9 def testingNB():
    10     listOPosts,listClasses = loadDataSet()
    11     myVocabList = createVocabList(listOPosts)
    12     trainMat=[]
    13     for postinDoc in listOPosts:
    14         trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    15     p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    16     testEntry = ['love', 'my', 'dalmation']
    17     thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    18     print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    19     testEntry = ['stupid', 'garbage']
    20     thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    21     print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    >>> reload(bayes)
    <module 'bayes' from 'D:Python27ayes.pyc'>
    >>> bayes.testingNB()
    ['love', 'my', 'dalmation'] classified as:  0
    ['stupid', 'garbage'] classified as:  1

    文档词袋模型

    def bagOfWords2VecMN(vocabList, inputSet):
        returnVec = [0]*len(vocabList)
        for word in inputSet:
            if word in vocabList:
                returnVec[vocabList.index(word)] += 1
        return returnVec

    setOfWords2Vec()几乎完全相同,唯一不同的是当每遇到一个单词,就会增加向量中的对应值,而不仅是将对应的数值设为1.

  • 相关阅读:
    SaveFileDialog
    在SQL Server 2008中配置文件流(FILESTREAM)
    C#中图片转二进制到存储数据库
    OpenFileDialog
    WPF中自定义只能输入数字的TextBox
    QL Server 2008新特性:FILESTREAM
    ERROR 2003 (HY000): Can't connect to MySQL server on 'localhost' (10061)
    winxp+Apache+Mysql+Python+Django安装配置
    django最佳实践
    Sphinx 在 windows 下安装使用
  • 原文地址:https://www.cnblogs.com/xiaoyingying/p/7515889.html
Copyright © 2011-2022 走看看