zoukankan      html  css  js  c++  java
  • 基于概率论的分类方法:朴素贝叶斯——使用朴素贝叶斯进行文档分类

    前言

    之前讨论过的k-近邻算法和决策树都是结果确定的分类算法,今天讨论的分类算法将不能完全确定数据实例应该划分到某个分类,或者只能给出数据实例属于给定分类的概率。

    嘤嘤语录:朴素贝叶斯解决的问题是,今天下雨的概率问题,你需要根据概率确定今天要不要带伞。

    说明:从本章开始,将不提供完整代码,只提供某个算法对应的代码块。

    需求

    以各大社交媒体为例,我们经常屏蔽一些关键性的词汇。我们要构建一个快速过滤器,如果某条留言使用了负面或者侮辱性的语言,那么就将该留言标识为内容不当。

    步骤

    1.准备数据

     1 def loadDataSet():
     2     postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
     3                  ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
     4                  ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
     5                  ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
     6                  ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
     7                  ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
     8     classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
     9     return postingList,classVec
    10                  
    11 def createVocabList(dataSet):
    12     vocabSet = set([])  #create empty set
    13     for document in dataSet:
    14         vocabSet = vocabSet | set(document) #union of the two sets
    15     return list(vocabSet)
    16 
    17 def setOfWords2Vec(vocabList, inputSet):
    18     returnVec = [0]*len(vocabList)
    19     for word in inputSet:
    20         if word in vocabList:
    21             returnVec[vocabList.index(word)] = 1
    22         else: print "the word: %s is not in my Vocabulary!" % word
    23     return returnVec
    函数loadDataSet()创建了一些实验样本。postingList是一系列的词条集合,classVec是一个类别标签的集合。

    函数createVocabList(dataSet)创建一个包含在文档中出现的不重复词的列表,词汇表。
    函数setOfWords2Vec(vocabList, inputSet)首先创建一个和词汇表等长的向量,并将其元素都设置为0.
                          接着,遍历文档中的所有单词,如果出现了词汇表中的单词,则将输出的文档向量中的对应值设为1.

    打开IDE,我们进一步熟悉一下刚才的三个函数:
    >>> import bayes
    >>> listOPosts,listClasses = bayes.loadDataSet()
    >>> listOPosts
    [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    >>> listClasses
    [0, 1, 0, 1, 0, 1]
    >>> myVocabList = bayes.createVocabList(listOPosts)
    >>> myVocabList
    ['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']

    发现现在没有出现重复的单词

    >>> bayes.setOfWords2Vec(myVocabList,listOPosts[0])
    [0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
    myVocabList
    ['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park',
    'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying',
    'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog',
    'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']

    listOPosts[0]

    ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please']

    2.训练算法
     1 def trainNB0(trainMatrix,trainCategory):
     2     numTrainDocs = len(trainMatrix) #6
     3     numWords = len(trainMatrix[0])  #32
     4     pAbusive = sum(trainCategory)/float(numTrainDocs)   #3/6.0
     5     p0Num = zeros(numWords); p1Num = zeros(numWords)      #change to ones() 
     6     p0Denom = 0.0; p1Denom = 0.0                        #change to 2.0
     7     for i in range(numTrainDocs):   # 0 1 2 3 4 5 6
     8         if trainCategory[i] == 1:
     9             p1Num += trainMatrix[i]
    10             p1Denom += sum(trainMatrix[i])
    11         else:
    12             p0Num += trainMatrix[i]
    13             p0Denom += sum(trainMatrix[i])
    14     p1Vect = (p1Num/p1Denom)          #change to log()
    15     p0Vect = (p0Num/p0Denom)          #change to log()
    16     return p0Vect,p1Vect,pAbusive

    trainCategory
    [0, 1, 0, 1, 0, 1]

    trainMat
    [[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],

    [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0],

    [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],

    [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],

    [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1],

    [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]]

     >>> for postinDoc in listOPosts:
    trainMat.append(bayes.setOfWords2Vec(myVocabList,postinDoc))

    >>> p0v,p1v,pab=bayes.trainNB0(trainMat,listClasses)

    >>> p0v array([ 0.04166667, 0.04166667, 0.04166667, 0. , 0. , 0.04166667, 0.04166667, 0.04166667, 0. , 0.04166667, 0.04166667, 0.04166667, 0.04166667, 0. , 0. , 0.08333333, 0. , 0. , 0.04166667, 0. , 0.04166667, 0.04166667, 0. , 0.04166667, 0.04166667, 0.04166667, 0. , 0.04166667, 0. , 0.04166667, 0.04166667, 0.125 ]) >>> p1v array([ 0. , 0. , 0. , 0.05263158, 0.05263158, 0. , 0. , 0. , 0.05263158, 0.05263158, 0. , 0. , 0. , 0.05263158, 0.05263158, 0.05263158, 0.05263158, 0.05263158, 0. , 0.10526316, 0. , 0.05263158, 0.05263158, 0. , 0.10526316, 0. , 0.15789474, 0. , 0.05263158, 0. , 0. , 0. ])

    pab=0.5,说明文档属于侮辱类的概率是0.5。一共输入了6句话,其中3句是侮辱性言论,因此侮辱性言论的概率是0.5

    嘤嘤语录,前面处理数据的方式,可以看成是把

    [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],

    ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],

    ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],

    ['stop', 'posting', 'stupid', 'worthless', 'garbage'],

    ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],

    ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

    里面的数据按照事先给好的标签【0,1,0,1,0,1】分成两类

    第一类是0的

    [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],

    ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],

    ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],

    分别计算每行在字典出现的次数/除以总的小数据量24

    (关于在字典里出现的次数的理解:看到一个单词去字典查阅,有就标记一下,tag随查阅到的字数的增加而增加)

    ([ 1., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1., 1.,
    0., 0., 2., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1.,
    0., 1., 0., 1., 1., 3.])

    同理,对于标签为1的侮辱性

    ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],

    ['stop', 'posting', 'stupid', 'worthless', 'garbage'],

     ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

    查阅字典后,得到的是

    ([ 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0.,
    1., 1., 1., 1., 1., 0., 2., 0., 1., 1., 0., 2., 0.,
    3., 0., 1., 0., 0., 0.])

    分别计算每行在字典出现的次数/除以总的小数据量19

    这样理解一下,思路就清晰多了

    为符合实际情况,我们把所有词出现的次数初始化为1,并将分母初始化为2,为方便计算,我们定义概率为log(p)

     p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
        p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
     p1Vect = log(p1Num/p1Denom)          #change to log()
        p0Vect = log(p0Num/p0Denom)          #change to log()

    朴素贝叶斯分类函数

     1 def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
     2     p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
     3     p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
     4     if p1 > p0:
     5         return 1
     6     else: 
     7         return 0
     8    
     9 def testingNB():
    10     listOPosts,listClasses = loadDataSet()
    11     myVocabList = createVocabList(listOPosts)
    12     trainMat=[]
    13     for postinDoc in listOPosts:
    14         trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    15     p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
    16     testEntry = ['love', 'my', 'dalmation']
    17     thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    18     print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    19     testEntry = ['stupid', 'garbage']
    20     thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    21     print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    >>> reload(bayes)
    <module 'bayes' from 'D:Python27ayes.pyc'>
    >>> bayes.testingNB()
    ['love', 'my', 'dalmation'] classified as:  0
    ['stupid', 'garbage'] classified as:  1

    文档词袋模型

    def bagOfWords2VecMN(vocabList, inputSet):
        returnVec = [0]*len(vocabList)
        for word in inputSet:
            if word in vocabList:
                returnVec[vocabList.index(word)] += 1
        return returnVec

    setOfWords2Vec()几乎完全相同,唯一不同的是当每遇到一个单词,就会增加向量中的对应值,而不仅是将对应的数值设为1.

  • 相关阅读:
    Validation failed for one or more entities. See 'EntityValidationErrors' property for more details
    Visual Studio断点调试, 无法监视变量, 提示无法计算表达式
    ASP.NET MVC中MaxLength特性设置无效
    项目从.NET 4.5迁移到.NET 4.0遇到的问题
    发布网站时应该把debug设置false
    什么时候用var关键字
    扩展方法略好于帮助方法
    在基类构造器中调用虚方法需谨慎
    ASP.NET MVC中商品模块小样
    ASP.NET MVC中实现属性和属性值的组合,即笛卡尔乘积02, 在界面实现
  • 原文地址:https://www.cnblogs.com/xiaoyingying/p/7515889.html
Copyright © 2011-2022 走看看