zoukankan      html  css  js  c++  java
  • 机器学习实战(Machine Learning in Action)学习笔记————04.朴素贝叶斯分类(bayes)

    机器学习实战(Machine Learning in Action)学习笔记————04.朴素贝叶斯分类(bayes)

    关键字:朴素贝叶斯、python、源码解析
    作者:米仓山下
    时间:2018-10-25
    机器学习实战(Machine Learning in Action,@author: Peter Harrington)
    源码下载地址:https://www.manning.com/books/machine-learning-in-action
    git@github.com:pbharrin/machinelearninginaction.git

    *************************************************************
    一、朴树贝叶斯分类(bayes)

    #朴素贝叶斯实现文本分类原理:
    bayes公式:P(ci|w)=P(w|ci)P(ci)/p(w)
    首先计算侮辱性言论和非侮辱性言论文档出现的概率P(ci)即P(1)、P(0)的概率;接着计算P(w|ci),假设词的出现相互独立(也是朴素贝叶斯“朴素”的含义),p(w0,w1,…wN|c1)=p(w0|c1)p(w1|c1)…p(wN|c1),然后利用公式计算在该词向量下属于不同类别概率,比较返回概率最大的类别

    训练:P(w|ci)——分类ci下词向量w的条件概率,P(w|ci)=p(w0,w1,…wN|c1)=p(w0|c1)p(w1|c1)…p(wN|c1);P(ci)——分类ci出现的概率;p(w)——词向量w出现的概率(每个词出现的概率之积,是一个定值?)求解:P(ci|w)——词向量w下分类为ci的条件概率

    测试:利用公式计算在该词向量下属于不同类别概率,比较返回概率最大的类别

    --------------------------------------------------------------
    #朴素贝叶斯实现文本分类,训练函数
    #input:trainMatrix——训练样本(词集模型,词出现与否分别为1和0),trainCategory——类别向量
    #output:p0Vect——[2],p1Vect——[1],pAbusive——样本为侮辱性P(1)的概率

    def trainNB0(trainMatrix,trainCategory):
        numTrainDocs = len(trainMatrix)                     #训练样本数目
        numWords = len(trainMatrix[0])                      #特征个数(词汇表大小)
        pAbusive = sum(trainCategory)/float(numTrainDocs)   #样本为侮辱性P(1)的概率,即P(ci)
        p0Num = ones(numWords); p1Num = ones(numWords)      #初始化概率,[注1]
        p0Denom = 2.0; p1Denom = 2.0
        for i in range(numTrainDocs):                       #遍历样本
            if trainCategory[i] == 1:                       #侮辱性样本
                p1Num += trainMatrix[i]                     #侮辱性词向量之和
                p1Denom += sum(trainMatrix[i])              #侮辱性样本中,出现的词汇总数
            else:                             #非侮辱性样本
                p0Num += trainMatrix[i]       #向量之和
                p0Denom += sum(trainMatrix[i])#非侮辱性样本,出现的词汇总数
        p1Vect = log(p1Num/p1Denom)           #[1]侮辱性样本中,词汇表每个词汇出现概率,即P(w|ci),[注1]
        p0Vect = log(p0Num/p0Denom)           #[2]非侮辱性样本中,词汇表每个词汇出现概率,同为P(w|ci)
        return p0Vect,p1Vect,pAbusive

    #注:利用贝叶斯分类器对文档进行分类时,要计算多个概率的乘积以获取文档属于某个类别的概率(p(w0|c=1)p(w1|c=1)p(w2|c=1)),若其中一个概率为零则整体为零,为避免影响,将p0Num,p1Num初始化为1,将分母p0Denom初始化为2;另外为避免太多小数相乘造成数据下溢,因此对其取对数(f(x)与ln(f(x))具有相同的单调性),并有ln(a*b)=ln(a)+n(b),将其后的乘积运算转换为加和

    #朴素贝叶斯分类测试函数

    def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):    #计算属于不同类别的概率并进行比较
        p1 = sum(vec2Classify * p1Vec) + log(pClass1)       #计算概率p(w0|c1)p(w1|c1)…p(wN|c1)p(c1),
        p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) #因为进行了对数运算,乘积运算转换为加和
        if p1 > p0:
            return 1
        else:
            return 0

    -------------------------------------------------------------
    测试:

    >>> import bayes
    >>> from numpy import *
    >>> listOPosts,listClasses=bayes.loadDataSet()     #读取训练数据集及类别标签
    >>> listOPosts
    [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    >>> listClasses
    [0, 1, 0, 1, 0, 1]
    >>> myVocalList=bayes.createVocabList(listOPosts)  #创建词汇表
    >>> myVocalList
    ['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']
    >>> trainMat=[]
    >>> for postindoc in listOPosts:                     #将训练数据集转换为词向量形式
    ...   trainMat.append(bayes.setOfWords2Vec(myVocalList,postindoc))
    ...
    >>> trainMat
    [[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0], [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1], [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0]]
    >>> p0v,p1v,pab=bayes.trainNB0(trainMat,listClasses)#计算条件概率及侮辱性文档出现的概率p(1)
    >>> p0v,p1v,pab
    (array([-2.56494936, -2.56494936, -2.56494936, -3.25809654, -3.25809654,
           -2.56494936, -2.56494936, -2.56494936, -3.25809654, -2.56494936,
           -2.56494936, -2.56494936, -2.56494936, -3.25809654, -3.25809654,
           -2.15948425, -3.25809654, -3.25809654, -2.56494936, -3.25809654,
           -2.56494936, -2.56494936, -3.25809654, -2.56494936, -2.56494936,
           -2.56494936, -3.25809654, -2.56494936, -3.25809654, -2.56494936,
           -2.56494936, -1.87180218]), array([-3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
           -3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
           -3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
           -2.35137526, -2.35137526, -2.35137526, -3.04452244, -1.94591015,
           -3.04452244, -2.35137526, -2.35137526, -3.04452244, -1.94591015,
           -3.04452244, -1.65822808, -3.04452244, -2.35137526, -3.04452244,
           -3.04452244, -3.04452244]), 0.5)
    >>>
    >>> testEntry = ['love', 'my', 'dalmation']
    >>> thisDoc = array(bayes.setOfWords2Vec(myVocalList, testEntry))  #测试数据转词向量
    >>> thisDoc
    array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
    >>> print testEntry,'classified as: ',bayes.classifyNB(thisDoc,p0v,p1v,pab)#判断测试数据类别
    ['love', 'my', 'dalmation'] classified as:  0

    *************************************************************
    二、示例:朴树贝叶斯过滤垃圾邮件
    email文件夹下有两个文件夹ham和spam,分别存放垃圾邮件和非垃圾邮件,将这50条数据,40条作为训练样本,10条数据作为测试样本,使用词袋模型完成训练和测试,并打印错误词,计算错误率

    >>> import bayes
    >>> bayes.spamTest()
    classification error ['benoit', 'mandelbrot', '1924', '2010', 'benoit', 'mandelbrot', '1924', '2010', 'wilmott', 'team', 'benoit', 'mandelbrot', 'the', 'mathematician', 'the', 'father', 'fractal', 'mathematics', 'and', 'advocate', 'more', 'sophisticated', 'modelling', 'quantitative', 'finance', 'died', '14th', 'october', '2010', 'aged', 'wilmott', 'magazine', 'has', 'often', 'featured', 'mandelbrot', 'his', 'ideas', 'and', 'the', 'work', 'others', 'inspired', 'his', 'fundamental', 'insights', 'you', 'must', 'logged', 'view', 'these', 'articles', 'from', 'past', 'issues', 'wilmott', 'magazine']
    classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
    classification error ['home', 'based', 'business', 'opportunity', 'knocking', 'your', 'door', 'don', 'rude', 'and', 'let', 'this', 'chance', 'you', 'can', 'earn', 'great', 'income', 'and', 'find', 'your', 'financial', 'life', 'transformed', 'learn', 'more', 'here', 'your', 'success', 'work', 'from', 'home', 'finder', 'experts']
    the error rate is:  0.3
    >>>

    另一个例子:使用朴素贝叶斯来发现低于相关的用词
    原理:利用训练样本求得p0Vect——[2],p1Vect——[1],pAbusive;即P(w|ci)和P(ci),通过调整移除的高频词数据,达到较高的准确率,然后返回出现概率较大的词汇。
    (代码略)
    详细见书P70

    推荐公众号,另一篇比较好的文章【朴素贝叶斯分类算法原理与实践

  • 相关阅读:
    Many Equal Substrings CF
    Seek the Name, Seek the Fame POJ
    人人都是好朋友(离散化 + 并查集)
    建设道路
    day_30
    day_29作业
    day_29
    day_28
    day_27
    day_26作业
  • 原文地址:https://www.cnblogs.com/Micang/p/9900834.html
Copyright © 2011-2022 走看看