zoukankan      html  css  js  c++  java
  • 机器学习实战读书笔记(四)基于概率论的分类方法:朴素贝叶斯

    4.1 基于贝叶斯决策理论的分类方法

      朴素贝叶斯

      优点:在数据较少的情况下仍然有效,可以处理多类别问题

      缺点:对于输入数据的准备方式较为敏感

      适用数据类型:标称型数据

      贝叶斯决策理论的核心思想:选择具有最高概率的决策。

    4.2 条件概率

    4.3 使用条件概率来分类

    4.4 使用朴素贝叶斯进行文档分类

      朴素贝叶斯的一般过程:

      1.收集数据

      2.准备数据

      3.分析数据

      4.训练算法

      5.测试算法

      6.使用算法

      朴素贝叶斯分类器中的另一个假设是,每个特征同等重要。

    4.5 使用Python进行文本分类

    4.5.1 准备数据:从文本中构建词向量

      建立bayes.py文件

    def loadDataSet():
        postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                     ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                     ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                     ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                     ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                     ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
        classVec = [0,1,0,1,0,1]    #1 is abusive, 0 not
        return postingList,classVec
                     
    def createVocabList(dataSet): 
        vocabSet = set([])  #create empty set
        for document in dataSet:
            vocabSet = vocabSet | set(document) #union of the two sets
        return list(vocabSet)
    
    def setOfWords2Vec(vocabList, inputSet):
        returnVec = [0]*len(vocabList)
        for word in inputSet:
            if word in vocabList:
                returnVec[vocabList.index(word)] = 1
            else: print "the word: %s is not in my Vocabulary!" % word
        return returnVec
    import bayes
    listOPosts,listClasses=bayes.loadDataSet() #
    myVocabList=bayes.createVocabList(listOPosts)
    bayes.setOfWords2Vec(myVocabList,listOPosts[0])
    bayes.setOfWords2Vec(myVocabList,listOPosts[3])

    4.5.2 训练算法,从词向量计算概率

      改写贝叶斯,使用以下公式:

      

      w为向量,p(w|ci)可以展开为p(w0,w1...wN|ci),假设所有词相互独立 ,那么该假设也称作条件独立性假设,这表示可以使用p(w0|ci)p(w1|ci)...p(wn|ci)计算上述概率。

      该函数伪代码如下:

      计算每个类别中的文档数目

      对每篇训练文档:

        对每个类别:

          如果词条出现在文档中->增加该词条的计数值

          增加所有词条的计数值

        对每个类别:

          对每个词条:

            将该词条的数目除以总词条数目得到条件概率

        返回每个类别的条件概率  

    def trainNB0(trainMatrix,trainCategory):
        numTrainDocs = len(trainMatrix)
        numWords = len(trainMatrix[0])
        pAbusive = sum(trainCategory)/float(numTrainDocs)
        p0Num = zeros(numWords); p1Num = zeros(numWords)      #change to ones() 
        p0Denom = 0.0; p1Denom = 0.0                        #change to 2.0
        for i in range(numTrainDocs):
            if trainCategory[i] == 1:
                p1Num += trainMatrix[i]
                p1Denom += sum(trainMatrix[i])
            else:
                p0Num += trainMatrix[i]
                p0Denom += sum(trainMatrix[i])
        p1Vect = p1Num/p1Denom          #change to log()
        p0Vect = p0Num/p0Denom          #change to log()
        return p0Vect,p1Vect,pAbusive
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bayes.setOfWords2Vec(myVocabList,postinDoc))
    p0V,p1V,pAb=bayes.trainNB0(trainMat,listClasses)

    4.5.3 测试算法:根据现实情况修改分类器

      利用贝叶斯分类器对文档进行分类时,要计算多个概率的乘积以获得文档属于某个类别的概率,即计算p(w0|1)p(w1|1)...,如果其中一个概率值为0,那么最后乘积也为0。为降低这种影响,可以将所有词的出现数初始化为1,并将分母初始化为2.

      修改TrainNB0()  

    def trainNB0(trainMatrix,trainCategory):
        numTrainDocs = len(trainMatrix)
        numWords = len(trainMatrix[0])
        pAbusive = sum(trainCategory)/float(numTrainDocs)
        p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
        p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
        for i in range(numTrainDocs):
            if trainCategory[i] == 1:
                p1Num += trainMatrix[i]
                p1Denom += sum(trainMatrix[i])
            else:
                p0Num += trainMatrix[i]
                p0Denom += sum(trainMatrix[i])
        p1Vect = p1Num/p1Denom          #change to log()
        p0Vect = p0Num/p0Denom          #change to log()
        return p0Vect,p1Vect,pAbusive

      另一个遇到的问题是下溢出,这是由于太多很小的数相乘造成的。当计算p(w0|1)p(w1|1)...时,由于大部分因子都很小,所以程序会下溢出或得到不正确的答案。一种解决办法是对乘积取自然对数。在代数中有ln(a*b)=ln(a)+ln(b),于是通过求对数可以避免下溢出或者浮点数舍入导致的错误。同时,采用自然对数进行处理不会有任何损失。因此,修改TrainNB0

    def trainNB0(trainMatrix,trainCategory):
        numTrainDocs = len(trainMatrix)
        numWords = len(trainMatrix[0])
        pAbusive = sum(trainCategory)/float(numTrainDocs)
        p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
        p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
        for i in range(numTrainDocs):
            if trainCategory[i] == 1:
                p1Num += trainMatrix[i]
                p1Denom += sum(trainMatrix[i])
            else:
                p0Num += trainMatrix[i]
                p0Denom += sum(trainMatrix[i])
        p1Vect = log(p1Num/p1Denom)          #change to log()
        p0Vect = log(p0Num/p0Denom)          #change to log()
        return p0Vect,p1Vect,pAbusive

      编写分类函数

    def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
        p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
        p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
        if p1 > p0:
            return 1
        else: 
            return 0
        
    def bagOfWords2VecMN(vocabList, inputSet):
        returnVec = [0]*len(vocabList)
        for word in inputSet:
            if word in vocabList:
                returnVec[vocabList.index(word)] += 1
        return returnVec
    
    def testingNB():
        listOPosts,listClasses = loadDataSet()
        myVocabList = createVocabList(listOPosts)
        trainMat=[]
        for postinDoc in listOPosts:
            trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
        p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
        testEntry = ['love', 'my', 'dalmation']
        thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
        print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
        testEntry = ['stupid', 'garbage']
        thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
        print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
    bayes.testingNB()

    4.5.4 准备数据:文档词袋模型

      词集模型:每个词出现一次。

      词袋模型:每个词在文档中出现不止一次。

      把setOfWords2Vec()改为bagOfWords2Vec()

    def bagOfWords2VecMN(vocabList, inputSet):
        returnVec = [0]*len(vocabList)
        for word in inputSet:
            if word in vocabList:
                returnVec[vocabList.index(word)] += 1
        return returnVec

    4.6 使用朴素贝叶斯过滤垃圾邮件

      1.收集数据:提供文本文件

      2.准备数据:将文本文件解析成词条向量

      3.分析数据:检查词条确保解析的正确性

      4.训练算法:使用我们之前建立的trainNB0()函数

      5.测试算法:使用classifyNB(),并且构建一个新的测试函数来计算文档集的错误率

      6.使用算法:构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上

    4.6.1 准备数据:切分文本

    4.6.2 测试算法:使用朴素贝叶斯进行交叉验证

      

    def textParse(bigString):    #input is big string, #output is word list
        import re
        listOfTokens = re.split(r'W*', bigString)
        return [tok.lower() for tok in listOfTokens if len(tok) > 2] 
        
    def spamTest():
        docList=[]; classList = []; fullText =[]
        for i in range(1,26):
            wordList = textParse(open('email/spam/%d.txt' % i).read())
            docList.append(wordList)
            fullText.extend(wordList)
            classList.append(1)
            wordList = textParse(open('email/ham/%d.txt' % i).read())
            docList.append(wordList)
            fullText.extend(wordList)
            classList.append(0)
        vocabList = createVocabList(docList)#create vocabulary
        trainingSet = range(50); testSet=[]           #create test set
        for i in range(10):
            randIndex = int(random.uniform(0,len(trainingSet)))
            testSet.append(trainingSet[randIndex])
            del(trainingSet[randIndex])  
        trainMat=[]; trainClasses = []
        for docIndex in trainingSet:#train the classifier (get probs) trainNB0
            trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
            trainClasses.append(classList[docIndex])
        p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
        errorCount = 0
        for docIndex in testSet:        #classify the remaining items
            wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
            if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
                errorCount += 1
                print "classification error",docList[docIndex]
        print 'the error rate is: ',float(errorCount)/len(testSet)
        #return vocabList,fullText

      以上程序,随机选择10篇作测试集,如果全部判对输出错误率0.0,若有错误则输出错分文档的词表。

    4.7 未完成

      

  • 相关阅读:
    HDU1720 A+B Coming
    HDU1390 ZOJ1383 Binary Numbers
    HDU1390 ZOJ1383 Binary Numbers
    HDU2504 又见GCD
    HDU2504 又见GCD
    HDU1335 POJ1546 UVA389 UVALive5306 ZOJ1334 Basically Speaking
    HDU1335 POJ1546 UVA389 UVALive5306 ZOJ1334 Basically Speaking
    HDU1020 ZOJ2478 Encoding
    HDU1020 ZOJ2478 Encoding
    HDU2097 Sky数
  • 原文地址:https://www.cnblogs.com/MarsMercury/p/5173336.html
Copyright © 2011-2022 走看看