zoukankan      html  css  js  c++  java
  • Naive Bayes总结

    为了更好地而阅读体验,欢迎来我的GitHub小站原址观看!

    This notebook is inspired but not limited by Machine Learning In Action.
    e.g., the implementation of the algrithm is at a higher level.

    All rights deserved by Diane(Qingyun Hu).

    1. About Naive Bayes

    1.1 Mechanism of Naive Bayes

    Naive Bayes is a variant of Bayes' Rule. Let's recap Bayes' Rule a bit.

    [P(c_i | w_1, w_2, w_3, ..., w_m) = frac{P(w_1, w_2, w_3, ..., w_m | c_i)*P(c_i)}{P(w_1, w_2, w_3, ..., w_m)} ]

    where (w_1, w_2, w_3, ..., w_m) is an vector of words that present in the document as well as included in the existing vocabulary list, and (c_i) stands for class i.

    Naive Bayes asks us to assume that the presence of (w_1, w_2, w_3, ..., w_m) is independent. Although this is not realistic, as in there are always some connection between one word to another. However, this assumption simplifies the calculation and works quite well so far. By assuming the presence of words is independent, here we have:

    [P(c_i | w_1, w_2, w_3, ..., w_m) = frac{( P(w_1 | c_i) * P(w_2 | c_i) * P(w_3 | c_i) * ... * P(w_m | c_i) ) * P(c_i)}{P(w_1) * P(w_2) * P(w_3) * ... * P(w_m))} ]

    1.2 Pros and Cons

    1.21 Pros

    1. Handles multiple classes.
    2. Works well on small dataset.

    1.22 Cons

    1. Sensitive to how the input data is prepared
    2. The sparse bag-of-words vector could consume a lot of memery if not handling it properly, as in for each vector, it's lenth is at the same lenth of vocabulary list.

    1.23 Works with

    Nominal Values

    2. ID3 Tree Construction

    # Creat demo dataset
    from IPython.core.interactiveshell import InteractiveShell
    InteractiveShell.ast_node_interactivity = "all"
    import pandas as pd
    import math
    def createDataSet():
        postingList=[['my', 'dog', 'has', 'flea', 
                      'problems', 'help', 'please'],
                     ['maybe', 'not', 'take', 'him', 
                      'to', 'dog', 'park', 'stupid'],
                     ['my', 'dalmation', 'is', 'so', 'cute', 
                       'I', 'love', 'him'],
                     ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                     ['mr', 'licks', 'ate', 'my', 'steak', 'how',
                       'to', 'stop', 'him'],
                     ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
        classVec = [0,1,0,1,0,1]
        return postingList,classVec
    dataSet, labels = createDataSet()
    dataSet
    labels
    
    [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
     ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
     ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
     ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
     ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
     ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    
    
    
    
    
    
    [0, 1, 0, 1, 0, 1]
    
    # Tool Function 1: Create an vocabulary list according to dataSet
    
    def createVocabList(dataSet):
        vocabList = set([])
        for docum in dataSet:
            vocabList = vocabList | set(docum)
        return list(vocabList)
    vocabList = createVocabList(dataSet)
    
    # Tool Function 2: Get an bag of words vector for each document
    import numpy as np
    def bagOfWordsVec(vocabList, document):
        returnVec = np.ones(len(vocabList))
        for token in document:
            if token in vocabList:
                returnVec[vocabList.index(token)] += 1
        return returnVec
    bagOfWordsVec(vocabList, dataSet[3])
    
    array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  2.,
            1.,  1.,  2.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  1.,  1.,  1.,
            1.,  2.,  1.,  1.,  2.,  1.])
    
    # Tool Function 3: Get BagOfWordsTable for Training Dataset
    
    def getBagOfWordsTable(dataSet, vocabList, label):
        bagOfWordsTable = []
        for document in dataSet:
            bagOfWordsTable.append(bagOfWordsVec(vocabList, document))
        bagOfWordsTable = pd.DataFrame(bagOfWordsTable, columns=vocabList)
        bagOfWordsTable['label']= label
        return bagOfWordsTable
    getBagOfWordsTable(dataSet, vocabList, labels)
    
    park food licks him problems love take not maybe mr ... has so I help stupid dalmation ate worthless my label
    0 1.0 1.0 1.0 1.0 2.0 1.0 1.0 1.0 1.0 1.0 ... 2.0 1.0 1.0 2.0 1.0 1.0 1.0 1.0 2.0 0
    1 2.0 1.0 1.0 2.0 1.0 1.0 2.0 2.0 2.0 1.0 ... 1.0 1.0 1.0 1.0 2.0 1.0 1.0 1.0 1.0 1
    2 1.0 1.0 1.0 2.0 1.0 2.0 1.0 1.0 1.0 1.0 ... 1.0 2.0 2.0 1.0 1.0 2.0 1.0 1.0 2.0 0
    3 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 2.0 1.0 1.0 2.0 1.0 1
    4 1.0 1.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 2.0 ... 1.0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 2.0 0
    5 1.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 2.0 1.0 1.0 2.0 1.0 1

    6 rows × 33 columns

    # Calculate Probabilities
    
    bagOfWordsTable = getBagOfWordsTable(dataSet, vocabList, labels)
    def getProb(c_i, bagOfWordsTable, testDataset):
        P_ci = bagOfWordsTable['label'][bagOfWordsTable.label==c_i].count() / bagOfWordsTable.shape[0]
        bagOfWordsTable_ci = bagOfWordsTable[bagOfWordsTable.label==c_i]
        P_Xi_ci = bagOfWordsTable_ci.sum() / bagOfWordsTable_ci.sum().sum()
        P_Xi = bagOfWordsTable.sum() / bagOfWordsTable.sum().sum()
        
        predVec = []
        for document in testDataset:
            predVec.append(np.exp(np.log(P_Xi_ci[document]).sum() + np.log(P_ci) - np.log(P_Xi[document]).sum()))
    #     return P_Xi_ci, P_ci, P_Xi
        return predVec
    
    print("Predictions on Traing DataSet (The propability of each document being Class 1) :")
    getProb(1, bagOfWordsTable,dataSet)
    
    print("Real Classes of Traing DataSet")
    labels
    
    print("Not Bad!")
    
    Predictions on Traing DataSet (The propability of each document being Class 1) :
    
    
    
    
    
    [0.18178454867713456,
     1.2017140246697071,
     0.12570863705130642,
     1.1353438671320581,
     0.14790295856460187,
     1.4539243496229779]
    
    
    
    Real Classes of Traing DataSet
    
    
    
    
    
    [0, 1, 0, 1, 0, 1]
    
    
    
    Not Bad!
    

    3. Misc.

    Trick 1: Initiate bag-of-words vector with 1s instead of 0s to prevent something like (P(w_i|c_1)==0) from happening, which would cause the prediction to be 0.

    Trick 2: Probability varies from 0 to 1, thus when multiplying a bunch of probabilities like (P(w_1) * P(w_2) * P(w_3) * ... * P(w_m)), underflow tend to happen. To prevent this from happening, what we can do is to apply log() first then exp() to the right side of the equation $P(c_i | w_1, w_2, w_3, ..., w_m) = frac{( P(w_1 | c_i) * P(w_2 | c_i) * P(w_3 | c_i) * ... * P(w_m | c_i) ) * P(c_i)}{P(w_1) * P(w_2) * P(w_3) * ... * P(w_m))} $ .

  • 相关阅读:
    java静态代码块、静态方法、静态变量、构造代码块、普通代码块
    linux复习3:linux字符界面的操作
    linux复习2:Fedora17系统的安装和删除
    算法设计与分析基础2:算法效率分析基础
    ip2Long 代码
    Spark知识点小结
    Transformation和Action
    Spark的优势
    深入理解RDD原理
    Spark集群的任务提交执行流程
  • 原文地址:https://www.cnblogs.com/DianeSoHungry/p/11357240.html
Copyright © 2011-2022 走看看