zoukankan      html  css  js  c++  java
  • 朴素贝叶斯算法简介及python代码实现分析

    概念:

      贝叶斯定理:贝叶斯理论是以18世纪的一位神学家托马斯.贝叶斯(Thomas Bayes)命名。通常,事件A在事件B(发生)的条件下的概率,与事件B在事件A(发生)的条件下的概率是不一样的;然而,这两者是有确定的关系的,贝叶斯定理就是这种关系的陈述

      朴素贝叶斯:朴素贝叶斯方法是基于贝叶斯定理和特征条件独立假设的分类方法。对于给定的训练数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布;然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率(Maximum A Posteriori)最大的输出y。

    通俗的来讲,在给定数据集的前提下,对于一个新样本(未分类),在数据集中找到和新样本特征相同的样本,最后根据这些样本算出每个类的概率,概率最高的类即为新样本的类。

    运算公式:

    P( h | d) = P ( d | h ) * P( h) / P(d)

    这里:
    P ( h | d ):是因子h基于数据d的假设概率,叫做后验概率
    P ( d | h ) : 是假设h为真条件下的数据d的概率
    P( h) : 是假设条件h为真的时候的概率(和数据无关),它叫做h的先验概率
    P(d) : 数据d的概率,和先验条件无关.

    算法实现分解:

    1 数据处理:加载数据并把他们分成训练数据和测试数据
    2 汇总数据:汇总训练数据的概率以便后续计算概率和做预测
    3 结果预测: 通过给定的测试数据和汇总的训练数据做预测
    4 评估准确性:使用测试数据来评估预测的准确性

    代码实现:

      1 # Example of Naive Bayes implemented from Scratch in Python
      2 import csv
      3 import random
      4 import math
      5 
      6 def loadCsv(filename):
      7         lines = csv.reader(open(filename, "rb"))
      8         dataset = list(lines)
      9         for i in range(len(dataset)):
     10                 dataset[i] = [float(x) for x in dataset[i]]
     11         return dataset
     12 
     13 def splitDataset(dataset, splitRatio):
     14         trainSize = int(len(dataset) * splitRatio)
     15         trainSet = []
     16         copy = list(dataset)
     17         while len(trainSet) < trainSize:
     18                 index = random.randrange(len(copy))
     19                 trainSet.append(copy.pop(index))
     20         return [trainSet, copy]
     21 
     22 def separateByClass(dataset):
     23         separated = {}
     24         for i in range(len(dataset)):
     25                 vector = dataset[i]
     26                 if (vector[-1] not in separated):
     27                         separated[vector[-1]] = []
     28                 separated[vector[-1]].append(vector)
     29         return separated
     30 
     31 def mean(numbers):
     32         return sum(numbers)/float(len(numbers))
     33 
     34 def stdev(numbers):
     35         avg = mean(numbers)
     36         variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
     37         return math.sqrt(variance)
     38 
     39 def summarize(dataset):
     40         summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
     41         del summaries[-1]
     42         return summaries
     43 
     44 def summarizeByClass(dataset):
     45         separated = separateByClass(dataset)
     46         summaries = {}
     47         for classValue, instances in separated.iteritems():
     48                 summaries[classValue] = summarize(instances)
     49         return summaries
     50 
     51 def calculateProbability(x, mean, stdev):
     52         exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
     53         return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent
     54 
     55 def calculateClassProbabilities(summaries, inputVector):
     56         probabilities = {}
     57         for classValue, classSummaries in summaries.iteritems():
     58                 probabilities[classValue] = 1
     59                 for i in range(len(classSummaries)):
     60                         mean, stdev = classSummaries[i]
     61                         x = inputVector[i]
     62                         probabilities[classValue] *= calculateProbability(x, mean, stdev)
     63         return probabilities
     64 
     65 def predict(summaries, inputVector):
     66         probabilities = calculateClassProbabilities(summaries, inputVector)
     67         bestLabel, bestProb = None, -1
     68         for classValue, probability in probabilities.iteritems():
     69                 if bestLabel is None or probability > bestProb:
     70                         bestProb = probability
     71                         bestLabel = classValue
     72         return bestLabel
     73 
     74 def getPredictions(summaries, testSet):
     75         predictions = []
     76         for i in range(len(testSet)):
     77                 result = predict(summaries, testSet[i])
     78                 predictions.append(result)
     79         return predictions
     80 
     81 def getAccuracy(testSet, predictions):
     82         correct = 0
     83         for i in range(len(testSet)):
     84                 if testSet[i][-1] == predictions[i]:
     85                         correct += 1
     86         return (correct/float(len(testSet))) * 100.0
     87 
     88 def main():
     89         filename = 'pima-indians-diabetes.data.csv'
     90         splitRatio = 0.67
     91         dataset = loadCsv(filename)
     92         trainingSet, testSet = splitDataset(dataset, splitRatio)
     93         print('Split {0} rows into train={1} and test={2} rows').format(len(dataset), len(trainingSet), len(testSet))
     94         # prepare model
     95         summaries = summarizeByClass(trainingSet)
     96         # test model
     97         predictions = getPredictions(summaries, testSet)
     98         accuracy = getAccuracy(testSet, predictions)
     99         print('Accuracy: {0}%').format(accuracy)
    100 
    101 main()

    pima-indians-diabetes.data.csv的下载地址:

    https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv 

    参考文档:

    1 https://en.wikipedia.org/wiki/Naive_Bayes_classifier

    2 https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

    3 https://machinelearningmastery.com/naive-bayes-for-machine-learning/

  • 相关阅读:
    Codeforces Round #251 (Div. 2) A
    topcoder SRM 623 DIV2 CatAndRat
    topcoder SRM 623 DIV2 CatchTheBeatEasy
    topcoder SRM 622 DIV2 FibonacciDiv2
    topcoder SRM 622 DIV2 BoxesDiv2
    Leetcode Linked List Cycle II
    leetcode Linked List Cycle
    Leetcode Search Insert Position
    关于vim插件
    Codeforces Round #248 (Div. 2) B. Kuriyama Mirai's Stones
  • 原文地址:https://www.cnblogs.com/dylancao/p/9761788.html
Copyright © 2011-2022 走看看