zoukankan      html  css  js  c++  java
  • 朴素贝叶斯算法简介及python代码实现分析

    概念:

      贝叶斯定理:贝叶斯理论是以18世纪的一位神学家托马斯.贝叶斯(Thomas Bayes)命名。通常,事件A在事件B(发生)的条件下的概率,与事件B在事件A(发生)的条件下的概率是不一样的;然而,这两者是有确定的关系的,贝叶斯定理就是这种关系的陈述

      朴素贝叶斯:朴素贝叶斯方法是基于贝叶斯定理和特征条件独立假设的分类方法。对于给定的训练数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布;然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率(Maximum A Posteriori)最大的输出y。

    通俗的来讲,在给定数据集的前提下,对于一个新样本(未分类),在数据集中找到和新样本特征相同的样本,最后根据这些样本算出每个类的概率,概率最高的类即为新样本的类。

    运算公式:

    P( h | d) = P ( d | h ) * P( h) / P(d)

    这里:
    P ( h | d ):是因子h基于数据d的假设概率,叫做后验概率
    P ( d | h ) : 是假设h为真条件下的数据d的概率
    P( h) : 是假设条件h为真的时候的概率(和数据无关),它叫做h的先验概率
    P(d) : 数据d的概率,和先验条件无关.

    算法实现分解:

    1 数据处理:加载数据并把他们分成训练数据和测试数据
    2 汇总数据:汇总训练数据的概率以便后续计算概率和做预测
    3 结果预测: 通过给定的测试数据和汇总的训练数据做预测
    4 评估准确性:使用测试数据来评估预测的准确性

    代码实现:

      1 # Example of Naive Bayes implemented from Scratch in Python
      2 import csv
      3 import random
      4 import math
      5 
      6 def loadCsv(filename):
      7         lines = csv.reader(open(filename, "rb"))
      8         dataset = list(lines)
      9         for i in range(len(dataset)):
     10                 dataset[i] = [float(x) for x in dataset[i]]
     11         return dataset
     12 
     13 def splitDataset(dataset, splitRatio):
     14         trainSize = int(len(dataset) * splitRatio)
     15         trainSet = []
     16         copy = list(dataset)
     17         while len(trainSet) < trainSize:
     18                 index = random.randrange(len(copy))
     19                 trainSet.append(copy.pop(index))
     20         return [trainSet, copy]
     21 
     22 def separateByClass(dataset):
     23         separated = {}
     24         for i in range(len(dataset)):
     25                 vector = dataset[i]
     26                 if (vector[-1] not in separated):
     27                         separated[vector[-1]] = []
     28                 separated[vector[-1]].append(vector)
     29         return separated
     30 
     31 def mean(numbers):
     32         return sum(numbers)/float(len(numbers))
     33 
     34 def stdev(numbers):
     35         avg = mean(numbers)
     36         variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
     37         return math.sqrt(variance)
     38 
     39 def summarize(dataset):
     40         summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
     41         del summaries[-1]
     42         return summaries
     43 
     44 def summarizeByClass(dataset):
     45         separated = separateByClass(dataset)
     46         summaries = {}
     47         for classValue, instances in separated.iteritems():
     48                 summaries[classValue] = summarize(instances)
     49         return summaries
     50 
     51 def calculateProbability(x, mean, stdev):
     52         exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
     53         return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent
     54 
     55 def calculateClassProbabilities(summaries, inputVector):
     56         probabilities = {}
     57         for classValue, classSummaries in summaries.iteritems():
     58                 probabilities[classValue] = 1
     59                 for i in range(len(classSummaries)):
     60                         mean, stdev = classSummaries[i]
     61                         x = inputVector[i]
     62                         probabilities[classValue] *= calculateProbability(x, mean, stdev)
     63         return probabilities
     64 
     65 def predict(summaries, inputVector):
     66         probabilities = calculateClassProbabilities(summaries, inputVector)
     67         bestLabel, bestProb = None, -1
     68         for classValue, probability in probabilities.iteritems():
     69                 if bestLabel is None or probability > bestProb:
     70                         bestProb = probability
     71                         bestLabel = classValue
     72         return bestLabel
     73 
     74 def getPredictions(summaries, testSet):
     75         predictions = []
     76         for i in range(len(testSet)):
     77                 result = predict(summaries, testSet[i])
     78                 predictions.append(result)
     79         return predictions
     80 
     81 def getAccuracy(testSet, predictions):
     82         correct = 0
     83         for i in range(len(testSet)):
     84                 if testSet[i][-1] == predictions[i]:
     85                         correct += 1
     86         return (correct/float(len(testSet))) * 100.0
     87 
     88 def main():
     89         filename = 'pima-indians-diabetes.data.csv'
     90         splitRatio = 0.67
     91         dataset = loadCsv(filename)
     92         trainingSet, testSet = splitDataset(dataset, splitRatio)
     93         print('Split {0} rows into train={1} and test={2} rows').format(len(dataset), len(trainingSet), len(testSet))
     94         # prepare model
     95         summaries = summarizeByClass(trainingSet)
     96         # test model
     97         predictions = getPredictions(summaries, testSet)
     98         accuracy = getAccuracy(testSet, predictions)
     99         print('Accuracy: {0}%').format(accuracy)
    100 
    101 main()

    pima-indians-diabetes.data.csv的下载地址:

    https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv 

    参考文档:

    1 https://en.wikipedia.org/wiki/Naive_Bayes_classifier

    2 https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

    3 https://machinelearningmastery.com/naive-bayes-for-machine-learning/

  • 相关阅读:
    基于OpenSSL自建CA和颁发SSL证书
    SSL与TLS的区别以及介绍
    Ubuntu中Nginx的安装与配置
    Openssl源代码整理学习---含P7/P10/P12说明
    动态加载js文件
    常用方法
    对reducers 理解
    小复习(3)
    如何使移动web页面禁止横屏?
    九个Console命令,让 JS 调试更简单
  • 原文地址:https://www.cnblogs.com/dylancao/p/9761788.html
Copyright © 2011-2022 走看看