zoukankan      html  css  js  c++  java
  • NBC朴素贝叶斯分类器 ————机器学习实战 python代码

    这里的p(y=1|x)计算基于朴素贝叶斯模型(周志华老师机器学习书上说的p(xi|y=1)=|Dc,xi|/|Dc|)

    也可以基于文本分类的事件模型

    见http://blog.csdn.net/app_12062011/article/details/50540429有详细介绍

    代码是机器学习实战所呈现的那种方式。。。。。。

    # -*- coding: utf-8 -*-
    """
    Created on Mon Aug 07 23:40:13 2017
    
    @author: mdz
    """
    import numpy as np
    def loadData():
    vocabList=[['fuck', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
    ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
    ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
    ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
    ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
    ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classList=[1,1,0,1,0,1]#1 侮辱性文字,0 正常言论
    return vocabList,classList
    
    #对vocabList已经拆分过的句子进行筛选,筛选掉重复的单词,最后再返回list
    #该list的length即属性的个数
    def filterVocabList(vocabList):
    vocabSet=set([])
    for document in vocabList:
    vocabSet=vocabSet|set(document)
    return list(vocabSet)
    
    #对测试样本进行0-1处理
    def zero_one(vocabList,input):
    returnVec=[0]*len(vocabList)
    for word in input:
    if word in vocabList:
    returnVec[vocabList.index(word)]=1
    else:
    print "the word: %s is not in my Vocabulary!"%word
    return returnVec
    
    def trainNbc(trainSamples,trainCategory):
    numTrainSamp=len(trainSamples)
    numWords=len(trainSamples[0])
    pAbusive=sum(trainCategory)/float(numTrainSamp)
    #y=1 or 0下的特征取值为1
    p0Num=np.ones(numWords)
    p1Num=np.ones(numWords)
    #y=1 or 0下的样本计数
    p0NumTotal=2.0#每个特征可能的取值2种情况
    p1NumTotal=2.0
    for i in range(numTrainSamp):
    if trainCategory[i]==1:
    p1Num+=trainSamples[i]
    p1NumTotal+=1
    else:
    p0Num+=trainSamples[i]
    p0NumTotal+=1
    p1Vec=p1Num/float(p1NumTotal)
    p0Vec=p0Num/float(p0NumTotal)
    return p1Vec,p0Vec,pAbusive
    
    def classifyOfNbc(testSamples,p1Vec,p0Vec,pAbusive):
    p1=sum(testSamples*np.log(p1Vec))+sum((1-testSamples)*np.log(1-p1Vec))+np.log(pAbusive)
    p0=sum(testSamples*np.log(p0Vec))+sum((1-testSamples)*np.log(1-p0Vec))+np.log(pAbusive)
    if p1>p0:
    return 1 
    else: 
    return 0 
    def testingNbc():
    vocabList,classList=loadData()
    vocabSet=filterVocabList(vocabList)
    trainList=[]
    for term in vocabList:
    trainList.append(zero_one(vocabSet,term))
    p1Vec,p0Vec,pAbusive=trainNbc(np.array(trainList),np.array(classList))
    testEntry=['fuck','my','daughter']
    testSamples=np.array(zero_one(vocabSet,testEntry))
    print testEntry,'classified as :',classifyOfNbc(testSamples,p1Vec,p0Vec,pAbusive)
    testEntry=['stupid','garbage']
    testSamples=np.array(zero_one(vocabSet,testEntry))
    print testEntry,'classified as :',classifyOfNbc(testSamples,p1Vec,p0Vec,pAbusive)
    
    '''上述代码存为bayesClassify.py'''
    
    '''控制台输入 :>>>import bayesClassify
    
                 >>>bayesClassify.testingNbc()
    
    '''输出结果:
    
    the word: daughter is not in my Vocabulary!
    ['fuck', 'my', 'daughter'] classified as : 1
    ['stupid', 'garbage'] classified as : 1
    
    '''
      
    

      

    认准了,就去做,不跟风,不动摇
  • 相关阅读:
    Apache RocketMQ + Hudi 快速构建 Lakehouse
    如何快速调度 PTS 的百万并发能力
    flask
    第三方登录————微博
    python定时获取树莓派硬件参数并使用MQTT进行数据推送
    树莓派修改时区
    windows编辑shell,报错syntax error near unexpected token `elif'
    树莓派挂载移动硬盘
    关闭树莓派摄像头红色LED指示灯
    树莓派拍照和录制视频
  • 原文地址:https://www.cnblogs.com/mdz-great-world/p/7308210.html
Copyright © 2011-2022 走看看