数学之路(3)-机器学习(3)-机器学习算法-贝叶斯定理(6)

zoukankan html css js c++ java

数学之路(3)-机器学习(3)-机器学习算法-贝叶斯定理(6)
我们可以读取文本的标题，将标题也加入正文做为分词计算先验概率的对象，因为标题往往是全文的主题
```
        if len(page_content.strip())>0:
            ybtxt[ci].append(page_content)
            try:
                print my_soup.title.string.encode('gb2312')
                page_content=my_soup.title.string+page_content
            except:
                print "...."
            finally:
                print "-done."
     
```
运行python程序后，分类效果不错

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
读取待分类文本
计算待分类文本后验概率
car.txt:汽车
计算待分类文本后验概率
war3.txt:军事
>>>

本博客所有内容是原创，如果转载请注明来源

 http://blog.csdn.net/myhaspl/

下一步，我们将程序修改后，直接读取未知网页文件进行分类，下面读取了2个网页链接，将这2个链接代表的网页进行分类，分别归属到以下几个类中

汽车 http://finance.chinanews.com/auto/gd.shtml
财经 http://finance.chinanews.com/cj/gd.shtml
健康 http://www.chinanews.com/jiankang.shtml
教育 http://www.chinanews.com/jiaoyu.shtml
军事 http://www.chinanews.com/mil/news.shtml

读取网页文件
```
##读取待分类文本
print u"
读取待分类文本"
ftestlinks=[]
ftestlinks.append(r'http://www.chinanews.com/edu/2013/09-17/5296319.shtml')          
ftestlinks.append(r'http://finance.chinanews.com/auto/2013/09-16/5290491.shtml') 
for mypage in ftestlinks:
    my_page=urllib2.urlopen(mypage)
    my_soup = BeautifulSoup(my_page,from_encoding="gb2312")
  .............................
    page_content=my_soup.title.string+page_content
    print u"%s读取成功."%mypage
```
计算后验概率
```
#计算待分类文本后验概率
    print u"计算待分类文本后验概率"
    testgl=None
    wordgl=None     
    testgl=np.repeat(1.,len(yb_txt))
    if len(page_content.strip())>0:
        ftest_seg_list = jieba.cut(page_content)
        for  myword in ftest_seg_list:
            myword=myword.encode('gbk')
            if not(myword.strip() in f_stop_seg_list) and len(myword.strip())>2:
                for i in xrange(0,len(yb_txt)):
............................
```
计算最大归属概率
```
#计算最大归属概率
        maxgl=0.
        mychoice=0
        for ti in xrange(0,len(yb_txt)):
            if testgl[ti]>maxgl:
                maxgl=testgl[ti]
                mychoice=ti
        print "

%s
:%s"%(mypage,txt_class[mychoice][0])
    
```
运行后，效果不错

>>> runfile(r'K:ook_prog ext_bayes2.py', wdir=r'K:ook_prog')
. . . . .
爬取汽车类网页:http://finance.chinanews.com/auto/gd.shtml
http://www.chinanews.com/auto/2013/09-18/5301023.shtml
http://www.chinanews.com/auto/2013/09-18/5301017.shtml
http://www.chinanews.com/auto/2013/09-18/5300854.shtml
.....................

读取待分类文本
http://www.chinanews.com/edu/2013/09-17/5296319.shtml读取成功.
计算待分类文本后验概率

http://www.chinanews.com/edu/2013/09-17/5296319.shtml
:教育
http://finance.chinanews.com/auto/2013/09-16/5290491.shtml读取成功.
计算待分类文本后验概率

http://finance.chinanews.com/auto/2013/09-16/5290491.shtml
:汽车
>>>
查看全文

相关阅读:
主从热备+负载均衡（LVS + keepalived）
这12行代码分分钟让你电脑崩溃手机重启
 Linux 下虚拟机——Virtual Box
软件著作权登记证书申请攻略
 ecshop整合discuz教程完美教程
 NetHogs——Linux下按进程实时统计网络带宽利用率
 深入研究CSS
SSH远程会话管理工具
 nginx防止SQL注入规则
 mysql完美增量备份脚本

原文地址：https://www.cnblogs.com/riskyer/p/3329010.html

数学之路(3)-机器学习(3)-机器学习算法-贝叶斯定理(6)

本博客所有内容是原创，如果转载请注明来源

http://blog.csdn.net/myhaspl/