zoukankan      html  css  js  c++  java
  • Python抓取豆瓣《白夜追凶》的评论并且分词

    最近网剧《白夜追凶》在很多朋友的推荐下,开启了追剧模式,自从琅琊榜过后没有看过国产剧了,此剧确实是良心剧呀!一直追下去,十一最后两天闲来无事就抓取豆瓣的评论看一下

    相关代码提交到github上

    个人github上相关python的项目:https://github.com/bytename/learnPy

    #-*-coding:utf-8-*-
    import  requests
    from lxml import etree
    import jieba
    header ={
        "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Encoding":"gzip, deflate, br",
        "Accept-Language":"zh-CN,zh;q=0.8,en;q=0.6",
        "Connection":"keep-alive",
        "Host":"movie.douban.com",
        "Referer":"https://movie.douban.com/subject/26883064/reviews?start=20",
        "Upgrade-Insecure-Requests":"1",
        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"
    }
    def getPageNum(url):
        if url:
            req = requests.get(url,headers=header)
            html = etree.HTML(req.text)
            pageNum = html.xpath(u"//div[@class='paginator']/a[last()]/text()")[0]
        return pageNum
    def getContent(url):
        if url:
            req = requests.get(url, headers=header)
            html = etree.HTML(req.text)
            data = html.xpath(u"//div[@class='short-content']/text()")
        return data
    
    def getUrl(pageNum):
        dataUrl= []
        for i in range(1,int(pageNum)):
            if pageNum >= 1:
                url ="https://movie.douban.com/subject/26883064/reviews?start=%d" %(((i - 1) *20),)
                dataUrl.append(url)
        return dataUrl
    if __name__ == '__main__':
        url = "https://movie.douban.com/subject/26883064/reviews?start=0"
        pageNum =getPageNum(url)
        data = getUrl(pageNum)
        datas = []
        dic = dict()
        for u in data:
            for d in getContent(u):
                jdata = jieba.cut(d)
                for i in jdata:
                    if len(i.strip()) > 1:
                         datas.append(i.strip())
        for i in datas:
            if datas.count(i) > 1:
                dic[i] = datas.count(i)
        for key,values in dic.items():
            print "%s===%d" %(key,values)
    

    抓取了评论并分词统计:

    C:Anaconda2python.exe D:/PycharmProjects/LearnPy/lesson01/SpriderDouBan.py
    Building prefix dict from the default dictionary ...
    Loading model from cache c:users
    cappdatalocal	empjieba.cache
    Loading model cost 0.379 seconds.
    Prefix dict has been built succesfully.
    结合体===2
    星期一===2
    出来===21
    第二===2
    还要===3
    应该===28
    刘副队===3
    案件===33
    发生===7
    成分===3
    诚然===2
    惊喜===7
    两天===5
    正常===10
    全剧===4
    看似===2
    关系===5
    坐等===2
    仿佛===2
    有理有据===2
    
  • 相关阅读:
    Qt 4套件的组成适用于Qt 4.5以后的版本
    GTK+, Qt, wxWidgets compare
    为什么选择Qt
    [转]零基础学Qt 4编程实例之四:理解并正确使用名字空间
    [转]Qt 4常见的IDE及其优缺点比较推荐Qt Creator和Eclipse
    *nix系统下验证Qt 4安装正确与否的方法和步骤
    Debian install matlab2010—also ok for ubuntu series!
    我推荐的Qt资源网站、论坛、博客等来自《零基础学Qt 4编程》一书的附录
    ubuntu debian fedora Mac install pgplot steps!!
    64位WIN7 配置IIS遇到问题
  • 原文地址:https://www.cnblogs.com/byteworld/p/7635615.html
Copyright © 2011-2022 走看看