zoukankan      html  css  js  c++  java
  • 爬虫综合大作业

    作业来自于:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3159

    爬取豆瓣高评分电影影评

    1.首先分析网页

    在豆瓣网站中,需要浏览影评,是需要用户登录的;因此,要爬取影评网页,就需要注册用户、登录,捉取cookie,模拟用户登录。

    mport requests
    from lxml import etree
    session = requests.Session()
    for id in range(0,251,25):
        URL = 'http://movie.douban.com/top250/?start=' + str(id)
        req = session.get(URL)
        req.encoding = 'utf8'              # 设置网页编码格式
        root=etree.HTML(req.content)                       #将request.content 转化为 Element
        items = root.xpath('//ol/li/div[@class="item"]')
        for item in items:
            rank,name,alias,rating_num,quote,url = "","","","","",""
            try:
                url = item.xpath('./div[@class="pic"]/a/@href')[0]
                rank = item.xpath('./div[@class="pic"]/em/text()')[0]
                title = item.xpath('./div[@class="info"]//a/span[@class="title"]/text()')
                name = title[0].encode('gb2312','ignore').decode('gb2312')
                alias = title[1].encode('gb2312','ignore').decode('gb2312') if len(title)==2 else ""
                rating_num = item.xpath('.//div[@class="bd"]//span[@class="rating_num"]/text()')[0]
                quote_tag = item.xpath('.//div[@class="bd"]//span[@class="inq"]')
                if len(quote_tag)  is not 0:
                    quote = quote_tag[0].text.encode('gb2312','ignore').decode('gb2312').replace('xa0','')
                print(rank,rating_num,quote)
                print(name.encode('gb2312','ignore').decode('gb2312') ,alias.encode('gb2312','ignore').decode('gb2312') .replace('/',','))
            except:
                print('faild!')
                pass

    2.获取每一部电影的信息

    def get_html(web_url):  # 爬虫获取网页没啥好说的
         header = {
             "User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16"}
         html = requests.get(url=web_url, headers=header).text#不加text返回的是response,加了返回的是字符串
         Soup = BeautifulSoup(html, "lxml")
         data = Soup.find("ol").find_all("li")  # 还是有一点要说,就是返回的信息最好只有你需要的那部分,所以这里进行了筛选
         return data

    requests.get()函数,会根据参数中url的链接,返回response对象

    .text会将response对象转换成str类型

    find_all()函数,会将html文本中的ol标签下的每一个li标签中的内容筛选出来

    3.pipelinemysql输入到数据库中:

    先在mysql中创建数据库与表,表的属性应与要插入的数据保持一致

    连接数据库db = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd=PWD, db='douban',charset='utf8')
    创建游标cur = db.cursor()

    将获取的电影信息导入数据库

    sql = "INSERT INTO test(rank, NAME, score, country, year, " 
              "category, votes, douban_url) values(%s,%s,%s,%s,%s,%s,%s,%s)"
        try:
            cur.executemany(sql, movies_info)
            db.commit()
        except Exception as e:
            print("Error:", e)
            db.rollback()

     4.分析评论人数top10的数据生成图表

    5.主要代码

    importscrapy
    
    fromscrapy importSpider
    
    fromdoubanTop250.items importDoubantop250Item
    
    classDoubanSpider(scrapy.Spider):
    
    name = 'douban'
    
    allowed_domains = ['douban.com']
    
    start_urls = ['https://movie.douban.com/top250/']
    
    defparse(self, response):
    
    lis = response.css('.info')
    
    forli inlis:
    
    item = Doubantop250Item()
    
    # 利用CSS选择器获取信息
    
    name = li.css('.hd span::text').extract()
    
    title = ''.join(name)
    
    info = li.css('p::text').extract()[1].replace('n', '').strip()
    
    score = li.css('.rating_num::text').extract_first()
    
    people = li.css('.star span::text').extract()[1]
    
    words = li.css('.inq::text').extract_first()
    
    # 生成字典
    
    item['title'] = title
    
    item['info'] = info
    
    item['score'] = score
    
    item['people'] = people
    
    item['words'] = words
    
    yielditem
    
    # 获取下一页链接,并进入下一页
    
    next = response.css('.next a::attr(href)').extract_first()
    
    ifnext:
    
    url = response.urljoin(next)
    
    yieldscrapy.Request(url=url, callback=self.parse)
    
    pass
    
    生成的items.py文件,是保存爬取数据的容器,代码修改如下。
    
    importscrapy
    
    classDoubantop250Item(scrapy.Item):
    
    # define the fields for your item here like:
    
    # name = scrapy.Field()
    
    title = scrapy.Field()
    
    info = scrapy.Field()
    
    score = scrapy.Field()
    
    people = scrapy.Field()
    
    words = scrapy.Field()
  • 相关阅读:
    Shiro
    RabbitMQ
    RabbitMQ
    RabbitMQ
    RabbitMQ
    RabbitMQ
    RabbitMQ
    Java
    Java
    Java
  • 原文地址:https://www.cnblogs.com/binguo666/p/10828802.html
Copyright © 2011-2022 走看看