zoukankan      html  css  js  c++  java
  • Scrapy系列之爬取豆瓣电影

      每日一练,每日一博。

      Scrapy,Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。

    1.确定目标网站:豆瓣电影 http://movie.douban.com/top250

    2.创建Scrapy项目: scrapy startproject doubanmovie

    3.配置settings.py文件

      

    BOT_NAME = 'doubanmovie'
    
    SPIDER_MODULES = ['doubanmovie.spiders']
    NEWSPIDER_MODULE = 'doubanmovie.spiders'
    
    
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
    
    FEED_URI = u'file:///G:/program/doubanmovie/douban.csv'  #将抓取的数据存放到douban.csv文件中
    FEED_FORMAT = 'CSV'

    3.定义数据items.py:

      

    from scrapy import Item,Field
    
    
    class DoubanmovieItem(Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = Field()      #标题--电影名
        movieInfo = Field()  #电影信息
        star = Field()     #电影评分
        quote = Field()     #名句

    4.创建爬虫doubanspider.py:

      

    import scrapy
    from scrapy.spiders import CrawlSpider
    from scrapy.http import Request
    from scrapy.selector import Selector
    from doubanmovie.items import DoubanmovieItem
    
    class Douban(CrawlSpider):
        name = "douban"
        redis_key = 'douban:start_urls'
        start_urls = ['http://movie.douban.com/top250']
    
        url = 'http://movie.douban.com/top250'
    
        def parse(self,response):
            # print response.body
            item = DoubanmovieItem()
            selector = Selector(response)
            Movies = selector.xpath('//div[@class="info"]')
            for eachMoive in Movies:
                title = eachMoive.xpath('div[@class="hd"]/a/span/text()').extract()
                fullTitle = ''
                for each in title:
                    fullTitle += each
                movieInfo = eachMoive.xpath('div[@class="bd"]/p/text()').extract()
                star = eachMoive.xpath('div[@class="bd"]/div[@class="star"]/span/em/text()').extract()[0]
                quote = eachMoive.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract()
                #quote可能为空,因此需要先进行判断
                if quote:
                    quote = quote[0]
                else:
                    quote = ''
                item['title'] = fullTitle
                item['movieInfo'] = ';'.join(movieInfo)
                item['star'] = star
                item['quote'] = quote
                yield item
            nextLink = selector.xpath('//span[@class="next"]/link/@href').extract()
            #第10页是最后一页,没有下一页的链接
            if nextLink:
                nextLink = nextLink[0]
                print nextLink
                yield Request(self.url + nextLink,callback=self.parse)

    5.爬取结果:如果出现编码问题,在excel文件中选择“utf-8”的编码保存文件即可

      

      

      

  • 相关阅读:
    unsupported jsonb version number 123
    如何在MPlayer上支持RTSP
    TDengine 时序数据库的 ADO.Net Core 提供程序 Maikebing.EntityFrameworkCore.Taos
    如何使用IoTSharp对接ModBus?
    如何从源码启动和编译IoTSharp
    Asp.Net Core 自动适应Windows服务、Linux服务、手动启动时的内容路径的扩展方法
    MQTTnet 的Asp.Net Core 认证事件的扩展
    Asp.Net Core 中利用QuartzHostedService 实现 Quartz 注入依赖 (DI)
    The remote certificate is invalid according to the validation procedure 远程证书验证无效
    settings插拔式源码
  • 原文地址:https://www.cnblogs.com/alarm1673/p/4815036.html
Copyright © 2011-2022 走看看