zoukankan      html  css  js  c++  java
  • Scrapy 学习笔记爬豆瓣 250

    Scrapy 是比较上层的库,基于中间层开发,它基于高层,所以它依赖许多其它库。事件驱动的异步技术。

    Scrapy 爬取网页,以豆瓣电影 Top 250 为例子。
    首先打开命令提示符,输入。scrapy startproject douban

    使用 Scrapy 提供的 cmd 命令

    from scrapy import cmdline
    cmdline.execute("scrapy crawl doubanmovie".split())
    

    设置 settings.py

    DOWNLOAD_DELAY = 2
    RANDOMIZE_DOWNLOAD_DELAY = True
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
    COOKIES_ENABLED = True
    
    
    FEED_URI = u'file:douban.csv'
    FEED_FORMAT = 'csv'
    
    class DoubanMovieItem(Item):
        title = Field()
        movieInfo = Field()
        star = Field()
        quote = Field()
    

    主要的爬虫程序:

    from scrapy import Request
    from scrapy import Selector
    from scrapy.spiders import Spider
    
    from douban.items import DoubanMovieItem
    
    
    class Douban(Spider):
        name = "douban"
        start_urls = ["https://movie.douban.com/top250"]
    
        url = 'https://movie.douban.com/top250'
    
        def parse(self, response):
            print("--- 爬取的内容 ---")
            print(response.url)
    
            item = DoubanMovieItem()
            selector = Selector(response)
            Movies = selector.xpath("//div[@class='info']")
            for eachMovice in Movies:
                title = eachMovice.xpath("div[@class='hd']/a/span/text()").extract()
    
                fullTitle = ''
                for each in title:
                    fullTitle += each
    
                movieInfo = eachMovice.xpath("div[@class='bd']/p/text()").extract()
                # 评分,xpath 从的数组下标从 1 开始
                star = eachMovice.xpath("div[@class='bd']/div[@class='star']/span[2]/text()").extract()
                print(star)
                # 一句脍炙人口的话
                quote = eachMovice.xpath("div[@class='bd']/div[@class='star']/span[4]/text()").extract()
                if (quote):
                    quote = quote[0]
                else:
                    quote = ''
                item['title'] = fullTitle
                item['movieInfo'] = ";".join(movieInfo)
                item['star'] = star
                item['quote'] = quote
                yield item
            nextLink = selector.xpath("//div[@class='paginator']/span[@class='next']/a/@href").extract()
    
            if (nextLink):
                nextLink = nextLink[0]
                print("下一页", nextLink)
                yield Request(self.url + nextLink, callback=self.parse)
    
    
  • 相关阅读:
    spring原理
    架构师和数学
    项目经理需要注意的地方
    如何快速掌握一门新技术
    项目管理要做啥
    编程原则
    架构设计的常用思想
    聊聊编程范式
    程序员与哲学家
    IT人员如何有效规划自己时间
  • 原文地址:https://www.cnblogs.com/liweiwei1419/p/7152882.html
Copyright © 2011-2022 走看看