  • Scrapy


    这里我用官方文档的第一个例子:爬取http://quotes.toscrape.com来作为我的首个scrapy爬虫,我没有找到scrapy 1.5的中文文档,后续内容有部分是我按照官方文档进行翻译的(广告:要翻译也可以联系我,我有三本英文书籍的翻译出版经验,其中两本是独立翻译LOL),具体的步骤是:

    1. 在CMD中,进入你想要存储代码的目录下执行:scrapy startproject myspiders,其中quotes可以是你想要创建的目录名字。
    2. Scrapy会自动创建一个名为myspiders的目录,并在它里面初始化一些内容。
    3. 进入myspiders/spiders目录,新建一个名为quotestoscrape.py的文件,并添加如下代码:
    import scrapy
    class BlogSpider(scrapy.Spider):
        name = 'quotestoscrape'
        def start_requests(self):
            urls = [
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
        def parse(self, response):
            for quote in response.css('div.quote'):
                yield {
                    'text': quote.css('span.text::text').extract_first(),
                    'author': quote.xpath('span/small/text()').extract_first(),

    保存后,切回CMD,执行scrapy crawl quotestoscrape,在展示结果之前,我想先简单解释一下这部分的代码:

    • 首先经过我的测试start_requests(self)这个方法并不是必须的,至少它也可以是一个名为start_urls[]的列表。不过我觉得还是遵循某种标准写法比较好。如果有的话,按照文档的说法,必须返回一个Requests的迭代器(它可以是一系列请求也可以是一个生成迭代器的方法),它代表了这个爬虫要从哪个或哪些地址开始爬取。同时也会同来进一步生成之后的请求。
    • 每条请求都会从服务器下载下来一些内容,parse()方法是用来处理这些内容的。参数response包含了整个页面的内容,之后你可以使用其他函数方法来进一步处理它。
    • yield关键字代表了Python另一个特性:生成器。我忽然想到似乎我从来没提到过它,虽然我知道这是什么。以后有机会在写一写吧。


    2018-04-19 15:56:07 [scrapy.core.engine] INFO: Spider opened
    2018-04-19 15:56:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2018-04-19 15:56:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on
    2018-04-19 15:56:07 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
    2018-04-19 15:56:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
    2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
    {'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein'}
    2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
    {'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling'}
    2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
    {'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein'}
    2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
    {'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'}
    2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
    {'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe'}
    2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
    {'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein'}
    2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
    {'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide'}
    2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
    {'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison'}
    2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
    {'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt'}
    2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
    {'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin'}
    2018-04-19 15:56:07 [scrapy.core.engine] INFO: Closing spider (finished)
    2018-04-19 15:56:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 446,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 2701,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/404': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2018, 4, 19, 19, 56, 7, 908603),
     'item_scraped_count': 10,
     'log_count/DEBUG': 13,
     'log_count/INFO': 7,
     'response_received_count': 2,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2018, 4, 19, 19, 56, 7, 400951)}
    2018-04-19 15:56:07 [scrapy.core.engine] INFO: Spider closed (finished)


    error: No module named win32api

    在最后执行的时候,有可能会出现找不到win32api的错误,安装如下模块即可:pip install pypiwin32


    初次接触爬虫,可能会对上述代码中的response.css(), quote.css(), quote.xpath()extract_first()感到陌生,这些就是所谓的进一步处理response的方法。


    • 第一种是用浏览器的审查模式。
    • 第二种是利用scrapy提供的命令行模式。




    依我之见,流程大概如下:利用屏幕底下几个标签可以先定位到一个大概的位置,比如说quote = response.css('div.quote')定位到图中蓝框的位置,之后我们要进行进一步的筛选,我没有找到文档说明应如何进行筛选,这里是我的一点经验之谈:如果是html标签用空格分割,如果标签带class标识,则用.连接,最后再加上::text 用来剔除首尾的<>标识。

    在整个过程中,我们都可以用scrapy的命令行来测试,在你的CMD下输入:scrapy shell "http://quotes.toscrape.com/"。之后出现一大推日志和一些可用的指令:

    D:OneDriveDocumentsPython和数据挖掘codelogspider>scrapy shell "http://quotes.toscrape.com/"
    2018-04-19 18:28:19 [scrapy.core.engine] INFO: Spider opened
    2018-04-19 18:28:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
    2018-04-19 18:28:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
    [s] Available Scrapy objects:
    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    [s]   crawler    <scrapy.crawler.Crawler object at 0x0000029D0C61AC50>
    [s]   item       {}
    [s]   request    <GET http://quotes.toscrape.com/>
    [s]   response   <200 http://quotes.toscrape.com/>
    [s]   settings   <scrapy.settings.Settings object at 0x0000029D0ED439B0>
    [s]   spider     <DefaultSpider 'default' at 0x29d0efecc18>
    [s] Useful shortcuts:
    [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
    [s]   fetch(req)                  Fetch a scrapy.Request and update local objects
    [s]   shelp()           Shell help (print this help)
    [s]   view(response)    View response in a browser


    # 定位这个网站的标题,extract()用来获取其中的data
    >>> response.css('title::text')
    [<Selector xpath='descendant-or-self::title/text()' data='Quotes to Scrape'>]
    >>> response.css('title::text').extract()
    ['Quotes to Scrape']
    # 定位到作者信息,这是最完整的写法
    >>> response.css("div.quote span small.author::text").extract()
    ['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
    # 也可以简单一点
    >>> response.css("div span small::text").extract()
    ['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
    # 也可以拆开来写
    >>> response.css("div.quote").css("span").css("small.author::text").extract()
    ['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
    # 只需要第一项?
    >>> response.css("div.quote").css("span").css("small.author::text")[0].extract()
    'Albert Einstein'
    >>> response.css("div.quote").css("span").css("small.author::text").extract_first()
    'Albert Einstein'



    另一种方法是使用XPath选择器,如上文中的代码:quote.xpath('span/small/text()')。根据文档的描述,XPath才是Scrapy的基础,事实上,即使是CSS选择器最终也会在底层被转化为XPath。XPath比CSS选择强大的地方在于它还可以对筛选出的网页的内容本身就行操作,比如说它可以进行诸如选择那个内容为(下一页)的链接的操作。官方提供了三个关于XPath的文档:using XPath with Scrapy Selectorslearn XPath through exampleshow to think in XPath



    scrapy crawl quotes -o data.json

    -o应该就是output,这个linux命令很像,不难理解。当然也可以是其他格式的文件,官方推荐一个叫JSON Lines的格式,虽然我目前还不知道这是什么格式。

    所有指出的到处数据类型为:'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle'。



    import scrapy
    class BlogSpider(scrapy.Spider):
        name = 'quotestoscrape'
        def start_requests(self):
            urls = [
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
        def parse(self, response):
            for quote in response.css('div.quote'):
                yield {
                    'text': quote.css('span.text::text').extract_first(),
                    'author': quote.xpath('span/small/text()').extract_first(),
            next_page = response.css('li.next a::attr("href")').extract_first()
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)



    import scrapy
    class BlogSpider(scrapy.Spider):
        name = 'ethanshub'
        start_urls = [
        def parse(self, response):
            yearlists = response.css('ul.listing')
            for i in range(len(yearlists)):
                lists = yearlists[i]
                for j in range(len(lists.css("li.listing_item"))//2):
                    yield {
                        'date': lists.css("li.listing_item::text")[j*2].extract(),
                        'title': lists.css("li.listing_item a::text")[j].extract(),


    在执行scrapy crawl ethanshub -o data.json之后抓取到的data.json文件内容如下:

    {"date": "[2017-12-16]
    ", "title": "Python3 u722cu866bu5165u95e8uff08u4e8cuff09"},
    {"date": "[2017-12-15]
    ", "title": "Python3 u722cu866bu5165u95e8uff08u4e00uff09"},
    {"date": "[2017-12-13]
    ", "title": "u7528Pythonu5411Kindleu63a8u9001u7535u5b50u4e66"},
    {"date": "[2017-12-12]
    ", "title": "GUIu7f16u7a0buff0cTkinteru5e93u548cu5e03u5c40"},
    {"date": "[2017-12-12]
    ", "title": "Python3u7684u6b63u5219u8868u8fbeu5f0f"},
    {"date": "[2017-12-10]
    ", "title": "Pythonu901fu89c8[7]"},
    {"date": "[2017-12-09]
    ", "title": "Pythonu901fu89c8[6]"},
    {"date": "[2013-09-16]
    ", "title": "How to split a string in C"},
    {"date": "[2012-11-28]
    ", "title": "Common Filters for Wireshark"}


