zoukankan      html  css  js  c++  java
  • Scrapy 入门基础

    原文学习参考链接: https://blog.csdn.net/u011054333/article/details/70165401

    问题解决参考链接:https://blog.csdn.net/dugushangliang/article/details/94585829

    1. Scrapy是一个高级的Python爬虫框架,它不仅包含了爬虫的特性,还可以方便的将爬虫数据保存到csv、json等文件中。

    Scrapy  安装

    pip install scrapy

     2. 快速开始 -- 第一个爬虫例子

    import scrapy
    
    
    class QuotesSpider(scrapy.Spider):
        name = "quotes"
    
        def start_requests(self):
            urls = [
                'http://quotes.toscrape.com/page/1/',
                'http://quotes.toscrape.com/page/2/',
            ]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
    
        def parse(self, response):
            page = response.url.split("/")[-2]
            filename = 'quotes-%s.html' % page
            with open(filename, 'wb') as f:
                f.write(response.body)
            self.log('Saved file %s' % filename)

    示例解释:

    • 爬虫类的name属性,用来标识爬虫,该名字在一个项目必须是唯一的。
    • start_requests()方法,必须返回一个可迭代的列表(可以是列表,也可以是生成器),Scrapy会从这些请求开始抓取网页。
    • parse() 方法用于从网页文本中抓取相应内容,我们需要根据自己的需要重写该方法。

    爬虫链接设置

      上面的例子中使用start_requests()方法来设置起始URL,如果只需要简单指定URL还可以使用另一种简便方法,那就是设置类属性start_urls,Scrapy会读取该属性来设置起始URL。

    import scrapy
    
    
    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]

    3. 提取数据

    可以使用Scrapy的shell功能。使用如下的命令启动Scrapy shell 并提取百思不得解段子内容,成功之后会打开一个交互式shell,我们可以进行交互式编程。

    scrapy shell 'http://www.budejie.com/text/'
    (tensorflow) C:Usersxxx>scrapy shell 'http://www.budejie.com/text/'
    2020-04-20 21:41:40 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot)
    2020-04-20 21:41:40 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.7 (default, Mar 23 2020, 23:19:08) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
    2020-04-20 21:41:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
    2020-04-20 21:41:40 [scrapy.crawler] INFO: Overridden settings:
    {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
     'LOGSTATS_INTERVAL': 0}
    2020-04-20 21:41:40 [scrapy.extensions.telnet] INFO: Telnet Password: 7f69dbe4b767b160
    2020-04-20 21:41:40 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole']
    2020-04-20 21:41:41 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2020-04-20 21:41:41 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2020-04-20 21:41:41 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2020-04-20 21:41:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
    2020-04-20 21:41:41 [scrapy.core.engine] INFO: Spider opened
    Traceback (most recent call last):
      File "f:anaconda3envs	ensorflowlib
    unpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "f:anaconda3envs	ensorflowlib
    unpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "F:Anaconda3envs	ensorflowScriptsscrapy.exe\__main__.py", line 7, in <module>
      File "f:anaconda3envs	ensorflowlibsite-packagesscrapycmdline.py", line 145, in execute
        _run_print_help(parser, _run_command, cmd, args, opts)
      File "f:anaconda3envs	ensorflowlibsite-packagesscrapycmdline.py", line 99, in _run_print_help
        func(*a, **kw)
      File "f:anaconda3envs	ensorflowlibsite-packagesscrapycmdline.py", line 153, in _run_command
        cmd.run(args, opts)
      File "f:anaconda3envs	ensorflowlibsite-packagesscrapycommandsshell.py", line 74, in run
        shell.start(url=url, redirect=not opts.no_redirect)
      File "f:anaconda3envs	ensorflowlibsite-packagesscrapyshell.py", line 45, in start
        self.fetch(url, spider, redirect=redirect)
      File "f:anaconda3envs	ensorflowlibsite-packagesscrapyshell.py", line 113, in fetch
        reactor, self._schedule, request, spider)
      File "f:anaconda3envs	ensorflowlibsite-packages	wistedinternet	hreads.py", line 122, in blockingCallFromThread
        result.raiseException()
      File "f:anaconda3envs	ensorflowlibsite-packages	wistedpythonfailure.py", line 488, in raiseException
        raise self.value.with_traceback(self.tb)
    ValueError: invalid hostname: 'http

    报错原因: window下面,scrapy shell 后的 url 需要双引号

    (tensorflow) C:Usersxxxx>scrapy shell "http://www.budejie.com//text//"
    2020-04-20 21:46:20 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot)
    2020-04-20 21:46:20 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.7 (default, Mar 23 2020, 23:19:08) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
    2020-04-20 21:46:20 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
    2020-04-20 21:46:20 [scrapy.crawler] INFO: Overridden settings:
    {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
     'LOGSTATS_INTERVAL': 0}
    2020-04-20 21:46:20 [scrapy.extensions.telnet] INFO: Telnet Password: 9399f1c2f556e3d9
    2020-04-20 21:46:20 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole']
    2020-04-20 21:46:21 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2020-04-20 21:46:21 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2020-04-20 21:46:21 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2020-04-20 21:46:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
    2020-04-20 21:46:21 [scrapy.core.engine] INFO: Spider opened
    2020-04-20 21:46:22 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://www.budejie.com//text//> (referer: None)
    2020-04-20 21:46:24 [asyncio] DEBUG: Using selector: SelectSelector
    [s] Available Scrapy objects:
    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    [s]   crawler    <scrapy.crawler.Crawler object at 0x00000211AB54EC88>
    [s]   item       {}
    [s]   request    <GET http://www.budejie.com//text//>
    [s]   response   <403 http://www.budejie.com//text//>
    [s]   settings   <scrapy.settings.Settings object at 0x00000211AD5374C8>
    [s]   spider     <DefaultSpider 'default' at 0x211ad9b9f08>
    [s] Useful shortcuts:
    [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
    [s]   fetch(req)                  Fetch a scrapy.Request and update local objects
    [s]   shelp()           Shell help (print this help)
    [s]   view(response)    View response in a browser
    2020-04-20 21:46:25 [asyncio] DEBUG: Using selector: SelectSelector
    In [1]:    

    4. 交互式命令使用示例

    2020-04-20 21:53:40 [asyncio] DEBUG: Using selector: SelectSelector
    In [1]: response.css('title')
    Out[1]: [<Selector xpath='descendant-or-self::title' data='<title>内涵段子_内涵笑话-百思不得姐官网,第2页</title>'>]
    
    In [2]: response.css("title::text").extract()
    Out[2]: ['内涵段子_内涵笑话-百思不得姐官网,第2页']
    
    In [3]: li=response.css('div.j-r-list-c-desc')
    
    In [4]: li
    Out[4]:
    [<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' j-r-list-c-desc ')]" data='<div class="j-r-list-c-desc">
           ...'>,
     <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' j-r-list-c-desc ')]" data='<div class="j-r-list-c-desc">
           ...'>,
     <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' j-r-list-c-desc ')]" data='<div class="j-r-list-c-desc">
           ...'>,
     <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' j-r-list-c-desc ')]" data='<div class="j-r-list-c-desc">
           ...'>,
     <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' j-r-list-c-desc ')]" data='<div class="j-r-list-c-desc">
           ...'>,
     <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' j-r-list-c-desc ')]" data='<div class="j-r-list-c-desc">
           ...'>]
    
    In [5]:  

    5. 编写爬虫

    确定如何提取数据之后,就可以编写爬虫了。下面的爬虫爬取了百思不得姐首页的用户名和段子

    import scrapy
    
    
    class Baisibudejie(scrapy.Spider):
        name = "jokes"
        start_urls = ['http://www.budejie.com/text/']
    
        def parse(self,response):
            lies = response.css('div.j-r-list>ul>li')
            for li in lies:
                username = li.css('a.u-user-name::text').extract()
                content = li.css('div.j-r-list-c-desc a::text').extract()
                yield {'username': username, 'content': content}

    写好爬虫后就可以运行了。下面使用如下命令运行爬虫,运行成功后回生成一个 user.json 文件,里面存储的就是我们爬取的内容。Scrapy支持多种格式,除了json之外,还可以将数据导出为XML、CSV等格式。

    scrapy runspider Baisibudejie.py -o user.json

    生成的文件位于当前的 user -- > document 文件夹目录下。

  • 相关阅读:
    Pyton项目打包成exe文件
    App数据指标
    电商基础指标体系
    Matplotlib复杂作图
    Sklearn之聚类分析
    Seaborn可视化
    Matplotlib可视化2
    Matplotlib可视化1
    Pandas可视化
    Linux常用指令(3)
  • 原文地址:https://www.cnblogs.com/runningRain/p/12740908.html
Copyright © 2011-2022 走看看