zoukankan      html  css  js  c++  java
  • [Scrapy] Some things about Scrapy

    1. Pause and resume a crawl

    Scrapy supports this functionality out of the box by providing > the following facilities:

    • a scheduler that persists scheduled > >requests on disk

    • a duplicates filter that persists >visited requests on disk

    • an extension that keeps some spider state (key/value pairs) > persistent between > batches

    run a crawl by

    scrapy crawl somespider -s JOBDIR=crawls/somespider_dir
    

    use Ctrl+C to close a drawl and resume by the same command above

    2. 发起一次get请求

    e.g.
    页面A是新闻的列表,包含了每个新闻的链接

    要发起一个请求去获取新闻的内容
    通过设置request.meta,可以将参数带到callback函数中去,用response.meta接收

    def parse(self, response):
        newslist = response.xpath('//ul[@class="linkNews"]/li')
    
        for item in newslist:
            news = News()
            news['title'] = item.xpath('a/text()').extract_first(default = '')
    
            contentUri = item.xpath('a/@href').extract_first(default = '')
            request = scrapy.Request(contentUri, 
                        callback = self.getContent_callback,
                        headers = headers)
            request.meta['item'] = news
            yield request
    
    def getContent_callback(self, response):
        news = response.meta['item']
        item['content'] = response.xpath('//article[@class="art_box"]').xpath('string(.)').extract_first(default = '').strip()
        yield item
    

    3. 交互式shell

    可以在这里交互式地获取各种信息,如response.status

    我主要用来调试xpath(!shell中调试结果并不可靠)

    PS C:UserspatrickDocumentsVisual Studio 2017ProjectsScrapyProjects> scrapy shell --nolog 'http://mil.news.sina.com.cn/2011-03-31/1342640379.html'
    [s] Available Scrapy objects:
    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    [s]   crawler    <scrapy.crawler.Crawler object at 0x0000026EA72752B0>
    [s]   item       {}
    [s]   request    <GET http://mil.news.sina.com.cn/2011-03-31/1342640379.html>
    [s]   response   <200 http://mil.news.sina.com.cn/2011-03-31/1342640379.html>
    [s]   settings   <scrapy.settings.Settings object at 0x0000026EA8586940>
    [s]   spider     <DefaultSpider 'default' at 0x26ea884bb38>
    [s] Useful shortcuts:
    [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
    [s]   fetch(req)                  Fetch a scrapy.Request and update local objects
    [s]   shelp()           Shell help (print this help)
    [s]   view(response)    View response in a browser
    In [1]: response.status
    Out[1]: 200
    

    在交互式环境里设置自定义headers

    $ scrapy shell --nolog
    ...
    ...
    >>> from scrapy import Request
    >>> req = Request('douban.com', headers = {'User-Agent' : '...'})
    >>> fetch(req)
    

    if you just want to set user agent

    scrapy shell -s USER_AGENT='useragent' 'https://movie.douban.com'
    

    4. 命令行下向爬虫传参数

    scrapy crawl myspider -a category=electronics
    

    在爬虫中获取参数,直接通过参数名获取,如下面代码中的category

    import scrapy
    
    class MySpider(scrapy.Spider):
        name = 'myspider'
    
        def __init__(self, category=None, *args, **kwargs):
            super(MySpider, self).__init__(*args, **kwargs)
            self.start_urls = ['http://www.example.com/categories/%s' % category]
            # ...
    

    5. 去除网页中的

    用xpath中的normalize-space

    以及extract_first是个好东西,还能加默认值

    item['content'] = response.xpath('normalize-space(//div[@class="blkContainerSblkCon" and @id="artibody"])').extract_first(default = '')
    

    6. 以编程方式停止一个爬虫

    方法是抛出一个内置的异常CloseSpider

    exception scrapy.exceptions.CloseSpider(reason='cancelled')
    This exception can be raised from a spider callback to request the spider to be closed/stopped. Supported arguments:

    Parameters: reason (str) – the reason for closing

    def parse_page(self, response):
        if 'Bandwidth exceeded' in response.body:
            raise CloseSpider('bandwidth_exceeded')
    

    7. [mysql] Incorrect string value: 'xF0x9Fx8CxB9' for column 'title' at row 1

    连接数据库时的charset参数设置成utf8mb4

    8. 写入文件时为utf-8编码而不是中文

    在settings.py 文件末加上 FEED_EXPORT_ENCODING = 'utf-8'

    9. soome things about Item

    >>> import scrapy
    >>> class A(scrapy.Item):
    ...     post_id = scrapy.Field()
    ...     user_id = scrapy.Field()
    ...     content = scrapy.Field()
    ...
    >>> type(A)
    <class 'scrapy.item.ItemMeta'>
    

    这里的post_iduser_id可以存储任何类型的数据

    取数据的时候也可以像是操作dic一样

    >>> a = A(post_id = '12312312', author_id = '2342_author_id')
    >>> a['post_id']
    '12312312'
    >>> a['author_id']
    '2342_author_id'
    

    如果field未被赋值,直接用dic['key']的方法取数据会报'KeyError',解决办法是改用get方法

    >>> a.get('content', default = 'empty')
    'empty'
    >>> a.get('content', 'empty')
    'empty'
    

    判断Item中是否存在某个field以及是否被赋值

    >>> 'name' in a   # name是否被赋值
    False
    >>> 'name' in a.fields  # a的属性里是否有 'name
    False
    >>> 'content' in a  # content是否被赋值
    False
    >>> 'content' in a.fields
    True
    

    建议所有dic['key']都改成dic.get('key', '')

    10. 日志写入到文件

    settings.py中插入

    LOG_STDOUT = True
    LOG_FILE = 'scrapy_log.txt'
    

    scrapy crawl MyCrawler -s LOG_FILE=/var/log/crawler_mycrawler.log
    

    Reference

    1. Set headers for scrapy shell request

    2. Scrapy 1.5 documentation

  • 相关阅读:
    一些 Ubuntu 使用的小技巧
    体验 Web 自动化测试工具 Selenium
    CentOS 7 上安装 Nginx
    Windows查看端口占用情况
    Windows远程登录提醒:由于没有远程桌面授权服务器可以提供许可证,远程会话连接已断开。请跟服务器管理员联系。
    Vue动态的改变css样式
    centos7 U盘安装卡在 starting dracut initqueue hook Reached target Basic System
    用tsc编译ts文件的时候报错,tsc : 无法加载文件,因为在此系统上禁止运行脚本;
    Linux修改SSH默认的端口号
    Centos编译安装新版本Git
  • 原文地址:https://www.cnblogs.com/arcsinw/p/9018714.html
Copyright © 2011-2022 走看看