zoukankan      html  css  js  c++  java
  • 4:登录知乎后爬取首页问题

    【转载】:http://www.jianshu.com/p/b7f41df6202d#pay-modal,作者:Andrew-Liu。

    上篇文章十分完备的谢了模拟登录,cookies以及headers的使用。

    不过要补充的是:

    1:其第二步分析From data时请注意清除cookies后查看,否者会多出部分不同。(PS:rememberme:y--->remember_me:true).

    2:关于此部分:

    1  rules = (
    2         Rule(SgmlLinkExtractor(allow = ('/question/d+#.*?', )), callback = 'parse_page', follow = True),
    3         Rule(SgmlLinkExtractor(allow = ('/question/d+', )), callback = 'parse_page', follow = True),
    4     )

    为Link extractor部分,文档见此(已经不推荐使用SgmlLinkExtractor)

    3:注意继承的来自CrawlSpider类,文档说明如下:

    CrawlSpider

    class scrapy.spiders.CrawlSpider

    This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. It may not be the best suited for your particular web sites or project, but it’s generic enough for several cases, so you can start from it and override it as needed for more custom functionality, or just implement your own spider.

    Apart from the attributes inherited from Spider (that you must specify), this class supports a new attribute:

    rules

    Which is a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling the site. Rules objects are described below. If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.

    This spider also exposes an overrideable method:

    parse_start_url(response)

    This method is called for the start_urls responses. It allows to parse the initial responses and must return either an Item object, a Request object, or an iterable containing any of them.

    Crawling rules

    class scrapy.spiders.Rule(link_extractorcallback=Nonecb_kwargs=Nonefollow=Noneprocess_links=None,process_request=None)

    link_extractor is a Link Extractor object which defines how links will be extracted from each crawled page.

    callback is a callable or a string (in which case a method from the spider object with that name will be used) to be called for each link extracted with the specified link_extractor. This callback receives a response as its first argument and must return a list containing Item and/or Request objects (or any subclass of them).

    Warning

    When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses theparse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

    cb_kwargs is a dict containing the keyword arguments to be passed to the callback function.

    follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it defaults to False.

    process_links is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specified link_extractor. This is mainly used for filtering purposes.

    process_request is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called with every request extracted by this rule, and must return a request or None (to filter out the request).

    CrawlSpider example

    Let’s now take a look at an example CrawlSpider with rules:

    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    class MySpider(CrawlSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com']
    
        rules = (
            # Extract links matching 'category.php' (but not matching 'subsection.php')
            # and follow links from them (since no callback means follow=True by default).
            Rule(LinkExtractor(allow=('category.php', ), deny=('subsection.php', ))),
    
            # Extract links matching 'item.php' and parse them with the spider's method parse_item
            Rule(LinkExtractor(allow=('item.php', )), callback='parse_item'),
        )
    
        def parse_item(self, response):
            self.logger.info('Hi, this is an item page! %s', response.url)
            item = scrapy.Item()
            item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (d+)')
            item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
            item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
            return item
    

    This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and an Item will be filled with it.

  • 相关阅读:
    用JAVA发送一个XML格式的HTTP请求
    LR 测试http协议xml格式数据接口
    软件测试术语
    linux学习笔记
    接口测试文章整理
    InputStream只能读取一次的解决办法 C# byte[] 和Stream转换
    zTree更新自定义标签>>>
    C# 各类常见Exception 异常信息
    C# 调用存储过程 Sql Server存储过程 存储过程报错,程序中的try
    SQL Server 2014 清除用户名和密码
  • 原文地址:https://www.cnblogs.com/pengsixiong/p/4922351.html
Copyright © 2011-2022 走看看