zoukankan      html  css  js  c++  java
  • 8_2 scrapy入门实战之CrawlSpider(微信小程序社区教程爬取示例)

    CrawlSpider可用于有规则的网站,对其整站的爬取

    一、创建项目

    scrapy startproject wxapp

    cd wxapp

    scrapy genspider -t crawl wxapp_spider wxapp-union.com

    二、更改setting.py

    ROBOTSTXT_OBEY = False
    DOWNLOAD_DELAY = 3
    DEFAULT_REQUEST_HEADERS = {...}

    三、wxapp_spider.py编写(重点)

      代码编写

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from scrapy.linkextractors import LinkExtractor
     4 from scrapy.spiders import CrawlSpider, Rule 6 
     7 class WxappSpiderSpider(CrawlSpider):
     8     name = 'wxapp_spider'
     9     allowed_domains = ['wxapp-union.com']
    10     #start_urls = ['http://wxapp-union.com/']
    11     start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
    12 
    13     rules = (
    14         # 注意特殊字符加
    15         Rule(LinkExtractor(allow=r'.+list&catid=2&page=d'), follow=True),
    16         Rule(LinkExtractor(allow=r'.+article.+.html'), callback='parse_detail', follow=False)
    17     )
    18 
    19     def parse_detail(self, response):
          print(type(response))

      附:注意问题:

        1、parse_detail(self, response):是用在Rule中的回调函数,若是要对其进行调试,则start_url和allowed_domains域名要一致。不一致的话是程序无法进入parse_detail(self, response):,因为会自动过滤。

        2、需要使用LinkExtractor和Rule,这两个决定爬虫的爬取方法;

          2.1  allow设置规则的方法:限制在程序需要爬取的url,同时注意re特殊字符的转义

          2.2  什么情况下使用follow:如果在爬取页面的时候,需要将满足当前条件的url再进行跟进,那么设置为Ture,否则设置False   

           2.3  什么情况下使用callback:如果想要获取url对应页面中的数据,那么就需要指定爬取函数为callback 。如果获取页面只是为了获取更多的url,不要需要其数据,则无需指定callback

    四、对页面进行爬取

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from scrapy.linkextractors import LinkExtractor
     4 from scrapy.spiders import CrawlSpider, Rule
     5 
     6 class WxappSpiderSpider(CrawlSpider):
     7     name = 'wxapp_spider'
     8     allowed_domains = ['wxapp-union.com']
     9     #start_urls = ['http://wxapp-union.com/']
    10     start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
    11 
    12     rules = (
    13         # 注意特殊字符加
    14         Rule(LinkExtractor(allow=r'.+list&catid=2&page=d'), follow=True),
    15         Rule(LinkExtractor(allow=r'.+article.+.html'), callback='parse_detail', follow=False)
    16     )
    17 
    18     def parse_detail(self, response):
    19         title = response.xpath("//h1[@class='ph']/text()").get()
    20         author_p = response.xpath("//p[@class='authors']")
    21         author = author_p.xpath(".//a/text()").get()
    22         pub_time = author_p.xpath(".//span/text()").get()
    23         content = response.xpath("//td[@id='article_content']//text()").getall()
    24         content = "".join(content).strip()

    五、数据存储

      1、items.py

     1 import scrapy
     2 
     3 
     4 class WxappItem(scrapy.Item):
     5     # define the fields for your item here like:
     6     # name = scrapy.Field()
     7     title = scrapy.Field()
     8     author = scrapy.Field()
     9     pub_time = scrapy.Field()
    10     content = scrapy.Field()

      2、pipelines.py

     1 from scrapy.exporters import JsonLinesItemExporter
     2 
     3 class WxappPipeline:
     4     def __init__(self):
     5         self.fp = open('wxjc.json', 'wb')
     6         self.exporter = JsonLinesItemExporter(self.fp,
     7                                               ensure_ascii=False,
     8                                               encoding='utf-8')
     9 
    10     def process_item(self, item, spider):
    11         self.exporter.export_item(item)
    12         return item
    13 
    14     def close_spider(self,spider):
    15         self.fp.close()

      3、wxapp_spider.py

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from scrapy.linkextractors import LinkExtractor
     4 from scrapy.spiders import CrawlSpider, Rule
     5 from wxapp.items import WxappItem
     6 
     7 class WxappSpiderSpider(CrawlSpider):
     8     name = 'wxapp_spider'
     9     allowed_domains = ['wxapp-union.com']
    10     #start_urls = ['http://wxapp-union.com/']
    11     start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
    12 
    13     rules = (
    14         # 注意特殊字符加
    15         Rule(LinkExtractor(allow=r'.+list&catid=2&page=d'), follow=True),
    16         Rule(LinkExtractor(allow=r'.+article.+.html'), callback='parse_detail', follow=False)
    17     )
    18 
    19     def parse_detail(self, response):
    20         title = response.xpath("//h1[@class='ph']/text()").get()
    21         author_p = response.xpath("//p[@class='authors']")
    22         author = author_p.xpath(".//a/text()").get()
    23         pub_time = author_p.xpath(".//span/text()").get()
    24         content = response.xpath("//td[@id='article_content']//text()").getall()
    25         content = "".join(content).strip()
    26 
    27         # 数据存储
    28         item = WxappItem(title=title, author=author,pub_time=pub_time, content=content)
    29         yield item

      4、更改setting.py

    ITEM_PIPELINES = {
       'wxapp.pipelines.WxappPipeline': 300,
    }
    
    
  • 相关阅读:
    HDU 1009 FatMouse' Trade
    HDU 2602 (简单的01背包) Bone Collector
    LA 3902 Network
    HDU 4513 吉哥系列故事——完美队形II
    LA 4794 Sharing Chocolate
    POJ (Manacher) Palindrome
    HDU 3294 (Manacher) Girls' research
    HDU 3068 (Manacher) 最长回文
    Tyvj 1085 派对
    Tyvj 1030 乳草的入侵
  • 原文地址:https://www.cnblogs.com/sruzzg/p/13185783.html
Copyright © 2011-2022 走看看