zoukankan      html  css  js  c++  java
  • 8_2 scrapy入门实战之CrawlSpider(微信小程序社区教程爬取示例)

    CrawlSpider可用于有规则的网站,对其整站的爬取

    一、创建项目

    scrapy startproject wxapp

    cd wxapp

    scrapy genspider -t crawl wxapp_spider wxapp-union.com

    二、更改setting.py

    ROBOTSTXT_OBEY = False
    DOWNLOAD_DELAY = 3
    DEFAULT_REQUEST_HEADERS = {...}

    三、wxapp_spider.py编写(重点)

      代码编写

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from scrapy.linkextractors import LinkExtractor
     4 from scrapy.spiders import CrawlSpider, Rule 6 
     7 class WxappSpiderSpider(CrawlSpider):
     8     name = 'wxapp_spider'
     9     allowed_domains = ['wxapp-union.com']
    10     #start_urls = ['http://wxapp-union.com/']
    11     start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
    12 
    13     rules = (
    14         # 注意特殊字符加
    15         Rule(LinkExtractor(allow=r'.+list&catid=2&page=d'), follow=True),
    16         Rule(LinkExtractor(allow=r'.+article.+.html'), callback='parse_detail', follow=False)
    17     )
    18 
    19     def parse_detail(self, response):
          print(type(response))

      附:注意问题:

        1、parse_detail(self, response):是用在Rule中的回调函数,若是要对其进行调试,则start_url和allowed_domains域名要一致。不一致的话是程序无法进入parse_detail(self, response):,因为会自动过滤。

        2、需要使用LinkExtractor和Rule,这两个决定爬虫的爬取方法;

          2.1  allow设置规则的方法:限制在程序需要爬取的url,同时注意re特殊字符的转义

          2.2  什么情况下使用follow:如果在爬取页面的时候,需要将满足当前条件的url再进行跟进,那么设置为Ture,否则设置False   

           2.3  什么情况下使用callback:如果想要获取url对应页面中的数据,那么就需要指定爬取函数为callback 。如果获取页面只是为了获取更多的url,不要需要其数据,则无需指定callback

    四、对页面进行爬取

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from scrapy.linkextractors import LinkExtractor
     4 from scrapy.spiders import CrawlSpider, Rule
     5 
     6 class WxappSpiderSpider(CrawlSpider):
     7     name = 'wxapp_spider'
     8     allowed_domains = ['wxapp-union.com']
     9     #start_urls = ['http://wxapp-union.com/']
    10     start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
    11 
    12     rules = (
    13         # 注意特殊字符加
    14         Rule(LinkExtractor(allow=r'.+list&catid=2&page=d'), follow=True),
    15         Rule(LinkExtractor(allow=r'.+article.+.html'), callback='parse_detail', follow=False)
    16     )
    17 
    18     def parse_detail(self, response):
    19         title = response.xpath("//h1[@class='ph']/text()").get()
    20         author_p = response.xpath("//p[@class='authors']")
    21         author = author_p.xpath(".//a/text()").get()
    22         pub_time = author_p.xpath(".//span/text()").get()
    23         content = response.xpath("//td[@id='article_content']//text()").getall()
    24         content = "".join(content).strip()

    五、数据存储

      1、items.py

     1 import scrapy
     2 
     3 
     4 class WxappItem(scrapy.Item):
     5     # define the fields for your item here like:
     6     # name = scrapy.Field()
     7     title = scrapy.Field()
     8     author = scrapy.Field()
     9     pub_time = scrapy.Field()
    10     content = scrapy.Field()

      2、pipelines.py

     1 from scrapy.exporters import JsonLinesItemExporter
     2 
     3 class WxappPipeline:
     4     def __init__(self):
     5         self.fp = open('wxjc.json', 'wb')
     6         self.exporter = JsonLinesItemExporter(self.fp,
     7                                               ensure_ascii=False,
     8                                               encoding='utf-8')
     9 
    10     def process_item(self, item, spider):
    11         self.exporter.export_item(item)
    12         return item
    13 
    14     def close_spider(self,spider):
    15         self.fp.close()

      3、wxapp_spider.py

     1 # -*- coding: utf-8 -*-
     2 import scrapy
     3 from scrapy.linkextractors import LinkExtractor
     4 from scrapy.spiders import CrawlSpider, Rule
     5 from wxapp.items import WxappItem
     6 
     7 class WxappSpiderSpider(CrawlSpider):
     8     name = 'wxapp_spider'
     9     allowed_domains = ['wxapp-union.com']
    10     #start_urls = ['http://wxapp-union.com/']
    11     start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
    12 
    13     rules = (
    14         # 注意特殊字符加
    15         Rule(LinkExtractor(allow=r'.+list&catid=2&page=d'), follow=True),
    16         Rule(LinkExtractor(allow=r'.+article.+.html'), callback='parse_detail', follow=False)
    17     )
    18 
    19     def parse_detail(self, response):
    20         title = response.xpath("//h1[@class='ph']/text()").get()
    21         author_p = response.xpath("//p[@class='authors']")
    22         author = author_p.xpath(".//a/text()").get()
    23         pub_time = author_p.xpath(".//span/text()").get()
    24         content = response.xpath("//td[@id='article_content']//text()").getall()
    25         content = "".join(content).strip()
    26 
    27         # 数据存储
    28         item = WxappItem(title=title, author=author,pub_time=pub_time, content=content)
    29         yield item

      4、更改setting.py

    ITEM_PIPELINES = {
       'wxapp.pipelines.WxappPipeline': 300,
    }
    
    
  • 相关阅读:
    三级连动的下拉框(数据库版)吐血推荐
    行排菜单
    用AJAX制作天气预
    XmlHttp实战学习中....
    ASP+JS三级连动下拉框
    ASP连接11种数据库语法总结
    oa数据库设计
    RSS PUBData 把正常时间函数转成rss2.0的标准
    浮点数的表示和基本运算
    C#4.0新特性:可选参数,命名参数,Dynamic
  • 原文地址:https://www.cnblogs.com/sruzzg/p/13185783.html
Copyright © 2011-2022 走看看