1.CrawlSpider介绍
Scrapy框架中分两类爬虫,Spider类和CrawlSpider类。
此案例采用的是CrawlSpider类实现爬虫。
它是Spider的派生类,Spider类的设计原则是只爬取start_url列表中的网页,而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制,从爬取的网页中获取link并继续爬取的工作更适合。
创建项目指令:
scrapy startproject baidu
模版创建:
scrapy genspider -t crawl baidu 'tieba.baidu.com'
CrawlSpider继承于Spider类,除了继承过来的属性外(name、allow_domains),还提供了新的属性和方法:
LinkExtractors
class scrapy.linkextractors.LinkExtractor
Link Extractors 的目的很简单: 提取链接。
每个LinkExtractor有唯一的公共方法是 extract_links(),它接收一个 Response 对象,并返回一个 scrapy.link.Link 对象。
Link Extractors要实例化一次,并且 extract_links 方法会根据不同的 response 调用多次提取链接。
每个LinkExtractor有唯一的公共方法是 extract_links(),它接收一个 Response 对象,并返回一个 scrapy.link.Link 对象。
Link Extractors要实例化一次,并且 extract_links 方法会根据不同的 response 调用多次提取链接。
主要参数:
allow:满足括号中“正则表达式”的值会被提取,如果为空,则全部匹配。
deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
allow_domains:会被提取的链接的domains。
deny_domains:一定不会被提取链接的domains。
restrict_xpaths:使用xpath表达式,和allow共同作用过滤链接。
rules
在rules中包含一个或多个Rule对象,每个Rule对爬取网站的动作定义了特定操作。如果多个rule匹配了相同的链接,则根据规则在本集合中被定义的顺序,第一个会被使用。
参数介绍:
link_extractor:是一个Link Extractor对象,用于定义需要提取的链接
callback: 从link_extractor中每获取到链接时,参数所指定的值作为回调函数,该回调函数接受一个response作为其第一个参数。
注意:当编写爬虫规则时,避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑,如果覆盖了 parse方法,crawl spider将会运行失败。
follow:是一个布尔(boolean)值,指定了根据该规则从response提取的链接是否需要跟进。 如果callback为None,follow 默认设置为True ,否则默认为False。
process_links:指定该spider中哪个的函数将会被调用,从link_extractor中获取到链接列表时将会调用该函数。该方法主要用来过滤。
process_request:指定该spider中哪个的函数将会被调用, 该规则提取到每个request时都会调用该函数。 (用来过滤request)
2.创建案例
a.开始一个项目
scrapy startproject wxapp
b.创建模板
scrapy genspider -t crawl wxapp_spider "wxapp-union.com"
c.settings.py
# -*- coding: utf-8 -*- # Scrapy settings for wxapp project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'wxapp' SPIDER_MODULES = ['wxapp.spiders'] NEWSPIDER_MODULE = 'wxapp.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'wxapp (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 2 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' } # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'wxapp.middlewares.WxappSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'wxapp.middlewares.WxappDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'wxapp.pipelines.WxappPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
wxapp_spider.py
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from wxapp.items import WxappItem class WxappSpiderSpider(CrawlSpider): name = 'wxapp_spider' allowed_domains = ['wxapp-union.com'] start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1'] rules = ( Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=d'), follow=True), Rule(LinkExtractor(allow=r'.+article-.+.html'),callback="parse_detail",follow=False) ) def parse_detail(self, response): title=response.xpath('//h1[@class="ph"]/text()').get() author_p=response.xpath('//p[@class="authors"]') author=author_p.xpath('.//a/text()').get() pub_time=author_p.xpath('.//span/text()').get() content=response.xpath('//td[@id="article_content"]//text()').getall() item=WxappItem(title=title,author=author,pub_time=pub_time,content=content) yield item #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract() #i['name'] = response.xpath('//div[@id="name"]').extract() #i['description'] = response.xpath('//div[@id="description"]').extract()
items.py
import scrapy class WxappItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title=scrapy.Field() author=scrapy.Field() pub_time=scrapy.Field() content=scrapy.Field()
pipelines.py
from scrapy.exporters import JsonLinesItemExporter class WxappPipeline(object): def __init__(self): self.fp=open("wxapp.json","wb") self.exporter=JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding="utf-8") def process_item(self, item, spider): self.exporter.export_item(item) return item def close_spider(self,spider): self.fp.close()
在wxapp目录下创建:start.py
from scrapy import cmdline cmdline.execute("scrapy crawl wxapp_spider".split())
执行start.py即可