zoukankan html css js c++ java

scrapy 爬取小说

scrapy 爬取小说（解决章节错乱问题ing）

爬虫页面
import scrapy
from firstblood.items import FirstbloodItem

class FirstSpider(scrapy.Spider):
	name = 'second'
	# allowed_domains = ['www.xxx.com']
	start_urls = ['https://www.zhenhunxiaoshuo.com/shapolang/']

	def parse_detail(self,response):
		# 回调函数接收item
		item = response.meta['item']
		page_detail = response.xpath('/html/body/section/div[1]/div/article//text()').extract()
		page_detail = ''.join(page_detail)
		item['page'] = page_detail
		yield item
		# print(page_detail)

	def parse(self, response):
		# //这个基本就是默认
		li_list = response.xpath('//div[@class="excerpts-wrapper"]/div/article')
		for li in li_list:
			item = FirstbloodItem()
			title = li.xpath('./a/text()')[0].extract()
			detail_url = li.xpath('./a/@href').extract_first()
			item['title'] = title
			# print(title)
			# print(detail_url)
			# 手动对详情页发请求
			# 请求传参
			yield scrapy.Request(detail_url,callback = self.parse_detail,meta={'item':item})
settings页面打开管道存储
from itemadapter import ItemAdapter
import pymysql


class FirstbloodPipeline(object):
	def process_item(self,item,spider):
		print(item['title'])  #这是刚才看了一下章节顺序
		return item

问题：章节错乱

原因好像是异步存储
很多小说章节前面都没有数字什么的，所以需要自己设定自增id

查看全文

相关阅读:
nova创建虚拟机源码分析系列之五 nova源码分发实现
 nova创建虚拟机源码分析系列之四 nova代码模拟
 nova创建虚拟机源码分析系列之三 PasteDeploy
nova创建虚拟机源码系列分析之二 wsgi模型
 nova创建虚拟机源码分析系列之一 restful api
devstack安装openstack newton版本
 openstack ocata版本简化安装
 openstack使用openvswitch实现vxlan组网
 OpenFlow协议1.0及1.3版本分析
 Python开发环境配置

原文地址：https://www.cnblogs.com/serendipity-my/p/13736030.html

最新文章
编程经验/原则
 商业分析
 IT全称
 void关键字
 super关键字
 this关键字
 final关键字
 static关键字
 Linq的Join == 两个foreach
dynamic的一些使用心得