Date: 2019-07-07
Author: Sun
1. Pycharm调试scrapy代码流程
由于Pycharm本身是没有自带scrapy代码包的,所以正常情况是不好调试scrapy代码的,那我们想要学习scrapy,调试scrapy时,会怎么处理呢?
本节给你带来处理方法:
本节以建立爬取 http://books.toscrape.com/ 网站为例
(1)创建scrapy工程
scrapy startproject books_toscrape
(2) 创建爬虫
cd books_toscrape
scrapy genspider toscrape
此时会在spiders目录下产生 toscrape.py的爬虫spider
(3) 在工程目录下创建调试文件main.py
books_toscrape/main.py
内容如下:
# -*- coding: utf-8 -*-
__author__ = 'sun'
__date__ = '2019/07/07 下午9:04'
import os, sys
from scrapy.cmdline import execute
sys.path.append(os.path.dirname(os.path.abspath(__file__))) #当前main.py的文件夹路径
SPIDER_NAME = "toscrape" #此名称是我们采用 scrapy genspider spider_name 指定的spider_name
execute(["scrapy", "crawl", SPIDER_NAME])
(4) 配置文件settings.py中的修改
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
(5) 开始调试
进入main.py文件,点击右键调试,进入调试模式。
在spiders/toscrape.py文件中的parse函数中设置断点,尝试采用xpath解析此页面中的部分书籍数据。
开始进入调试模式,就可以进入scrapy了
2. 案例分析
采用scrapy分析并爬取http://books.toscrape.com/ 网站书籍信息
(1)创建项目
scrapy startproject BookToscrape
(2) 创建爬虫
创建一个基于basic模板的爬虫
scrapy genspider toscrape books.toscrape.com
此时会在spiders目录下产生一个爬虫文件toscrape.py
(3) 修改配置文件 settings.py
修改两个选项USER_AGENT和ROBOTSTXT_OBEY,具体配置文件选项说明见day02
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
(4) 编写爬虫文件逻辑
spiders/toscrape.py
内容如下:
class ToscrapeSpider(scrapy.Spider):
name = 'toscrape' #spider name 爬虫名称
allowed_domains = ['books.toscrape.com'] #爬虫的作用域,爬取范围
start_urls = ['http://books.toscrape.com/'] #待爬取的初始化URL地址
def parse(self, response):
'''
start_urls 被基类爬虫scrapy.Spider进行遍历后,封装成Request(url, callback=parse)
发射给sheduler ---》 downloader ---》 parse
:param response:
:return:
'''
article_list = response.xpath('//article[@class="product_pod"]')
for article in article_list:
book_title = article.xpath("./h3/a/text()").extract_first()
book_detail_url = article.xpath("./h3/a/@href").extract_first()
if p_book_detail.match(book_detail_url) == None:
book_detail = 'http://books.toscrape.com/' + 'catalogue/' + book_detail_url
else:
book_detail = 'http://books.toscrape.com/' + book_detail_url
book_image = article.xpath("./div[@class='image_container']/a/img/@src").extract_first()
if p_img_pre.match(book_image) == None:
book_image = self.start_urls[0] + book_image
else:
book_image = book_image.split("../")[-1]
book_image = self.start_urls[0] + book_image
book_price = article.xpath("./div[@class='product_price']/p/text()").extract_first()
book_price = p_price.findall(book_price)[0]
print(f"book_title:{book_title}, book_detail:{book_detail}, book_image:{book_image},"
f" book_price:{book_price}")
(5)引入上述调试文件 books_toscrape/main.py
设置断点调试并运行此爬虫系统