scrapy学习笔记
下面以爬取1919网站为例子,完成对一整个网站数据爬取的scrapy项目创建。
创建一个scrapy文件
在任意目录下输入命令
scrapy startproject OneNine (文件名)
将会得到如下目录的文件
OneNine/ scrapy.cfg # 部署配置文件 OneNine/ # Python模块,你所有的代码都放这里面 __init__.py items.py # Item定义文件 pipelines.py # pipelines定义文件 settings.py # 配置文件 spiders/ # 所有爬虫spider都放这个文件夹下面 __init__.py ...
接着创建一个spider文件用来编写爬取规则
cd OneNine
scrape genspider onenine onenine.com
此时在spiders文件夹下就会生成一个onenine.py文件,我们将在这个文件中编写爬虫规则
定义Item
在items.py文件中需要编写我们要爬取的字段内容。
import scrapy class OnenineItem(scrapy.Item): url = scrapy.Field() good_name = scrapy.Field() actual_price = scrapy.Field() details = scrapy.Field() year = scrapy.Field() month = scrapy.Field() plateform = scrapy.Field() cat_lv_one = scrapy.Field() cat_lv_two = scrapy.Field() shop_id = scrapy.Field() shop_name = scrapy.Field() shop_area = scrapy.Field() shop_province = scrapy.Field() shop_city = scrapy.Field() good_id = scrapy.Field() brand = scrapy.Field() size = scrapy.Field() percent = scrapy.Field() country = scrapy.Field() area = scrapy.Field() type = scrapy.Field() grape_type = scrapy.Field() num = scrapy.Field() name_price = scrapy.Field() bottle_price = scrapy.Field() comments = scrapy.Field() accumulate_sales = scrapy.Field() month_sales = scrapy.Field() month_bottle_sales = scrapy.Field() month_sale_amounts = scrapy.Field()
scrapy.Field的属性的字段可以直接在后期直接生成你要的文件格式。
spider文件
在OneNine/spiders文件夹下的onenine.py文件中我们编写了对于网站爬取规则的编写。
在编写爬取规则前,我们要先继承一个scrapy.Spider类,并定义一些属性:
- name:Spider名称,必须唯一
- allowed_domains:定义网页的筛选规则
- start_urls:起始爬取的网址
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from ..items import OnenineItem 4 from scrapy.linkextractors import LinkExtractor 5 from scrapy.spiders import CrawlSpider 6 import requests,re 7 8 class OnenineSpider(scrapy.Spider): 9 name = 'onenine' 10 allowed_domains = ['www.1919.cn'] 11 start_urls = ['https://www.1919.cn/search.html?sort=DEFAULT_SORT&page='+str(x) +'&size=16&kw=%E7%99%BD%E9%85%92' 12 for x in range(0,27)] #使用列表生成式完成翻页处理 13 14 def parse(self, response): 15 16 result = response.xpath('//div[@class="ml-info ml-rpb12"]') 17 for i in result: 18 item = OnenineItem() 19 item['good_name'] = i.xpath('p[@class="ml-pdtname"]/a/text()').extract()[0] # 商品名 20 item['name_price'] = i.xpath('p[@class="ml-pdtpri"]/span[@class="ml-pri"]/text()').extract()[0].replace('.','')# 商品价格 21 item['url'] = i.xpath('p[@class="ml-pdtname"]/a/@href').extract()[0] # 商品url 22 url = response.urljoin(item['url']) 23 yield scrapy.Request(url,meta={'item':item},callback=self.good_detail) 24 25 def good_detail(self,response): 26 # item = OnenineItem() 27 item = response.meta['item'] 28 result = response.xpath('//div[@class="intro-cont com-size"]') 29 li_list = [] 30 for i in result: 31 result2 = i.xpath('span/text()').extract() 32 li_list.append(''.join(result2)) 33 34 35 36 item['year'] = 2018 37 item['month'] = 2 38 item['plateform'] = '1919' 39 item['cat_lv_one'] = '酒水' 40 item['cat_lv_two'] = '白酒' 41 42 shop_url = response.xpath('//a[@class="dt-mainRedColor"]/@href').extract()[0] 43 panter = re.compile('v/(.*?).', re.S) 44 item['shop_id'] = re.findall(panter,shop_url)[0] 45 46 item['shop_name'] = response.xpath('//input[@name="vendorName"]/@value').extract()[0] 47 48 item['brand'] = response.xpath('//input[@name="brandName"]/@value').extract()[0] 49 50 item['good_id'] = response.xpath('//input[@name="productCode"]/@value').extract()[0] 51 52 item['actual_price'] = response.xpath('//em[@class="details-pri"]/text()').extract()[0].replace('.','') 53 54 details = ','.join(li_list) + ',' 55 56 item['grape_type'] = '' 57 item['country'] = '' 58 item['area'] = '' 59 item['type'] = '' 60 61 if '葡萄品种' in details: 62 panter = re.compile('葡萄品种:(.*?),', re.S) 63 results8 = re.findall(panter, details) 64 if results8 != []: 65 item['grape_type'] = results8[0] 66 67 if '产国' in details: 68 panter = re.compile('产国:(.*?),', re.S) 69 results8 = re.findall(panter, details) 70 if results8 != []: 71 item['country'] = results8[0] 72 73 if '产地' in details: 74 panter = re.compile('产地:(.*?),', re.S) 75 results8 = re.findall(panter, details) 76 if results8 != []: 77 item['area'] = results8[0] 78 79 if '产区' in details: 80 panter = re.compile('产区:(.*?),', re.S) 81 results8 = re.findall(panter, details) 82 if results8 != []: 83 item['area'] = results8[0] 84 85 #针对 葡萄酒 白酒 86 if '型:' in details: 87 panter = re.compile('型:(.*?),', re.S) 88 results8 = re.findall(panter, details) 89 if results8 != []: 90 item['type'] = results8[0] 91 92 #针对洋酒 93 if '品类' in details: 94 panter = re.compile('品类:(.*?),', re.S) 95 results8 = re.findall(panter, details) 96 if results8 != []: 97 item['type'] = results8[0] 98 99 100 101 item['details'] = details 102 103 #评论数据是js渲染后的页面,通过抓包的方式找到信息 104 #在spider中使用requests爬取会导致进程阻塞 105 pro_url = 'https://www.1919.cn/product/commentData?productCode=' + item['good_id'] + '&productId=346840940029284375&page=1&vendorId=346833407843635201' 106 contents = requests.get(pro_url).text 107 panter = re.compile('<span class="ass-num">(.*?)</span>', re.S) 108 results = re.findall(panter, contents) 109 if results != []: 110 item['comments'] = results[0] 111 112 yield item
在这个文件中定义了深度爬取和一个翻页的方法,通过一个requests请求来解决js渲染的问题。
关联数据库
拿到数据后,我们要将数据持久化保存到数据库。scrapy支持多个数据库。在此,以mysql举例。
先创建一个models.py文件来连接数据库。
#Auther: Xiaoliuer Li from sqlalchemy import Column, String , Integer,BIGINT,TEXT,DECIMAL from sqlalchemy.ext.declarative import declarative_base from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker Base = declarative_base() engine = create_engine('mysql+pymysql://root:@localhost/drinking?charset=utf8') DBSession = sessionmaker(bind=engine) class OneNinedata(Base): __tablename__ = 'ecommerce_data' id = Column(Integer, primary_key=True) year = Column(Integer) month = Column(Integer) plateform = Column(String(20)) cat_lv_one = Column(String(20)) cat_lv_two = Column(String(20)) shop_id = Column(String(20)) shop_name = Column(String(100)) shop_area = Column(String(50)) shop_province = Column(String(20)) shop_city = Column(String(20)) good_id = Column(String(20)) good_name = Column(String(100)) brand = Column(String(50)) size = Column(Integer) percent = Column(DECIMAL) country = Column(String(50)) area = Column(String(50)) type = Column(String(20)) grape_type = Column(String(50)) num = Column(Integer) name_price = Column(Integer) actual_price = Column(Integer) bottle_price = Column(Integer) comments = Column(Integer) accumulate_sales = Column(Integer) month_sales = Column(Integer) month_bottle_sales = Column(Integer) month_sale_amounts = Column(BIGINT) url = Column(String(256)) details = Column(TEXT)
文件中使用的是SQLAlchemy来保存数据库,不清楚的同志了解一下。
在pipelines.py文件中,我们编写管道,让scrapy明确知道要接收哪些数据。
from scrapy.exceptions import DropItem from .models import OnesNinedata,DBSession class OneninePipeline(object): def open_spider(self, spider): self.session = DBSession() def process_item(self, item, spider): a =OneNinedata( year=item['year'], month=item['month'], plateform=item['plateform'], cat_lv_one=item['cat_lv_one'], cat_lv_two=item['cat_lv_two'], brand=item['brand'],type=item['type'],name_price=item['name_price'], url=item['url'], shop_id=item['shop_id'], shop_name=item['shop_name'],area=item['area'], good_name=item['good_name'],grape_type=item['grape_type'],country=item['country'], good_id=item['good_id'], actual_price=item['actual_price'], details=item['details'], comments=item['comments']) self.session.add(a) self.session.commit() def close_spider(self,spider): self.session.close()
修改settings.py文件,告诉scrapy我们要将数据保存到数据库。
ITEM_PIPELINES = { 'OneNine.pipelines.OneninePipeline': 300, }
运行scrapy
在命令行中输入
scrapy crawl onenine
打开数据库,就可以看见数据保存在数据中了。
同时我们还可以将数据以其他格式保存在本地。
scrapy crawl onenine -o items.json
上面是例子是以json格式把数据保存在了本地。