zoukankan html css js c++ java

scrapy学习笔记一

scrapy学习笔记

下面以爬取1919网站为例子，完成对一整个网站数据爬取的scrapy项目创建。

创建一个scrapy文件

在任意目录下输入命令

scrapy startproject OneNine (文件名)

将会得到如下目录的文件

OneNine/
    scrapy.cfg            # 部署配置文件

    OneNine/           # Python模块，你所有的代码都放这里面
        __init__.py

        items.py          # Item定义文件

        pipelines.py      # pipelines定义文件

        settings.py       # 配置文件

        spiders/          # 所有爬虫spider都放这个文件夹下面
            __init__.py
            ...

接着创建一个spider文件用来编写爬取规则

cd OneNine
scrape genspider onenine onenine.com

此时在spiders文件夹下就会生成一个onenine.py文件，我们将在这个文件中编写爬虫规则

定义Item

在items.py文件中需要编写我们要爬取的字段内容。

import scrapy

class OnenineItem(scrapy.Item):
    url = scrapy.Field()
    good_name = scrapy.Field()
    actual_price = scrapy.Field()
    details = scrapy.Field()
    year = scrapy.Field()
    month = scrapy.Field()
    plateform = scrapy.Field()
    cat_lv_one = scrapy.Field()
    cat_lv_two = scrapy.Field()
    shop_id = scrapy.Field()
    shop_name = scrapy.Field()
    shop_area = scrapy.Field()
    shop_province = scrapy.Field()
    shop_city = scrapy.Field()
    good_id = scrapy.Field()
    brand = scrapy.Field()
    size = scrapy.Field()
    percent = scrapy.Field()
    country = scrapy.Field()
    area = scrapy.Field()
    type = scrapy.Field()
    grape_type = scrapy.Field()
    num = scrapy.Field()
    name_price = scrapy.Field()
    bottle_price = scrapy.Field()
    comments = scrapy.Field()
    accumulate_sales = scrapy.Field()
    month_sales = scrapy.Field()
    month_bottle_sales = scrapy.Field()
    month_sale_amounts = scrapy.Field()

scrapy.Field的属性的字段可以直接在后期直接生成你要的文件格式。

spider文件

在OneNine/spiders文件夹下的onenine.py文件中我们编写了对于网站爬取规则的编写。

在编写爬取规则前，我们要先继承一个scrapy.Spider类，并定义一些属性：

name：Spider名称，必须唯一
allowed_domains：定义网页的筛选规则
start_urls：起始爬取的网址

  1 # -*- coding: utf-8 -*-
  2 import scrapy
  3 from ..items import OnenineItem
  4 from scrapy.linkextractors import LinkExtractor
  5 from scrapy.spiders import CrawlSpider
  6 import requests,re
  7 
  8 class OnenineSpider(scrapy.Spider):
  9     name = 'onenine'
 10     allowed_domains = ['www.1919.cn']
 11     start_urls = ['https://www.1919.cn/search.html?sort=DEFAULT_SORT&page='+str(x) +'&size=16&kw=%E7%99%BD%E9%85%92'
 12                    for x in range(0,27)]    #使用列表生成式完成翻页处理
 13 
 14     def parse(self, response):
 15 
 16         result = response.xpath('//div[@class="ml-info ml-rpb12"]')
 17         for i in result:
 18             item = OnenineItem()
 19             item['good_name'] = i.xpath('p[@class="ml-pdtname"]/a/text()').extract()[0]  # 商品名
 20             item['name_price'] =  i.xpath('p[@class="ml-pdtpri"]/span[@class="ml-pri"]/text()').extract()[0].replace('.','')# 商品价格
 21             item['url'] = i.xpath('p[@class="ml-pdtname"]/a/@href').extract()[0]   # 商品url
 22             url = response.urljoin(item['url'])
 23             yield scrapy.Request(url,meta={'item':item},callback=self.good_detail)
 24 
 25     def good_detail(self,response):
 26         # item = OnenineItem()
 27         item = response.meta['item']
 28         result = response.xpath('//div[@class="intro-cont com-size"]')
 29         li_list = []
 30         for i in result:
 31             result2 = i.xpath('span/text()').extract()
 32             li_list.append(''.join(result2))
 33 
 34 
 35 
 36         item['year'] = 2018
 37         item['month'] = 2
 38         item['plateform'] = '1919'
 39         item['cat_lv_one'] = '酒水'
 40         item['cat_lv_two'] = '白酒'
 41 
 42         shop_url = response.xpath('//a[@class="dt-mainRedColor"]/@href').extract()[0]
 43         panter = re.compile('v/(.*?).', re.S)
 44         item['shop_id'] = re.findall(panter,shop_url)[0]
 45 
 46         item['shop_name'] = response.xpath('//input[@name="vendorName"]/@value').extract()[0]
 47 
 48         item['brand'] = response.xpath('//input[@name="brandName"]/@value').extract()[0]
 49 
 50         item['good_id'] = response.xpath('//input[@name="productCode"]/@value').extract()[0]
 51 
 52         item['actual_price'] = response.xpath('//em[@class="details-pri"]/text()').extract()[0].replace('.','')
 53 
 54         details = ','.join(li_list) + ','
 55 
 56         item['grape_type'] = ''
 57         item['country'] = ''
 58         item['area'] = ''
 59         item['type'] = ''
 60 
 61         if '葡萄品种' in details:
 62             panter = re.compile('葡萄品种：(.*?),', re.S)
 63             results8 = re.findall(panter, details)
 64             if results8 != []:
 65                 item['grape_type'] = results8[0]
 66 
 67         if '产国' in details:
 68             panter = re.compile('产国：(.*?),', re.S)
 69             results8 = re.findall(panter, details)
 70             if results8 != []:
 71                 item['country'] = results8[0]
 72 
 73         if '产地' in details:
 74             panter = re.compile('产地：(.*?),', re.S)
 75             results8 = re.findall(panter, details)
 76             if results8 != []:
 77                 item['area'] = results8[0]
 78 
 79         if '产区' in details:
 80             panter = re.compile('产区：(.*?),', re.S)
 81             results8 = re.findall(panter, details)
 82             if results8 != []:
 83                 item['area'] = results8[0]
 84 
 85         #针对 葡萄酒 白酒
 86         if '型：' in details:
 87             panter = re.compile('型：(.*?),', re.S)
 88             results8 = re.findall(panter, details)
 89             if results8 != []:
 90                 item['type'] =  results8[0]
 91 
 92         #针对洋酒
 93         if '品类' in details:
 94             panter = re.compile('品类：(.*?),', re.S)
 95             results8 = re.findall(panter, details)
 96             if results8 != []:
 97                 item['type'] = results8[0]
 98 
 99 
100 
101         item['details'] = details
102 
103         #评论数据是js渲染后的页面，通过抓包的方式找到信息
104         #在spider中使用requests爬取会导致进程阻塞
105         pro_url = 'https://www.1919.cn/product/commentData?productCode=' + item['good_id'] + '&productId=346840940029284375&page=1&vendorId=346833407843635201'
106         contents = requests.get(pro_url).text
107         panter = re.compile('<span class="ass-num">(.*?)</span>', re.S)
108         results = re.findall(panter, contents)
109         if results != []:
110             item['comments'] = results[0]
111 
112         yield item

View Code

在这个文件中定义了深度爬取和一个翻页的方法，通过一个requests请求来解决js渲染的问题。

关联数据库

拿到数据后，我们要将数据持久化保存到数据库。scrapy支持多个数据库。在此，以mysql举例。

先创建一个models.py文件来连接数据库。

#Auther: Xiaoliuer Li

from sqlalchemy import Column, String , Integer,BIGINT,TEXT,DECIMAL
from sqlalchemy.ext.declarative import declarative_base

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker


Base = declarative_base()

engine = create_engine('mysql+pymysql://root:@localhost/drinking?charset=utf8')
DBSession = sessionmaker(bind=engine)

class OneNinedata(Base):
    __tablename__ = 'ecommerce_data'

    id = Column(Integer, primary_key=True)
    year = Column(Integer)
    month = Column(Integer)
    plateform = Column(String(20))
    cat_lv_one = Column(String(20))
    cat_lv_two = Column(String(20))
    shop_id = Column(String(20))
    shop_name = Column(String(100))
    shop_area = Column(String(50))
    shop_province = Column(String(20))
    shop_city = Column(String(20))
    good_id = Column(String(20))
    good_name = Column(String(100))
    brand = Column(String(50))
    size = Column(Integer)
    percent = Column(DECIMAL)
    country = Column(String(50))
    area = Column(String(50))
    type = Column(String(20))
    grape_type = Column(String(50))
    num = Column(Integer)
    name_price = Column(Integer)
    actual_price = Column(Integer)
    bottle_price = Column(Integer)
    comments = Column(Integer)
    accumulate_sales = Column(Integer)
    month_sales = Column(Integer)
    month_bottle_sales = Column(Integer)
    month_sale_amounts = Column(BIGINT)
    url = Column(String(256))
    details = Column(TEXT)

View Code

文件中使用的是SQLAlchemy来保存数据库，不清楚的同志了解一下。

在pipelines.py文件中，我们编写管道，让scrapy明确知道要接收哪些数据。

from scrapy.exceptions import DropItem
from .models import OnesNinedata,DBSession


class OneninePipeline(object):

    def open_spider(self, spider):
        self.session = DBSession()

    def process_item(self, item, spider):
        a =OneNinedata(
            year=item['year'], month=item['month'], plateform=item['plateform'], cat_lv_one=item['cat_lv_one'],
            cat_lv_two=item['cat_lv_two'], brand=item['brand'],type=item['type'],name_price=item['name_price'],
            url=item['url'], shop_id=item['shop_id'], shop_name=item['shop_name'],area=item['area'],
            good_name=item['good_name'],grape_type=item['grape_type'],country=item['country'],
            good_id=item['good_id'], actual_price=item['actual_price'], details=item['details'],
            comments=item['comments'])
        self.session.add(a)
        self.session.commit()

    def close_spider(self,spider):
        self.session.close()

View Code

修改settings.py文件，告诉scrapy我们要将数据保存到数据库。

ITEM_PIPELINES = {
   'OneNine.pipelines.OneninePipeline': 300,
}

运行scrapy

在命令行中输入

scrapy crawl onenine

打开数据库，就可以看见数据保存在数据中了。

同时我们还可以将数据以其他格式保存在本地。

scrapy crawl onenine -o items.json

上面是例子是以json格式把数据保存在了本地。

查看全文

相关阅读:
php curl 获取邮箱通讯录 126
php curl 获取邮箱通讯录 sns(hotmail)
让input表单不显示历史记录
 array_multisort() 排序理解
 jquery对表单checkbox复选框的操作例子(全选,反选,获取选取值)
php curl 获取邮箱通讯录 sohu
xdebug 显示数组深度 netbeans配置Xdebug
期末考试总结
 Win32 Application和Win32 Console Application的区别(转载)
《大师的智慧：：十五位杰出电脑科学家们的生平与发现》读书笔记（未完）

原文地址：https://www.cnblogs.com/lixiaoliuer/p/8658989.html