zoukankan      html  css  js  c++  java
  • scrapy学习笔记一

    scrapy学习笔记

    下面以爬取1919网站为例子,完成对一整个网站数据爬取的scrapy项目创建。

    创建一个scrapy文件

    在任意目录下输入命令

    scrapy startproject OneNine (文件名)

    将会得到如下目录的文件

    OneNine/
        scrapy.cfg            # 部署配置文件
    
        OneNine/           # Python模块,你所有的代码都放这里面
            __init__.py
    
            items.py          # Item定义文件
    
            pipelines.py      # pipelines定义文件
    
            settings.py       # 配置文件
    
            spiders/          # 所有爬虫spider都放这个文件夹下面
                __init__.py
                ...

    接着创建一个spider文件用来编写爬取规则

    cd OneNine
    scrape genspider onenine onenine.com 

    此时在spiders文件夹下就会生成一个onenine.py文件,我们将在这个文件中编写爬虫规则

    定义Item

    在items.py文件中需要编写我们要爬取的字段内容。

    import scrapy
    
    class OnenineItem(scrapy.Item):
        url = scrapy.Field()
        good_name = scrapy.Field()
        actual_price = scrapy.Field()
        details = scrapy.Field()
        year = scrapy.Field()
        month = scrapy.Field()
        plateform = scrapy.Field()
        cat_lv_one = scrapy.Field()
        cat_lv_two = scrapy.Field()
        shop_id = scrapy.Field()
        shop_name = scrapy.Field()
        shop_area = scrapy.Field()
        shop_province = scrapy.Field()
        shop_city = scrapy.Field()
        good_id = scrapy.Field()
        brand = scrapy.Field()
        size = scrapy.Field()
        percent = scrapy.Field()
        country = scrapy.Field()
        area = scrapy.Field()
        type = scrapy.Field()
        grape_type = scrapy.Field()
        num = scrapy.Field()
        name_price = scrapy.Field()
        bottle_price = scrapy.Field()
        comments = scrapy.Field()
        accumulate_sales = scrapy.Field()
        month_sales = scrapy.Field()
        month_bottle_sales = scrapy.Field()
        month_sale_amounts = scrapy.Field()

    scrapy.Field的属性的字段可以直接在后期直接生成你要的文件格式。

    spider文件

    在OneNine/spiders文件夹下的onenine.py文件中我们编写了对于网站爬取规则的编写。

    在编写爬取规则前,我们要先继承一个scrapy.Spider类,并定义一些属性:

    • name:Spider名称,必须唯一
    • allowed_domains:定义网页的筛选规则
    • start_urls:起始爬取的网址
      1 # -*- coding: utf-8 -*-
      2 import scrapy
      3 from ..items import OnenineItem
      4 from scrapy.linkextractors import LinkExtractor
      5 from scrapy.spiders import CrawlSpider
      6 import requests,re
      7 
      8 class OnenineSpider(scrapy.Spider):
      9     name = 'onenine'
     10     allowed_domains = ['www.1919.cn']
     11     start_urls = ['https://www.1919.cn/search.html?sort=DEFAULT_SORT&page='+str(x) +'&size=16&kw=%E7%99%BD%E9%85%92'
     12                    for x in range(0,27)]    #使用列表生成式完成翻页处理
     13 
     14     def parse(self, response):
     15 
     16         result = response.xpath('//div[@class="ml-info ml-rpb12"]')
     17         for i in result:
     18             item = OnenineItem()
     19             item['good_name'] = i.xpath('p[@class="ml-pdtname"]/a/text()').extract()[0]  # 商品名
     20             item['name_price'] =  i.xpath('p[@class="ml-pdtpri"]/span[@class="ml-pri"]/text()').extract()[0].replace('.','')# 商品价格
     21             item['url'] = i.xpath('p[@class="ml-pdtname"]/a/@href').extract()[0]   # 商品url
     22             url = response.urljoin(item['url'])
     23             yield scrapy.Request(url,meta={'item':item},callback=self.good_detail)
     24 
     25     def good_detail(self,response):
     26         # item = OnenineItem()
     27         item = response.meta['item']
     28         result = response.xpath('//div[@class="intro-cont com-size"]')
     29         li_list = []
     30         for i in result:
     31             result2 = i.xpath('span/text()').extract()
     32             li_list.append(''.join(result2))
     33 
     34 
     35 
     36         item['year'] = 2018
     37         item['month'] = 2
     38         item['plateform'] = '1919'
     39         item['cat_lv_one'] = '酒水'
     40         item['cat_lv_two'] = '白酒'
     41 
     42         shop_url = response.xpath('//a[@class="dt-mainRedColor"]/@href').extract()[0]
     43         panter = re.compile('v/(.*?).', re.S)
     44         item['shop_id'] = re.findall(panter,shop_url)[0]
     45 
     46         item['shop_name'] = response.xpath('//input[@name="vendorName"]/@value').extract()[0]
     47 
     48         item['brand'] = response.xpath('//input[@name="brandName"]/@value').extract()[0]
     49 
     50         item['good_id'] = response.xpath('//input[@name="productCode"]/@value').extract()[0]
     51 
     52         item['actual_price'] = response.xpath('//em[@class="details-pri"]/text()').extract()[0].replace('.','')
     53 
     54         details = ','.join(li_list) + ','
     55 
     56         item['grape_type'] = ''
     57         item['country'] = ''
     58         item['area'] = ''
     59         item['type'] = ''
     60 
     61         if '葡萄品种' in details:
     62             panter = re.compile('葡萄品种:(.*?),', re.S)
     63             results8 = re.findall(panter, details)
     64             if results8 != []:
     65                 item['grape_type'] = results8[0]
     66 
     67         if '产国' in details:
     68             panter = re.compile('产国:(.*?),', re.S)
     69             results8 = re.findall(panter, details)
     70             if results8 != []:
     71                 item['country'] = results8[0]
     72 
     73         if '产地' in details:
     74             panter = re.compile('产地:(.*?),', re.S)
     75             results8 = re.findall(panter, details)
     76             if results8 != []:
     77                 item['area'] = results8[0]
     78 
     79         if '产区' in details:
     80             panter = re.compile('产区:(.*?),', re.S)
     81             results8 = re.findall(panter, details)
     82             if results8 != []:
     83                 item['area'] = results8[0]
     84 
     85         #针对 葡萄酒 白酒
     86         if '型:' in details:
     87             panter = re.compile('型:(.*?),', re.S)
     88             results8 = re.findall(panter, details)
     89             if results8 != []:
     90                 item['type'] =  results8[0]
     91 
     92         #针对洋酒
     93         if '品类' in details:
     94             panter = re.compile('品类:(.*?),', re.S)
     95             results8 = re.findall(panter, details)
     96             if results8 != []:
     97                 item['type'] = results8[0]
     98 
     99 
    100 
    101         item['details'] = details
    102 
    103         #评论数据是js渲染后的页面,通过抓包的方式找到信息
    104         #在spider中使用requests爬取会导致进程阻塞
    105         pro_url = 'https://www.1919.cn/product/commentData?productCode=' + item['good_id'] + '&productId=346840940029284375&page=1&vendorId=346833407843635201'
    106         contents = requests.get(pro_url).text
    107         panter = re.compile('<span class="ass-num">(.*?)</span>', re.S)
    108         results = re.findall(panter, contents)
    109         if results != []:
    110             item['comments'] = results[0]
    111 
    112         yield item
    View Code

    在这个文件中定义了深度爬取和一个翻页的方法,通过一个requests请求来解决js渲染的问题。

    关联数据库

    拿到数据后,我们要将数据持久化保存到数据库。scrapy支持多个数据库。在此,以mysql举例。

    先创建一个models.py文件来连接数据库。

    #Auther: Xiaoliuer Li
    
    from sqlalchemy import Column, String , Integer,BIGINT,TEXT,DECIMAL
    from sqlalchemy.ext.declarative import declarative_base
    
    from sqlalchemy import create_engine
    from sqlalchemy.orm import sessionmaker
    
    
    Base = declarative_base()
    
    engine = create_engine('mysql+pymysql://root:@localhost/drinking?charset=utf8')
    DBSession = sessionmaker(bind=engine)
    
    class OneNinedata(Base):
        __tablename__ = 'ecommerce_data'
    
        id = Column(Integer, primary_key=True)
        year = Column(Integer)
        month = Column(Integer)
        plateform = Column(String(20))
        cat_lv_one = Column(String(20))
        cat_lv_two = Column(String(20))
        shop_id = Column(String(20))
        shop_name = Column(String(100))
        shop_area = Column(String(50))
        shop_province = Column(String(20))
        shop_city = Column(String(20))
        good_id = Column(String(20))
        good_name = Column(String(100))
        brand = Column(String(50))
        size = Column(Integer)
        percent = Column(DECIMAL)
        country = Column(String(50))
        area = Column(String(50))
        type = Column(String(20))
        grape_type = Column(String(50))
        num = Column(Integer)
        name_price = Column(Integer)
        actual_price = Column(Integer)
        bottle_price = Column(Integer)
        comments = Column(Integer)
        accumulate_sales = Column(Integer)
        month_sales = Column(Integer)
        month_bottle_sales = Column(Integer)
        month_sale_amounts = Column(BIGINT)
        url = Column(String(256))
        details = Column(TEXT)
    View Code

    文件中使用的是SQLAlchemy来保存数据库,不清楚的同志了解一下。

    在pipelines.py文件中,我们编写管道,让scrapy明确知道要接收哪些数据。

    from scrapy.exceptions import DropItem
    from .models import OnesNinedata,DBSession
    
    
    class OneninePipeline(object):
    
        def open_spider(self, spider):
            self.session = DBSession()
    
        def process_item(self, item, spider):
            a =OneNinedata(
                year=item['year'], month=item['month'], plateform=item['plateform'], cat_lv_one=item['cat_lv_one'],
                cat_lv_two=item['cat_lv_two'], brand=item['brand'],type=item['type'],name_price=item['name_price'],
                url=item['url'], shop_id=item['shop_id'], shop_name=item['shop_name'],area=item['area'],
                good_name=item['good_name'],grape_type=item['grape_type'],country=item['country'],
                good_id=item['good_id'], actual_price=item['actual_price'], details=item['details'],
                comments=item['comments'])
            self.session.add(a)
            self.session.commit()
    
        def close_spider(self,spider):
            self.session.close()
    View Code

    修改settings.py文件,告诉scrapy我们要将数据保存到数据库。

    ITEM_PIPELINES = {
       'OneNine.pipelines.OneninePipeline': 300,
    }

    运行scrapy

    在命令行中输入

    scrapy crawl onenine

    打开数据库,就可以看见数据保存在数据中了。

     同时我们还可以将数据以其他格式保存在本地。

    scrapy crawl onenine -o items.json

    上面是例子是以json格式把数据保存在了本地。

  • 相关阅读:
    多线程创建方式四种

    归并排序
    Spark调优之--资源调优、并行度调优
    多线程中的上下文切换
    守护线程和本地线程
    线程和进程的区别
    3. 无重复字符的最长子串
    [蓝桥杯][历届试题]连号区间数
    [蓝桥杯][历届试题]蚂蚁感冒
  • 原文地址:https://www.cnblogs.com/lixiaoliuer/p/8658989.html
Copyright © 2011-2022 走看看