zoukankan      html  css  js  c++  java
  • 使用scrapy框架爬取某商城部分数据并存入MongoDB

    爬取电商网站的商品信息:
    
        URL为: https://www.zhe800.com/ju_type/baoyou
        抓取不同分类下的商品数据
        抓取内容为商品的名称, 价格数字, 商品图片
        将商品图片二进制流, 商品名称和价格数字一同存储于MongoDB数据库
    
    存储数据结构为:
    
    {
    
              ‘name’: ‘懒人神奇, 看电影必备’,
    
              ‘price’: ‘5.5’,  
    
           ‘img’: ….,
    
               “category”: ‘家纺’
    
    }

    这里抓包就不说了,很简单,利用xpath进行解析

    • by.py
      • # -*- coding: utf-8 -*-
        import scrapy
        from ..items import BywItem
        class BySpider(scrapy.Spider):
            name = 'by'
            # allowed_domains = ['baidu.com']
            start_urls = ['https://www.zhe800.com/ju_type/baoyou']
          
        def img_parse(self,response): item = BywItem() item['name'] = response.meta['name'] # print(name) item['cate'] = response.meta['cate'] # print(cate) item['price'] = response.meta['price'] item['img'] = response.body yield item #详情 def xq_parse(self,response): cate = response.meta['cate'] print(cate) xq_list = response.xpath('//div[@class="con "]') print(xq_list) for xq in xq_list: name = xq.xpath('./h3/a/@title').extract_first() print(name) price = xq.xpath('./h4/em/text()').extract_first() print(price) img_link = 'https:' +xq.xpath('.//a/img/@data-original').extract_first() print(img_link) meta = { 'name':name, 'price':price, 'cate':cate } yield scrapy.Request(url=img_link,callback=self.img_parse,meta=meta) def parse(self, response): a_list = response.xpath('//div[@class="area"]/a[position()>1]') for a in a_list: cate = a.xpath('./em/text()').extract_first() # print(cate) cate_link = 'https:' +a.xpath('./@href').extract_first() # print(cate_link) yield scrapy.Request(url=cate_link,callback=self.xq_parse,meta={'cate':cate})
    • items.py
      • import scrapy
        
        
        class BywItem(scrapy.Item):
            # define the fields for your item here like:
            name = scrapy.Field()
            cate = scrapy.Field()
            price = scrapy.Field()
            img = scrapy.Field()
    • pipelines.py
      • import pymongo
        conn = pymongo.MongoClient()  #连接
        db = conn.byw  #创建数据库
        table = db.by  #创建表
        
        class BywPipeline:
            def process_item(self, item, spider):
                table.insert_one(dict(item))  #插入数据
                return item
    • settings.py
      • #ua
        
        USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'
        
        #robots协议
        ROBOTSTXT_OBEY = False
        
        
        
        #管道
        ITEM_PIPELINES = {
           'byw.pipelines.BywPipeline': 300,
        }
    • 效果

        
  • 相关阅读:
    Codeforces Round #652 (Div. 2) A. FashionabLee(几何)
    轻量应用服务器如何通过修改apache配置文件实现非https的访问多域名到不同子目录程序?
    在Windows环境下使用hexo搭建博客以及部署到gitee / github
    使用WordPress搭建个人手机博客(阿里云)
    访问自己服务器的ip地址
    php环境无法上传文件的解决方法
    SSRF漏洞
    CSRF全家桶(含义,防御,攻击)
    JS实现HTML实体与字符的相互转换
    CentOS系统下载及应用部署
  • 原文地址:https://www.cnblogs.com/u-damowang1/p/12896523.html
Copyright © 2011-2022 走看看