zoukankan html css js c++ java

使用scrapy框架爬取某商城部分数据并存入MongoDB

爬取电商网站的商品信息:

    URL为: https://www.zhe800.com/ju_type/baoyou
    抓取不同分类下的商品数据
    抓取内容为商品的名称, 价格数字, 商品图片
    将商品图片二进制流, 商品名称和价格数字一同存储于MongoDB数据库

存储数据结构为:

{

          ‘name’: ‘懒人神奇, 看电影必备’,

          ‘price’: ‘5.5’,　　

　　   　　‘img’: ….,

           “category”: ‘家纺’

}

这里抓包就不说了，很简单，利用xpath进行解析

by.py

# -*- coding: utf-8 -*-
import scrapy
from ..items import BywItem
class BySpider(scrapy.Spider):
    name = 'by'
    # allowed_domains = ['baidu.com']
    start_urls = ['https://www.zhe800.com/ju_type/baoyou']
　　 
    def img_parse(self,response):
        item = BywItem()
        item['name'] = response.meta['name']
        # print(name)
        item['cate'] = response.meta['cate']
        # print(cate)
        item['price'] = response.meta['price']
        item['img'] = response.body
        yield item

    #详情
    def xq_parse(self,response):
        cate = response.meta['cate']
        print(cate)
        xq_list = response.xpath('//div[@class="con "]')
        print(xq_list)
        for xq in xq_list:
            name = xq.xpath('./h3/a/@title').extract_first()
            print(name)
            price = xq.xpath('./h4/em/text()').extract_first()
            print(price)
            img_link = 'https:' +xq.xpath('.//a/img/@data-original').extract_first()
            print(img_link)
            meta = {
                'name':name,
                'price':price,
                'cate':cate
            }
            yield scrapy.Request(url=img_link,callback=self.img_parse,meta=meta)


    def parse(self, response):
        a_list = response.xpath('//div[@class="area"]/a[position()>1]')
        for a in a_list:
            cate = a.xpath('./em/text()').extract_first()
            # print(cate)
            cate_link = 'https:' +a.xpath('./@href').extract_first()
            # print(cate_link)
            yield scrapy.Request(url=cate_link,callback=self.xq_parse,meta={'cate':cate})

items.py

import scrapy


class BywItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    cate = scrapy.Field()
    price = scrapy.Field()
    img = scrapy.Field()

pipelines.py

import pymongo
conn = pymongo.MongoClient()  #连接
db = conn.byw  #创建数据库
table = db.by  #创建表

class BywPipeline:
    def process_item(self, item, spider):
        table.insert_one(dict(item))  #插入数据
        return item

settings.py

#ua

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'

#robots协议
ROBOTSTXT_OBEY = False



#管道
ITEM_PIPELINES = {
   'byw.pipelines.BywPipeline': 300,
}

效果

查看全文

相关阅读:
正当防卫与互殴的界限在哪里
 [php入门] 5、初学CSS从中记下的一些基础点（For小白）
[ZigBee] 13、ZigBee基础阶段性回顾与加深理解——用定时器1产生PWM来控制LED亮度（七色灯）
[ZigBee] 12、ZigBee之看门狗定时器——饿了就咬人的GOOD DOG
[ZigBee] 11、ZigBee之睡眠定时器二
 [ZigBee] 10、ZigBee之睡眠定时器
 [ZigBee] 9、ZigBee之AD剖析——AD采集CC2530温度串口显示
 [ZigBee] 8、ZigBee之UART剖析·二（串口收发）
[php入门] 4、HTML基础入门一篇概览
 [ZigBee] 2、 ZigBee开发环境搭建

原文地址：https://www.cnblogs.com/u-damowang1/p/12896523.html