zoukankan      html  css  js  c++  java
  • Scrapy导出欠套型JSON

    Scrapy导出欠套型JSON

    scrapy如何导出类型如下结构的JSON:

    [
    {
    	"pingPai": ["ALPINA"],
    	"carTypes": [{
    		"carType": ["ALPINA"],
    		"carNames": {
    			"carName": ["ALPINA B4",
    			"ALPINA B3",
    			"ALPINA D5",
    			"ALPINA B7",
    			"ALPINA XD3"]
    		}
    	}],
    	"picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M05/AB/2E/100x100_f40_autohomecar__wKgHHls8hiKADrqGAABK67H4HUI503.png"
    },
    {
    	"pingPai": ["ABT"],
    	"carTypes": [{
    		"carType": ["ABT"],
    		"carNames": {
    			"carName": ["ABT A3",
    			"ABT A5",
    			"ABT TT"]
    		}
    	}],
    	"picUrl": "https://car3.autoimg.cn/cardfs/series/g30/M07/B0/47/100x100_f40_autohomecar__wKgHPls9vLOAHILAAAAWGGhA_W0282.png"
    }    
    ]
    

    解决核心

    Scrapy导出欠套型JSON实质是对列表的操作

    准备知识

    • python合并多个列表,直接用“+”
    list1 = [1,2,3]
    list2 = [5,6,7]
    print(list1+list2)
    # 输出:
    [1, 2, 3, 5, 6, 7]
    
    • python合并多个字典,用“update()”(也可以用其它方法,这里只讲update)(在此文中实际上没有用到)
    dic1 = {'a':'1','b':'2'}
    dic2 = {'c':'3','d':'4'}
    dic1.update(dic2)
    print(dic1)
    # 输出:
    {'a': '1', 'b': '2', 'c': '3', 'd': '4'}
    
    • xpath匹配得到的结果实际上是一个列表
      比如,xpath匹配到一行数据就是“[X]”,X是所匹配的值
      xpath匹配到多行数据就是“[X,Y,Z....]”

    解决方法

    观察如上欠套JSON,1级节点是:“pingPai”、“carTypes”、“picUrl”三个字段,根据scrapy定义items.py文件的特性,我们只需要定义这三个一级节点,定义为:
    打开items.py文件,添加如下代码:

    class CarModelItem(scrapy.Item):
        pingPai = scrapy.Field()  # 品牌
        carTypes = scrapy.Field()  # 车型
        picUrl = scrapy.Field()  # 品牌图片
    

    注意

    此处也可以不用定义 items.py文件 直接在导出的pipelines.py文件里面使用json.dumps系列化(把对象转换为字节序列的过程称为对象的序列化。把字节序列恢复为对象的过程称为对象的反序列化。)(json.dumps将一个Python数据结构转换为JSON,json.loads将一个JSON编码的字符串转换回一个Python数据结构)原理都是一样的,只是运用了系列化而已,本文不作讨论。

    要生成欠套型的JSON,我们只需要在carTypes列表内再添加列表就行(添加值为字典“{}”类型的列表就可以了)。
    比如我们要爬取地址为“https://www.autohome.com.cn/grade/carhtml/A.html”这个地址的内容,打开浏览器查看效果如下:

    查看源代码,图片和1级节点在“dl/dt”内,如下图:

    第二节点和第三节点在“dl/dd”内,如下图:

    这里是比较难处理的地方,一般这里我们要定义的列表为:列表内再添加列表(值为字典)的数据格式才能满足需求。
    直接上代码:

    class GetcarmodelSpider(scrapy.Spider):
        name = "GetCarModel"
        allowed_domains = ["www.autohome.com.cn"]
        chars = [
            "A",
            """
            "B",
            "C",
            "D",
            "F",
            "G",
            "H",
            "J",
            "K",
            "L",
            "M",
            "N",
            "O",
            "P",
            "Q",
            "R",
            "S",
            "T",
            "W",
            "X",
            "Y",
            "Z",""",
        ]
        start_urls = [
            "https://www.autohome.com.cn/grade/carhtml/%s.html" % i2 for i2 in chars
        ]
    
        def parse(self, response):
            dtArray = response.xpath("//dl[@id]")
            for dt in dtArray:
                pingPai = dt.xpath("./dt/div/a/text()").extract()
                pingPaiPicArr = dt.xpath("./dt/a/img/@src").extract()
                pingPaiPic = ""
                # 这里图片其实只有一张图片
                for cti in pingPaiPicArr:
                    # carTypeImg = "http://" + cti[2:]
                    pingPaiPic = parse.urljoin(response.url, cti)
    
                carTypesTemp = dt.xpath("./dd/div[@class='h3-tit']")
                carTypes = []
                for pp in carTypesTemp:
                    print(">>>>>>>>>>>>>>>>>>>>>>", pp.xpath("./a/text()").extract())
                    carTypes += [
                        {"carType": pp.xpath("./a/text()").extract(), "carNames": {}}
                    ]
    
                # 获取具体名称
                carNameArray = dt.xpath("./dd/ul[@class='rank-list-ul']")
                carNames = []
                for cn in carNameArray:
                    # 直接定义值为字典类型的列表,这样在循环第X次的时候取值就是carNames[X]
                    carNames += [{"carName": cn.xpath("./li/h4/a/text()").extract()}]
                    print(".......", [{"carName": cn.xpath("./li/h4/a/text()").extract()}])
    
                for i in range(len(carTypes)):
                    try:
                        carTypes[i]["carNames"] = carNames[i]
                    except Exception as e:
                        print(e)
    
    
                print("pingPai:", pingPai)
                print("pingPaiPic:", pingPaiPic)
                print("carTypes:", carTypes)
                print("carNames:", carNames)
    
                carModel = CarModelItem()
                carModel["pingPai"] = pingPai
                carModel["carTypes"] = carTypes
                carModel["picUrl"] = pingPaiPic
                yield carModel
    

    注意,导出JSON方法这里不再说明,自行搜索,网上一大堆
    运行代码,得到导出的JSON文件如下:

    [{
    	"pingPai": ["奥迪"],
    	"carTypes": [{
    		"carType": ["一汽-大众奥迪"],
    		"carNames": {
    			"carName": ["奥迪Q2L新能源",
    			"奥迪A3",
    			"奥迪A4L",
    			"奥迪A6L",
    			"奥迪Q2L",
    			"奥迪Q3",
    			"奥迪Q5L",
    			"奥迪A6L新能源",
    			"奥迪Q4",
    			"奥迪A4",
    			"奥迪A6",
    			"奥迪Q5"]
    		}
    	},
    	{
    		"carType": ["Audi Sport"],
    		"carNames": {
    			"carName": ["奥迪RS 3",
    			"奥迪RS 4",
    			"奥迪RS 5",
    			"奥迪RS 6",
    			"奥迪RS 7",
    			"奥迪R8",
    			"奥迪TT RS",
    			"奥迪RS Q3",
    			"奥迪RSQ e-tron"]
    		}
    	},
    	{
    		"carType": ["奥迪(进口)"],
    		"carNames": {
    			"carName": ["奥迪e-tron",
    			"奥迪A3(进口)",
    			"奥迪S3",
    			"奥迪A4(进口)",
    			"奥迪A5",
    			"奥迪S4",
    			"奥迪S5",
    			"奥迪A6(进口)",
    			"奥迪S6",
    			"奥迪A7",
    			"奥迪S7",
    			"奥迪A8",
    			"奥迪Q7",
    			"奥迪Q7新能源",
    			"奥迪TT",
    			"奥迪TTS",
    			"奥迪A0",
    			"奥迪A1",
    			"奥迪S1",
    			"e-tron Concept",
    			"奥迪AI:ME",
    			"奥迪A6新能源(进口)",
    			"奥迪A7新能源",
    			"奥迪Aicon",
    			"奥迪e-tron GT",
    			"Prologue",
    			"奥迪A8新能源",
    			"奥迪A9",
    			"奥迪S8",
    			"allroad",
    			"奥迪Q2",
    			"奥迪SQ2",
    			"奥迪Q3(进口)",
    			"奥迪Q4(进口)",
    			"奥迪Q4新能源(进口)",
    			"奥迪TT offroad",
    			"h-tron quattro",
    			"奥迪Elaine",
    			"奥迪Q5(进口)",
    			"奥迪Q5新能源(进口)",
    			"奥迪SQ5",
    			"奥迪Q8",
    			"奥迪SQ7",
    			"奥迪Q9",
    			"e-tron Vision Gran Turismo",
    			"quattro",
    			"奥迪PB18",
    			"奥迪R18",
    			"奥迪Urban",
    			"奥迪A2",
    			"奥迪80",
    			"奥迪A3新能源(进口)",
    			"奥迪Coupe",
    			"奥迪100",
    			"Crosslane Coupe",
    			"奥迪Cross",
    			"Nanuk"]
    		}
    	}],
    	"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M0B/AE/B3/100x100_f40_autohomecar__wKgHEVs9u5WAV441AAAKdxZGE4U148.png"
    },
    {
    	"pingPai": ["阿斯顿·马丁"],
    	"carTypes": [{
    		"carType": ["阿斯顿·马丁"],
    		"carNames": {
    			"carName": ["Rapide",
    			"V8 Vantage",
    			"Vanquish",
    			"阿斯顿·马丁DB11",
    			"阿斯顿·马丁DBS",
    			"Cygnet",
    			"Rapide E",
    			"阿斯顿·马丁DBX",
    			"V12 Vantage",
    			"阿斯顿·马丁DB9",
    			"AM-RB 003",
    			"Heritage EV",
    			"Virage",
    			"Vulcan",
    			"阿斯顿·马丁CC100",
    			"阿斯顿·马丁DB10",
    			"阿斯顿·马丁DB5",
    			"阿斯顿·马丁DP-100",
    			"战神",
    			"拉共达Taraf",
    			"Ulster",
    			"V12 Zagato",
    			"阿斯顿·马丁DB6",
    			"阿斯顿·马丁One-77"]
    		}
    	}],
    	"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M06/AE/B5/100x100_f40_autohomecar__wKgHEVs9u6GAPWN8AAAYsmBsCWs847.png"
    },
    {
    	"pingPai": ["AC Schnitzer"],
    	"carTypes": [{
    		"carType": ["AC Schnitzer"],
    		"carNames": {
    			"carName": ["AC Schnitzer 3系",
    			"AC Schnitzer M4",
    			"AC Schnitzer 7系",
    			"AC Schnitzer X6",
    			"AC Schnitzer X5"]
    		}
    	}],
    	"picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M01/B0/62/100x100_f40_autohomecar__ChcCQFs9vBKAO3YSAAAW0WOWvRc555.png"
    },
    {
    	"pingPai": ["安凯客车"],
    	"carTypes": [{
    		"carType": ["安凯客车"],
    		"carNames": {
    			"carName": ["宝斯通"]
    		}
    	}],
    	"picUrl": "https://car2.autoimg.cn/cardfs/series/g29/M00/AB/C8/100x100_f40_autohomecar__ChcCSFs8riCAYVA2AAApQLgf8a0969.png"
    },
    {
    	"pingPai": ["阿尔法·罗密欧"],
    	"carTypes": [{
    		"carType": ["阿尔法·罗密欧"],
    		"carNames": {
    			"carName": ["Giulia",
    			"Stelvio",
    			"MiTo",
    			"Giulietta",
    			"Tonale",
    			"ALFA 4C",
    			"Disco Volante",
    			"Gloria",
    			"ALFA 147",
    			"ALFA 156",
    			"ALFA 159",
    			"ALFA 166",
    			"ALFA 2uettottanta",
    			"ALFA 8C",
    			"ALFA GT",
    			"ALFA S.Z.",
    			"ALFA TZ3"]
    		}
    	}],
    	"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M05/B0/29/100x100_f40_autohomecar__ChcCP1s9u5qAemANAABON_GMdvI451.png"
    },
    {
    	"pingPai": ["ALPINA"],
    	"carTypes": [{
    		"carType": ["ALPINA"],
    		"carNames": {
    			"carName": ["ALPINA B4",
    			"ALPINA B3",
    			"ALPINA D5",
    			"ALPINA B7",
    			"ALPINA XD3"]
    		}
    	}],
    	"picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M05/AB/2E/100x100_f40_autohomecar__wKgHHls8hiKADrqGAABK67H4HUI503.png"
    },
    {
    	"pingPai": ["ABT"],
    	"carTypes": [{
    		"carType": ["ABT"],
    		"carNames": {
    			"carName": ["ABT A3",
    			"ABT A5",
    			"ABT TT"]
    		}
    	}],
    	"picUrl": "https://car3.autoimg.cn/cardfs/series/g30/M07/B0/47/100x100_f40_autohomecar__wKgHPls9vLOAHILAAAAWGGhA_W0282.png"
    },
    {
    	"pingPai": ["AEV ROBOTICS"],
    	"carTypes": [{
    		"carType": ["AEV ROBOTICS"],
    		"carNames": {
    			"carName": ["Modular Vehicle System"]
    		}
    	}],
    	"picUrl": "https://car2.autoimg.cn/cardfs/series/g3/M02/58/D3/autohomecar__ChcCRVw0TJaAM8BmAAAS-7AD7DQ372.png"
    },
    {
    	"pingPai": ["Agile Automotive"],
    	"carTypes": [{
    		"carType": ["Agile Automotive"],
    		"carNames": {
    			"carName": ["Agile Automotive SC122",
    			"Agile Automotive SCX"]
    		}
    	}],
    	"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M09/AF/8C/100x100_f40_autohomecar__wKgHHVs9r62AIbiYAAAvAsqdpoA594.png"
    },
    {
    	"pingPai": ["Apollo"],
    	"carTypes": [{
    		"carType": ["Apollo"],
    		"carNames": {
    			"carName": ["Apollo N",
    			"Arrow",
    			"Intensa Emozione"]
    		}
    	}],
    	"picUrl": "https://car3.autoimg.cn/cardfs/series/g28/M06/B0/C6/100x100_f40_autohomecar__ChcCR1s90RGASBRgAACz67wh_68723.png"
    },
    {
    	"pingPai": ["Arash"],
    	"carTypes": [{
    		"carType": ["Arash"],
    		"carNames": {
    			"carName": ["AF8 Cassini",
    			"Arash AF10"]
    		}
    	}],
    	"picUrl": "https://car3.autoimg.cn/cardfs/series/g30/M05/AA/D4/100x100_f40_autohomecar__wKgHHFs8n1CAVhcNAAAV3xEAiDM531.png"
    },
    {
    	"pingPai": ["ARCFOX"],
    	"carTypes": [{
    		"carType": ["北汽新能源"],
    		"carNames": {
    			"carName": ["ARCFOX-1",
    			"ARCFOX ECF Concept",
    			"ARCFOX-7",
    			"ARCFOX-GT"]
    		}
    	}],
    	"picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M02/AB/F7/100x100_f40_autohomecar__ChcCQFs8nA6AP-h5AABsvxhHw3E709.png"
    },
    {
    	"pingPai": ["Aria"],
    	"carTypes": [{
    		"carType": ["Aria"],
    		"carNames": {
    			"carName": ["Aria FXE"]
    		}
    	}],
    	"picUrl": "https://car3.autoimg.cn/cardfs/series/g28/M0B/B0/0D/100x100_f40_autohomecar__wKgHI1s9r2iAJwIXAAAIBShzq60456.png"
    },
    {
    	"pingPai": ["ATS"],
    	"carTypes": [{
    		"carType": ["ATS"],
    		"carNames": {
    			"carName": ["ATS GT"]
    		}
    	}],
    	"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M08/D7/D3/autohomecar__ChsEe1wYwKmAY2p9AAA1NP0jCHk594.png"
    },
    {
    	"pingPai": ["Aurus"],
    	"carTypes": [{
    		"carType": ["Aurus"],
    		"carNames": {
    			"carName": ["Senat"]
    		}
    	}],
    	"picUrl": "https://car2.autoimg.cn/cardfs/series/g27/M07/F3/E1/autohomecar__ChcCQFuN6WiAcztKAAAsLfBmU9g074.png"
    },
    {
    	"pingPai": ["艾康尼克"],
    	"carTypes": [{
    		"carType": ["艾康尼克ICONIQ Motors"],
    		"carNames": {
    			"carName": ["MUSE",
    			"艾康尼克七系"]
    		}
    	}],
    	"picUrl": "https://car2.autoimg.cn/cardfs/series/g29/M0A/A9/EC/100x100_f40_autohomecar__wKgHG1s8iP6ASbjTAAAOIwskkzo314.png"
    },
    {
    	"pingPai": ["爱驰"],
    	"carTypes": [{
    		"carType": ["爱驰汽车"],
    		"carNames": {
    			"carName": ["爱驰U5",
    			"爱驰U7",
    			"RG Nathalie"]
    		}
    	}],
    	"picUrl": "https://car3.autoimg.cn/cardfs/series/g29/M09/A9/9B/100x100_f40_autohomecar__wKgHG1s8fwqAOp3IAAALEeTkn6c536.png"
    }]
    
  • 相关阅读:
    异常处理——Java的try catch用法
    C# 如何通过mailto标签和SMTP协议两种方式发送邮件
    Sublime Text 插件之HTML-CSS-JS Prettify—格式化HTML CSS JS与显示函数列表
    C# 如何提取字符串中的数字
    C# 解决读取dbf文件,提示Microsoft Jet 数据库引擎找不到对象的问题
    解决UEFI启动模式下无法使用U盘启动WIN7安装界面
    实现Windows直接远程访问Ubuntu 18.04(旧版本也支持,无需安装第三方桌面,直接使用自带远程工具)
    深度学习(TensorFlow)环境搭建:(三)Ubuntu16.04+CUDA8.0+cuDNN7+Anaconda4.4+Python3.6+TensorFlow1.3
    深度学习(TensorFlow)环境搭建:(二)Ubuntu16.04+1080Ti显卡驱动
    xrdp完美实现Windows远程访问Ubuntu 16.04
  • 原文地址:https://www.cnblogs.com/zh672903/p/11018891.html
Copyright © 2011-2022 走看看