zoukankan      html  css  js  c++  java
  • Hive解析多重嵌套JSON数组

    最近在工作中遇到一个很难解析的JSON,他是一个嵌套的JSON数组的JSON,要使用Hive来进行解析,用Presto写了一次,逻辑就很清晰,因为Presto自带了JSON数据类型,转换数组就很方便,而Hive解析完JSON数组后是一个字符串,只能使用split方法来对string类型的数据进行切分,所以如果遇到多层嵌套的数组,要注意切分方法,不然就会乱套。

    需要解析的JSON

    {
    	"base": {
    		"code": "xm",
    		"name": "project"
    	},
    	"list": [{
    		"ACode": "cp1",
    		"AName": "Product1",
    		"BList": [{
    			"BCode": "gn1",
    			"BName": "Feature1"
    		}, {
    			"BCode": "gn2",
    			"BName": "Feature2"
    		}]
    	}, {
    		"ACode": "cp2",
    		"AName": "Product2",
    		"BList": [{
    			"BCode": "gn1",
    			"BName": "Feature1"
    		}]
    	}]
    }
    

    解析出来的结果应该如下表所示

    code name ACode Aname Bcode Bname
    xm project cp1 Product1 gn1 Feature1
    xm project cp1 Product1 gn2 Feature2
    xm project cp2 Product2 gn1 Feature1

    解决方案

    首先使用get_json_object方法,把需要解析的数组解析出来,然后使用regexp_replace}]},{替换成}]}||{,然后再使用split方法对||进行分割,分割成数组后,使用lateral view explode方法对其进行展开成多列即刻。

    SELECT
        code
      , name
      , ai.ACode
      , ai.AName
      , bi.BCode
      , bi.BName
    FROM
        (
            SELECT
                get_json_object(t.value, '$.base.code') AS code
              , get_json_object(t.value, '$.base.name') AS name
              , get_json_object(t.value, '$.list')      AS list
            FROM
                (
                    SELECT
                        '{"base":{"code":"xm","name":"project"},"list":[{"ACode":"cp1","AName":"Product1","BList":[{"BCode":"gn1","BName":"Feature1"},{"BCode":"gn2","BName":"Feature2"}]},{"ACode":"cp2","AName":"Product2","BList":[{"BCode":"gn1","BName":"Feature1"}]}]}' as value
                )
                t
        )
        t lateral view explode(split(regexp_replace(regexp_extract(list,'^\[(.+)\]$',1),'\}\]\}\,\{', '\}\]\}\|\|\{'),'\|\|')) list as a 
    	lateral view json_tuple(a,'ACode','AName','BList') ai as ACode
        , AName
        , BList lateral view explode(split(regexp_replace(regexp_extract(BList,'^\[(.+)\]$',1),'\}\,\{', '\}\|\|\{'),'\|\|')) BList as b 
        lateral view json_tuple(b,'BCode','BName') bi as BCode
        , BName
    ;
    

    执行完

    xm	project	cp1	Product1	gn1	Feature1
    xm	project	cp1	Product1	gn2	Feature2
    xm	project	cp2	Product2	gn1	Feature1
    Time taken: 0.787 seconds, Fetched: 3 row(s)
    

    hive

    总结

    1. 尝试切分为数组后,使用lateral view posexplode方案,逐层解析,但这样会导致笛卡尔。所以必须一次性全部解析好,而不是套用多个子查询逐层解析;
    2. 使用OUTER字段,能使LATERAL VIEW不忽略NULL

    include OUTER in the query to get rows with NULL values

    something like,

    select *  FROM table LATERAL VIEW OUTER explode (  split (  email  ,','  ) ) email AS email_id;
    
  • 相关阅读:
    1027 Colors in Mars (20 分)
    1025 PAT Ranking (25 分)
    1024 Palindromic Number (25 分)
    JVM调优之jstack找出最耗cpu的线程并定位代码
    kill -3 获取threaddump信息
    java单例支持高并发
    yum出现的“UnicodeDecodeError: 'ascii' codec”问题解决
    命令行参数
    vyatta常用操作
    mysql5.8安装指南
  • 原文地址:https://www.cnblogs.com/harrylyx/p/12986284.html
Copyright © 2011-2022 走看看