zoukankan      html  css  js  c++  java
  • 拉链表知识

    Hive拉链表原理

    拉链表记录每条信息的生命周期,一旦记录的生命周期结束,就重新开始一条新记录,并且把当前日期作为新记录生效开始日期。
    如果当前信息至今有效,在生命周期结束中填入一个极大值(‘9999-99-99’)。

    应用场景:
    1需要查看某些业务信息的某一个时间点当日的信息
    2数据会发生变化,但是大部分是不变的,无法做每日增量。
    3数据量有一定规模。无法按照每日全量的方式保存,比如1亿用户*365天,每天一份用户信息,显然不太合适。

    拉链表的形成过程:

     如何使用拉链表:
    获取某个时间点的数据。
    只要指定开始生效期比它小,结束生效期比它大就可以。
    如获取2019/12/01这一天的状态的数据。

    select * from order_info where start_date<='2019-12-01' and end_date>='2019-12-01';

    拉链表的适用场景

    维护历史状态,以及最新状态数据

    适用场景:

    1.数据量比较大

    2.表中的部分字段会被更新

    3.需要查看某一个时间点或者时间段的历史快照信息

        查看某一个订单在历史某一个时间点的状态

        某一个用户在过去某一段时间,下单次数

    4.更新的比例和频率不是很大

        如果表中信息变化不是很大,每天都保留一份全量,那么每次全量中会保存很多不变的信息,对存储是极大的浪费

    优点

    1、满足反应数据的历史状态

    2、最大程度节省存储

    拉链表的实现

    拉链表适用于以下几种情况吧

    数据量有点大,表中某些字段有变化,但是呢变化的频率也不是很高,业务需求呢又需要统计这种变化状态,每天全量一份呢,有点不太现实,

    不仅浪费了存储空间,有时可能业务统计也有点麻烦,这时,拉链表的作用就提现出来了,既节省空间,又满足了需求。

    假设以天为维度,以每天的最后一个状态为当天的最终状态。

    以一张订单表为例,如下是原始数据,每天的订单状态明细

    1       2016-08-20      2016-08-20      创建
    2       2016-08-20      2016-08-20      创建
    3       2016-08-20      2016-08-20      创建
    1       2016-08-20      2016-08-21      支付
    2       2016-08-20      2016-08-21      完成
    4       2016-08-21      2016-08-21      创建
    1       2016-08-20      2016-08-22      完成
    3       2016-08-20      2016-08-22      支付
    4       2016-08-21      2016-08-22      支付
    5       2016-08-22      2016-08-22      创建

    根据拉链表我们希望得到的是:

    1       2016-08-20      2016-08-20      创建    2016-08-20      2016-08-21
    1       2016-08-20      2016-08-21      支付    2016-08-21      2016-08-22
    1       2016-08-20      2016-08-22      完成    2016-08-22      9999-12-31
    2       2016-08-20      2016-08-20      创建    2016-08-20      2016-08-21
    2       2016-08-20      2016-08-21      完成    2016-08-21      9999-12-31
    3       2016-08-20      2016-08-20      创建    2016-08-20      2016-08-22
    3       2016-08-20      2016-08-22      支付    2016-08-22      9999-12-31
    4       2016-08-21      2016-08-21      创建    2016-08-21      2016-08-22
    4       2016-08-21      2016-08-22      支付    2016-08-22      9999-12-31
    5       2016-08-22      2016-08-22      创建    2016-08-22      9999-12-31

    最后两个字段:begin_date表示该条记录的生命周期开始时间,end_date表示该条记录的生命周期结束时间;

    end_date = ‘9999-12-31’表示该条记录目前处于有效状态;

    如果查询当前所有有效的记录,则select * from order_his where dw_end_date = ‘9999-12-31′;

    如果查询2016-08-21的历史快照,则select * from order_his where begin_date <= ‘2016-08-21′ and end_date >= ‘2016-08-21’;

    本例以hive为例,只考虑到实现,与性能和业务无关。

    CREATE TABLE orders (
    orderid INT,
    createtime STRING,
    modifiedtime STRING,
    status STRING
    ) row format delimited fields terminated by '	';
    
    CREATE TABLE ods_orders (
    orderid INT,
    createtime STRING,
    modifiedtime STRING,
    status STRING
    ) PARTITIONED BY (day STRING)
    row format delimited fields terminated by '	';
     
    CREATE TABLE dw_orders_history (
    orderid INT,
    createtime STRING,
    modifiedtime STRING,
    status STRING,
    dw_start_date STRING,
    dw_end_date STRING
    ) row format delimited fields terminated by '	' ;

    首先全量更新,我们先到2016-08-20为止的数据。

    初始化,先把2016-08-20的数据初始化进去。

    INSERT overwrite TABLE ods_orders PARTITION (day = '2016-08-20')
    SELECT orderid,createtime,modifiedtime,status
    FROM orders
    WHERE createtime < '2016-08-21' and modifiedtime <'2016-08-21';

    刷到dw_orders_history中。

    INSERT overwrite TABLE dw_orders_history
    SELECT orderid,createtime,modifiedtime,status,
    createtime AS dw_start_date,
    '9999-12-31' AS dw_end_date
    FROM ods_orders
    WHERE day = '2016-08-20';

     2016-08-21的数据增量刷到ods_orders。注意,如果 hive 表是分区表的话,insert overwrite 操作只会重写当前分区的数据,不会重写其他分区数据。

    INSERT overwrite TABLE ods_orders PARTITION (day = '2016-08-21')
    SELECT orderid,createtime,modifiedtime,status
    FROM orders
    WHERE (createtime = '2016-08-21'  and modifiedtime = '2016-08-21') OR modifiedtime = '2016-08-21';

    先放到增量表中,然后进行关联到一张临时表中,在插入到新表中

    DROP TABLE IF EXISTS dw_orders_his_tmp;
    CREATE TABLE dw_orders_his_tmp AS
    SELECT orderid,
    createtime,
    modifiedtime,
    status,
    dw_start_date,
    dw_end_date
    FROM (
        SELECT a.orderid,
        a.createtime,
        a.modifiedtime,
        a.status,
        a.dw_start_date,
        CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2016-08-21' THEN '2016-08-21' ELSE a.dw_end_date END AS dw_end_date
        FROM dw_orders_history a
        left outer join (SELECT * FROM ods_orders WHERE day = '2016-08-21') b
        ON (a.orderid = b.orderid)
        UNION ALL
        SELECT orderid,
        createtime,
        modifiedtime,
        status,
        modifiedtime AS dw_start_date,
        '9999-12-31' AS dw_end_date
        FROM ods_orders
        WHERE day = '2016-08-21'
    ) x
    ORDER BY orderid,dw_start_date;
    
    INSERT overwrite TABLE dw_orders_history
    SELECT * FROM dw_orders_his_tmp;

    重复上面2步把2016-08-22号的数据更新进去,最后结果如下

    ods_orders表数据:

     拉链表dw_orders_history的数据:

     

    全部sql

    CREATE TABLE orders (
    orderid INT,
    createtime STRING,
    modifiedtime STRING,
    status STRING
    ) row format delimited fields terminated by '	';
    
    CREATE TABLE ods_orders (
    orderid INT,
    createtime STRING,
    modifiedtime STRING,
    status STRING
    ) PARTITIONED BY (day STRING)
    row format delimited fields terminated by '	';
     
    CREATE TABLE dw_orders_history (
    orderid INT,
    createtime STRING,
    modifiedtime STRING,
    status STRING,
    dw_start_date STRING,
    dw_end_date STRING
    ) row format delimited fields terminated by '	' ;
    
    INSERT overwrite TABLE ods_orders PARTITION (day = '2016-08-20')
    SELECT orderid,createtime,modifiedtime,status
    FROM orders
    WHERE createtime < '2016-08-21' and modifiedtime <'2016-08-21';
    
    ##刷到dw中
    INSERT overwrite TABLE dw_orders_history
    SELECT orderid,createtime,modifiedtime,status,
    createtime AS dw_start_date,
    '9999-12-31' AS dw_end_date
    FROM ods_orders
    WHERE day = '2016-08-20';
    
    INSERT overwrite TABLE ods_orders PARTITION (day = '2016-08-21')
    SELECT orderid,createtime,modifiedtime,status
    FROM orders
    WHERE (createtime = '2016-08-21'  and modifiedtime = '2016-08-21') OR modifiedtime = '2016-08-21';
    
    ##先放到增量表中,然后进行关联到一张临时表中,在插入到新表中
    DROP TABLE IF EXISTS dw_orders_his_tmp;
    CREATE TABLE dw_orders_his_tmp AS
    SELECT orderid,
    createtime,
    modifiedtime,
    status,
    dw_start_date,
    dw_end_date
    FROM (
        SELECT a.orderid,
        a.createtime,
        a.modifiedtime,
        a.status,
        a.dw_start_date,
        CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2016-08-21' THEN '2016-08-21' ELSE a.dw_end_date END AS dw_end_date
        FROM dw_orders_history a
        left outer join (SELECT * FROM ods_orders WHERE day = '2016-08-21') b
        ON (a.orderid = b.orderid)
        UNION ALL
        SELECT orderid,
        createtime,
        modifiedtime,
        status,
        modifiedtime AS dw_start_date,
        '9999-12-31' AS dw_end_date
        FROM ods_orders
        WHERE day = '2016-08-21'
    ) x
    ORDER BY orderid,dw_start_date;
    
    INSERT overwrite TABLE dw_orders_history
    SELECT * FROM dw_orders_his_tmp;
    
    
    #8月22 增量更新
    INSERT overwrite TABLE ods_orders PARTITION (day = '2016-08-22')
    SELECT orderid,createtime,modifiedtime,status
    FROM orders
    WHERE (createtime = '2016-08-22'  and modifiedtime = '2016-08-22') OR modifiedtime = '2016-08-22';
    
    DROP TABLE IF EXISTS dw_orders_his_tmp;
    CREATE TABLE dw_orders_his_tmp AS
    SELECT orderid,
    createtime,
    modifiedtime,
    status,
    dw_start_date,
    dw_end_date
    FROM (
        SELECT a.orderid,
        a.createtime,
        a.modifiedtime,
        a.status,
        a.dw_start_date,
        CASE WHEN b.orderid IS NOT NULL AND a.dw_end_date > '2016-08-22' THEN '2016-08-22' ELSE a.dw_end_date END AS dw_end_date
        FROM dw_orders_history a
        left outer join (SELECT * FROM ods_orders WHERE day = '2016-08-22') b
        ON (a.orderid = b.orderid)
        UNION ALL
        SELECT orderid,
        createtime,
        modifiedtime,
        status,
        modifiedtime AS dw_start_date,
        '9999-12-31' AS dw_end_date
        FROM ods_orders
        WHERE day = '2016-08-22'
    ) x
    ORDER BY orderid,dw_start_date;
    
    INSERT overwrite TABLE dw_orders_history
    SELECT * FROM dw_orders_his_tmp;
    View Code

     参考博客:Hive拉链表原理

                      hive中拉链表

                       你真的了解全量表,增量表及拉链表吗?

             拉链表及其Hive实现

                      缓慢变化维(SCD)常见解决方案

  • 相关阅读:
    HDU
    hdu-1260 tickets
    hdu-1024 Max Sum Plus Plus
    spfa+链式前向星模板
    kafka-伪集群搭建
    elasticsearch-安装-centos7- es7.5 搭建
    elk-安装 通过docker
    kibana-安装-通过docker
    logstash -grok插件语法介绍
    docker 启动redis/nginx
  • 原文地址:https://www.cnblogs.com/yangms/p/14425310.html
Copyright © 2011-2022 走看看