zoukankan      html  css  js  c++  java
  • ClickHouse 研讨会学习笔记(clickhouse tips and tricks)

    一.显示执行日志

    clickhouse-client --send_logs_level=trace

    或者进入client session 后输入

    set send_logs_level = 'trace'
    select 1
    set send_log_level='none'
    

    可以跟踪执行日志

    二.对字段进行编码

    1. 创建表时声明对列编码

    create table test__codecs ( a String,
    a_lc LowCardinality(String) default a,
    b UInt32,
    b_delta UInt32 default b Codec(Delta),
    b_delta_lz4 UInt32 default b Codec(Delta,LZ4),
    b_dd UInt32 default b Codec(DoubleDelta),
    b_dd_lz4 UInt32 default b Codec(DoubleDelta,LZ4)
    )
    engine = MergeTree
    partiton by tuple() order by tuple();
    

    字段 a 为原字符字段, a_lc 对 a 的字典编码类型.b 为原 int 字段,b_xxx 对 b 字段值进行不同级别的 codec

    Codec 可接受两个参数,第一个指定编码方式,第二个指定压缩算法

    partition by tuple() 指不分区

    2. 加载数据

    inert into test_codecs(a,b)
    select
    concat('prefix',toString(rand() * 1000)),now()+(number * 10)
    from system.numbers 
    limit 100000000
    settings max_block_size=1000000
    

    3. 查看数据

    select name , sum(data_compressed_bytes) comp,
    sum(data_uncompressed_bytes) uncomp,
    round(comp/uncomp * 100.0,2) as percent
    from system.columns where table = 'test_codecs'
    group by name order by name
    

    可以比较不同压缩率

    4. 编码同时可加速查询

    select a as a,count(*) as c from test_codecs
    group by a order by c asc limit 10
    
    select a_lc as a,count(*) as c from test_codecs
    group by a order by c asc limit 10
    

    第二个编码列的查询要快得多

    5. 各种编码方式比较

    Name Best for
    LowCardinality 少于10 K 个值的字符串
    Delta 时间序列
    Double Delta 递增的计数
    Gorilla 仪表数据(均值波动)
    T64 非随机整数

    三.善用物化视图

    常见场景

    例如展示每一个节点cpu 利用率的当前值

    使用argMaxState 聚合列

    create materialized view cpu_last_point_idle_mv 
    engine = AggregatingMergeTree()
    partition by tuple()
    order by tags_id
    populate
    as select
    argMaxState(create_date,created_at) as created_data,
    maxState(create_at) as max_created_max,
    argMaxState(time,created_at) as time,
    tags_id,
    argMaxState(usage_idle,created_at) as usage_idle
    from cpu 
    group by tags_id
    

    argMax(a,b) 函数返回 b 最大值时 a的值

    State 为聚合函数的后缀,聚合函数加此后缀不直接返回结果,返回聚合函数的中间结果,该中间结果可在AggregatingMergeTree 引擎中使用

    使用Merge函数后缀得到聚合结果

    create view cpu_last_point_idle_v as
    select 
    argMaxMerge(created_date) as created_date,
    maxMerge(max_created_at) as created_at,
    argMaxMerge(time) as time,
    tags_id,
    argMaxMerge(usage_idle) as usage_idle
    from cpu_last_point_idle_mv
    group by tags_id
    

    查询结果视图

    select 
    tags_id,
    100 - usage_idle usage
    from cpu_last_point_idle_v
    order by usage desc,tags_id asc
    limit 10
    

    可以看到,查询速度非常快

    四.使用数组类型存储 k-v 对

    常规的表结构

    create table cpu (
        created_date Date default today(),
        created_at DateTime default now(),
        time Strng,
        tags_id UInt32,
        usage_user Float64,
        usage_system Float64,
        ...
        additional_tags String default '')
    enginge = MergeTree()
    partition by created_date
    order by (tags_id,created_at)
    

    使用数组字段将获得更多灵活性

    create table cpu_dynamic(
        created_data Date,
        created_at DateTime,
        time String,
        tags_id UInt32,
        metrics_name Array(String),
        metrics_value Array(Float64)
    )
    engine = MergeTree()
    partition by created_date
    order by (tags_id,created_at)
    

    metrics_name 为 key 数组 metrics_value 为 value 数组

    以json串格式插入数据

    clickhouse-client -d default 
    --query="insert into cpu_dynamic format JSONEachRow"
    <<DATA
    {"created_data":"2016-01-13",
    "created_at":"2016-01-13 00:00:00",
    "time":"2016-01-13 00:00:00 +0000",
    "tags_id":6220,
    "metrics_name":["user","system","idle","nice"],
    "metrics_value":[35,47,77,21]}
    DATA
    

    可以把 metrics_name 和 metrics_value 当成对应的 KV 对: user:35 idle:77

    然后可用 array join 把 数组铺平

    select tags_id,name,value 
    from cpu_dynamic
    array join metrics_name as name,metrics_value as value 
    
    tags_id name value
    6220 user 35
    6220 system 47
    6220 idle 77
    6220 nice 21

    五.用物化做预计算

    创建一个物化列

    alter table cpu_dynamic add column usage_user materialized
        metrics_value[indexOf(metrics_name,'user']
            after tags_id
    
    
    
    select time,tags_id,usage_user 
    from cpu_dynamic
    

    新增一物化列,这一列计算出 metrics_value 中 key 为 user 对应的值

    time tags_id usage_user
    00:00:00 6220 35
    00:00:10 6220 0

    六.使用字典替代维表关联

    维度关联是数仓模型中非常普遍

    select tags.rack rack,avg(100-cpu.usage_idle) usage 
    from cpu 
    inner join tags as t on cpu.tags_id = t.id
    group by rack
    order by usage desc 
    limit 10
    

    这种方式灵活度不够,每获取一个维度就需要一次 join ,并且维度往往有变动

    配置外部字典

    在 /etc/clickhouse-server 创建文件 tag_dict.xml

    <yandex><dictionary>
      <name>tags</name>
      <source><mysql>
        <host>localhost</host><port>3306</port>
        <user>root</user><password>****</password>
        <db>tsbs</db><table>tags</table>
      </mysql></source>
      <layout><hashed/></layout>
      <structure>
        <id> <name>id</name> </id>
        <attribute>
          <name>hostname</name><type>String</type>
          <null_value></null_value>
        </attribute> 
        ...
      </structure>
    </yandex></dictionary>
    

    外部字典表的配置详细见 Configuring an External Dictionary | ClickHouse Documentation

    直接使用字典来获取维度

    select 
        dictGetString('tags','rack',toUInt64(cpu.tags_id)) rack,
        avg(100-cpu.usage_idle) usage
    from cpu
    group by rack
    order by usage desc
    limit 10
    

    dictGetString 函数从字典表里面找出相应维度值.tags 即为上面配置的外部字典表, rack 为 字典表的属性(维度), toUInt64(cpu.tags_id) 为字典表的 key ,字典表的 key 类型必须为 UInt64

    七.直接使用 mysql 引擎将更加方便

    创建 mysql 引擎库

    create database mysql_repl
    engin = MySQL (
        '127.0.0.1'.
        'repl',
        'root',
        'secret'
    )
    

    接着关联 mysql 里面表数据

    select 
      t.datetime,t.date,t.request_id,
      t.name customer,s.name sku
    from (
      select t.* from traffic t
      join customer c on t.customer_id=c.id ) as t
    join mysql_repl.sku s on t.sku_id = s.id
    where customer_id = 5 
    order by t.request_id limit 10
    

    这里 where 语句会触发 clickhouse 谓词下推优化

    八.使用 TTL(time to live)删除过期数据

    指定过期字段和时间

    create table traffic (
        datetime DateTime,
        date Date,
        request_id UInt64,
        cust_id UInt32,
        sku UInt32
    ) engine = MergeTree
    partition by toYYYYMM(date)
    order by(cust_id,date)
    TTL datetime + INTERVAL 90 DAY
    

    最后一行意思使用 datetime 字段判断,设置90天前的数据自动过期

    更加灵活的过期时间设置

    create table traffic_ttl_variable(
        date Date,
        retention_days UInt16,
        ...
    ) ...
    TTL date + INTERVAL (retention_days * 2) DAY
    

    过期时间也可以设置为跟字段值相关的变量,上面设置了保留时间跟 retention_days 这个字段值有关,也就是说每一行都有自己的过期时间.

    指定过期数据的处理策略

    create table traffic_ttl_disk(
        date Date,
        retention_days UInt16,
        ...
    ) ...
    TTL data + INTERVAL 7 DAY to DISK 'ssd'
        data + INTERVAL 30 DAY to DISK 'hdd'
        data + INTERVAL 180 DAY DELETE
    

    DISK 为 clickhouse 里配置的存储卷,上面指定7天前数据存入 ssd,30天前数据存入 hdd,180天前数据直接删除.

    九.使用副本表来代替备份

    关于副本表和分片表,单独开篇学习.


    所有代码引用自 altinity webinar

  • 相关阅读:
    dp
    数学分析 + 容斥原理
    容斥
    并查集
    矩阵hash + KMP
    扫描线
    位运算
    2015 Multi-University Training Contest 5 1009 MZL's Border
    iOS ZipArchive文件解压缩
    iOS GCD倒计时
  • 原文地址:https://www.cnblogs.com/hdpdriver/p/14024460.html
Copyright © 2011-2022 走看看