zoukankan      html  css  js  c++  java
  • ClickHouse开源数据库

    ClickHouse是一个开源的面向列式数据的数据库管理系统,能够使用SQL查询并且生成实时数据报告。

    优点:

    1.并行处理单个查询(利用多核)

    2.在多个服务器上分布式处理

    3.非常快的扫描,可用于实时查询

    4.列存储非常适用于“宽”/“非规格化”表(多列)

    5.良好的压缩特性

    6.SQL支持(有限的支持)

    7.一系列函数的支持,包括对近似计算的支持

    8.不同的存储引擎的支持(磁盘存储格式)

    9.非常适合结构性日志/事件数据以及时间序列数据(引擎的合并树需要日期字段)

    10. 索引支持(仅主键支持,不是所有的存储引擎都支持)

    11.漂亮的命令行界面,用户友好的进度条和格式

    缺点:

    1.不支持实时的删除/更新操作,不支持事物(与Spark和大部分大数据系统一样)

    2.不支持二级索引(与Spark和大部分大数据系统一样)

    3.只支持自己的协议(不支持MySQL协议)

    4.有限的SQL支持,join的实现方式不同。如果您正在从MySQL或Spark迁移数据,那么您必须重新编写包含join的所有查询。

    5.没有窗口功能

    Group by: in-memory vs. on-disk

    内存不足是在ClickHouse中处理大型数据集时可能遇到的潜在问题之一:

    SELECT 
        min(toMonth(date)), 
        max(toMonth(date)), 
        path, 
        count(*), 
        sum(hits), 
        sum(hits) / count(*) AS hit_ratio
    FROM wikistat 
    WHERE (project = 'en') 
    GROUP BY path
    ORDER BY hit_ratio DESC
    LIMIT 10
     
    ↖ Progress: 1.83 billion rows, 85.31 GB (68.80 million rows/s., 3.21 GB/s.) ██████████▋                                                                                                                                                    6%Received exception from server:
    Code: 241. DB::Exception: Received from localhost:9000, 127.0.0.1. 
    DB::Exception: Memory limit (for query) exceeded: would use 9.31 GiB (attempt to allocate chunk of 1048576 bytes), maximum: 9.31 GiB: 
    (while reading column hits): 

    默认情况下,ClickHouse会限制group by使用的内存量(它使用 hash table来处理group by)。这很容易解决 - 如果你有空闲的内存,增加这个参数:

    SET max_memory_usage = 128000000000; #128G,

    如果你没有那么多的内存可用,ClickHouse可以通过设置这个“溢出”数据到磁盘:

    set max_bytes_before_external_group_by=20000000000; #20G
    set max_memory_usage=40000000000; #40G

    根据文档,如果需要使用max_bytes_before_external_group_by,建议将max_memory_usage设置为max_bytes_before_external_group_by大小的两倍。

    (原因是聚合需要分两个阶段进行:1.查询并且建立中间数据 2.合并中间数据。 数据“溢出”到磁盘一般发生在第一个阶段,如果没有发生数据“溢出”,ClickHouse在阶段1和阶段2可能需要相同数量的内存)

     性能对比:

    ClickHouse vs. Spark

    ClickHouse 和Spark 都是分布式的; 但是为了方便测试我们只进行单节点测试。 测试结果令人震惊

    Size / compression Spark v. 2.0.2 ClickHouse
    数据存储格式 Parquet, compressed: snappy Internal storage, compressed
    Size (uncompressed: 1.2TB) 395G 212G

     

     Test  Spark v. 2.0.2 ClickHouse Diff
    Query 1: count (warm) 7.37 sec (no disk IO) 6.61 sec  ~same
    Query 2: simple group (warm)  792.55 sec (no disk IO) 37.45 sec 21x better
    Query 3: complex group by   2522.9 sec  398.55 sec  6.3x better

     

    附录:

    硬件

    • CPU: 24xIntel(R) Xeon(R) CPU L5639 @ 2.13GHz (physical = 2, cores = 12, virtual = 24, hyperthreading = yes)
    • Disk: 2 consumer grade SSD in software RAID 0 (mdraid)

    Query 1

    select count(*) from wikistat

    ClickHouse:

    :)  select count(*) from wikistat;
     
    SELECT count(*)
    FROM wikistat 
     
    ┌─────count()─┐
    │ 26935251789 │
    └─────────────┘
     
    1 rows in set. Elapsed: 6.610 sec. Processed 26.88 billion rows, 53.77 GB (4.07 billion rows/s., 8.13 GB/s.) 

    Spark:

    spark-sql> select count(*) from wikistat;
    26935251789
    Time taken: 7.369 seconds, Fetched 1 row(s)

    Query 2

    select count(*), month(dt) as mon
    from wikistat where year(dt)=2008
    and month(dt) between 1 and 10
    group by month(dt)
    order by month(dt);

    ClickHouse:

    :) select count(*), toMonth(date) as mon from wikistat 
    where toYear(date)=2008 and toMonth(date) between 1 and 10 group by mon;
     
    SELECT 
        count(*), 
        toMonth(date) AS mon
    FROM wikistat 
    WHERE (toYear(date) = 2008) AND ((toMonth(date) >= 1) AND (toMonth(date) <= 10))
    GROUP BY mon
     
    ┌────count()─┬─mon─┐
    │ 21001626041 │
    │ 19697570692 │
    │ 20813715303 │
    │ 21568785124 │
    │ 24768906215 │
    │ 25266628966 │
    │ 24897232447 │
    │ 24803563588 │
    │ 25227465449 │
    │ 261437235210 │
    └────────────┴─────┘
     
    10 rows in set. Elapsed: 37.450 sec. Processed 23.37 billion rows, 46.74 GB (623.97 million rows/s., 1.25 GB/s.) 

    Spark:

    spark-sql> select count(*), month(dt) as mon from wikistat where year(dt)=2008 and month(dt) between 1 and 10 group by month(dt) order by month(dt);
    2100162604      1                                                               
    1969757069      2
    2081371530      3
    2156878512      4
    2476890621      5
    2526662896      6
    2489723244      7
    2480356358      8
    2522746544      9
    2614372352      10
    Time taken: 792.552 seconds, Fetched 10 row(s)

    Query 3

    SELECT
    path,
    count(*),
    sum(hits) AS sum_hits,
    round(sum(hits) / count(*), 2) AS hit_ratio
    FROM wikistat
    WHERE project = 'en'
    GROUP BY path
    ORDER BY sum_hits DESC
    LIMIT 100;

    ClickHouse:

    :) SELECT
    :-]     path,
    :-]     count(*),
    :-]     sum(hits) AS sum_hits,
    :-]     round(sum(hits) / count(*), 2) AS hit_ratio
    :-] FROM wikistat
    :-] WHERE (project = 'en')
    :-] GROUP BY path
    :-] ORDER BY sum_hits DESC
    :-] LIMIT 100;
     
    SELECT 
        path,
        count(*),
        sum(hits) AS sum_hits,
        round(sum(hits) / count(*), 2) AS hit_ratio
    FROM wikistat
    WHERE project = 'en'
    GROUP BY path
    ORDER BY sum_hits DESC
    LIMIT 100
     
    ┌─path────────────────────────────────────────────────┬─count()─┬───sum_hits─┬─hit_ratio─┐
    │ Special:Search                                      │   447954544605711101453.41 │
    │ Main_Page                                           │   31930211589697766266.74 │
    │ Special:Random                                      │   3015953383053417700.54 │
    │ Wiki                                                │   10237404884163955.11 │
    │ Special:Watchlist                                   │   3820637200069973.67 │
    │ YouTube                                             │    9960343498043448.78 │
    │ Special:Randompage                                  │    8085289596243581.9 │
    │ Special:AutoLogin                                   │   3441324436845710.11 │
    │ Facebook                                            │    7153182633532553.24 │
    │ Wikipedia                                           │   2373217848385752.08 │
    │ Barack_Obama                                        │   13832169657751226.56 │
    │ index.html                                          │    6658169215832541.54 │
    …
     
    100 rows in set. Elapsed: 398.550 sec. Processed 26.88 billion rows, 1.24 TB (67.45 million rows/s., 3.10 GB/s.)

    Spark:

    spark-sql> SELECT
             >     path,
             >     count(*),
             >     sum(hits) AS sum_hits,
             >     round(sum(hits) / count(*), 2) AS hit_ratio
             > FROM wikistat
             > WHERE (project = 'en')
             > GROUP BY path
             > ORDER BY sum_hits DESC
             > LIMIT 100;
    ...
    Time taken: 2522.903 seconds, Fetched 100 row(s)

    ClickHouse适用场景

    1. Web和App数据分析

    2. 广告网络和RTB

    3. 电信

    4. 电子商务和金融

    5. 信息安全

    6. 监测和遥测

    7. 时序数据

    8. 商业智能

    9. 在线游戏

    10. 物联网

    ClickHouse适用场景

    1. 事物性工作(OLTP)

    2. 高并发的键值访问

    3. Blob或者文档存储

    4. 超标准化的数据 

    参考:

    https://www.percona.com/blog/2017/02/13/clickhouse-new-opensource-columnar-database/

    https://clickhouse.yandex/docs/en/single/#ClickHouse%20features%20that%20can%20be%20considered%20disadvantages

     

  • 相关阅读:
    【解题报告】 [YNOI2019]排队
    【解题报告】[AHOI2001]彩票摇奖
    【解题报告】 [NOIP2006]能量项链
    【解题报告】 启示录
    【解题报告】 【NOIP2018】 摆渡车
    【解题报告】 【NOIP2017】 奶酪
    C# winform 窗体从右下角向上弹出窗口效果
    C#开发COM+组件和ActiveX控件
    JQuery
    How to delete a team project from Team Foundation Service (tfs.visualstudio.com)
  • 原文地址:https://www.cnblogs.com/davygeek/p/8018292.html
Copyright © 2011-2022 走看看