zoukankan      html  css  js  c++  java
  • ClickHouse介绍(一)初次使用

    ClickHouse使用

    ClickHouse是一个面向列存储的OLAP分析数据库,以其强大的分析速度而闻名。有关ClickHouse的介绍可以参考其官网说明[1]。本文主要介绍它的基本使用。

    1. 安装

    使用的环境为2台 AWS EC2,操作系统为Amazon Linux2。使用的ClickHouse为最新的stable版本v21.2.5.5-stable [2]。

    export LATEST_VERSION=21.2.5.5
    
    curl -O https://repo.clickhouse.tech/tgz/stable/clickhouse-common-static-$LATEST_VERSION.tgz
    curl -O https://repo.clickhouse.tech/tgz/stable/clickhouse-common-static-dbg-$LATEST_VERSION.tgz
    curl -O https://repo.clickhouse.tech/tgz/stable/clickhouse-server-$LATEST_VERSION.tgz
    curl -O https://repo.clickhouse.tech/tgz/stable/clickhouse-client-$LATEST_VERSION.tgz
    
    tar -xzvf clickhouse-common-static-$LATEST_VERSION.tgz
    sudo clickhouse-common-static-$LATEST_VERSION/install/doinst.sh
    
    tar -xzvf clickhouse-common-static-dbg-$LATEST_VERSION.tgz
    sudo clickhouse-common-static-dbg-$LATEST_VERSION/install/doinst.sh
    
    tar -xzvf clickhouse-server-$LATEST_VERSION.tgz
    sudo clickhouse-server-$LATEST_VERSION/install/doinst.sh
    sudo /etc/init.d/clickhouse-server start
    
    tar -xzvf clickhouse-client-$LATEST_VERSION.tgz
    sudo clickhouse-client-$LATEST_VERSION/install/doinst.sh

    2. 初次使用

    2.1. 数据

    使用官网提供的数据:Yandex.Metrica的匿名数据。它是在ClickHouse成为开源之前作为生产环境运行的第一个服务:

    curl https://datasets.clickhouse.tech/hits/tsv/hits_v1.tsv.xz | unxz --threads=`nproc` > hits_v1.tsv
    curl https://datasets.clickhouse.tech/visits/tsv/visits_v1.tsv.xz | unxz --threads=`nproc` > visits_v1.tsv
    
    # 上传到s3
    aws s3 sync ./ s3://xxx-clickhouse/data/

    2.2. 建表

    与其他数据库一样,clickhouse也自带一个default数据库。这里先创建一个tutorial数据库:

    ip-10-0-4-69.cn-north-1.compute.internal :) create database if not exists tutorial
    
    > CREATE DATABASE IF NOT EXISTS tutorial

    建表语句必须指定3个关键事情:

    1. 表名
    2. 表结构:列名以及对应数据类型
    3. 表引擎及其设置:决定了对此表的查询操作是如何在物理层执行的所有细节

    Yandex.Metrica 是一个网络分析服务,样本数据集不包括其全部功能,因此只有2个表可以创建:

    1. hits表:包含所有用户在服务所涵盖的所有网站上完成的每个操作
    2. visits表:包含预先构建的会话,而不是单个操作

    建表语句:

    CREATE TABLE tutorial.hits_v1
    (
        `WatchID` UInt64,
        `JavaEnable` UInt8,
        `Title` String,
        `GoodEvent` Int16,
        `EventTime` DateTime,
        `EventDate` Date,
        `CounterID` UInt32,
        `ClientIP` UInt32,
        `ClientIP6` FixedString(16),
        `RegionID` UInt32,
        `UserID` UInt64,
        `CounterClass` Int8,
        `OS` UInt8,
        `UserAgent` UInt8,
        `URL` String,
        `Referer` String,
        `URLDomain` String,
        `RefererDomain` String,
        `Refresh` UInt8,
        `IsRobot` UInt8,
        `RefererCategories` Array(UInt16),
        `URLCategories` Array(UInt16),
        `URLRegions` Array(UInt32),
        `RefererRegions` Array(UInt32),
        `ResolutionWidth` UInt16,
        `ResolutionHeight` UInt16,
        `ResolutionDepth` UInt8,
        `FlashMajor` UInt8,
        `FlashMinor` UInt8,
        `FlashMinor2` String,
        `NetMajor` UInt8,
        `NetMinor` UInt8,
        `UserAgentMajor` UInt16,
        `UserAgentMinor` FixedString(2),
        `CookieEnable` UInt8,
        `JavascriptEnable` UInt8,
        `IsMobile` UInt8,
        `MobilePhone` UInt8,
        `MobilePhoneModel` String,
        `Params` String,
        `IPNetworkID` UInt32,
        `TraficSourceID` Int8,
        `SearchEngineID` UInt16,
        `SearchPhrase` String,
        `AdvEngineID` UInt8,
        `IsArtifical` UInt8,
        `WindowClientWidth` UInt16,
        `WindowClientHeight` UInt16,
        `ClientTimeZone` Int16,
        `ClientEventTime` DateTime,
        `SilverlightVersion1` UInt8,
        `SilverlightVersion2` UInt8,
        `SilverlightVersion3` UInt32,
        `SilverlightVersion4` UInt16,
        `PageCharset` String,
        `CodeVersion` UInt32,
        `IsLink` UInt8,
        `IsDownload` UInt8,
        `IsNotBounce` UInt8,
        `FUniqID` UInt64,
        `HID` UInt32,
        `IsOldCounter` UInt8,
        `IsEvent` UInt8,
        `IsParameter` UInt8,
        `DontCountHits` UInt8,
        `WithHash` UInt8,
        `HitColor` FixedString(1),
        `UTCEventTime` DateTime,
        `Age` UInt8,
        `Sex` UInt8,
        `Income` UInt8,
        `Interests` UInt16,
        `Robotness` UInt8,
        `GeneralInterests` Array(UInt16),
        `RemoteIP` UInt32,
        `RemoteIP6` FixedString(16),
        `WindowName` Int32,
        `OpenerName` Int32,
        `HistoryLength` Int16,
        `BrowserLanguage` FixedString(2),
        `BrowserCountry` FixedString(2),
        `SocialNetwork` String,
        `SocialAction` String,
        `HTTPError` UInt16,
        `SendTiming` Int32,
        `DNSTiming` Int32,
        `ConnectTiming` Int32,
        `ResponseStartTiming` Int32,
        `ResponseEndTiming` Int32,
        `FetchTiming` Int32,
        `RedirectTiming` Int32,
        `DOMInteractiveTiming` Int32,
        `DOMContentLoadedTiming` Int32,
        `DOMCompleteTiming` Int32,
        `LoadEventStartTiming` Int32,
        `LoadEventEndTiming` Int32,
        `NSToDOMContentLoadedTiming` Int32,
        `FirstPaintTiming` Int32,
        `RedirectCount` Int8,
        `SocialSourceNetworkID` UInt8,
        `SocialSourcePage` String,
        `ParamPrice` Int64,
        `ParamOrderID` String,
        `ParamCurrency` FixedString(3),
        `ParamCurrencyID` UInt16,
        `GoalsReached` Array(UInt32),
        `OpenstatServiceName` String,
        `OpenstatCampaignID` String,
        `OpenstatAdID` String,
        `OpenstatSourceID` String,
        `UTMSource` String,
        `UTMMedium` String,
        `UTMCampaign` String,
        `UTMContent` String,
        `UTMTerm` String,
        `FromTag` String,
        `HasGCLID` UInt8,
        `RefererHash` UInt64,
        `URLHash` UInt64,
        `CLID` UInt32,
        `YCLID` UInt64,
        `ShareService` String,
        `ShareURL` String,
        `ShareTitle` String,
        `ParsedParams` Nested(
            Key1 String,
            Key2 String,
            Key3 String,
            Key4 String,
            Key5 String,
            ValueDouble Float64),
        `IslandID` FixedString(16),
        `RequestNum` UInt32,
        `RequestTry` UInt8
    )
    ENGINE = MergeTree()
    PARTITION BY toYYYYMM(EventDate)
    ORDER BY (CounterID, EventDate, intHash32(UserID))
    SAMPLE BY intHash32(UserID)
    
    
    CREATE TABLE tutorial.visits_v1
    (
        `CounterID` UInt32,
        `StartDate` Date,
        `Sign` Int8,
        `IsNew` UInt8,
        `VisitID` UInt64,
        `UserID` UInt64,
        `StartTime` DateTime,
        `Duration` UInt32,
        `UTCStartTime` DateTime,
        `PageViews` Int32,
        `Hits` Int32,
        `IsBounce` UInt8,
        `Referer` String,
        `StartURL` String,
        `RefererDomain` String,
        `StartURLDomain` String,
        `EndURL` String,
        `LinkURL` String,
        `IsDownload` UInt8,
        `TraficSourceID` Int8,
        `SearchEngineID` UInt16,
        `SearchPhrase` String,
        `AdvEngineID` UInt8,
        `PlaceID` Int32,
        `RefererCategories` Array(UInt16),
        `URLCategories` Array(UInt16),
        `URLRegions` Array(UInt32),
        `RefererRegions` Array(UInt32),
        `IsYandex` UInt8,
        `GoalReachesDepth` Int32,
        `GoalReachesURL` Int32,
        `GoalReachesAny` Int32,
        `SocialSourceNetworkID` UInt8,
        `SocialSourcePage` String,
        `MobilePhoneModel` String,
        `ClientEventTime` DateTime,
        `RegionID` UInt32,
        `ClientIP` UInt32,
        `ClientIP6` FixedString(16),
        `RemoteIP` UInt32,
        `RemoteIP6` FixedString(16),
        `IPNetworkID` UInt32,
        `SilverlightVersion3` UInt32,
        `CodeVersion` UInt32,
        `ResolutionWidth` UInt16,
        `ResolutionHeight` UInt16,
        `UserAgentMajor` UInt16,
        `UserAgentMinor` UInt16,
        `WindowClientWidth` UInt16,
        `WindowClientHeight` UInt16,
        `SilverlightVersion2` UInt8,
        `SilverlightVersion4` UInt16,
        `FlashVersion3` UInt16,
        `FlashVersion4` UInt16,
        `ClientTimeZone` Int16,
        `OS` UInt8,
        `UserAgent` UInt8,
        `ResolutionDepth` UInt8,
        `FlashMajor` UInt8,
        `FlashMinor` UInt8,
        `NetMajor` UInt8,
        `NetMinor` UInt8,
        `MobilePhone` UInt8,
        `SilverlightVersion1` UInt8,
        `Age` UInt8,
        `Sex` UInt8,
        `Income` UInt8,
        `JavaEnable` UInt8,
        `CookieEnable` UInt8,
        `JavascriptEnable` UInt8,
        `IsMobile` UInt8,
        `BrowserLanguage` UInt16,
        `BrowserCountry` UInt16,
        `Interests` UInt16,
        `Robotness` UInt8,
        `GeneralInterests` Array(UInt16),
        `Params` Array(String),
        `Goals` Nested(
            ID UInt32,
            Serial UInt32,
            EventTime DateTime,
            Price Int64,
            OrderID String,
            CurrencyID UInt32),
        `WatchIDs` Array(UInt64),
        `ParamSumPrice` Int64,
        `ParamCurrency` FixedString(3),
        `ParamCurrencyID` UInt16,
        `ClickLogID` UInt64,
        `ClickEventID` Int32,
        `ClickGoodEvent` Int32,
        `ClickEventTime` DateTime,
        `ClickPriorityID` Int32,
        `ClickPhraseID` Int32,
        `ClickPageID` Int32,
        `ClickPlaceID` Int32,
        `ClickTypeID` Int32,
        `ClickResourceID` Int32,
        `ClickCost` UInt32,
        `ClickClientIP` UInt32,
        `ClickDomainID` UInt32,
        `ClickURL` String,
        `ClickAttempt` UInt8,
        `ClickOrderID` UInt32,
        `ClickBannerID` UInt32,
        `ClickMarketCategoryID` UInt32,
        `ClickMarketPP` UInt32,
        `ClickMarketCategoryName` String,
        `ClickMarketPPName` String,
        `ClickAWAPSCampaignName` String,
        `ClickPageName` String,
        `ClickTargetType` UInt16,
        `ClickTargetPhraseID` UInt64,
        `ClickContextType` UInt8,
        `ClickSelectType` Int8,
        `ClickOptions` String,
        `ClickGroupBannerID` Int32,
        `OpenstatServiceName` String,
        `OpenstatCampaignID` String,
        `OpenstatAdID` String,
        `OpenstatSourceID` String,
        `UTMSource` String,
        `UTMMedium` String,
        `UTMCampaign` String,
        `UTMContent` String,
        `UTMTerm` String,
        `FromTag` String,
        `HasGCLID` UInt8,
        `FirstVisit` DateTime,
        `PredLastVisit` Date,
        `LastVisit` Date,
        `TotalVisits` UInt32,
        `TraficSource` Nested(
            ID Int8,
            SearchEngineID UInt16,
            AdvEngineID UInt8,
            PlaceID UInt16,
            SocialSourceNetworkID UInt8,
            Domain String,
            SearchPhrase String,
            SocialSourcePage String),
        `Attendance` FixedString(16),
        `CLID` UInt32,
        `YCLID` UInt64,
        `NormalizedRefererHash` UInt64,
        `SearchPhraseHash` UInt64,
        `RefererDomainHash` UInt64,
        `NormalizedStartURLHash` UInt64,
        `StartURLDomainHash` UInt64,
        `NormalizedEndURLHash` UInt64,
        `TopLevelDomain` UInt64,
        `URLScheme` UInt64,
        `OpenstatServiceNameHash` UInt64,
        `OpenstatCampaignIDHash` UInt64,
        `OpenstatAdIDHash` UInt64,
        `OpenstatSourceIDHash` UInt64,
        `UTMSourceHash` UInt64,
        `UTMMediumHash` UInt64,
        `UTMCampaignHash` UInt64,
        `UTMContentHash` UInt64,
        `UTMTermHash` UInt64,
        `FromHash` UInt64,
        `WebVisorEnabled` UInt8,
        `WebVisorActivity` UInt32,
        `ParsedParams` Nested(
            Key1 String,
            Key2 String,
            Key3 String,
            Key4 String,
            Key5 String,
            ValueDouble Float64),
        `Market` Nested(
            Type UInt8,
            GoalID UInt32,
            OrderID String,
            OrderPrice Int64,
            PP UInt32,
            DirectPlaceID UInt32,
            DirectOrderID UInt32,
            DirectBannerID UInt32,
            GoodID String,
            GoodName String,
            GoodQuantity Int32,
            GoodPrice Int64),
        `IslandID` FixedString(16)
    )
    ENGINE = CollapsingMergeTree(Sign)
    PARTITION BY toYYYYMM(StartDate)
    ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID)
    SAMPLE BY intHash32(UserID)

    可以看到,hits_v1 使用的是MergeTree引擎;visits_v1使用的是Collapsing引擎。两者的partition格式均为toYYYYMM(EventDate)。

    2.3. 导入数据并查询

    导入本地数据:

    clickhouse-client --query "INSERT INTO tutorial.hits_v1 FORMAT TSV" --max_insert_block_size=100000 < hits_v1.tsv
    clickhouse-client --query "INSERT INTO tutorial.visits_v1 FORMAT TSV" --max_insert_block_size=100000 < visits_v1.tsv

    优化表:

    OPTIMIZE TABLE tutorial.hits_v1 FINAL
    OPTIMIZE TABLE tutorial.visits_v1 FINAL

    示例查询:

    SELECT
        StartURL AS URL,
        AVG(Duration) AS AvgDuration
    FROM tutorial.visits_v1
    WHERE (StartDate >= '2014-03-23') AND (StartDate <= '2014-03-30')
    GROUP BY URL
    ORDER BY AvgDuration DESC
    LIMIT 10
    
    10 rows in set. Elapsed: 0.088 sec. Processed 1.45 million rows, 114.85 MB (16.56 million rows/s., 1.31 GB/s.)
    SELECT
        sum(Sign) AS visits,
        sumIf(Sign, has(Goals.ID, 1105530)) AS goal_visits,
        (100. * goal_visits) / visits AS goal_percent
    FROM tutorial.visits_v1
    WHERE (CounterID = 912887) AND (toYYYYMM(StartDate) = 201403) AND (domain(StartURL) = 'yandex.ru')
    
    1 rows in set. Elapsed: 0.012 sec. Processed 13.05 thousand rows, 2.88 MB (1.10 million rows/s., 242.38 MB/s.)

    从返回速度来看,基本上是立即返回,处理时间仅用 0.088 和 0.012 秒。

    3. 为什么ClickHouse如此快

    从各种公开文档来看,ClickHouse如此之快的原因主要有2点:

    1. 列式存储数据库
    2. 使用向量化引擎

    3.1. 列式存储

    列式存储与行式存储的区别已经有大量公开文档进行详细说明,在此不再赘述。简单来说,列式存储的优势在于:

    1. 只提取所需要的列的信息,避免了扫描不需要的其他列信息
    2. 对数据压缩的友好型:因为同一列拥有同样的数据类型和现实语义,重复项的可能性更高

    这两点优势提供的是:

    1. 减少了数据扫描范围:有效减少了所需扫描的数据量
    2. 减少了数据传输的大小:数据压缩率越高,则数据体量越小,在网络中传输的数据量更少,所以对网络带宽和磁盘IO的压力也就越小,速度也就越快。

    3.2. 向量化执行

    向量化执行是合理利用CPU指令集的方式,它的必备条件是CPU支持SIMD(Single Instruction Multiple Data)指令,此指令的作用是:单条指定一次性操作多条数据。

    在Stack Overflow[3] 上对此有一个较为具体的说明:

    许多CPU都有“vector”或“SIMD”指令集,可以将同一个操作同时应用到2条、4条或是更多的数据条目上。向量化(vectorization)就是重写循环的操作,在一个循环中(例如while循环),一个长度为N的数组需要循环N次才能处理完。若是使用向量化操作,假设它一次能够处理4条数据,则对于长度为N的数据,仅需要N/4的时间即能处理完毕。

    更具体的例子是,假设有以下循环语句:

    for (int i=0; i<16; ++i)
        C[i] = A[i] + B[i];

    传统处理方式是:一次循环处理A[i] 与 B[i] 相加,并赋值给C[i]。

    此循环可以继续展开为:

    for (int i=0; i<16; i+=4) {
        C[i]   = A[i]   + B[i];
        C[i+1] = A[i+1] + B[i+1];
        C[i+2] = A[i+2] + B[i+2];
        C[i+3] = A[i+3] + B[i+3];
    }

    对此使用向量化操作,则可以表示为:

    for (int i=0; i<16; i+=4)
        addFourThingsAtOnceAndStoreResult(&C[i], &A[i], &B[i]);

    此处addFourThingsAtOnceAndStoreResult() 为一个向量化操作,可以在一次循环中,同时处理4条数据。若是大家有了解过python 中 numpy的向量化操作,相信对此会有更深的了解。

    3.3. 持续优化

    根据朱凯[4] 在其书中提到的观点,ClickHouse如此之快的原因还包含:

    1. 开发人员注意到各种影响性能的细节,并进行优化,一点一滴的积累,使得性能越来越好
    2. 针对不同场景使用了最优的算法,使性能最优化
    3. 若是出现了更合适、更快的算法,开发人员会立即进行验证,若是效果理想则保留使用,否则将其抛弃
    4. ClickHouse更新迭代非常频繁,开发人员一直在对此进行不断的改进,追求更佳的性能

    References

    [1] https://clickhouse.tech/docs/zh/

    [2] https://github.com/ClickHouse/ClickHouse/releases/tag/v21.1.8.30-stable

    [3] https://stackoverflow.com/questions/1422149/what-is-vectorization

    [4] 朱凯,ClickHouse原理解析与应用实践,机器工业出版社,2020年

  • 相关阅读:
    Why use strong named assemblies?
    Dependency Walker
    “等一下,我碰!”——常见的2D碰撞检测
    MOBA游戏的网络同步技术
    VS2017如何配置openGL环境
    UE4关于Oculus Rift (VR)开发忠告
    UE4 的json读写方式
    Blueprint 编译概述
    UE4编码规范
    Unreal Enginer4特性介绍
  • 原文地址:https://www.cnblogs.com/zackstang/p/14660471.html
Copyright © 2011-2022 走看看