背景介绍:
有三台CentOS7服务器安装了ClickHouse
HostName | IP | 安装程序 | 程序端口 |
centf8118.sharding1.db | 192.168.81.18 | clickhouse-server,clickhouse-client | 9000 |
centf8119.sharding2.db | 192.168.81.19 | clickhouse-server,clickhouse-client | 9000 |
centf8120.sharding3.db | 192.168.81.20 | clickhouse-server,clickhouse-client | 9000 |
1:创建数据库
CREATE DATABASE [IF NOT EXISTS] db_name [ON CLUSTER cluster] [ENGINE = engine(...)]
CREATE DATABASE testdb; //创建数据库
DROP DATABASE testdb; //删除数据库
2:建表
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster] ( name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [compression_codec] [TTL expr1], name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [compression_codec] [TTL expr2], ... ) ENGINE = engine
CREATE TABLE test_table( province String, province_name String, create_date date ) ENGINE = MergeTree(create_date, (province), 8192); 如果直接这样执行会报错,有两种方法解决: 1: 在每一行后面加右斜杠,比如:
CREATE TABLE test_table( province String, province_name String, create_date date ) ENGINE = MergeTree(create_date, (province), 8192); 2: 在登录的时候加 -m参数支持多行模式,比如: clickhouse-client -m
ENGINE:是表的引擎类型,最常用的MergeTree。还有一个Log引擎也是比较常用。MergeTree要求有一个日期字段,还有主键。Log没有这个限制。
create_date:是表的日期字段,一个表必须要有一个日期字段。
province:是表的主键,主键可以有多个字段,每个字段用逗号分隔
8192:是索引粒度,用默认值8192即可。
3:导入数据
3.1:普通的CSV文件导入。
cat > test_table.csv << EOF WA,WA_NAME,2020-08-25 CA,CA_NAME,2020-09-25 OR,OR_NAME,2020-10-25 EOF
–导数
clickhouse-client --query "INSERT INTO testdb.test_table FORMAT CSV" < test_table.csv;
–或者用管道的方式
cat test_table.csv | clickhouse-client --query “INSERT INTO testdb.test_table FORMAT CSV”
查看结果数据:
centf8118.sharding1.db :) select * from test_table limit 2 ; SELECT * FROM test_table LIMIT 2 ┌─province─┬─province_name─┬─create_date─┐ │ WA │ WA_NAME │ 2020-08-25 │ └──────────┴───────────────┴─────────────┘ ┌─province─┬─province_name─┬─create_date─┐ │ CA │ CA_NAME │ 2020-09-25 │ └──────────┴───────────────┴─────────────┘ 2 rows in set. Elapsed: 0.004 sec.
3.2:特殊的CSV文件导入(包含回车换行,转义符等)。
–建表(sql语句)
CREATE TABLE testdb.test_table3 ( id1 UInt32, id2 Float32, name1 String, name2 String, date1 Date, date2 DateTime) ENGINE = Log;
–测试数据(Linux命令)
[root@centf8118 tmp]# cat test_table3.csv
1,123.456,”abc 123”,” abc" "'123”,2020-08-26,2020-08-26 17:08:09
–执行导入(Linux命令)
clickhouse-client --query "INSERT INTO testdb.test_table3 FORMAT CSV" < test_table3.csv
–查看结果(sql语句)
centf8118.sharding1.db :) select * from test_table3; SELECT * FROM test_table3 ┌─id1─┬─────id2─┬─name1─────┬─name2────────────┬──────date1─┬───────────────date2─┐ │ 1 │ 123.456 │ ”abc 123” │ ” abc" "'123” │ 2020-08-26 │ 2020-08-26 17:08:09 │ └─────┴─────────┴───────────┴──────────────────┴────────────┴─────────────────────┘
3.3:尝试导入官方提供的大数据表
官方文档: https://clickhouse.tech/docs/en/getting-started/tutorial/
1>>下载并提取表数据,提取的文件大小约为10GB。
我这里下载到/data/clickhouse/tmp/目录下。
curl https://clickhouse-datasets.s3.yandex.net/hits/tsv/hits_v1.tsv.xz | unxz --threads=`nproc` > hits_v1.tsv curl https://clickhouse-datasets.s3.yandex.net/visits/tsv/visits_v1.tsv.xz | unxz --threads=`nproc` > visits_v1.tsv
[root@centf8118 tmp]# curl https://clickhouse-datasets.s3.yandex.net/hits/tsv/hits_v1.tsv.xz | unxz --threads=`nproc` > hits_v1.tsv % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 802M 100 802M 0 0 8074k 0 0:01:41 0:01:41 --:--:-- 9047k [root@centf8118 tmp]# curl https://clickhouse-datasets.s3.yandex.net/visits/tsv/visits_v1.tsv.xz | unxz --threads=`nproc` > visits_v1.tsv % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 405M 100 405M 0 0 8790k 0 0:00:47 0:00:47 --:--:-- 9911k
查看目录下文件大小为9.8G。
[root@centf8118 ~]# du -sh /data/clickhouse/tmp 9.8G /data/clickhouse/tmp
2>>创建新的数据库:tutorial
clickhouse-client --query "CREATE DATABASE IF NOT EXISTS tutorial"
3>>创建要导入数据的表:tutorial.hits_v1和tutorial.visits_v1
创建tutorial.hits_v1表:
CREATE TABLE tutorial.hits_v1 ( `WatchID` UInt64, `JavaEnable` UInt8, `Title` String, `GoodEvent` Int16, `EventTime` DateTime, `EventDate` Date, `CounterID` UInt32, `ClientIP` UInt32, `ClientIP6` FixedString(16), `RegionID` UInt32, `UserID` UInt64, `CounterClass` Int8, `OS` UInt8, `UserAgent` UInt8, `URL` String, `Referer` String, `URLDomain` String, `RefererDomain` String, `Refresh` UInt8, `IsRobot` UInt8, `RefererCategories` Array(UInt16), `URLCategories` Array(UInt16), `URLRegions` Array(UInt32), `RefererRegions` Array(UInt32), `ResolutionWidth` UInt16, `ResolutionHeight` UInt16, `ResolutionDepth` UInt8, `FlashMajor` UInt8, `FlashMinor` UInt8, `FlashMinor2` String, `NetMajor` UInt8, `NetMinor` UInt8, `UserAgentMajor` UInt16, `UserAgentMinor` FixedString(2), `CookieEnable` UInt8, `JavascriptEnable` UInt8, `IsMobile` UInt8, `MobilePhone` UInt8, `MobilePhoneModel` String, `Params` String, `IPNetworkID` UInt32, `TraficSourceID` Int8, `SearchEngineID` UInt16, `SearchPhrase` String, `AdvEngineID` UInt8, `IsArtifical` UInt8, `WindowClientWidth` UInt16, `WindowClientHeight` UInt16, `ClientTimeZone` Int16, `ClientEventTime` DateTime, `SilverlightVersion1` UInt8, `SilverlightVersion2` UInt8, `SilverlightVersion3` UInt32, `SilverlightVersion4` UInt16, `PageCharset` String, `CodeVersion` UInt32, `IsLink` UInt8, `IsDownload` UInt8, `IsNotBounce` UInt8, `FUniqID` UInt64, `HID` UInt32, `IsOldCounter` UInt8, `IsEvent` UInt8, `IsParameter` UInt8, `DontCountHits` UInt8, `WithHash` UInt8, `HitColor` FixedString(1), `UTCEventTime` DateTime, `Age` UInt8, `Sex` UInt8, `Income` UInt8, `Interests` UInt16, `Robotness` UInt8, `GeneralInterests` Array(UInt16), `RemoteIP` UInt32, `RemoteIP6` FixedString(16), `WindowName` Int32, `OpenerName` Int32, `HistoryLength` Int16, `BrowserLanguage` FixedString(2), `BrowserCountry` FixedString(2), `SocialNetwork` String, `SocialAction` String, `HTTPError` UInt16, `SendTiming` Int32, `DNSTiming` Int32, `ConnectTiming` Int32, `ResponseStartTiming` Int32, `ResponseEndTiming` Int32, `FetchTiming` Int32, `RedirectTiming` Int32, `DOMInteractiveTiming` Int32, `DOMContentLoadedTiming` Int32, `DOMCompleteTiming` Int32, `LoadEventStartTiming` Int32, `LoadEventEndTiming` Int32, `NSToDOMContentLoadedTiming` Int32, `FirstPaintTiming` Int32, `RedirectCount` Int8, `SocialSourceNetworkID` UInt8, `SocialSourcePage` String, `ParamPrice` Int64, `ParamOrderID` String, `ParamCurrency` FixedString(3), `ParamCurrencyID` UInt16, `GoalsReached` Array(UInt32), `OpenstatServiceName` String, `OpenstatCampaignID` String, `OpenstatAdID` String, `OpenstatSourceID` String, `UTMSource` String, `UTMMedium` String, `UTMCampaign` String, `UTMContent` String, `UTMTerm` String, `FromTag` String, `HasGCLID` UInt8, `RefererHash` UInt64, `URLHash` UInt64, `CLID` UInt32, `YCLID` UInt64, `ShareService` String, `ShareURL` String, `ShareTitle` String, `ParsedParams` Nested( Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64), `IslandID` FixedString(16), `RequestNum` UInt32, `RequestTry` UInt8 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192;
创建tutorial.visits_v1表:
CREATE TABLE tutorial.visits_v1 ( `CounterID` UInt32, `StartDate` Date, `Sign` Int8, `IsNew` UInt8, `VisitID` UInt64, `UserID` UInt64, `StartTime` DateTime, `Duration` UInt32, `UTCStartTime` DateTime, `PageViews` Int32, `Hits` Int32, `IsBounce` UInt8, `Referer` String, `StartURL` String, `RefererDomain` String, `StartURLDomain` String, `EndURL` String, `LinkURL` String, `IsDownload` UInt8, `TraficSourceID` Int8, `SearchEngineID` UInt16, `SearchPhrase` String, `AdvEngineID` UInt8, `PlaceID` Int32, `RefererCategories` Array(UInt16), `URLCategories` Array(UInt16), `URLRegions` Array(UInt32), `RefererRegions` Array(UInt32), `IsYandex` UInt8, `GoalReachesDepth` Int32, `GoalReachesURL` Int32, `GoalReachesAny` Int32, `SocialSourceNetworkID` UInt8, `SocialSourcePage` String, `MobilePhoneModel` String, `ClientEventTime` DateTime, `RegionID` UInt32, `ClientIP` UInt32, `ClientIP6` FixedString(16), `RemoteIP` UInt32, `RemoteIP6` FixedString(16), `IPNetworkID` UInt32, `SilverlightVersion3` UInt32, `CodeVersion` UInt32, `ResolutionWidth` UInt16, `ResolutionHeight` UInt16, `UserAgentMajor` UInt16, `UserAgentMinor` UInt16, `WindowClientWidth` UInt16, `WindowClientHeight` UInt16, `SilverlightVersion2` UInt8, `SilverlightVersion4` UInt16, `FlashVersion3` UInt16, `FlashVersion4` UInt16, `ClientTimeZone` Int16, `OS` UInt8, `UserAgent` UInt8, `ResolutionDepth` UInt8, `FlashMajor` UInt8, `FlashMinor` UInt8, `NetMajor` UInt8, `NetMinor` UInt8, `MobilePhone` UInt8, `SilverlightVersion1` UInt8, `Age` UInt8, `Sex` UInt8, `Income` UInt8, `JavaEnable` UInt8, `CookieEnable` UInt8, `JavascriptEnable` UInt8, `IsMobile` UInt8, `BrowserLanguage` UInt16, `BrowserCountry` UInt16, `Interests` UInt16, `Robotness` UInt8, `GeneralInterests` Array(UInt16), `Params` Array(String), `Goals` Nested( ID UInt32, Serial UInt32, EventTime DateTime, Price Int64, OrderID String, CurrencyID UInt32), `WatchIDs` Array(UInt64), `ParamSumPrice` Int64, `ParamCurrency` FixedString(3), `ParamCurrencyID` UInt16, `ClickLogID` UInt64, `ClickEventID` Int32, `ClickGoodEvent` Int32, `ClickEventTime` DateTime, `ClickPriorityID` Int32, `ClickPhraseID` Int32, `ClickPageID` Int32, `ClickPlaceID` Int32, `ClickTypeID` Int32, `ClickResourceID` Int32, `ClickCost` UInt32, `ClickClientIP` UInt32, `ClickDomainID` UInt32, `ClickURL` String, `ClickAttempt` UInt8, `ClickOrderID` UInt32, `ClickBannerID` UInt32, `ClickMarketCategoryID` UInt32, `ClickMarketPP` UInt32, `ClickMarketCategoryName` String, `ClickMarketPPName` String, `ClickAWAPSCampaignName` String, `ClickPageName` String, `ClickTargetType` UInt16, `ClickTargetPhraseID` UInt64, `ClickContextType` UInt8, `ClickSelectType` Int8, `ClickOptions` String, `ClickGroupBannerID` Int32, `OpenstatServiceName` String, `OpenstatCampaignID` String, `OpenstatAdID` String, `OpenstatSourceID` String, `UTMSource` String, `UTMMedium` String, `UTMCampaign` String, `UTMContent` String, `UTMTerm` String, `FromTag` String, `HasGCLID` UInt8, `FirstVisit` DateTime, `PredLastVisit` Date, `LastVisit` Date, `TotalVisits` UInt32, `TraficSource` Nested( ID Int8, SearchEngineID UInt16, AdvEngineID UInt8, PlaceID UInt16, SocialSourceNetworkID UInt8, Domain String, SearchPhrase String, SocialSourcePage String), `Attendance` FixedString(16), `CLID` UInt32, `YCLID` UInt64, `NormalizedRefererHash` UInt64, `SearchPhraseHash` UInt64, `RefererDomainHash` UInt64, `NormalizedStartURLHash` UInt64, `StartURLDomainHash` UInt64, `NormalizedEndURLHash` UInt64, `TopLevelDomain` UInt64, `URLScheme` UInt64, `OpenstatServiceNameHash` UInt64, `OpenstatCampaignIDHash` UInt64, `OpenstatAdIDHash` UInt64, `OpenstatSourceIDHash` UInt64, `UTMSourceHash` UInt64, `UTMMediumHash` UInt64, `UTMCampaignHash` UInt64, `UTMContentHash` UInt64, `UTMTermHash` UInt64, `FromHash` UInt64, `WebVisorEnabled` UInt8, `WebVisorActivity` UInt32, `ParsedParams` Nested( Key1 String, Key2 String, Key3 String, Key4 String, Key5 String, ValueDouble Float64), `Market` Nested( Type UInt8, GoalID UInt32, OrderID String, OrderPrice Int64, PP UInt32, DirectPlaceID UInt32, DirectOrderID UInt32, DirectBannerID UInt32, GoodID String, GoodName String, GoodQuantity Int32, GoodPrice Int64), `IslandID` FixedString(16) ) ENGINE = CollapsingMergeTree(Sign) PARTITION BY toYYYYMM(StartDate) ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192;
4>>导入数据
clickhouse-client --query "INSERT INTO tutorial.hits_v1 FORMAT TSV" --max_insert_block_size=100000 < hits_v1.tsv clickhouse-client --query "INSERT INTO tutorial.visits_v1 FORMAT TSV" --max_insert_block_size=100000 < visits_v1.tsv
数据导入完后,优化一下表
clickhouse-client --query "OPTIMIZE TABLE tutorial.hits_v1 FINAL" clickhouse-client --query "OPTIMIZE TABLE tutorial.visits_v1 FINAL"
优化完之后,查看下表数据量
clickhouse-client --query "SELECT COUNT(*) FROM tutorial.hits_v1" clickhouse-client --query "SELECT COUNT(*) FROM tutorial.visits_v1"
[root@centf8118 tmp]# clickhouse-client --query "SELECT COUNT(*) FROM tutorial.hits_v1" 8873898 [root@centf8118 tmp]# clickhouse-client --query "SELECT COUNT(*) FROM tutorial.visits_v1" 1676861
查询数据
SELECT StartURL AS URL, AVG(Duration) AS AvgDuration FROM tutorial.visits_v1 WHERE StartDate BETWEEN '2014-03-23' AND '2014-03-30' GROUP BY URL ORDER BY AvgDuration DESC LIMIT 10 SELECT sum(Sign) AS visits, sumIf(Sign, has(Goals.ID, 1105530)) AS goal_visits, (100. * goal_visits) / visits AS goal_percent FROM tutorial.visits_v1 WHERE (CounterID = 912887) AND (toYYYYMM(StartDate) = 201403) AND (domain(StartURL) = 'yandex.ru')