zoukankan      html  css  js  c++  java
  • StreamSets Data Collector 实时同步 PostgreSQL/MySQL 数据到 Clickhouse

    Streamsets 是一款大数据实时采集和 ETL 工具,可以实现不写一行代码完成数据的采集和流转。通过拖拽式的可视化界面,实现数据管道(Pipelines)的设计和定时任务调度。最大的特点有:
    - 可视化界面操作,不写代码完成数据的采集和流转,在几分钟内设计用于流式传输、批处理和更改数据捕获 (CDC) 的管道
    - 内置监控,可是实时查看数据流传输的基本信息和数据的质量
    - 强大的整合力,对现有常用组件全力支持,包括50种数据源、44 种数据操作、46 种目的地。

     对于Streamsets来说,最重要的概念就是数据源(Origins)、操作(Processors)、目的地(Destinations)。创建一个Pipelines管道配置也基本是这三个方面。

    常见的 Origins 有 Kafka、HTTP、UDP、JDBC、HDFS 等;Processors 可以实现对每个字段的过滤、更改、编码、聚合等操作;Destinations 跟 Origins 差不多,可以写入 Kafka、Flume、JDBC、HDFS、Redis 等。

    Origins(读取的目标数据源)(文档: https://docs.streamsets.com/portal/#datacollector/3.22.x/help/datacollector/UserGuide/Origins/Origins_title.html)

    Destinations(写入的数据源)

    Streamsets 安装

    # 关闭防火墙:
    systemctl stop firewalld.service
    # 关闭SELINUX:
    setenforce 0 (临时生效)  
    修改 /etc/selinux/config 下的 SELINUX=disabled (重启后永久生效)
    # 配置文件上限
    ulimit -n
    vi /etc/security/limits.conf
    sdc soft nofile    32768
    sdc hard nofile    32768
    vi /etc/profile
    ulimit -HSn 32768
    source /etc/profile
    # mac OSX 下修改 ulimit 参数
    sudo sysctl -w kern.maxfilesperproc=65535
    ulimit -n 65535
    # 安装 axel
    brew install axel
    wget http://download-ib01.fedoraproject.org/pub/epel/7/x86_64/Packages/a/axel-2.4-9.el7.x86_64.rpm
    rpm -ivh axel-2.4-9.el7.x86_64.rpm
    # 下载 streamsets-datacollector
    axel -n 20 -a -v https://archives.streamsets.com/datacollector/3.22.3/tarball/activation/streamsets-datacollector-common-3.22.3.tgz
    axel -n 20 -a -v https://archives.streamsets.com/datacollector/3.22.3/tarball/activation/streamsets-datacollector-all-3.22.3.tgz(不推荐,全组件)
    # 安装 jdk11
    sudo yum install java-11-openjdk-devel
    scp /Users/irving/Downloads/streamsets-datacollector-common-3.22.3.tgz root@10.34.12.255:/opt
    
    # 启动服务
    tar xvzf streamsets-datacollector-common-3.22.3.tgz
    nohup bin/streamsets dc &
    
    # pm2 管理
    pm2 start "bin/streamsets dc" --name streamsets-datacollector
    pm2 startup
    pm2 save
    
    # 修改配置文件(启用 form 认证,非默认的 OAUTH2 认证)
    vi sdc.properties
    # The authentication for the HTTP endpoint of Data Collector
    # Valid values are: 'none', 'basic', 'digest', 'form' or 'aster'
    http.authentication=form
    pm2 logs streamsets-datacollector
    0|streamse | Java 11 detected; adding $SDC_JAVA11_OPTS of "-Djdk.nio.maxCachedBufferSize=262144" to $SDC_JAVA_OPTS
    0|streamse | Activation enabled, activation is valid and it does not expire
    0|streamse | Logging initialized @3119ms to org.eclipse.jetty.util.log.Slf4jLog
    0|streamsets-datacollector  | Running on URI : 'http://:18630'
    
    # 上传 JDBC 驱动 (复制到 /opt/streamsets-datacollector-3.22.3/streamsets-libs-extras/streamsets-datacollector-jdbc-lib/lib 文件夹)
    mvn dependency:copy-dependencies -DoutputDirectory=lib -DincludeScope=compile
    scp -r /Users/irving/Desktop/git/microsrv/clickhouse/lib root@10.34.12.255:/opt/streamsets-datacollector-3.22.3/streamsets-libs-extras/streamsets-datacollector-jdbc-lib/lib
    pm2 restart streamsets-datacollector

     Maven 打包上传 JDBC 驱动包(MSSQL 与 PostgreSQL 默认支持)

     <dependencies>
            <dependency>
                <groupId>ru.yandex.clickhouse</groupId>
                <artifactId>clickhouse-jdbc</artifactId>
                <version>0.3.1</version>
            </dependency>
            <dependency>
                <groupId>mysql</groupId>
                <artifactId>mysql-connector-java</artifactId>
                <version>8.0.25</version>
            </dependency>
            <!--        <dependency>-->
            <!--            <groupId>org.postgresql</groupId>-->
            <!--            <artifactId>postgresql</artifactId>-->
            <!--            <version>42.2.23</version>-->
            <!--        </dependency>-->
        </dependencies>
    # tree -h
    .
    ├── [ 352K]  clickhouse-jdbc-0.3.1.jar
    ├── [ 327K]  commons-codec-1.11.jar
    ├── [  60K]  commons-logging-1.2.jar
    ├── [ 762K]  httpclient-4.5.13.jar
    ├── [ 321K]  httpcore-4.4.13.jar
    ├── [  41K]  httpmime-4.5.13.jar
    ├── [  65K]  jackson-annotations-2.9.10.jar
    ├── [ 318K]  jackson-core-2.9.10.jar
    ├── [ 1.3M]  jackson-databind-2.9.10.8.jar
    ├── [ 635K]  lz4-java-1.7.1.jar
    ├── [ 2.3M]  mysql-connector-java-8.0.25.jar
    ├── [ 1.6M]  protobuf-java-3.11.4.jar
    └── [  40K]  slf4j-api-1.7.30.jar

    Clickhouse 安装

    # 检查环境
    grep -q sse4_2 /proc/cpuinfo && echo “SSE 4.2 supported” || echo “SSE 4.2 not supported.
    “SSE 4.2 supported” 代表可以安装,ClickHouse 需要使用 SSE 硬件指令集加速,大大加快了 CPU 寄存器计算效率。
    # 安装 clickhouse
    sudo yum install yum-utils
    sudo rpm --import https://repo.clickhouse.tech/CLICKHOUSE-KEY.GPG
    sudo yum-config-manager --add-repo https://repo.clickhouse.tech/rpm/clickhouse.repo
    sudo yum install clickhouse-server clickhouse-client
    
    [root@ ~]# sudo yum install clickhouse-server clickhouse-client
    Repository epel is listed more than once in the configuration
    上次元数据过期检查:0:01:02 前,执行于 2021年07月15日 星期四 15时49分34秒。
    依赖关系解决。
    ============================================================================================================================================================================================================
    软件包                                                    架构                                    版本                                            仓库                                                大小
    ============================================================================================================================================================================================================
    安装:
    clickhouse-client                                         noarch                                  21.7.3.14-2                                     clickhouse-stable                                   76 k
    clickhouse-server                                         noarch                                  21.7.3.14-2                                     clickhouse-stable                                  100 k
    安装依赖关系:
    clickhouse-common-static                                  x86_64                                  21.7.3.14-2                                     clickhouse-stable                                  166 M
    
    事务概要
    ============================================================================================================================================================================================================
    安装  3 软件包
    
    总下载:166 M
    安装大小:539 M
    确定吗?[y/N]: y
    下载软件包:
    [MIRROR] clickhouse-client-21.7.3.14-2.noarch.rpm: Curl error (56): Failure when receiving data from the peer for http://repo.clickhouse.tech/rpm/stable/x86_64/clickhouse-client-21.7.3.14-2.noarch.rpm [Recv failure: 连接被对方重设]
    (1/3): clickhouse-server-21.7.3.14-2.noarch.rpm                                                                                                                             142 kB/s | 100 kB     00:00
    (2/3): clickhouse-client-21.7.3.14-2.noarch.rpm                                                                                                                              99 kB/s |  76 kB     00:00
    (3/3): clickhouse-common-static-21.7.3.14-2.x86_64.rpm                                                                                                                      5.3 MB/s | 166 MB     00:31
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    总计                                                                                                                                                                        5.3 MB/s | 166 MB     00:31
    运行事务检查
    事务检查成功。
    运行事务测试
    事务测试成功。
    运行事务
    准备中  :                                                                                                                                                                                             1/1
    安装    : clickhouse-common-static-21.7.3.14-2.x86_64                                                                                                                                                 1/3
    安装    : clickhouse-client-21.7.3.14-2.noarch                                                                                                                                                        2/3
    安装    : clickhouse-server-21.7.3.14-2.noarch                                                                                                                                                        3/3
    运行脚本: clickhouse-server-21.7.3.14-2.noarch                                                                                                                                                        3/3
    ClickHouse binary is already located at /usr/bin/clickhouse
    Symlink /usr/bin/clickhouse-server already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
    Creating symlink /usr/bin/clickhouse-server to /usr/bin/clickhouse.
    Symlink /usr/bin/clickhouse-client already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
    Creating symlink /usr/bin/clickhouse-client to /usr/bin/clickhouse.
    Symlink /usr/bin/clickhouse-local already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
    Creating symlink /usr/bin/clickhouse-local to /usr/bin/clickhouse.
    Symlink /usr/bin/clickhouse-benchmark already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
    Creating symlink /usr/bin/clickhouse-benchmark to /usr/bin/clickhouse.
    Symlink /usr/bin/clickhouse-copier already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
    Creating symlink /usr/bin/clickhouse-copier to /usr/bin/clickhouse.
    Symlink /usr/bin/clickhouse-obfuscator already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
    Creating symlink /usr/bin/clickhouse-obfuscator to /usr/bin/clickhouse.
    Creating symlink /usr/bin/clickhouse-git-import to /usr/bin/clickhouse.
    Symlink /usr/bin/clickhouse-compressor already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
    Creating symlink /usr/bin/clickhouse-compressor to /usr/bin/clickhouse.
    Symlink /usr/bin/clickhouse-format already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
    Creating symlink /usr/bin/clickhouse-format to /usr/bin/clickhouse.
    Symlink /usr/bin/clickhouse-extract-from-config already exists but it points to /clickhouse. Will replace the old symlink to /usr/bin/clickhouse.
    Creating symlink /usr/bin/clickhouse-extract-from-config to /usr/bin/clickhouse.
    Creating clickhouse group if it does not exist.
    groupadd -r clickhouse
    Creating clickhouse user if it does not exist.
    useradd -r --shell /bin/false --home-dir /nonexistent -g clickhouse clickhouse
    Will set ulimits for clickhouse user in /etc/security/limits.d/clickhouse.conf.
    Creating config directory /etc/clickhouse-server/config.d that is used for tweaks of main server configuration.
    Creating config directory /etc/clickhouse-server/users.d that is used for tweaks of users configuration.
    Config file /etc/clickhouse-server/config.xml already exists, will keep it and extract path info from it.
    /etc/clickhouse-server/config.xml has /var/lib/clickhouse/ as data path.
    /etc/clickhouse-server/config.xml has /var/log/clickhouse-server/ as log path.
    Users config file /etc/clickhouse-server/users.xml already exists, will keep it and extract users info from it.
    chown --recursive clickhouse:clickhouse '/etc/clickhouse-server'
    Creating log directory /var/log/clickhouse-server/.
    Creating data directory /var/lib/clickhouse/.
    Creating pid directory /var/run/clickhouse-server.
    chown --recursive clickhouse:clickhouse '/var/log/clickhouse-server/'
    chown --recursive clickhouse:clickhouse '/var/run/clickhouse-server'
    chown clickhouse:clickhouse '/var/lib/clickhouse/'
    groupadd -r clickhouse-bridge
    useradd -r --shell /bin/false --home-dir /nonexistent -g clickhouse-bridge clickhouse-bridge
    chown --recursive clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-odbc-bridge'
    chown --recursive clickhouse-bridge:clickhouse-bridge '/usr/bin/clickhouse-library-bridge'
    Password for default user is empty string. See /etc/clickhouse-server/users.xml and /etc/clickhouse-server/users.d to change it.
    Setting capabilities for clickhouse binary. This is optional.
    
    ClickHouse has been successfully installed.
    
    Start clickhouse-server with:
    sudo clickhouse start
    
    Start clickhouse-client with:
    clickhouse-client
    
    Synchronizing state of clickhouse-server.service with SysV service script with /usr/lib/systemd/systemd-sysv-install.
    Executing: /usr/lib/systemd/systemd-sysv-install enable clickhouse-server
    Created symlink /etc/systemd/system/multi-user.target.wants/clickhouse-server.service → /etc/systemd/system/clickhouse-server.service.
    
    验证    : clickhouse-client-21.7.3.14-2.noarch                                                                                                                                                        1/3
    验证    : clickhouse-common-static-21.7.3.14-2.x86_64                                                                                                                                                 2/3
    验证    : clickhouse-server-21.7.3.14-2.noarch                                                                                                                                                        3/3
    
    已安装:
    clickhouse-client-21.7.3.14-2.noarch                             clickhouse-common-static-21.7.3.14-2.x86_64                             clickhouse-server-21.7.3.14-2.noarch
    
    完毕!
    
    # 查看版本
    clickhouse-server --version
    ClickHouse server version 21.7.3.14 (official build).
    # 启动服务
    systemctl start clickhouse-server.service
    # 停止服务
    systemctl restart clickhouse-server.service
    # 停止服务
    systemctl stop clickhouse-server.service
    # 查看状态
    systemctl status clickhouse-server.service
    
    ```
    [root@ clickhouse-server]# systemctl status clickhouse-server.service
    ● clickhouse-server.service - ClickHouse Server (analytic DBMS for big data)
    Loaded: loaded (/etc/systemd/system/clickhouse-server.service; enabled; vendor preset: disabled)
    Active: active (running) since Thu 2021-07-15 17:28:31 CST; 1min 11s ago
    Main PID: 508117 (clckhouse-watch)
    Tasks: 156 (limit: 49524)
    Memory: 127.5M
    CGroup: /system.slice/clickhouse-server.service
    ├─508117 clickhouse-watchdog --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
    └─508118 /usr/bin/clickhouse-server --config=/etc/clickhouse-server/config.xml --pid-file=/run/clickhouse-server/clickhouse-server.pid
    
    7月 15 17:28:31  systemd[1]: Started ClickHouse Server (analytic DBMS for big data).
    7月 15 17:28:31  clickhouse-server[508117]: Processing configuration file '/etc/clickhouse-server/config.xml'.
    7月 15 17:28:31  clickhouse-server[508117]: Logging trace to /var/log/clickhouse-server/clickhouse-server.log
    7月 15 17:28:31  clickhouse-server[508117]: Logging errors to /var/log/clickhouse-server/clickhouse-server.err.log
    7月 15 17:28:32  clickhouse-server[508117]: Processing configuration file '/etc/clickhouse-server/config.xml'.
    7月 15 17:28:32  clickhouse-server[508117]: Saved preprocessed configuration to '/var/lib/clickhouse/preprocessed_configs/config.xml'.
    7月 15 17:28:32  clickhouse-server[508117]: Processing configuration file '/etc/clickhouse-server/users.xml'.
    7月 15 17:28:32  clickhouse-server[508117]: Saved preprocessed configuration to '/var/lib/clickhouse/preprocessed_configs/users.xml'.
    ```
    # 目录
    ```
    (1)/etc/clickhouse-server:服务端的配置文件目录,包括全局配置 config.xml 和用户配置 users.xml 等。
    (2)/var/lib/clickhouse:默认的数据存储目录(通常会修改默认路径配置,将数据保存到大容量磁盘挂载的路径)。
    (3)/var/log/clickhouse-server:默认保存日志的目录(通常会修改路径配置,将日志保存到大容量磁盘挂载的路径)。
    
    Before going further, please notice the <path> element in config.xml. Path determines the location for data storage, so it should be located on volume with large disk capacity; the default value is /var/lib/clickhouse/.
    ```
    # 配置远程访问权限
    chmod 777 config.xml users.xml
    <!-- 如果禁用了ipv6,使用下面配置-->
    <listen_host>0.0.0.0</listen_host>
    <!-- 如果没有禁用ipv6,使用下面配置
    <listen_host>::</listen_host>
    -->
    ```
    1.纯文本:
    <password>password</password>
    2.sha256:
    <password_sha256_hex>password</password_sha256_hex>
    从shell生成密码的示例:
    PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-'
    第一行明文,第二行sha256
    3.sha1:
    <password_double_sha1_hex>password</password_double_sha1_hex>
    从shell生成密码的示例:
    PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha1sum | tr -d '-' | xxd -r -p | sha1sum | tr -d '-'
    第一行明文,第二行sha1
    ```
    # 生成密码
    [root@ clickhouse-server]# PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha1sum | tr -d '-' | xxd -r -p | sha1sum | tr -d '-'
    JUxG8EwNS
    b1854b52ed2577d50a23591d0d8fda3420b4d2af
    # 配置
    <password_double_sha1_hex>b1854b52ed2577d50a23591d0d8fda3420b4d2af</password_double_sha1_hex>
    # 客户端登陆(http端口是8123,tcp端口是9000)
    clickhouse-client -h 127.0.0.1 --port 9000 -m -u default --password JUxG8EwNS
    # 导入数据(航班数据为样例测试)
    curl -O https://datasets.clickhouse.tech/ontime/partitions/ontime.tar
    # 启动20个线程下载并显示进度条
    axel -n 10 -a https://datasets.clickhouse.tech/ontime/partitions/ontime.tar
    tar xvf ontime.tar -C /var/lib/clickhouse
    # 查看目录文件大小
    du -h --max-depth=1
    # 重启服务
    sudo systemctl restart clickhouse-server.service
    # 查看总条数(1839537321.8亿+)
    clickhouse-client -h 127.0.0.1 --port 9000 -m -u default --password JUxG8EwNS --query "select count(*) from datasets.ontime"
    # 建测试表
    CREATE TABLE emp_replacingmergetree (
      emp_id UInt16 COMMENT '员工id',
      name String COMMENT '员工姓名',
      work_place String COMMENT '工作地点',
      age UInt8 COMMENT '员工年龄',
      depart String COMMENT '部门',
      salary Decimal32(2) COMMENT '薪资'
      )ENGINE=ReplacingMergeTree()
      ORDER BY emp_id
      PRIMARY KEY emp_id
      PARTITION BY work_place
      ;
     -- 插入数据 
    INSERT INTO emp_replacingmergetree
    VALUES (1,'tom','上海',25,'技术部',20000),(2,'jack','上海',26,'人事部',10000);
    INSERT INTO emp_replacingmergetree
    VALUES (3,'bob','北京',33,'财务部',50000),(4,'tony','杭州',28,'销售事部',50000);
    # 分区表
    按时间分区:
    toYYYYMM(EventDate):按月分区
    toMonday(EventDate):按周分区
    toDate(EventDate):按天分区
    按指定列分区:
    PARTITION BY cloumn_name

    示意图

    MaterializeMySQL 与 MaterializedPostgreSQL 引擎

    https://clickhouse.tech/docs/en/engines/database-engines/materialize-mysql/(这个特性是实验性的)
    https://clickhouse.tech/docs/en/engines/database-engines/materialized-postgresql/(官方文档未说是实验性的,还未测试)

    备注:如果对实时性要求比较高使用 CDC(Change Data Capture) 或许也是一个比较好的方案, MySQL 需要开启 binlog 支持,PostgreSQL 需要开启 Write-Ahead Logging (WAL) 支持;debezium 或 flink cdc connector (暂未支持 Clickhouse)也是不错的备选方案。

    REFER:
    https://docs.streamsets.com/portal/#datacollector/latest/help/index.html
    http://streamsets.vip/
    https://clickhouse.tech/docs/en/
    https://streamsets.com/products/dataops-platform/data-collector/
    https://accounts.streamsets.com/archives(全组件 6GB+ )
    https://github.com/ververica/flink-cdc-connectors
    https://github.com/Altinity/clickhouse-mysql-data-reader
    https://github.com/debezium/debezium
    https://juejin.cn/post/6860480798294966286
    https://www.cnblogs.com/uestc2007/p/13704912.html

  • 相关阅读:
    [svc]linux启动过程及级别
    [svc]linux紧急情况处理
    [100]shell中exec解析
    [100]第一波命令及总结
    [100]find&xargs命令
    [svc]nginx优化
    hbase总结:如何监控region的性能
    hbase集群 常用维护命令
    navicat 导入sql文件乱码问题解决
    ue标签不见了,如何解决?
  • 原文地址:https://www.cnblogs.com/Irving/p/15023319.html
Copyright © 2011-2022 走看看