zoukankan      html  css  js  c++  java
  • Apache Griffin安装

    介绍

    1.原理:

    从hive metadata中加载数据源
    根据用户指定的数据质量检查的规则,将规则转换为Spark程序,利用Spark这种强大的计算能力,为数据质量做出检测分析。

    2.程序设计模块

    measure:
    计算层,使用spark计算用户制定的数据质量校验规则,由scala开发。
    service:
    服务层,对接ui的后端接口,定时调度、向livy提交spark程序的角色。
    ui:
    展现层,由angular2开发

    安装

    一、集群基础环境

    1.JDK (1.8 or later versions)

    2.PostgreSQL(version 10.4) or MySQL(version 8.0.11)

    3.Hadoop (2.6.0 or later)

    4.Hive (version 2.x),安装参考 :https://www.cnblogs.com/caoxb/p/11333741.html

    5.Spark (version 2.2.1) 安装参考: https://blog.csdn.net/k393393/article/details/92440892

    6.Livy 安装参考:https://www.cnblogs.com/students/p/11400940.html

    7.ElasticSearch (5.0 or later versions). 参考https://blog.csdn.net/fiery_heart/article/details/85265585

    8.Scala

    二、安装Grigffin

    1、MySQL:

    1)在MySQL中创建数据库quartz,

    2)然后执行Init_quartz_mysql_innodb.sql脚本初始化表信息:

    mysql -u <username> -p <password> quartz < Init_quartz_mysql_innodb.sql

    2、Hadoop和Hive:

    从Hadoop服务器拷贝配置文件到Livy服务器上,这里假设将配置文件放在/usr/data/conf目录下。

    在Hadoop服务器上创建/home/spark_conf目录,并将Hive的配置文件hive-site.xml上传到该目录下:

    #创建/home/spark_conf目录
    hadoop fs -mkdir -p /home/spark_conf
    #上传hive-site.xml
    hadoop fs -put hive-site.xml /home/spark_conf/
    

    3、设置环境变量:

    #!/bin/bash
    export JAVA_HOME=/data/jdk1.8.0_192
    
    #spark目录
    export SPARK_HOME=/usr/data/spark-2.1.1-bin-2.6.3
    #livy命令目录
    export LIVY_HOME=/usr/data/livy/bin
    #hadoop配置文件目录
    export HADOOP_CONF_DIR=/usr/data/conf
    

    4、Livy配置:

    更新livy/conf下的livy.conf配置文件:

    livy.server.host = 127.0.0.1
    livy.spark.master = yarn
    livy.spark.deployMode = cluster
    livy.repl.enable-hive-context = true
    

    启动livy:

    livy-server start
    

    5、Elasticsearch配置:

    在ES里创建griffin索引:

    curl -H "Content-Type: application/json" -XPUT http://es:9200/griffin?include_type_name=true '
    {
        "aliases": {},
        "mappings": {
            "accuracy": {
                "properties": {
                    "name": {
                        "fields": {
                            "keyword": {
                                "ignore_above": 256,
                                "type": "keyword"
                            }
                        },
                        "type": "text"
                    },
                    "tmst": {
                        "type": "date"
                    }
                }
            }
        },
        "settings": {
            "index": {
                "number_of_replicas": "2",
                "number_of_shards": "5"
            }
        }
    }'
    

    源码打包部署

    在这里我使用源码编译打包的方式来部署Griffin,Griffin的源码地址是:https://github.com/apache/griffin.git,这里我使用的源码tag是griffin-0.4.0

    Griffin的源码结构很清晰,主要包括griffin-doc、measure、service和ui四个模块,其中griffin-doc负责存放Griffin的文档,measure负责与spark交互,执行统计任务,service使用spring boot作为服务实现,负责给ui模块提供交互所需的restful api,保存统计任务,展示统计结果。

    源码导入构建完毕后,需要修改配置文件,具体修改的配置文件如下:

    1、service/src/main/resources/application.properties:

    # Apache Griffin应用名称
    spring.application.name=griffin_service
    # MySQL数据库配置信息
    spring.datasource.url=jdbc:mysql://10.xxx.xx.xxx:3306/griffin_quartz?useSSL=false
    spring.datasource.username=xxxxx
    spring.datasource.password=xxxxx
    spring.jpa.generate-ddl=true
    spring.datasource.driver-class-name=com.mysql.jdbc.Driver
    spring.jpa.show-sql=true
    # Hive metastore配置信息
    hive.metastore.uris=thrift://namenode.test01.xxx:9083
    hive.metastore.dbname=default
    hive.hmshandler.retry.attempts=15
    hive.hmshandler.retry.interval=2000ms
    # Hive cache time
    cache.evict.hive.fixedRate.in.milliseconds=900000
    # Kafka schema registry,按需配置
    kafka.schema.registry.url=http://namenode.test01.xxx:8081
    # Update job instance state at regular intervals
    jobInstance.fixedDelay.in.milliseconds=60000
    # Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
    jobInstance.expired.milliseconds=604800000
    # schedule predicate job every 5 minutes and repeat 12 times at most
    #interval time unit s:second m:minute h:hour d:day,only support these four units
    predicate.job.interval=5m
    predicate.job.repeat.count=12
    # external properties directory location
    external.config.location=
    # external BATCH or STREAMING env
    external.env.location=
    # login strategy ("default" or "ldap")
    login.strategy=default
    # ldap,登录策略为ldap时配置
    ldap.url=ldap://hostname:port
    ldap.email=@example.com
    ldap.searchBase=DC=org,DC=example
    ldap.searchPattern=(sAMAccountName={0})
    # hdfs default name
    fs.defaultFS=
    # elasticsearch配置
    elasticsearch.host=griffindq02-test1-rgtj1-tj1
    elasticsearch.port=9200
    elasticsearch.scheme=http
    # elasticsearch.user = user
    # elasticsearch.password = password
    # livy配置
    livy.uri=http://10.104.xxx.xxx:8998/batches
    # yarn url配置
    yarn.uri=http://10.104.xxx.xxx:8088
    # griffin event listener
    internal.event.listeners=GriffinJobEventHook
    

    2、service/src/main/resources/quartz.properties

    #
    # Licensed to the Apache Software Foundation (ASF) under one
    # or more contributor license agreements.  See the NOTICE file
    # distributed with this work for additional information
    # regarding copyright ownership.  The ASF licenses this file
    # to you under the Apache License, Version 2.0 (the
    # "License"); you may not use this file except in compliance
    # with the License.  You may obtain a copy of the License at
    # 
    #   http://www.apache.org/licenses/LICENSE-2.0
    # 
    # Unless required by applicable law or agreed to in writing,
    # software distributed under the License is distributed on an
    # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    # KIND, either express or implied.  See the License for the
    # specific language governing permissions and limitations
    # under the License.
    #
    org.quartz.scheduler.instanceName=spring-boot-quartz
    org.quartz.scheduler.instanceId=AUTO
    org.quartz.threadPool.threadCount=5
    org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
    # If you use postgresql as your database,set this property value to org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
    # If you use mysql as your database,set this property value to org.quartz.impl.jdbcjobstore.StdJDBCDelegate
    # If you use h2 as your database, it's ok to set this property value to StdJDBCDelegate, PostgreSQLDelegate or others
    org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate
    org.quartz.jobStore.useProperties=true
    org.quartz.jobStore.misfireThreshold=60000
    org.quartz.jobStore.tablePrefix=QRTZ_
    org.quartz.jobStore.isClustered=true
    org.quartz.jobStore.clusterCheckinInterval=20000
    

    3、service/src/main/resources/sparkProperties.json:

    {
      "file": "hdfs:///griffin/griffin-measure.jar",
      "className": "org.apache.griffin.measure.Application",
      "name": "griffin",
      "queue": "default",
      "numExecutors": 2,
      "executorCores": 1,
      "driverMemory": "1g",
      "executorMemory": "1g",
      "conf": {
        "spark.yarn.dist.files": "hdfs:///home/spark_conf/hive-site.xml"
      },
      "files": [
      ]
    }
    

    4、service/src/main/resources/env/env_batch.json:

    {
      "spark": {
        "log.level": "INFO"
      },
      "sinks": [
        {
          "type": "CONSOLE",
          "config": {
            "max.log.lines": 10
          }
        },
        {
          "type": "HDFS",
          "config": {
            "path": "hdfs://namenodetest01.xx.xxxx.com:9001/griffin/persist",
            "max.persist.lines": 10000,
            "max.lines.per.file": 10000
          }
        },
        {
          "type": "ELASTICSEARCH",
          "config": {
            "method": "post",
            "api": "http://10.xxx.xxx.xxx:9200/griffin/accuracy",
            "connection.timeout": "1m",
            "retry": 10
          }
        }
      ],
      "griffin.checkpoint": []
    }
    

    配置文件修改好后,在idea里的terminal里执行如下maven命令进行编译打包:

    mvn -Dmaven.test.skip=true clean install
    

    命令执行完成后,会在service和measure模块的target目录下分别看到service-0.4.0.jar和measure-0.4.0.jar两个jar,将这两个jar分别拷贝到服务器目录下。这两个jar的使用方式如下:

    1、使用如下命令将measure-0.4.0.jar这个jar上传到HDFS的/griffin文件目录里:

    #改变jar名称
    mv measure-0.4.0.jar griffin-measure.jar
    mv service-0.4.0.jar griffin-service.jar #上传griffin-measure.jar到HDFS文件目录里 hadoop fs -put measure-0.4.0.jar /griffin/

    这样做的目的主要是因为spark在yarn集群上执行任务时,需要到HDFS的/griffin目录下加载griffin-measure.jar,避免发生类org.apache.griffin.measure.Application找不到的错误。

    2、运行service-0.4.0.jar,启动Griffin管理后台:

    nohup java -jar service-0.4.0.jar>service.out 2>&1 &
    

    几秒钟后,我们可以访问Apache Griffin的默认UI(默认情况下,spring boot的端口是8080)。

    http://IP:8080

    基于Apache Griffin Kafka源数据计算

    http://griffin.apache.org/docs/usecases.html

    实时数据检测目前未有界面配置,可以通过api的方式提交实时数据监控

  • 相关阅读:
    mysql 系列文章推荐
    文章推荐
    LeetCode 229: Majority Element II
    archlinux安装ssh,并启动服务 | 繁华的森林
    小程序之登录
    owasp top10
    JAVASE学习笔记—009 异常处理
    Spring学习笔记:Bean的配置及其细节
    Vim编码识别及转换
    理解 Java 中的类装载器
  • 原文地址:https://www.cnblogs.com/mergy/p/12177037.html
Copyright © 2011-2022 走看看