zoukankan      html  css  js  c++  java
  • 【Hive学习之一】Hive简介

    环境
      虚拟机:VMware 10
      Linux版本:CentOS-6.5-x86_64
      客户端:Xshell4
      FTP:Xftp4
      jdk8
      hadoop-3.1.1
      apache-hive-3.1.1

    一、简介

    Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射成一张表,并提供类sql语句的查询功能;
    Hive使用Hql作为查询接口,使用HDFS存储,使用mapreduce计算;Hive的本质是将Hql转化为mapreduce;让非java编程者对hdfs的数据做mapreduce操作

    Hive: 数据仓库。与关系型数据库区别:
    ①数据库可以用在Online的应用中,Hive主要进行离线的大数据分析;

    ②数据库的查询语句为SQL,Hive的查询语句为HQL;

    ③数据库数据存储在LocalFS,Hive的数据存储在HDFS;

    ④Hive执行MapReduce,MySQL执行Executor;

    ⑤Hive没有索引;

    ⑥Hive延迟性高;

    ⑦Hive可扩展性高;

    ⑧Hive数据规模大;

    二、架构
    Hive的架构


    (1)用户接口主要有三个:Hive命令行模式(CLI),最常用模式;Hive Client(如JavaApi方式),对外提供RPC服务 ; Hive Web Interface(HWI):浏览器方式。 
    (2)Hive运行时将元数据存储在数据库中,如mysql、derby。Hive中的元数据包括表的名字,表的列和分区及其属性,表的属性(是否为外部表等),表的数据所在目录等。 
    (3)解释器、编译器、优化器完成HQL查询语句从词法分析、语法分析、编译、优化以及查询计划的生成。生成的查询计划存储在HDFS中,并在随后有MapReduce调用执行。  
    (4)Hive的数据存储在HDFS中,大部分的查询、计算由MapReduce完成(包含*的查询,比如select * from tbl不会生成MapRedcue任务)。

    Operator操作符

    编译器将一个Hive SQL转换操作符;
    操作符是Hive的最小的处理单元;
    每个操作符代表HDFS的一个操作或者一道MapReduce作业;

    ANTLR词法语法分析工具解析hql

    三、搭建

    划分的维度:对关系型数据库的访问和管理来划分的
    1、local模式:此模式连接到一个In-memory的数据库Derby,一般用于Unit Test。因为在内存不易维护,建议不用。


    2、单用户模式:通过网络连接到一个数据库中,是最经常使用到的模式;

    步骤一:安装mysql
    参考:搭建Linux-java web运行环境之二:安装mysql

    步骤二:解压apache-hive-3.1.1-bin.tar.gz 并设置环境变量
    [root@PCS102 src]# tar -xf apache-hive-3.1.1-bin.tar.gz -C /usr/local

    步骤三:修改配置

    [root@PCS102 conf]# cd /usr/local/apache-hive-3.1.1-bin/conf && ll
    total 332
    -rw-r--r--. 1 root root 1596 Apr 4 2018 beeline-log4j2.properties.template
    -rw-r--r--. 1 root root 299970 Oct 24 08:19 hive-default.xml.template
    -rw-r--r--. 1 root root 2365 Apr 4 2018 hive-env.sh.template
    -rw-r--r--. 1 root root 2274 Apr 4 2018 hive-exec-log4j2.properties.template
    -rw-r--r--. 1 root root 3086 Oct 24 07:49 hive-log4j2.properties.template
    -rw-r--r--. 1 root root 2060 Apr 4 2018 ivysettings.xml
    -rw-r--r--. 1 root root 3558 Oct 24 07:49 llap-cli-log4j2.properties.template
    -rw-r--r--. 1 root root 7163 Oct 24 07:49 llap-daemon-log4j2.properties.template
    -rw-r--r--. 1 root root 2662 Apr 4 2018 parquet-logging.properties

    #复制配置文件

    [root@PCS102 conf]# cp hive-default.xml.template hive-site.xml
    #修改配置hive-site.xml
    <!--hive数据上传到HDFS中的目录-->
    <property>  
      <name>hive.metastore.warehouse.dir</name>  
      <value>/root/hive_remote/warehouse</value>  
    </property>
    <!--hive是否本地模式-->
    <property>  
      <name>hive.metastore.local</name>  
      <value>false</value>  
    </property>
    <!--hive连接mysql地址-->
    <property>  
      <name>javax.jdo.option.ConnectionURL</name>  
      <value>jdbc:mysql://PCS101/hive_remote?createDatabaseIfNotExist=true</value> 
    </property> 
    <!--hive连接mysql驱动类-->
    <property>  
      <name>javax.jdo.option.ConnectionDriverName</name>  
      <value>com.mysql.jdbc.Driver</value>  
    </property>  
    <!--hive连接mysql用户名-->
    <property>  
      <name>javax.jdo.option.ConnectionUserName</name>  
      <value>root</value>  
    </property>  
    <!--hive连接mysql 密码-->
    <property>  
      <name>javax.jdo.option.ConnectionPassword</name>  
      <value>123456</value>  
    </property>
    
    
    :.,$-1d  删除当前行到倒数第二行

    步骤四:拷贝mysql驱动包mysql-connector-java-5.1.32-bin
    /usr/local/apache-hive-3.1.1-bin/lib

    步骤五:初始化数据库  第一次启动之前需要hive元数据库初始化

    [root@PCS102 bin]# /usr/local/apache-hive-3.1.1-bin/bin/schematool -dbType mysql -initSchema

    未初始化 会报错:

    hive> create table test01(id int,age int);
    FAILED: HiveException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

    步骤六:启动 默认进入命令行

    [root@PCS102 bin]# /usr/local/apache-hive-3.1.1-bin/bin/hive 
    which: no hbase in (/usr/local/jdk1.8.0_65/bin:/home/cluster/subversion-1.10.3/bin:/home/cluster/apache-storm-0.9.2/bin:/usr/local/hadoop-3.1.1/bin:/usr/local/hadoop-3.1.1/sbin:/usr/local/apache-hive-3.1.1-bin/bin:/usr/local/jdk1.7.0_80/bin:/home/cluster/subversion-1.10.3/bin:/home/cluster/apache-storm-0.9.2/bin:/usr/local/hadoop-3.1.1/bin:/usr/local/hadoop-3.1.1/sbin:/usr/local/jdk1.7.0_80/bin:/home/cluster/subversion-1.10.3/bin:/home/cluster/apache-storm-0.9.2/bin:/usr/local/hadoop-3.1.1/bin:/usr/local/hadoop-3.1.1/sbin:/usr/local/jdk1.7.0_80/bin:/home/cluster/subversion-1.10.3/bin:/home/cluster/apache-storm-0.9.2/bin:/usr/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/openssh/bin:/root/bin)
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/local/apache-hive-3.1.1-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/local/hadoop-3.1.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    Hive Session ID = e08b6258-e2eb-40af-ba98-87abcb2d1728
    
    Logging initialized using configuration in jar:file:/usr/local/apache-hive-3.1.1-bin/lib/hive-common-3.1.1.jar!/hive-log4j2.properties Async: true
    Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
    hive> create table test01(id int,age int);
    OK
    Time taken: 1.168 seconds
    hive> desc test01;
    OK
    id int 
    age int 
    Time taken: 0.181 seconds, Fetched: 2 row(s)
    hive> insert into test01 values(1,23);
    Query ID = root_20190125164516_aa852f47-a9b0-4c59-9043-efb557965a5b
    Total jobs = 3
    Launching Job 1 out of 3
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
    set mapreduce.job.reduces=<number>
    Starting Job = job_1548397153910_0001, Tracking URL = http://PCS102:8088/proxy/application_1548397153910_0001/
    Kill Command = /usr/local/hadoop-3.1.1/bin/mapred job -kill job_1548397153910_0001
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
    2019-01-25 16:45:26,923 Stage-1 map = 0%, reduce = 0%
    2019-01-25 16:45:32,107 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.96 sec
    2019-01-25 16:45:38,271 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.01 sec
    MapReduce Total cumulative CPU time: 8 seconds 10 msec
    Ended Job = job_1548397153910_0001
    Stage-4 is selected by condition resolver.
    Stage-3 is filtered out by condition resolver.
    Stage-5 is filtered out by condition resolver.
    Moving data to directory hdfs://PCS102:9820/root/hive_remote/warehouse/test01/.hive-staging_hive_2019-01-25_16-45-16_011_1396999443961154869-1/-ext-10000
    Loading data to table default.test01
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 8.01 sec HDFS Read: 14187 HDFS Write: 236 SUCCESS
    Total MapReduce CPU Time Spent: 8 seconds 10 msec
    OK
    Time taken: 23.714 seconds
    hive>

    查看HDFS:

    [root@PCS102 bin]# hdfs dfs -cat /root/hive_remote/warehouse/test01/*
    123

    查看插入数据MR:

    查看mysql:

    字段:

    表:

    步骤七:退出
    hive>exit;
    或者
    hive>quit;


    3、远程服务器模式/多用户模式:用于非Java客户端访问元数据库,在服务器端启动MetaStoreServer,客户端利用Thrift协议通过MetaStoreServer访问元数据库;

    服务端需要启动metastore服务

    [root@PCS102 conf]# nohup hive --service metastore &
    [root@PCS102 conf]# jps
    24657 RunJar
    29075 Jps
    18534 NameNode
    20743 NodeManager
    18712 DataNode
    23609 JobHistoryServer
    28842 RunJar
    20523 ResourceManager
    19020 SecondaryNameNode

    1、服务端和客户端在同一个节点:
    PCS101:mysql服务端
    PCS102:hive服务端和客户端

    PCS102 配置文件:hive-site.xml

    <configuration>
    <!--hive数据上传到HDFS中的目录-->
    <property> 
    <name>hive.metastore.warehouse.dir</name> 
    <value>/root/hive_remote/warehouse</value> 
    </property>
    <!--hive是否本地模式-->
    <property> 
    <name>hive.metastore.local</name> 
    <value>false</value> 
    </property>
    <!--hive连接mysql地址-->
    <property> 
    <name>javax.jdo.option.ConnectionURL</name> 
    <value>jdbc:mysql://PCS101/hive_remote?createDatabaseIfNotExist=true</value> 
    </property> 
    <!--hive连接mysql驱动类-->
    <property> 
    <name>javax.jdo.option.ConnectionDriverName</name> 
    <value>com.mysql.jdbc.Driver</value> 
    </property> 
    <!--hive连接mysql用户名-->
    <property> 
    <name>javax.jdo.option.ConnectionUserName</name> 
    <value>root</value> 
    </property> 
    <!--hive连接mysql 密码-->
    <property> 
    <name>javax.jdo.option.ConnectionPassword</name> 
    <value>123456</value> 
    </property>
    <!--hive meta store client地址-->
    <property> 
    <name>hive.metastore.uris</name> 
    <value>thrift://PCS102:9083</value> 
    </property> 
    </configuration>

    2、服务端和客户端在不同节点(客户端 服务端都要依赖hadoop)
    PCS101:mysql服务端
    PCS102:hive服务端
    PCS103:hive客户端

    PCS102:hive服务端配置文件:hive-site.xml

    <configuration>
    <!--hive数据上传到HDFS中的目录-->
    <property> 
    <name>hive.metastore.warehouse.dir</name> 
    <value>/root/hive_remote/warehouse</value> 
    </property>
    <!--hive连接mysql地址-->
    <property> 
    <name>javax.jdo.option.ConnectionURL</name> 
    <value>jdbc:mysql://PCS101/hive_remote?createDatabaseIfNotExist=true</value> 
    </property> 
    <!--hive连接mysql驱动类-->
    <property> 
    <name>javax.jdo.option.ConnectionDriverName</name> 
    <value>com.mysql.jdbc.Driver</value> 
    </property> 
    <!--hive连接mysql用户名-->
    <property> 
    <name>javax.jdo.option.ConnectionUserName</name> 
    <value>root</value> 
    </property> 
    <!--hive连接mysql 密码-->
    <property> 
    <name>javax.jdo.option.ConnectionPassword</name> 
    <value>123456</value> 
    </property>
    </configuration>

    PCS103:hive客户端配置文件:hive-site.xml

    <configuration>
    <!--hive数据上传到HDFS中的目录-->
    <property> 
    <name>hive.metastore.warehouse.dir</name> 
    <value>/root/hive_remote/warehouse</value> 
    </property>
    <!--hive是否本地模式-->
    <property> 
    <name>hive.metastore.local</name> 
    <value>false</value> 
    </property>
    <!--hive meta store client地址-->
    <property> 
    <name>hive.metastore.uris</name> 
    <value>thrift://PCS102:9083</value> 
    </property> 
    </configuration>

    问题:/usr/local/hadoop-2.6.5/share/hadoop/yarn/lib   下jar包jline-0.9.94.jar 比较老导致  将hive下jline拷贝到hadoop下就可以了

    [root@node101 bin]# schematool -dbType mysql -initSchema
    19/07/02 16:21:32 WARN conf.HiveConf: HiveConf of name hive.metastore.local does not exist
    Metastore connection URL:     jdbc:mysql://node102/hive_remote?createDatabaseIfNotExist=true
    Metastore Connection Driver :     com.mysql.jdbc.Driver
    Metastore connection User:     root
    Starting metastore schema initialization to 1.2.0
    Initialization script hive-schema-1.2.0.mysql.sql
    [ERROR] Terminal initialization failed; falling back to unsupported
    java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
        at jline.TerminalFactory.create(TerminalFactory.java:101)
        at jline.TerminalFactory.get(TerminalFactory.java:158)
        at org.apache.hive.beeline.BeeLineOpts.<init>(BeeLineOpts.java:74)
        at org.apache.hive.beeline.BeeLine.<init>(BeeLine.java:117)
        at org.apache.hive.beeline.HiveSchemaTool.runBeeLine(HiveSchemaTool.java:346)
        at org.apache.hive.beeline.HiveSchemaTool.runBeeLine(HiveSchemaTool.java:326)
        at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:266)
        at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:243)
        at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:473)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
    
    Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
        at org.apache.hive.beeline.BeeLineOpts.<init>(BeeLineOpts.java:102)
        at org.apache.hive.beeline.BeeLine.<init>(BeeLine.java:117)
        at org.apache.hive.beeline.HiveSchemaTool.runBeeLine(HiveSchemaTool.java:346)
        at org.apache.hive.beeline.HiveSchemaTool.runBeeLine(HiveSchemaTool.java:326)
        at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:266)
        at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:243)
        at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:473)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
    View Code
  • 相关阅读:
    ffmpeg rtmp推流 视频转码
    java日志发展史 log4j slf4j log4j2 jul jcl 日志和各种桥接包的关系
    nginx stream 流转发,可以转发rtmp、mysql访问流,转发rtmp、jdbc请求
    java web http 转https 通过nginx代理访问
    linux 服务器磁盘挂载
    novnc 通过websockify代理 配置多点访问
    linux 文件服务 minio 安装部署配置
    AOP实现原理,手写aop
    java 泛型
    JAVA反射getGenericSuperclass()用法
  • 原文地址:https://www.cnblogs.com/cac2020/p/10315105.html
Copyright © 2011-2022 走看看