zoukankan      html  css  js  c++  java
  • Hive On Spark环境搭建

    Spark源码编译与环境搭建

    Note that you must have a version of Spark which does not include the Hive jars;

    Spark编译:

    git clone https://github.com/apache/spark.git spark_src
    cd spark_src
    export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
    ./make-distribution.sh --name "spark-without-hive" --tgz -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.1 -Pyarn -DskipTests package

    Spark搭建:见Spark环境搭建章节

    Hive源码编译与环境搭建

    Hive编译

    git clone https://github.com/apache/hive.git hive_on_spark
    git checkout spark
    cd hive_on_spark
    mvn clean install -Phadoop-2,dist -DskipTests

    编译完成后,hive安装包的位置: /packaging/target/apache-hive-1.2.0-SNAPSHOT-bin.tar.gz

    注意pom.xml中spark.version要和spark的版本号对应

    <spark.version>1.3.0</spark.version>

    Hive安装:见Hive环境搭建章节

    本案例中Spark和Hive的安装路径如下:

    Spark安装目录:/home/spark/app/spark-1.3.0-bin-spark-without-hive

    Hive安装目录:/home/spark/app/apache-hive-1.2.0-SNAPSHOT-bin

    添加Spark的依赖到Hive的方法

    方式一: Set the property 'spark.home' to point to the Spark installation:

    hive> set spark.home=/home/spark/app/spark-1.3.0-bin-spark-without-hive;

    方式二: Define the SPARK_HOME environment variable before starting Hive CLI/HiveServer2:

    export SPARK_HOME=/home/spark/app/spark-1.3.0-bin-spark-without-hive

    方式三: Set the spark-assembly jar on the Hive auxpath:

    hive --auxpath /home/spark/app/spark-1.3.0-bin-spark-without-hive/lib/spark-assembly-*.jar

    方式四: Add the spark-assembly jar for the current user session:

    hive> add jar /home/spark/app/spark-1.3.0-bin-spark-without-hive/lib/spark-assembly-*.jar;

    方式五: Link the spark-assembly jar to $HIVE_HOME/lib.

    启动Hive过程中可能出现的错误: 

    [ERROR] Terminal initialization failed; falling back to unsupported
    java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
            at jline.TerminalFactory.create(TerminalFactory.java:101)
            at jline.TerminalFactory.get(TerminalFactory.java:158)
            at jline.console.ConsoleReader.<init>(ConsoleReader.java:229)
            at jline.console.ConsoleReader.<init>(ConsoleReader.java:221)
            at jline.console.ConsoleReader.<init>(ConsoleReader.java:209)
            at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:773)
            at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:715)
            at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
            at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:606)
            at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
    
    Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

    解决方法:export HADOOP_USER_CLASSPATH_FIRST=true

    其他场景的错误解决方法参见:https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

    还有一个坑:需要设置spark.eventLog.dir参数,比如:

    set spark.eventLog.dir= hdfs://hadoop000:8020/directory

    否则查询会报错,这个坑深啊。。。。。。,否则一直报错:/tmp/spark-event类似的文件夹不存在。。。。

    启动hive后设置执行引擎为spark:

    hive> set hive.execution.engine=spark;

    设置spark的运行模式:

    hive> set spark.master=spark://hadoop000:7077

    或者yarn:spark.master=yarn

    Configure Spark-application configs for Hive

    可以配置在spark-defaults.conf或者hive-site.xml

    spark.master=<Spark Master URL>
    spark.eventLog.enabled=true;            
    spark.executor.memory=512m;             
    spark.serializer=org.apache.spark.serializer.KryoSerializer;
    spark.executor.memory=...  #Amount of memory to use per executor process.
    spark.executor.cores=...  #Number of cores per executor.
    spark.yarn.executor.memoryOverhead=...
    spark.executor.instances=...  #The number of executors assigned to each application.
    spark.driver.memory=...  #The amount of memory assigned to the Remote Spark Context (RSC). We recommend 4GB.
    spark.yarn.driver.memoryOverhead=...  #We recommend 400 (MB).

    参数配置详见文档:https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

    执行sql语句后可以在监控页面查看job/stages等信息

    hive (default)> select city_id, count(*) c from page_views group by city_id order by c desc limit 5;
    Query ID = spark_20150309173838_444cb5b1-b72e-4fc3-87db-4162e364cb1e
    Total jobs = 1
    Launching Job 1 out of 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    state = SENT
    state = STARTED
    state = STARTED
    state = STARTED
    state = STARTED
    Query Hive on Spark job[0] stages:
    0
    1
    2
    Status: Running (Hive on Spark job[0])
    Job Progress Format
    CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
    2015-03-09 17:38:11,822 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1  Stage-2_0: 0/1
    state = STARTED
    state = STARTED
    state = STARTED
    2015-03-09 17:38:14,845 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1  Stage-2_0: 0/1
    state = STARTED
    state = STARTED
    2015-03-09 17:38:16,861 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1      Stage-2_0: 0/1
    state = SUCCEEDED
    2015-03-09 17:38:17,867 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished Stage-2_0: 1/1 Finished
    Status: Finished successfully in 10.07 seconds
    OK
    city_id c
    -1000   22826
    -10     17294
    -20     10608
    -1      6186
    237     4158
    Time taken: 18.417 seconds, Fetched: 5 row(s)

     

  • 相关阅读:
    破解密码那些事儿(Hacking Secret Ciphers with Python)
    Hacking Secret Ciphers with Python翻译序言
    闲话高并发的那些神话,看京东架构师如何把它拉下神坛
    实现rabbitmq 延迟队列功能
    日志文件的编写
    依赖倒置、控制反转和依赖注入的区分
    Oracle ORA-01722: 无效数字 处理方法
    Time.timeScale 对 协程WaitForSeconds的影响
    [转]Coroutine,你究竟干了什么?
    转:Automatic Layout Groups
  • 原文地址:https://www.cnblogs.com/luogankun/p/4326672.html
Copyright © 2011-2022 走看看