zoukankan      html  css  js  c++  java
  • 【原创】大数据基础之Hive(5)hive on spark

    hive 2.3.4 on spark 2.4.0

    Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine.

    set hive.execution.engine=spark;

    1 version

    Hive on Spark is only tested with a specific version of Spark, so a given version of Hive is only guaranteed to work with a specific version of Spark. Other versions of Spark may work with a given version of Hive, but that is not guaranteed. Below is a list of Hive versions and their corresponding compatible Spark versions.

    以上版本对应是测试过的,其他版本也可能可用,需要测试;

    2 yarn

    Instead of the capacity scheduler, the fair scheduler is required.  This fairly distributes an equal share of resources for jobs in the YARN cluster.

    yarn-site.xml

        <property>

            <name>yarn.resourcemanager.scheduler.class</name>

            <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>

        </property>

    3 spark

    $ export SPARK_HOME=...

    Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile. If you will use Parquet tables, it's recommended to also enable the "parquet-provided" profile. Otherwise there could be conflicts in Parquet dependency.

    不能直接使用现有的spark安装目录,一个是hive依赖,一个parquet依赖,这两个依赖很容易导致问题;

    4 library

    $ ln -s $SPARK_HOME/jars/scala-library-2.11.8.jar $HIVE_HOME/lib/scala-library-2.11.8.jar
    $ ln -s $SPARK_HOME/jars/spark-core_2.11-2.0.2.jar $HIVE_HOME/lib/spark-core_2.11-2.0.2.jar
    $ ln -s $SPARK_HOME/jars/spark-network-common_2.11-2.0.2.jar $HIVE_HOME/lib/spark-network-common_2.11-2.0.2.jar

    Prior to Hive 2.2.0, link the spark-assembly jar to HIVE_HOME/lib

    spark2之前的版本有spark-assembly.jar,直接将该jar link到HIVE_HOME/lib

    5 hive

    $ hive
    hive> set hive.execution.engine=spark;

    默认的spark.master=yarn,更多配置

    set spark.master=<Spark Master URL>
    set spark.eventLog.enabled=true;
    set spark.eventLog.dir=<Spark event log folder (must exist)>
    set spark.executor.memory=512m;
    set spark.executor.instances=10;
    set spark.executor.cores=1;
    set spark.serializer=org.apache.spark.serializer.KryoSerializer;

    以上配置可以像设置hive config一样直接执行,也可以放到hive-site.xml中,也可以放到HIVE_CONF_DIR中的spark-defaults.conf中

    This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration (hive-site.xml).

    6 报错

    hive执行sql报错:

    FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client

    hive执行日志位于 /tmp/$user/hive.log

    详细错误日志

    2019-03-05 11:06:43 ERROR ApplicationMaster:91 - User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
    java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
    at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
    at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
    at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)

    因为spark打包时加了hive依赖,尝试使用没有hive的包

    https://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.4-without-hive.tgz

    再执行,报parquet版本冲突

    Caused by: java.lang.NoSuchMethodError: org.apache.parquet.schema.Types$MessageTypeBuilder.addFields([Lorg/apache/parquet/schema/Type;)Lorg/apache/parquet/schema/Types$BaseGroupBuilder;

    只能编译了

    1)spark 2.0-2.2

    ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"

    得到spark-2.0.2-bin-hadoop2-without-hive.tgz

    2)spark 2.3及以上

    ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided"

    得到spark-2.4.0-bin-hadoop2-without-hive.tgz

    使用spark-2.0.2-bin-hadoop2-without-hive.tgz再执行,还有报错

    2019-03-05T17:10:55,537 ERROR [901dc3cf-a990-4e8b-95ec-fcf6a9c9002c main] ql.Driver: FAILED: SemanticException Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.
    org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create spark client.

    详细错误日志

    2019-03-05T17:08:37,364 INFO [stderr-redir-1] client.SparkClientImpl: Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream

    缺少jar,直接从spark-2.0.0-bin-hadoop2.4-without-hive里拷贝

    $ cd spark-2.0.2-bin-hadoop2-without-hive
    $ cp ../spark-2.4.0-bin-hadoop2.6/jars/hadoop-* jars/
    $ cp ../spark-2.4.0-bin-hadoop2.6/jars/slf4j-* jars/
    $ cp ../spark-2.4.0-bin-hadoop2.6/jars/log4j-* jars/
    $ cp ../spark-2.4.0-bin-hadoop2.6/jars/guava-* jars/
    $ cp ../spark-2.4.0-bin-hadoop2.6/jars/commons-* jars/
    $ cp ../spark-2.4.0-bin-hadoop2.6/jars/protobuf-* jars/
    $ cp ../spark-2.4.0-bin-hadoop2.6/jars/htrace-* jars/

    这次ok了,执行sql输出

    Query ID = hadoop_20190305180847_e8b638c8-394c-496d-a43e-26a0a17f9e18
    Total jobs = 1
    Launching Job 1 out of 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
    set mapreduce.job.reduces=<number>
    Starting Spark Job = d5fea72c-c67c-49ec-9f4c-650a795c74c3
    Running with YARN Application = application_1551754784891_0008
    Kill Command = $HADOOP_HOME/bin/yarn application -kill application_1551754784891_0008

    Query Hive on Spark job[1] stages: [2, 3]

    Status: Running (Hive on Spark job[1])
    --------------------------------------------------------------------------------------
    STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
    --------------------------------------------------------------------------------------
    Stage-2 ........ 0 FINISHED 275 275 0 0 0
    Stage-3 ........ 0 FINISHED 1009 1009 0 0 0
    --------------------------------------------------------------------------------------
    STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 149.58 s
    --------------------------------------------------------------------------------------
    Status: Finished successfully in 149.58 seconds
    OK

    使用spark-2.4.0-bin-hadoop2-without-hive.tgz也没有问题;

    参考:

    https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark

    https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

  • 相关阅读:
    如何做兼容性测试
    兼顾pc和移动端的textarea字数监控的实现方法
    js监听audio播放完毕
    layer弹出层移动端组件
    moment.js(日期处理类库)的使用
    移动端跳转方案-解决误触
    js移动端回退监听 popstate
    js大数计算之展示
    js大数计算之计算
    封装jquery的ajax
  • 原文地址:https://www.cnblogs.com/barneywill/p/10475122.html
Copyright © 2011-2022 走看看