hive on spark真的很折腾人啊!!!!!!!
一.软件准备阶段
maven3.3.9
spark2.0.0
hive2.3.3
hadoop2.7.6
二.下载源码spark2.0.0,编译
下载地址:http://archive.apache.org/dist/spark/spark-2.0.0/
编译: ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
三.将编译好的spark-2.0.0-bin-hadoop2-without-hive.tgz tar -zxvf 到目录
在/etc/profile里配置好 $SPARK_HOME环境变量,并 . /etc/profile使环境变量生效。
接下来配置hive/spark/yarn
1) 配置hive
1.拷贝spark下的jar包到hive的lib下
- cp scala-library-2.11.8.jar /usr/share/hive-2.3.3/lib/
- cp spark-core_2.11-2.0.0.jar /usr/share/hive-2.3.3/lib/
- cp spark-network-common_2.11-2.0.0.jar /usr/share/hive-2.3.3/lib/
2.在hive的conf下建立文件spark-defaults.conf
set hive.execution.engine=spark;
set spark.master=yarn;
set spark.submit.deployMode=client;
set spark.eventLog.enabled=true;
set spark.executor.memory=2g;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
3. 修改hive-site.xml,增加
目的:允许yarn缓存spark依赖的一些jar包到各个nodeManager节点上,避免每次应用运行频繁分发。
upload all jars in $SPARK_HOME/jars to hdfs file(for example:hdfs://bi/spark-jars/)
1)hdfs dfs -put ../jars /spark-jars //上传spark依赖的jars到hdfs的spark-jars目录里。
2)修改hive-site.xml,增加
<property>
<name>spark.yarn.jars</name>
<value>hdfs://bi/spark-jars/*</value>
</property>
2)配置spark
cp spark-env.sh.template spark-env.sh
配置spark-env.sh
export SPARK_DIST_CLASSPATH=$(/usr/share/hadoop-HA/hadoop-2.7.6/bin/hadoop classpath)
export HADOOP_HOME=/usr/share/hadoop-HA/hadoop-2.7.6
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop/
四)测试
开启metastore: nohup hive --service metastore &
开启hiveserver2: nohup hive --service hiveserver2 &
set hive.execution.engine=spark;