Spark uses Hadoop client libraries for HDFS and YARN. Starting in version Spark 1.4, the project packages “Hadoop free” builds that lets you more easily connect a single Spark binary to any Hadoop version. To use these builds, you need to modify SPARK_DIST_CLASSPATH
to include Hadoop’s package jars. The most convenient place to do this is by adding an entry in conf/spark-env.sh
.
This page describes how to connect Spark to Hadoop for different types of distributions.
Spark使用HDFS和YARN的Hadoop客户端库。 从Spark 1.4版本开始,项目包“Hadoop free”构建,可让您更轻松地将单个Spark二进制文件连接到任何Hadoop版本。 要使用这些构建,您需要修改SPARK_DIST_CLASSPATH以包含Hadoop的包jar。 最方便的地方是在conf / spark-env.sh中添加一个条目。
本页介绍如何将Spark连接到Hadoop以用于不同类型的分发。
Apache Hadoop
For Apache distributions, you can use Hadoop’s ‘classpath’ command. For instance:
1 ### in conf/spark-env.sh ### 2 3 # If 'hadoop' binary is on your PATH 4 export SPARK_DIST_CLASSPATH=$(hadoop classpath) 5 6 # With explicit path to 'hadoop' binary 7 export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath) 8 9 # Passing a Hadoop configuration directory 10 export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)