zoukankan      html  css  js  c++  java
  • spark2的编译

    0、操作系统

    centos:6.4
    hadoop:2.5.0-cdh5.3.6

    1、为什么要编译 spark 源码?

    学习spark的第一步 就应该是编译源码,后期修改和调试,扩展集成的功能模块

    2、Spark 源码编译的三种形式?

    a.maven 编译
    # export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
    # ${SPARK_HOME_SRC}/./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package

    b.SBT 编译
    #${SPARK_HOME_SRC}/./build/sbt -Pyarn -Phadoop-2.3 package

    c.打包编译
    # ${SPARK_HOME_SRC}/./dev/make-distribution.sh --tgz -Psparkr -Dhadoop.version=2.5.0-cdh5.3.6 -Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn

    3、版本要求:

    Maven 3.3.9

    JDK 1.8+(1.8.0_12)
    Scala 2.11.8
    Note: Starting version 2.0, Spark is built with Scala 2.11 by default.
    R(3.2.0)
    wget http://mirrors.tuna.tsinghua.edu.cn/CRAN/src/base/R-3/R-3.2.0.tar.gz

    4、编译步骤概览:

    0. root 用户编译 + 网络通畅
    1. jdk 环境搭建
    2. maven 环境搭建
    3. R(3.2.0)语言环境
    4. 正式编译

    5、jdk、maven 环境都是采用压缩包安装形式

    操作形式:上传压缩包、解压、配置环境变量、更新source 资源文件
    NOTE:
    检查Maven 是否和现有Java 环境对应起来
    给Maven 配置阿里云镜像:
    修改 ${MAVEN_HOME}/conf/settings.xml
    添加镜像:
    <mirror>
    <id>alimaven</id>
    <name>aliyun maven</name>
    <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
    <mirrorOf>central</mirrorOf>
    </mirror>

    R 语言 搭建
    下载源码
    # cd ${R_HOME}
    # yum install gcc-gfortran readline-devel libXt-devel

    error:
    # yum install gcc-gfortran #否则报”configure: error: No F77 compiler found”错误

    # yum install gcc gcc-c++ #否则报”configure: error: C++ preprocessor “/lib/cpp” fails sanity check”错误

    # yum install readline-devel #否则报”–with-readline=yes (default) and headers/libs are not available”错误

    # yum install libXt-devel #否则报”configure: error: –with-x=yes (default) and X11 headers/libs are not available”错误

    # ./configure --enable-R-shlib

    #make && make install
    # vi ~/.bashrc (配置环境变量)
    export R_HOME=/opt/modules/R-3.2.0
    export PATH=$R_HOME/bin:$PATH、

    6、正式编译

    上传源码压缩包并解压
    # cd ${SPARK_HOME_SRC}
    # ${SPARK_HOME_SRC}/./dev/make-distribution.sh --tgz -Psparkr -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.6 -Phive -Phive-thriftserver -Pyarn
    a. 添加 sparkr

    b. 添加hadoop版本 -Dhadoop.version=2.5.0-cdh5.3.6

    c. scala 压缩包解压到${SPARK_HOME_SRC}/build/

    d. 修改为对应的版本(dev/make-distribution.sh)
    初始
    VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null | grep -v "INFO" | tail -n 1)
    SCALA_VERSION=$("$MVN" help:evaluate -Dexpression=scala.binary.version $@ 2>/dev/null
    | grep -v "INFO"
    | tail -n 1)
    SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null
    | grep -v "INFO"
    | tail -n 1)
    SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null
    | grep -v "INFO"
    | fgrep --count "<id>hive</id>";
    # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing
    # because we use "set -o pipefail"
    echo -n)
    替换为下面对应的参数值
    VERSION=2.10
    SCALA_VERSION=2.11
    SPARK_HADOOP_VERSION=2.5.0-cdh5.3.6
    SPARK_HIVE=1

    e.spark pom.xml 添加 cdh reponsitory
    <repository>
    <id>cloudera</id>
    <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    <releases>
    <enabled>true</enabled>
    </releases>
    <snapshots>
    <enabled>false</enabled>
    </snapshots>
    </repository>

    如果不添加会出现如下错误信息:
    Failed to execute goal on project spark-launcher_2.11: Could not resolve dependencies for project org.apache.spark:spark-launcher_2.11:jar:2.1.0: Could not find artifact org.apache.hadoop:hadoop-client:jar:2.5.0-cdh5.3.6

    [ERROR] After correcting the problems, you can resume the build with the command
    [ERROR] mvn <goals> -rf :spark-launcher_2.11
    -rf :spark-launcher_2.11

    # ${SPARK_HOME_SRC}/./dev/make-distribution.sh --tgz -Psparkr -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.6 -Phive -Phive-thriftserver -Pyarn -rf :spark-launcher_2.11

    下面是没有使用R模块的
    # ${SPARK_HOME_SRC}/./dev/make-distribution.sh --tgz -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.6 -Phive -Phive-thriftserver -Pyarn
    ===============================================================================

    最终打包编译 生成的包目录对应为${SPARK_HOME_SRC}/spark-2.1.0-bin-2.5.0-cdh5.3.6.tgz
    SPARK_VERSION-bin-HADOOP-VERSION.tgz

    NOTE:
    将编译好的spark 源码打包保存一份,后面 spark sql 及 spark streaming 后续学习会使用到相关的 jar 包.

    =====================================================================================

    真正使用R 运行在 spark 上,前面编译完成以后你需要初始化 R
    # cd {SPARK_HOME_SRC}/R/
    # ./install-dev.sh
    参考文章: https://github.com/apache/spark/tree/master/R

  • 相关阅读:
    leetcode 122. Best Time to Buy and Sell Stock II
    leetcode 121. Best Time to Buy and Sell Stock
    python 集合(set)和字典(dictionary)的用法解析
    leetcode 53. Maximum Subarray
    leetcode 202. Happy Number
    leetcode 136.Single Number
    leetcode 703. Kth Largest Element in a Stream & c++ priority_queue & minHeap/maxHeap
    [leetcode]1379. Find a Corresponding Node of a Binary Tree in a Clone of That Tree
    正则表达式
    十种排序算法
  • 原文地址:https://www.cnblogs.com/feiyumo/p/7482465.html
Copyright © 2011-2022 走看看