zoukankan      html  css  js  c++  java
  • Spark的安装及配置


    title: Spark的安装及配置
    summary: 关键词:Hadoop集群环境 Spark scala python ubuntu 安装和配置
    date: 2019-5-19 13:56
    author: foochane
    urlname: 2019051904
    categories: 大数据
    tags:

    • spark
    • 大数据

    本文作者:foochane 
    本文链接:https://foochane.cn/article/2019051904.html

    1 安装说明

    在安装spark之前,需要安装hadoop集群环境,如果没有可以查看:Hadoop分布式集群的搭建

    1.1 用到的软件

    软件 版本 下载地址
    linux Ubuntu Server 18.04.2 LTS https://www.ubuntu.com/download/server
    hadoop hadoop-2.7.1 http://archive.apache.org/dist/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
    java jdk-8u211-linux-x64 https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
    spark spark-2.4.3-bin-hadoop2.7 https://www.apache.org/dyn/closer.lua/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
    scala scala-2.12.5 http://www.scala-lang.org/download/
    Anaconda Anaconda3-2019.03-Linux-x86_64.sh https://www.anaconda.com/distribution/

    1.2 节点安排

    名称 ip hostname
    主节点 192.168.233.200 Master
    子节点1 192.168.233.201 Slave01
    子节点2 192.168.233.202 Slave02

    2 安装Spark

    2.1 解压到安装目录

    $ tar zxvf spark-2.4.3-bin-hadoop2.7.tgz -C /usr/local/bigdata/
    $ cd /usr/local/bigdata/
    $ mv spark-2.4.3-bin-hadoop2.7 spark-2.4.3
    

    2.2 修改配置文件

    配置文件位于/usr/local/bigdata/spark-2.4.3/conf目录下。

    (1) spark-env.sh

    spark-env.sh.template重命名为spark-env.sh
    添加如下内容:

    export SCALA_HOME=/usr/local/bigdata/scala
    export JAVA_HOME=/usr/local/bigdata/java/jdk1.8.0_211
    export HADOOP_HOME=/usr/local/bigdata/hadoop-2.7.1
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    SPARK_MASTER_IP=Master
    SPARK_LOCAL_DIRS=/usr/local/bigdata/spark-2.4.3
    SPARK_DRIVER_MEMORY=512M
    

    (2)slaves

    slaves.template重命名为slaves
    修改为如下内容:

    Slave01
    Slave02
    

    2.3 配置环境变量

    ~/.bashrc文件中添加如下内容,并执行$ source ~/.bashrc命令使其生效

    export SPARK_HOME=/usr/local/bigdata/spark-2.4.3
    export PATH=$PATH:/usr/local/bigdata/spark-2.4.3/bin:/usr/local/bigdata/spark-2.4.3/sbin
    

    3 运行Spark

    先启动hadoop
    $ cd $HADOOP_HOME/sbin/
    $ ./start-dfs.sh
    $ ./start-yarn.sh
    $ ./start-history-server.sh
    
    然后启动启动sapark
    $ cd $SPARK_HOME/sbin/
    $ ./start-all.sh
    $ ./start-history-server.sh
    

    要注意的是:其实我们已经配置的环境变量,所以执行start-dfs.shstart-yarn.sh可以不切换到当前目录下,但是start-all.shstop-all.sh/start-history-server.sh这几个命令hadoop目录下和spark目录下都同时存在,所以为了避免错误,最好切换到绝对路径下。

    spark启动成功后,可以在浏览器中查看相关资源情况:http://192.168.233.200:8080/,这里192.168.233.200Master节点的IP

    4 配置Scala环境

    spark既可以使用Scala作为开发语言,也可以使用python作为开发语言。

    4.1 安装Scala

    spark中已经默认带有scala,如果没有或者要安装其他版本可以下载安装包安装,过程如下:
    先下载安装包,然后解压

    $ tar zxvf scala-2.12.5.tgz -C /usr/local/bigdata/
    

    然后在~/.bashrc文件中添加如下内容,并执行$ source ~/.bashrc命令使其生效

    export SCALA_HOME=/usr/local/bigdata/scala-2.12.5
    export PATH=/usr/local/bigdata/scala-2.12.5/bin:$PATH
    

    测试是否安装成功,可以执行如下命令:

    scala -version
    
    Scala code runner version 2.12.5 -- Copyright 2002-2018, LAMP/EPFL and Lightbe
    

    4.2 启动Spark shell界面

    执行 spark-shell --master spark://master:7077命令,启动spark shell。

    hadoop@Master:~$ spark-shell --master spark://master:7077
    19/06/08 08:01:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://Master:4040
    Spark context available as 'sc' (master = spark://master:7077, app id = app-20190608080221-0002).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _ / _ / _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_   version 2.4.3
          /_/
    
    Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_211)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala>
    

    5 配置python环境

    5.1 安装python

    系统已经默认安装了python,但是为了方便开发,推荐可以直接安装Anaconda,这里下载的是安装包是Anaconda3-2019.03-Linux-x86_64.sh,安装过程也很简单,直接执行$ bash Anaconda3-2019.03-Linux-x86_64.sh即可。

    5.2 启动PySpark的客户端

    执行命令:$ pyspark --master spark://master:7077

    具体如下:

    hadoop@Master:~$ pyspark --master spark://master:7077
    Python 3.6.3 |Anaconda, Inc.| (default, Oct 13 2017, 12:02:49)
    [GCC 7.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    19/06/08 08:12:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _ / _ / _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_   version 2.4.3
          /_/
    
    Using Python version 3.6.3 (default, Oct 13 2017 12:02:49)
    SparkSession available as 'spark'.
    >>>
    >>>
    
  • 相关阅读:
    POJ 1330 Nearest Common Ancestors(LCA Tarjan算法)
    LCA 最近公共祖先 (模板)
    线段树,最大值查询位置
    带权并查集
    转负二进制
    UVA 11437 Triangle Fun
    UVA 11488 Hyper Prefix Sets (字典树)
    UVALive 3295 Counting Triangles
    POJ 2752 Seek the Name, Seek the Fame (KMP)
    UVA 11584 Partitioning by Palindromes (字符串区间dp)
  • 原文地址:https://www.cnblogs.com/foochane/p/11016051.html
Copyright © 2011-2022 走看看