zoukankan      html  css  js  c++  java
  • Spark之安装和使用

    Scala安装

    • Spark使用Scala开发,安装的Scala版本必须和Spark使用的版本一致,否则使用Scala编写的Spark任务会有兼容性问题
    • 可以到Spark官网查看确定Scala版本,或者通过执行Spark的bin/spark-shell查看
    • 依赖jdk,安装之前需要先安装好Scala依赖的最低jdk版本
    • 下载: https://scala-lang.org/download/
    • 解压:[root@master src]# tar xvzf scala-2.11.8.tgz
    • 设置环境变量(可选):
    export $SCALA_HOME=/usr/local/src/scala-2.11.8
    export $PATH=$PATH:$SCALA_HOME/bin
    

    Spark安装

    • 如果需要部署Spark standalone集群模式,master和slave节点间需要建立ssh互信;如果直接跑在Yarn集群上,则不需要,因为master 和 slave不需要启动
    • 如果只是需要本地调试,不需要启动任何节点,./bin/spark-submit使用本地模式提交任务;本地模式,只是在本地机器模拟执行
    • 如果配置文件中使用的机器名未配置名称和ip的映射,需要在/etc/hosts配置
    • 如果任务是通过Yarn执行,需要先启动Hadoop
    • Spark master HA版本需要依赖ZooKeeper
    • Spark是Scala编写,执行需要依赖Scala环境,Scala版本需要和Spark使用的Scala版本一致
    • 可以通过./bin/spark-shell执行查看Scala版本:Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
    • 下载地址:http://archive.apache.org/dist/spark/
    • 解压:[root@master src]# tar xvzf spark-2.0.2-bin-hadoop2.6.tgz
    • 配置:
    [root@master conf]# cp spark-env.sh.template spark-env.sh
    [root@master conf]# cp slaves.template slaves
    [root@master conf]# vim slaves
    slave1
    slave2
    [root@master spark-2.0.2-bin-hadoop2.6]# mkdir tmp
    [root@master conf]# vim spark-env.sh
    export JAVA_HOME=/usr/local/src/jdk1.8.0_181
    export HADOOP_HOME=/usr/local/src/hadoop-2.6.5
    export HADOOP_CONF_DIR=/usr/local/src/hadoop-2.6.5/etc/hadoop
    SPARK_MASTER_IP=master
    SPARK_LOCAL_DIRS=/usr/local/src/spark-2.0.2-bin-hadoop2.6/tmp
    SPARK_DRIVER_MEMORY=512M
    SPARK_EXECUTOR_MEMORY=512M
    
    • Scala和Spark同步到slave1和slave2
    # 同步到slave1
    [root@master src]# scp -rp ./scala-2.12.6 slave1:/usr/local/src
    [root@master src]# scp -rp ./spark-2.0.2-bin-hadoop2.6 slave1:/usr/local/src
    
    # 同步到slave2
    [root@master src]# scp -rp ./scala-2.12.6 slave2:/usr/local/src
    [root@master src]# scp -rp ./spark-2.0.2-bin-hadoop2.6 slave2:/usr/local/src
    
    • 启动Spark集群
    # master机
    [root@master spark-2.0.2-bin-hadoop2.6]# ./sbin/start-all.sh
    starting org.apache.spark.deploy.master.Master, logging to /usr/local/src/spark-2.0.2-bin-hadoop2.6/logs/spark-wadeyu-org.apache.spark.deploy.master.Master-1-master.out
    slave1: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/src/spark-2.0.2-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave1.out
    slave2: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/src/spark-2.0.2-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave2.out
    # 查看jps进程,启动成功,会出现Master进程
    [root@master spark-2.0.2-bin-hadoop2.6]# jps
    1953 SecondaryNameNode
    25954 Master
    26020 Jps
    1803 NameNode
    2123 ResourceManager
    
    # slave机
    # 查看jps进程,启动成功,会出现Worker进程
    [root@slave1 wadeyu]# jps
    1314 DataNode
    7509 Worker
    7574 Jps
    1386 NodeManager
    
    • 验证集群
    [root@master spark-2.0.2-bin-hadoop2.6]# ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://master:7077 examples/jars/spark-examples_2.11-2.0.2.jar 100
    ......
    18/09/19 17:12:09 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
    18/09/19 17:12:09 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 17.381 s
    18/09/19 17:12:09 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 20.376037 s
    Pi is roughly 3.1418439141843915
    18/09/19 17:12:12 INFO server.ServerConnector: Stopped ServerConnector@6865c751{HTTP/1.1}{0.0.0.0:4040}
    ......
    

    也可以通过Web UI查看 http://master:8080/

    • 关闭Spark集群
    [root@master spark-2.0.2-bin-hadoop2.6]# ./sbin/stop-all.sh
    slave1: stopping org.apache.spark.deploy.worker.Worker
    slave2: stopping org.apache.spark.deploy.worker.Worker
    stopping org.apache.spark.deploy.master.Master
    [root@master spark-2.0.2-bin-hadoop2.6]# jps
    1953 SecondaryNameNode
    1803 NameNode
    2123 ResourceManager
    26061 Jps
    

    提交Spark任务模式

    • 本地模拟集群模式,适合本地调试
    [root@master src]# ./bin/spark-submit --master local[2](本地模式,启动2个线程) --class com.wadeyu.spark.ScalaWordCount(需要执行的类对象) ./scala_word_count.jar(你写的jar包) /data/The_Man_of_Property.txt(数据源,默认从hdfs取)
    ......
    (pains,1)
    (silence,,13)
    (pavement;,1)
    (washes,1)
    (lower,4)
    (comment,1)
    (scornfully.,1)
    (father-in-law,1)
    (shareholder.,2)
    18/09/20 09:47:43 INFO spark.SparkContext: Invoking stop() from shutdown hook
    18/09/20 09:47:43 INFO server.ServerConnector: Stopped ServerConnector@6a62689d{HTTP/1.1}{0.0.0.0:4041}
    ......
    
    • 提交到Spark独立集群
      1 未使用hadoop的计算资源,使用的是Spark自己的计算资源
      2 需要启动Spark集群(因为没有跑在hadoop yarn里)
    [root@master spark-2.0.2-bin-hadoop2.6]# ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://master:7077 ./examples/jars/spark-examples_2.11-2.0.2.jar 100
    ......
    18/09/20 10:59:48 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 10.501 s
    18/09/20 10:59:48 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
    18/09/20 10:59:48 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 12.318682 s
    Pi is roughly 3.1423823142382314
    18/09/20 10:59:48 INFO server.ServerConnector: Stopped ServerConnector@6d1d4d7{HTTP/1.1}{0.0.0.0:4040}
    18/09/20 10:59:48 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@4eed46ee{/stages/stage/kill,null,UNAVAILABLE}
    ......
    
    • 提交到hadoop集群
    # yarn cluster模式
    k-2.0.2-bin-hadoop2.6]# ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster ./examples/jars/spark-examples_2.11-2.0.2.jar 100
    ......
    18/09/20 14:11:16 INFO yarn.Client: Application report for application_1537422678993_0003 (state: RUNNING)
    18/09/20 14:11:17 INFO yarn.Client: Application report for application_1537422678993_0003 (state: FINISHED)
    18/09/20 14:11:17 INFO yarn.Client: 
             client token: N/A
             diagnostics: N/A
             ApplicationMaster host: 192.168.1.17
             ApplicationMaster RPC port: 0
             queue: default
             start time: 1537423819027
             final status: SUCCEEDED
             tracking URL: http://master:8088/proxy/application_1537422678993_0003/
             user: root
    ......
    
    # yarn client模式
    [root@master spark-2.0.2-bin-hadoop2.6]# ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client ./examples/jars/spark-examples_2.11-2.0.2.jar 100 
    ......
    18/09/20 14:29:12 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 
    18/09/20 14:29:12 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 21.683 s
    18/09/20 14:29:12 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 31.225426 s
    Pi is roughly 3.1412563141256316
    18/09/20 14:29:12 INFO server.ServerConnector: Stopped ServerConnector@2f162cc0{HTTP/1.1}{0.0.0.0:4040}
    18/09/20 14:29:13 INFO handler.ContextHandler: Stopped o.s.j.s.ServletContextHandler@1849db1a{/stages/stage/kill,null,UNAVAILABLE}
    ......
    
    # 两者区别
    yarn-cluster spark AM 从集群中选一台NodeManager跑
    yarn-client spark AM 跑在提交任务的机器上
    

    提交spark任务脚本说明

    • 用法
    [root@master spark-2.0.2-bin-hadoop2.6]# ./bin/spark-submit --help
    Usage: spark-submit [options] <app jar | python file> [app arguments]
    Usage: spark-submit --kill [submission ID] --master [spark://...]
    Usage: spark-submit --status [submission ID] --master [spark://...]
    Usage: spark-submit run-example [options] example-class [example args]
    
    • 常用选项
    选项 含义 备注
    --master 主节点,任务接受者 spark://host:port, mesos://host:port, yarn, or local.
    --deploy-mode AM节点所在位置 client:本地,cluster:集群中slave中的一台,默认为client
    --class java或者scala主类名 应用程序的主类
    --name 应用名 --
    --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths.
    --packages Packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version.
    --exclude-packages -- Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts.
    --repositories -- Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages.
    --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.
    --files FILES Comma-separated list of files to be placed in the working directory of each executor.
    --conf PROP=VALUE -- Arbitrary Spark configuration property.
    --properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf.
    --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
    --driver-java-options -- Extra Java options to pass to the driver.
    --driver-library-path -- Extra library path entries to pass to the driver.
    --driver-class-path -- Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath.
    --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
    --proxy-user NAME User to impersonate when submitting the application. This argument does not work with --principal / --keytab.
    --help, -h -- Show this help message and exit.
    --verbose, -v -- Print additional debug output.
    --version, -- Print the version of current Spark.

    以下是不同模式专用的参数

     Spark standalone with cluster deploy mode only:
      --driver-cores NUM          Cores for driver (Default: 1).
    
     Spark standalone or Mesos with cluster deploy mode only:
      --supervise                 If given, restarts the driver on failure.
      --kill SUBMISSION_ID        If given, kills the driver specified.
      --status SUBMISSION_ID      If given, requests the status of the driver specified.
    
     Spark standalone and Mesos only:
      --total-executor-cores NUM  Total cores for all executors.
    
     Spark standalone and YARN only:
      --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                                  or all available cores on the worker in standalone mode)
    
     YARN-only:
      --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                                  (Default: 1).
      --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
      --num-executors NUM         Number of executors to launch (Default: 2).
                                  If dynamic allocation is enabled, the initial number of
                                  executors will be at least NUM.
      --archives ARCHIVES         Comma separated list of archives to be extracted into the
                                  working directory of each executor.
      --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                                  secure HDFS.
      --keytab KEYTAB             The full path to the file that contains the keytab for the
                                  principal specified above. This keytab will be copied to
                                  the node running the Application Master via the Secure
                                  Distributed Cache, for renewing the login tickets and the
                                  delegation tokens periodically.
    

    碰到的问题

    • 问题1
    ......
    Exception in thread "main" java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: scala/runtime/java8/JFunction2$mcIII$sp
            at com.wadeyu.spark.ScalaWordCount$.main(ScalaWordCount.scala:18)
            at com.wadeyu.spark.ScalaWordCount.main(ScalaWordCount.scala)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:498)
            at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
            at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
            at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
            at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
            at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Caused by: java.lang.NoClassDefFoundError: scala/runtime/java8/JFunction2$mcIII$sp
            ... 11 more
    Caused by: java.lang.ClassNotFoundException: scala.runtime.java8.JFunction2$mcIII$sp
            at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
            ... 11 more
    ......
    

    原因:编写scala程序使用的scala版本和spark使用的scala版本不一致导致

    解决:让两个版本一致

    • 问题2
    Error:scalac: missing or invalid dependency detected while loading class file 'RDD.class'.
    Could not access term hadoop in package org.apache,
    because it (or its dependencies) are missing. Check your build definition for
    missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
    A full rebuild may help if 'RDD.class' was compiled against an incompatible version of org.apache.
    
    Error:scalac: missing or invalid dependency detected while loading class file 'RDD.class'.
    Could not access term io in value org.apache.hadoop,
    because it (or its dependencies) are missing. Check your build definition for
    missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.)
    A full rebuild may help if 'RDD.class' was compiled against an incompatible version of org.apache.hadoop.
    

    原因:intellij idea scala工程未导入spark jars包

    解决:工程导入spark jars目录所有的jar包

    • 问题3
    client token: N/A diagnostics: Application application_1537417716278_0001 failed 2 times due to AM Container for appattempt_1537417716278_0001_000002 exited with  exitCode: -103
    For more detailed output, check application tracking page:http://master:8088/proxy/application_1537417716278_0001/Then, click on links to logs of each attempt.
    Diagnostics: Container [pid=1606,containerID=container_1537417716278_0001_02_000001] is running beyond virtual memory limits. Current usage: 80.1 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
    Dump of the process-tree for container_1537417716278_0001_02_000001 :
    

    原因:虚拟内存不够

    解决:增加机器虚拟内存以及提高yarn虚拟内存和物理内存的比值,或者检查下程序是否出现了内存泄露

    # yarn-site.xml 增加配置虚拟内存和物理内存的比例
    
    <property>
    	<name>yarn.nodemanager.vmem-pmem-ratio</name>
    	<value>4.0</value>
    </property>
    
    • 问题4
    Failed redirect for container_1537419703454_0001_02_000001
    
    Failed while trying to construct the redirect url to the log server. Log Server url may not be configured
    java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all.
    

    原因:未配置好yarn history server 和未启动history server

    解决;配置好history server 并且启动

    参考资料

    【0】八斗学院内部spark学习资料
    【1】Spark2.0.2官方文档
    https://spark.apache.org/docs/2.0.2/
    【2】hadoop 在 yarn 上运行时报的虚拟内存错误,或者是物理内存不够错误
    https://blog.csdn.net/T1DMzks/article/details/78818874
    【3】centos用虚拟内存扩展内存
    https://blog.csdn.net/herobacking/article/details/80371242
    【4】设置/修改centos上的swap交换分区的方法
    http://sujing2857.blog.163.com/blog/static/74485462009101143013852/

  • 相关阅读:
    第五周课程总结&实验报告(三)
    第四周Java实验总结&学习总结
    第三周课程总结&实验报告
    第二周JAVA学习总结
    大一学习总结
    寒假第一次作业
    寒假第二次作业
    第十周课程总结
    第九周课程总结&实验报告(七)
    第八周课程总结&实验报告(六)
  • 原文地址:https://www.cnblogs.com/wadeyu/p/9681075.html
Copyright © 2011-2022 走看看