zoukankan html css js c++ java

spark安装

一　集群规划

使用standalone 模式．18台机器，一台master,17台slave

二　版本

scala-2.11.7.tgz

spark-1.4.1-bin-hadoop2.6.tgz

三　安装

默认hadoop已经安装完成，没有安装的看hadoop安装那篇

3.1 安装scala

$ cd /opt/soft
$ tar /home/hadoop/scala-2.11.7.tgz 
$ mv scala-2.11.7/ scala

3.2 安装spark

$ tar /home/hadoop/spark-1.4.1-bin-hadoop2.6.tgz 
$ mv spark-1.4.1-bin-hadoop2.6/ spark

3.3 添加环境变量

/etc/profile 增加如下内容

export SCALA_HOME=/opt/soft/scala
export SPARK_HOME=/opt/soft/spark
export PATH=$SCALA_HOME/bin:$SPARK_HOME/bin:$PATH

四配置spark

4.1 配置slaves

$ cd /opt/soft/spark/conf
$ cp slaves.template slaves
$ cat slaves
# A Spark Worker will be started on each of the machines listed below.
a02
a03
a04
a05
a06
a07
a08
a09
a10
a11
a12
a13
a14
a15
a16
a17
a18

4.2 配置spark-env.sh

$ cp spark-env.sh.template spark-env.sh
$ vim spark-env.sh
#公共配置
export SCALA_HOME=/opt/soft/scala/
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/
export SPARK_LOCAL_DIRS=/opt/soft/spark/
export SPARK_CONF_DIR=/opt/soft/spark/conf/
export SPARK_PID_DIR=/opt/spark/pid_file/

#standalone
export SPARK_MASTER_IP=a01
export SPARK_MASTER_PORT=7077
#每个Worker进程所需要的CPU核的数目
export SPARK_WORKER_CORES=4
#每个Worker进程所需要的内存大小
export SPARK_WORKER_MEMORY=9g
#每个Worker节点上运行Worker进程的数目
export SPARK_WORKER_INSTANCES=6
#work执行任务使用本地磁盘的位置
export SPARK_WORKER_DIR=/opt/spark/local
#web ui端口
export SPARK_MASTER_WEBUI_PORT=8099
#Spark History Server配置
export SPARK_HISTORY_OPTS="-Dspark.history.retainedApplications=20 -Dspark.history.fs.logDirectory=hdfs://a01:9000/user/spark/applicationHistory"

这里配置的是standalone模式，每个配置的值根据具体的机器硬件来配置，但是一定要保证

SPARK_WORKER_CORES　＊　SPARK_WORKER_INSTANCES　＜＝ 单台机器cpu总核数

SPARK_WORKER_MEMORY　＊　SPARK_WORKER_INSTANCES　＜＝　单台机器总内存

还有更多配置项可以参考　spark-env.sh.template里

SPARK_HISTORY_OPTS　是配置历史记录，详细的可以参考　http://www.cnblogs.com/luogankun/p/3981645.html

4.3 配置spark-defaults.conf

$ cp spark-defaults.conf.template spark-defaults.conf
$ vim spark-defaults.conf
#默认使用 standalone 模式
spark.master    spark://a01:7077
#Spark History Server 设置
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://a01:9000/user/spark/applicationHistory

全部配置完成，将spark 重新打包传到slave节点．

slave节点安装先做第三步，再解压刚传过来的spark即可

五　启动

$ /opt/soft/spark/sbin/start-all.sh

查看各个机器上的进程是否都有了

5.1 手工启动worker

在使用start-all.sh启动的时候，有时候会出现个别worker启动失败。或者生产环境中出现有worker下线的情况

这个时候，不想重启整个集群，把这个worker重新启动。

a.先找出失败的worker

这个在webui中可以查找哪台机器上的worker数与实际数不符合的，再去这台机器上查看worker日志就知道是那个worker出问题了，直接kill进程即可

b.重新启动这个worker

使用命令

$SPARK_HOME/sbin/spark-daemon.sh [--config <conf-dir>] (start|stop|status) <spark-command> <spark-instance-number> <args...>

第一个参数　：　--config $SPARK_HOME/conf
第二个参数　：　start
第三个参数　：　org.apache.spark.deploy.worker.Worker（worker类的路径）
第四个参数　：　这个worker的号码，根据机器上已有的worker数来看
第五个参数　：　启动时的参数，下面是源码解析参数类　WorkerArguments.scala　中截取，都很清楚，传自己需要的参数即可

 case ("--ip" | "-i") :: value :: tail =>
      Utils.checkHost(value, "ip no longer supported, please use hostname " + value)
      host = value
      parse(tail)

    case ("--host" | "-h") :: value :: tail =>
      Utils.checkHost(value, "Please use hostname " + value)
      host = value
      parse(tail)

    case ("--port" | "-p") :: IntParam(value) :: tail =>
      port = value
      parse(tail)

    case ("--cores" | "-c") :: IntParam(value) :: tail =>
      cores = value
      parse(tail)

    case ("--memory" | "-m") :: MemoryParam(value) :: tail =>
      memory = value
      parse(tail)

    case ("--work-dir" | "-d") :: value :: tail =>
      workDir = value
      parse(tail)

    case "--webui-port" :: IntParam(value) :: tail =>
      webUiPort = value
      parse(tail)

    case ("--properties-file") :: value :: tail =>
      propertiesFile = value
      parse(tail)

一个例子

sbin/spark-daemon.sh --config conf/ start org.apache.spark.deploy.worker.Worker 2 --webui-port 8082 -c 4 -m 9G spark://a01:7077

注意　最后master地址是必须要加的

六　jobserver 安装

jobServer依赖sbt，所以必须先装好sbt

rpm -ivh https://dl.bintray.com/sbt/rpm/sbt-0.13.7.rpm

安装git,从git上拉取代码，启动

yum install git
# 下面clone这个项目
SHELL$ git clone https://github.com/ooyala/spark-jobserver.git
# 在项目根目录下，进入sbt  
SHELL$ sbt
......
[info] Loading project definition from /home/pingjie/wordspace/spark-jobserver/project
>
#在本地启动jobServer（开发者模式）
>re-start --- -Xmx4g
......
#此时会下载spark-core，jetty和liftweb等相关模块。
job-server-extras Starting spark.jobserver.JobServer.main()
[success] Total time: 111 s, completed 2015-9-22 9:59:21

然后访问http://localhost:8090 可以看到Web UI

安装完成

6.2 API

JARS

GET /jars　　列出所有上传的jars与上次更新时间

POST /jars/<appName>　　查出指定名称appName的jar

Contexts

GET /contexts - 列出当前所有contexts

POST /contexts/<name>   - 创建一个新的contexts

DELETE /contexts/<name> - 删除一个contexts,并停止上面所有的任务

Jobs

GET /jobs 查询所有job
POST /jobs 提交一个新job
GET /jobs/<jobId>  查询某一任务的结果和状态
GET /jobs/<jobId>/config 　查询job的配置
DELETE /jobs/<jobId>  删除指定job

6.３熟悉jobserver的命令

拿job-server-tests测试，先编译打包，命令跟maven很像

pingjie@pingjie-youku:~/wordspace/spark-jobserver$ sbt job-server-tests/package
[info] Loading project definition from /home/pingjie/wordspace/spark-jobserver/project
Missing bintray credentials /home/pingjie/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/pingjie/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/pingjie/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/pingjie/.bintray/.credentials. Some bintray features depend on this.
[info] Set current project to root (in build file:/home/pingjie/wordspace/spark-jobserver/)
[info] scalastyle using config /home/pingjie/wordspace/spark-jobserver/scalastyle-config.xml
[info] Processed 5 file(s)
[info] Found 0 errors
[info] Found 0 warnings
[info] Found 0 infos
[info] Finished in 4 ms
[success] created output: /home/pingjie/wordspace/spark-jobserver/job-server-tests/target
[warn] Credentials file /home/pingjie/.bintray/.credentials does not exist
[info] Updating {file:/home/pingjie/wordspace/spark-jobserver/}job-server-tests...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] scalastyle using config /home/pingjie/wordspace/spark-jobserver/scalastyle-config.xml
[info] Processed 3 file(s)
[info] Found 0 errors
[info] Found 0 warnings
[info] Found 0 infos
[info] Finished in 0 ms
[success] created output: /home/pingjie/wordspace/spark-jobserver/job-server-api/target
[info] Compiling 5 Scala sources to /home/pingjie/wordspace/spark-jobserver/job-server-tests/target/scala-2.10/classes...
[warn] Multiple main classes detected.  Run 'show discoveredMainClasses' to see the list
[info] Packaging /home/pingjie/wordspace/spark-jobserver/job-server-tests/target/scala-2.10/job-server-tests_2.10-0.5.3-SNAPSHOT.jar ...
[info] Done packaging.
[success] Total time: 41 s, completed 2015-9-22 10:06:19

显示成功，在target目录下已经生成jar包了

#提交一个新的jars
pingjie@pingjie-youku:~/wordspace/spark-jobserver$ curl --data-binary @job-server-tests/target/scala-2.10/job-server-tests_2.10-0.5.3-SNAPSHOT.jar localhost:8090/jars/test
OK
#查看当前所有的jars
pingjie@pingjie-youku:~/wordspace/spark-jobserver$ curl localhost:8090/jars
{
  "test": "2015-09-22T10:10:29.815+08:00"
}
#提交一个新job,不指定context,会默认创建一个contexts
pingjie@pingjie-youku:~/wordspace/spark-jobserver$ curl -d "input.string= hello job server " 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'

{
  "status": "STARTED",
  "result": {
    "jobId": "64196fca-80da-4c74-9b6f-27c5954ee25c",
    "context": "bf196647-spark.jobserver.WordCountExample"
  }
}
＃提交一个job，不指定context,会默认创建一个contexts
pingjie@pingjie-youku:~/wordspace/spark-jobserver$ curl -X POST -d "input.string= hello job server " 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'

　{
　　　"status": "STARTED",
　　　"result": {
　　　"jobId": "d09ec0c4-91db-456d-baef-633b5c0ff504",
　　　"context": "7500533c-spark.jobserver.WordCountExample"
　　}
　}


#查看所有job，已经有上面新建的那个了
pingjie@pingjie-youku:~/wordspace/spark-jobserver$ curl 'localhost:8090/jobs'
[{
  "duration": "0.715 secs",
  "classPath": "spark.jobserver.WordCountExample",
  "startTime": "2015-09-22T10:19:34.591+08:00",
  "context": "bf196647-spark.jobserver.WordCountExample",
  "status": "FINISHED",
  "jobId": "64196fca-80da-4c74-9b6f-27c5954ee25c"
}]
#查看所有contexts,现在是空的
pingjie@pingjie-youku:~/wordspace/spark-jobserver$ curl 'localhost:8090/contexts'
[]
＃新建一个contexts，并指定使用的cpu数与每个work使用的内存
pingjie@pingjie-youku:~/wordspace/spark-jobserver$ curl -d "" 'localhost:8090/contexts/test-contexts?num-cpu-cores=1&mem-per-node=512m'
OK
#再次查看，已经有刚才新建的context了
pingjie@pingjie-youku:~/wordspace/spark-jobserver$ cur'localhost:8090/contexts'
["test-contexts"] 
#提交任务，指定contexts
pingjie@pingjie-youku:~/wordspace/spark-jobserver$ curl -X POST -d "input.string= hello job server " 'localhost:8090/jobs?appName=test&classPath=spark.jobserve        r.WordCountExample&context=test-contexs&sync=true'

{
"status": "OK",
"result": {
"job": 1,
"hello": 1,
"server": 1
}
}

在jobserver上提交一个任务的是顺序应该是

1.提交jar包

2.创建context

3.提交job

也可以创建不创建contexts，可以像上面那样的方式直接提交job，那样就会默认创建一个context,并且会占用jobserver剩下的所有资源．

6.4　配置文件

打开配置文件，可以发现master设置为local[4],可以将其改为我们的集群地址。

vim spark-jobserver/config/local.conf.template
master = "local[4]"

此外，关于数据对象的存储方法和路径：

jobdao = spark.jobserver.io.JobFileDAO
 
    filedao {
      rootdir = /tmp/spark-job-server/filedao/data
    }

默认context设置，该设置可以被

下面再次在sbt中启动REST接口的中的参数覆盖。

 # universal context configuration.  These settings can be overridden, see README.md
  context-settings {
    num-cpu-cores = 2           # 使用的总cpu数.  Required.
    memory-per-node = 512m         # 对应spark每个exector节点上使用的内存, -Xmx style eg 512m, #1G, etc.
 
    # in case spark distribution should be accessed from HDFS (as opposed to being installed on every mesos slave)
    # spark.executor.uri = "hdfs://namenode:8020/apps/spark/spark.tgz"
 
    # uris of jars to be loaded into the classpath for this context
    # dependent-jar-uris = ["file:///some/path/present/in/each/mesos/slave/somepackage.jar"]
  }

基本的使用到此为止，jobServer的部署和项目使用将之后介绍。

6.5 部署

复制conﬁg/local.sh.template到local.sh ，并且设置相关参数。可以在多个主机上配置jobserver，并指定安装路径，Spark Home, Spark Conf等参数。

# Environment and deploy file
# For use with bin/server_deploy, bin/server_package etc.
DEPLOY_HOSTS="a01"

APP_USER=hadoop
APP_GROUP=hadoop
# optional SSH Key to login to deploy server
#SSH_KEY=/path/to/keyfile.pem
INSTALL_DIR=/opt/soft/job-server
LOG_DIR=/opt/soft/job-server/logs
PIDFILE=spark-jobserver.pid
SPARK_HOME=/opt/soft/spark
SPARK_CONF_DIR=$SPARK_HOME/conf
# Only needed for Mesos deploys
#SPARK_EXECUTOR_URI=/usr/spark/spark-1.4.0-bin-hadoop2.4.tgz
# Only needed for YARN running outside of the cluster
# You will need to COPY these files from your cluster to the remote machine
# Normally these are kept on the cluster in /etc/hadoop/conf
# YARN_CONF_DIR=/pathToRemoteConf/conf
SCALA_VERSION=2.11.7

部署jobserver，需要漫长的等待。为了配置方便，最好配置好ssh互信

6.6　启动

./server_start.sh

查看全文

相关阅读:
（九）SpringBoot之错误处理
 （九）SpringBoot之使用jsp
（八）SpringBoot之freeMarker基本使用
 （七）freemarker的基本语法及入门基础
 （六）Spring Boot之日志配置-logback和log4j2
（五）Spring Boot之@RestController注解和ConfigurationProperties配置多个属性
 （四）Spring Boot之配置文件-多环境配置
 HashPayloadPcapReader
Wireshark理解TCP乱序重组和HTTP解析渲染
 Centos定时启动和清除任务

原文地址：https://www.cnblogs.com/pingjie/p/4812737.html

spark安装

一 集群规划

二 版本

三 安装

四 配置spark

五 启动