Spark on K8S 的几种模式
- Standalone:在 K8S 启动一个长期运行的集群,所有 Job 都通过 spark-submit 向这个集群提交
- Kubernetes Native:通过 spark-submit 直接向 K8S 的 API Server 提交,申请到资源后启动 Pod 做为 Driver 和 Executor 执行 Job,参考 http://spark.apache.org/docs/2.4.6/running-on-kubernetes.html
- Spark Operator:安装 Spark Operator,然后定义 spark-app.yaml,再执行 kubectl apply -f spark-app.yaml,这种申明式 API 和调用方式是 K8S 的典型应用方式,参考 https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
Start Minikube
sudo minikube start --driver=none
--image-repository=registry.cn-hangzhou.aliyuncs.com/google_containers
--kubernetes-version="v1.15.3"
如果启动失败可以尝试先删除集群 minikube delete
Download Spark
https://archive.apache.org/dist/spark/
Spark 和 Hadoop 关系比较紧密,可以下载带 Hadoop 的版本,这样会有 Hadoop 的 jar 包可以用,不然可能会出现找不到包和类的错误,哪怕其实没用到 Hadoop
Build Spark Image
Spark 2.3 开始提供 bin/docker-image-tool.sh 工具用于 build image
sudo ./bin/docker-image-tool.sh -t my_spark_2.4_hadoop_2.7 build
遇到类似下面的错误
WARNING: Ignoring http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz: temporary error (try again later)
WARNING: Ignoring http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz: temporary error (try again later)
ERROR: unsatisfiable constraints:
bash (missing):
required by: world[bash]
这是网络问题,可以修改 ./bin/docker-image-tool.sh,在里面的 docker build 命令加上 --network=host 使容器使用宿主机网络 (要确保宿主机网络是 OK 的)
启动集群
定义 manifest
---
apiVersion: v1
kind: Service
metadata:
name: spark-manager
spec:
type: ClusterIP
ports:
- name: rpc
port: 7077
- name: ui
port: 8080
selector:
app: spark
component: sparkmanager
---
apiVersion: v1
kind: Service
metadata:
name: spark-manager-rest
spec:
type: NodePort
ports:
- name: rest
port: 8080
targetPort: 8080
selector:
app: spark
component: sparkmanager
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: spark-manager
spec:
replicas: 1
selector:
matchLabels:
app: spark
component: sparkmanager
template:
metadata:
labels:
app: spark
component: sparkmanager
spec:
containers:
- name: sparkmanager
image: spark:my_spark_2.4_hadoop_2.7
workingDir: /opt/spark
command: ["/bin/bash", "-c", "/opt/spark/sbin/start-master.sh && while true;do echo hello;sleep 6000;done"]
ports:
- containerPort: 7077
name: rpc
- containerPort: 8080
name: ui
livenessProbe:
tcpSocket:
port: 7077
initialDelaySeconds: 30
periodSeconds: 60
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: spark-worker
spec:
replicas: 2
selector:
matchLabels:
app: spark
component: worker
template:
metadata:
labels:
app: spark
component: worker
spec:
containers:
- name: sparkworker
image: spark:my_spark_2.4_hadoop_2.7
workingDir: /opt/spark
command: ["/bin/bash", "-c", "/opt/spark/sbin/start-slave.sh spark://spark-manager:7077 && while true;do echo hello;sleep 6000;done"]
启动
sudo kubectl create -f standalone.yaml
查看 pod 状态
spark-manager-cfc7f9fb-679tc 1/1 Running 0 16s
spark-worker-6f55fddc87-sgnfh 1/1 Running 0 16s
spark-worker-6f55fddc87-w5zgm 1/1 Running 0 16s
查看 service
spark-manager ClusterIP 10.108.230.84 <none> 7077/TCP,8080/TCP 6m16s
spark-manager-rest NodePort 10.106.200.126 <none> 8080:30277/TCP 6m16s
查看 rest service 信息
lin@lin-VirtualBox:~$ sudo kubectl get svc spark-manager-rest
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
spark-manager-rest NodePort 10.106.200.126 <none> 8080:30277/TCP 7m59s
登陆 10.106.200.126:8080 就可以看到 Spark Manager 的 Web UI,可以看到 worker 信息,和 Job 信息
如果要看更详细的 Job 信息还需要启动 spark history server
提交 Job
登陆其中一台 worker
sudo kubectl exec -t -i spark-worker-6f55fddc87-w5zgm /bin/bash
提交 Job
# 第二个 wordcount.py 作为参数用
bin/spark-submit
--master spark://spark-manager:7077
--num-executors 2
--name spark-test
/opt/spark/examples/src/main/python/wordcount.py
/opt/spark/examples/src/main/python/wordcount.py
注意在 standalone 模式下的 Python 不支持 cluster 模式,即 driver 必然运行在执行 spark-submit 的容器上
Log
Driver 的 log 随 spark-submit 命令打出来
Executor 的 log 分布在每个 Worker 的 work 目录下
/opt/spark/work/app-20200727062422-0002/0/stderr
app-20200727062422-0002 是 Job 的 Id,可以在 Web UI 上看到,也可以在 Driver 的 log 看到