zoukankan      html  css  js  c++  java
  • jupyter-notebook 以yarn模式运行出现的问题及解决

    https://blog.csdn.net/weixin_37353303/article/details/84313473
    jupyter-notebook 以yarn模式运行的出现的问题及解决方法

    之前用pyspark虚拟机只跑了单机程序,现在想试试分布式运算。
    在做之前找了书和博客来看,总是有各种各样的问题,无法成功。现在特记录一下过程:
    这里一共有两个虚拟机,一个做master,一个做slave1

    虚拟机slave1安装spark
    slave1之前已经安装了hadoop,并且可以成功进行Hadoop集群运算。这里就不多说了。
    将master的spark安装包复制到slave1,
    (1)进入到spark/conf文件夹中,将slaves.template复制成slaves,在里面添加slave1

    (2)增加路径到/etc/profile


    master与slave1都要做(1),(2)的步骤

    slave1安装anaconda
    可以用scp直接将master的anaconda复制过来,接下来修改/etc/profile就可。上面的图已经显示了修改的内容

    启动,这时候遇到了好多问题
    在master终端输入start-all.sh,使用jps查看,master和slave1都能正常启动
    在master终端输入
    HADOOP_CONF_DIR=/hadoop/hadoop/etc/hadoop PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" MASTER=yarn-client pyspark
    看资料说,如果没有在spark.env.sh中配置HADOOP_CONF_DIR,需要像上面代码在终端写出。这时候,jupyter-notebook可以成功启动,但是我在其中写入sc.master看它是何种模式运行时,却给我报了好多错误

    [root@master home]#HADOOP_CONF_IR=/hadoop/hadoop/etc/hadoop PYSPARK_DRIVER_PYTHON="jupyter"
    PYSPARK_DRIVER_PYTHON_OPTS="notebook"  pyspark

    [I 18:58:24.475 NotebookApp]
    [nb_conda_kernels] enabled, 2 kernels found

    [I 18:58:25.101 NotebookApp] ✓ nbpresent HTML export ENABLED
    [W 18:58:25.101 NotebookApp] ✗ nbpresent PDF export DISABLED: No module named 'nbbrowserpdf'
    [I 18:58:25.163 NotebookApp]
    [nb_anacondacloud] enabled
    [I 18:58:25.167 NotebookApp] [nb_conda] enabled
    [I 18:58:25.167 NotebookApp] Serving
    notebooks from local directory: /home
    [I 18:58:25.167 NotebookApp] 0 active
    kernels 
    [I 18:58:25.168 NotebookApp] The Jupyter
    Notebook is running at: http://localhost:8888/
    [I 18:58:25.168 NotebookApp] Use Control-C
    to stop this server and shut down all kernels (twice to skip confirmation).
    [I 18:58:33.844 NotebookApp] Kernel
    started: c15aabde-b441-45f2-b78d-9933e6534c27
    Exception in thread "main"
    java.lang.Exception: When running with master 'yarn-client' either
    HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
           at
    org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:263)
           at
    org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:240)
           at
    org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
          at
    org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
           at
    org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    [IPKernelApp] WARNING | Unknown error in
    handling PYTHONSTARTUP file /hadoop/spark/python/pyspark/shell.py:
    [I 19:00:33.829 NotebookApp] Saving file at
    /Untitled2.ipynb
    [I 19:00:57.754 NotebookApp] Creating new
    notebook in 
    [I 19:00:59.174 NotebookApp] Kernel
    started: ebfbdfd5-2343-4149-9fef-28877967d6c6
    Exception in thread "main"
    java.lang.Exception: When running with master 'yarn-client' either
    HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
           at
    org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:263)
           at
    org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:240)
           at
    org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:116)
           at
    org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
           at
    org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    [IPKernelApp] WARNING | Unknown error in
    handling PYTHONSTARTUP file /hadoop/spark/python/pyspark/shell.py:
    [I 19:01:12.315 NotebookApp] Saving file at
    /Untitled3.ipynb
    ^C[I 19:01:15.971 NotebookApp] interrupted
    Serving notebooks from local directory:
    /home
    2 active kernels 
    The Jupyter Notebook is running at:
    http://localhost:8888/
    Shutdown this notebook server (y/[n])? y
    [C 19:01:17.674 NotebookApp] Shutdown
    confirmed
    [I 19:01:17.675 NotebookApp] Shutting down
    kernels
    [I 19:01:18.189 NotebookApp] Kernel
    shutdown: ebfbdfd5-2343-4149-9fef-28877967d6c6

    [I 19:01:18.190 NotebookApp] Kernel
    shutdown: c15aabde-b441-45f2-b78d-9933e6534c27
    通过日志显示:

    Exception in thread "main"  java.lang.Exception: When running with master 'yarn-client' either  HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
    1
    于是配置spark.env.sh

    再次运行:

    分别将这两个错误百度下
    有的说是内存不足,有的说是需要两个内核
    对于内存不足,在yarn-site.xml增加两个点
    就是下面图片上的最后两个点


    又修改虚拟机设置给slave1增加了两个处理器,使它变成两个核
    然而仍旧出现相同的错误
    继续修改,中间不知道修改了什么,再次运行
    出现了不一样的错误

    [root@master hadoop]# pyspark --master yarn
    继续按照日志给出的信息继续寻找,
    当我用

    hadoop dfsadmin -report 查看一下磁盘使用情况时
    Configured Capacity: 0 (0 B)

    Present Capacity: 0 (0 B)

    DFS Remaining: 0 (0 B)

    DFS Used: 0 (0 B)

    DFS Used%: NaN%

    Under replicated blocks: 0

    Blocks with corrupt replicas: 0

    Missing blocks: 0

    于是重新格式化namenode,
    因为上面提到hdfs,我有修改了一下hdfs-site.xml。将里面的replication值从1变到2
    再一次start-all.sh,

    [root@master bin]# hadoop dfsadmin -report

    DEPRECATED: Use of this script to execute
    hdfs command is deprecated.

    Instead use the hdfs command for it.

    Configured Capacity: 18238930944 (16.99 GB)

    Present Capacity: 6707884032 (6.25 GB)

    DFS Remaining: 6707879936 (6.25 GB)

    DFS Used: 4096 (4 KB)

    DFS Used%: 0.00%

    Under replicated blocks: 0

    Blocks with corrupt replicas: 0

    Missing blocks: 0
    -------------------------------------------------
    Live datanodes (1):

    Name: 192.168.127.131:50010 (slave1)

    Hostname: slave1

    Decommission Status : Normal

    Configured Capacity: 18238930944 (16.99 GB)

    DFS Used: 4096 (4 KB)

    Non DFS Used: 11531046912 (10.74 GB)

    DFS Remaining: 6707879936 (6.25 GB)

    DFS Used%: 0.00%

    DFS Remaining%: 36.78%

    Configured Cache Capacity: 0 (0 B)

    Cache Used: 0 (0 B)

    Cache Remaining: 0 (0 B)

    Cache Used%: 100.00%

    Cache Remaining%: 0.00%

    Xceivers: 1

    Last contact: Tue Nov 20 21:26:11 CST 2018
    在终端输入

    pyspark --master yarn
    惊喜了一下,结果出来了

  • 相关阅读:
    zabbix-agent报错记录
    远程执行命令恢复
    触发器例子
    自定义监控项
    监控项更新间隔
    python paramiko登陆设备
    python爬取某站磁力链
    python网络编程
    并发爬取网站图片
    Pandas Series和DataFrame的基本概念
  • 原文地址:https://www.cnblogs.com/timssd/p/12720328.html
Copyright © 2011-2022 走看看