zoukankan      html  css  js  c++  java
  • Spark学习之Spark调优与调试(7)

    Spark学习之Spark调优与调试(7)

    1. 对Spark进行调优与调试通常需要修改Spark应用运行时配置的选项。

    当创建一个SparkContext时就会创建一个SparkConf实例。
    

    2. Spark特定的优先级顺序来选择实际配置:

    优先级最高的是在用户代码中显示调用set()方法设置选项;
    其次是通过spark-submit传递的参数;
    再次是写在配置文件里的值;
    最后是系统的默认值。
    

    3.查看应用进度信息和性能指标有两种方式:网页用户界面、驱动器和执行器进程生成的日志文件。

    4.Spark执行的组成部分:作业、任务和步骤

    需求:使用Spark shell完成简单的日志分析应用。
    
    scala> val input =sc.textFile("/home/spark01/Documents/input.text")
    input: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:27
    
    scala> val tokenized = input.map(line=>line.split(" ")).filter(words=>words.size>0)
    tokenized: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at filter at <console>:29
    
    scala> val counts = tokenized.map(words=>(words(0),1)).reduceByKey{(a,b)=>a+b}
    counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceByKey at <console>:31
    
    scala> // see RDD
    
    scala> input.toDebugString
    res0: String = 
    (1) MapPartitionsRDD[3] at textFile at <console>:27 []
     |  /home/spark01/Documents/input.text HadoopRDD[2] at textFile at <console>:27 []
    
    scala> counts.toDebugString
    res1: String = 
    (1) ShuffledRDD[7] at reduceByKey at <console>:31 []
     +-(1) MapPartitionsRDD[6] at map at <console>:31 []
        |  MapPartitionsRDD[5] at filter at <console>:29 []
        |  MapPartitionsRDD[4] at map at <console>:29 []
        |  MapPartitionsRDD[3] at textFile at <console>:27 []
        |  /home/spark01/Documents/input.text HadoopRDD[2] at textFile at <console>:27 []
    
    scala> counts.collect()
    res2: Array[(String, Int)] = Array((ERROR,1), (##input.text##,1), (INFO,4), ("",2), (WARN,2))
    
    scala> counts.cache()
    res3: counts.type = ShuffledRDD[7] at reduceByKey at <console>:31
    
    scala> counts.collect()
    res5: Array[(String, Int)] = Array((ERROR,1), (##input.text##,1), (INFO,4), ("",2), (WARN,2))
    
    scala>

    5. Spark网页用户界面

    默认情况地址是http://localhost:4040
    通过浏览器可以查看已经运行过的作业(job)的详细情况
    如图下图:
    

    所有任务
    图1所有任务用户界面
    这里写图片描述
    图二作业2详细信息用户界面

    6. 关键性能考量:

    代码层面:并行度、序列化格式、内存管理
    运行环境:硬件供给。
    
  • 相关阅读:
    实例
    LR接口测试---webservices
    LR常用函数整理
    Codeforces Round #639 (Div. 2) A. Puzzle Pieces
    Codeforces Round #640 (Div. 4)全部七题
    POJ3177 Redundant Paths(e-DCC+缩点)
    洛谷P3469 [POI2008]BLO-Blockade(割点)
    洛谷P3275 [SCOI2011]糖果(缩点+拓扑序DP)
    POJ1236 Network of Schools(强连通分量)
    P3387 【模板】缩点(Tarjan求强连通分量)
  • 原文地址:https://www.cnblogs.com/lanzhi/p/6467792.html
Copyright © 2011-2022 走看看