zoukankan      html  css  js  c++  java
  • spark

     # Spark is a fast and general engine for large-scale data processing.

    # Spark libraries

    YARN

    ./bin/run-example SparkPi 10


    ./bin/spark-shell --master spark://IP:POR
    ./bin/spark-shell
    http://192.168.1.112:8080/
    http://192.168.1.112:4040/


    RDD (Resilient Distributed Dataset)

    # create RDD using hdfs
    var textFile = sc.textFile("hdfs://localhost:9000/user/root/BUILDING.txt");
    textFile.count()
    textFile.first()
    textFile.filter(line => line.contains("hadoop")).count()
    val count = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)
    count.collect()

    Some concepts
    --------------------------------
    RDD (resillient distributed dataset)
    Task: Task is comprised of ShuffleMapTask and ResultTask. ShuffleMapTask and ResultTask are similar to Map and Reduce in Hadoop.
    Job:
    Stage:
    Partition:
    NarrowDependency:
    ShuffleDependency:
    DAG (Directed Acycle graph)

    Core functions
    --------------------------------
    SparkContext


    hadoop-2.7.2/etc/hadoop/core-site.xml

  • 相关阅读:
    KBMMW 4.80.00 发布
    RCF库ClientStub.setAutoReconnect
    编译 boost
    2017学习计划
    _beginthreadex注意事项
    push_back模式工作
    总结2016
    <转>好婚姻是彼此放心
    ProcessExplore 最新版
    网站seo新手快速提升自己的技巧
  • 原文地址:https://www.cnblogs.com/weiweifeng/p/7489436.html
Copyright © 2011-2022 走看看