zoukankan      html  css  js  c++  java
  • spark1.3.1使用基础教程



    spark可以通过交互式命令行及编程两种方式来进行调用:
    前者支持scala与python
    后者支持scala、python与java

    本文参考https://spark.apache.org/docs/latest/quick-start.html,可作快速入门

    再详细资料及用法请见https://spark.apache.org/docs/latest/programming-guide.html


    建议学习路径:

    1、安装单机环境:http://blog.csdn.net/jediael_lu/article/details/45310321

    2、快速入门,有简单的印象:本文http://blog.csdn.net/jediael_lu/article/details/45333195

    3、学习scala

    4、深入一点:https://spark.apache.org/docs/latest/programming-guide.html

    5、找其它专业资料或者在使用中学习


    一、基础介绍
    1、spark的所有操作均是基于RDD(Resilient Distributed Dataset)进行的,其中R(弹性)的意思为可以方便的在内存和存储间进行交换。
    2、RDD的操作可以分为2类:transformation 和 action,其中前者从一个RDD生成另一个RDD(如filter),后者对RDD生成一个结果(如count)。

    二、命令行方式

    1、快速入门
    $ ./bin/spark-shell

    (1)先将一个文件读入一个RDD中,然后统计这个文件的行数及显示第一行。
    scala> var textFile = sc.textFile("/mnt/jediael/spark-1.3.1-bin-hadoop2.6/README.md")
    textFile: org.apache.spark.rdd.RDD[String] = /mnt/jediael/spark-1.3.1-bin-hadoop2.6/README.md MapPartitionsRDD[1] at textFile at <console>:21

    scala> textFile.count()
    res0: Long = 98

    scala> textFile.first();
    res1: String = # Apache Spark

    (2)统计包含spark的行数
    scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
    linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:23

    scala> linesWithSpark.count()
    res0: Long = 19

    (3)以上的filter与count可以组合使用
    scala> textFile.filter(line => line.contains("Spark")).count()
    res1: Long = 19

    2、深入一点
    (1)使用map统计每一行的单词数量,reduce找出最大的那一行所包括的单词数量
    scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
    res2: Int = 14

    (2)在scala中直接调用java包
    scala> import java.lang.Math
    import java.lang.Math

    scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
    res2: Int = 14

    (3)wordcount的实现
    scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
    wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:24

    scala> wordCounts.collect()
    res4: Array[(String, Int)] = Array((package,1), (For,2), (processing.,1), (Programs,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (computation,1), (Try,1), (have,1), (through,1), (several,1), (This,2), ("yarn-cluster",1), (graph,1), (Hive,2), (storage,1), (["Specifying,1), (To,2), (page](http://spark.apache.org/documentation.html),1), (Once,1), (application,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,,1), (processing,,2), (the,21), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (given.,1), (if,4), (build,3), (when,1), (be,2), (Tests,1), (Apache,1), (all,1), (./bin/run-example,2), (programs,,1), (including,3), (Spark.,1), (package.,1), (1000).count(),1), (HDFS,1), (Versions,1), (Data.,1), (>...

    3、缓存:将RDD写入缓存会大大提高处理效率
    scala> linesWithSpark.cache()
    res5: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:23
    scala> linesWithSpark.count()
    res8: Long = 19

    三、编码

    scala代码,还不熟悉,以后再运行

    import org.apache.spark.SparkContext
    import org.apache.spark.SparkContext._
    import org.apache.spark.SparkConf

    object SimpleApp {
      def main(args: Array[String]) {
        val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
        val conf = new SparkConf().setAppName("Simple Application")
        val sc = new SparkContext(conf)
        val logData = sc.textFile(logFile, 2).cache()
        val numAs = logData.filter(line => line.contains("a")).count()
        val numBs = logData.filter(line => line.contains("b")).count()
        println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
      }
    }


  • 相关阅读:
    03_ if 练习 _ little2big
    uva 11275 3D Triangles
    uva 12296 Pieces and Discs
    uvalive 3218 Find the Border
    uvalive 2797 Monster Trap
    uvalive 4992 Jungle Outpost
    uva 2218 Triathlon
    uvalive 3890 Most Distant Point from the Sea
    uvalive 4728 Squares
    uva 10256 The Great Divide
  • 原文地址:https://www.cnblogs.com/eaglegeek/p/4557794.html
Copyright © 2011-2022 走看看