zoukankan      html  css  js  c++  java
  • spark1.3.1使用基础教程 分类: B8_SPARK 2015-04-28 11:10 1651人阅读 评论(0) 收藏


    spark可以通过交互式命令行及编程两种方式来进行调用:
    前者支持scala与python
    后者支持scala、python与java

    本文参考https://spark.apache.org/docs/latest/quick-start.html,可作快速入门

    再详细资料及用法请见https://spark.apache.org/docs/latest/programming-guide.html

    建议学习路径:

    1、安装单机环境:http://blog.csdn.net/jediael_lu/article/details/45310321

    2、快速入门,有简单的印象:本文http://blog.csdn.net/jediael_lu/article/details/45333195

    3、学习scala

    4、深入一点:https://spark.apache.org/docs/latest/programming-guide.html

    5、找其它专业资料或者在使用中学习


    一、基础介绍
    1、spark的所有操作均是基于RDD(Resilient Distributed Dataset)进行的,其中R(弹性)的意思为可以方便的在内存和存储间进行交换。
    2、RDD的操作可以分为2类:transformation 和 action,其中前者从一个RDD生成另一个RDD(如filter),后者对RDD生成一个结果(如count)。

    二、命令行方式
    只要机器上有java与scala,spark无需任何安装、配置,就可直接运行,可以首先通过运行:

    ./bin/run-example SparkPi 10

    检查是否正常,然后再使用下面介绍的shell来执行job。

    1、快速入门
    $ ./bin/spark-shell

    (1)先将一个文件读入一个RDD中,然后统计这个文件的行数及显示第一行。
    scala> var textFile = sc.textFile("/mnt/jediael/spark-1.3.1-bin-hadoop2.6/README.md")
    textFile: org.apache.spark.rdd.RDD[String] = /mnt/jediael/spark-1.3.1-bin-hadoop2.6/README.md MapPartitionsRDD[1] at textFile at <console>:21

    scala> textFile.count()
    res0: Long = 98

    scala> textFile.first();
    res1: String = # Apache Spark

    (2)统计包含spark的行数
    scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
    linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:23

    scala> linesWithSpark.count()
    res0: Long = 19

    (3)以上的filter与count可以组合使用
    scala> textFile.filter(line => line.contains("Spark")).count()
    res1: Long = 19

    2、深入一点
    (1)使用map统计每一行的单词数量,reduce找出最大的那一行所包括的单词数量
    scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
    res2: Int = 14

    (2)在scala中直接调用java包
    scala> import java.lang.Math
    import java.lang.Math

    scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
    res2: Int = 14

    (3)wordcount的实现
    scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
    wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:24

    scala> wordCounts.collect()
    res4: Array[(String, Int)] = Array((package,1), (For,2), (processing.,1), (Programs,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (computation,1), (Try,1), (have,1), (through,1), (several,1), (This,2), ("yarn-cluster",1), (graph,1), (Hive,2), (storage,1), (["Specifying,1), (To,2), (page](http://spark.apache.org/documentation.html),1), (Once,1), (application,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,,1), (processing,,2), (the,21), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (given.,1), (if,4), (build,3), (when,1), (be,2), (Tests,1), (Apache,1), (all,1), (./bin/run-example,2), (programs,,1), (including,3), (Spark.,1), (package.,1), (1000).count(),1), (HDFS,1), (Versions,1), (Data.,1), (>...

    3、缓存:将RDD写入缓存会大大提高处理效率
    scala> linesWithSpark.cache()
    res5: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:23
    scala> linesWithSpark.count()
    res8: Long = 19

    三、编码

    scala代码,还不熟悉,以后再运行

    import org.apache.spark.SparkContext
    import org.apache.spark.SparkContext._
    import org.apache.spark.SparkConf

    object SimpleApp {
      def main(args: Array[String]) {
        val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
        val conf = new SparkConf().setAppName("Simple Application")
        val sc = new SparkContext(conf)
        val logData = sc.textFile(logFile, 2).cache()
        val numAs = logData.filter(line => line.contains("a")).count()
        val numBs = logData.filter(line => line.contains("b")).count()
        println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
      }
    }

    版权声明:本文为博主原创文章,未经博主允许不得转载。

  • 相关阅读:
    Outlook同步问题
    Excel下三角图解的绘制
    数据库,SQL,万恶之源?
    新年第一篇
    ArcGIS的GeoProcessing的原理及实现(1)
    如何在多个文件中查找需要的信息
    关于GIS门户(GIS Portal)的概念
    2004总结
    MapViewControl更新
    GCDPlot 0.22 介绍
  • 原文地址:https://www.cnblogs.com/lujinhong2/p/4637193.html
Copyright © 2011-2022 走看看