zoukankan      html  css  js  c++  java
  • spark 随意笔记

    Spark comes with several sample programs. Scala, Java, Python and R examples are in the examples/src/main directory. To run one of the Java or Scala sample programs, use bin/run-example <class> [params] in the top-level Spark directory. (Behind the scenes, this invokes the more general spark-submit script for launching applications). For example,

     ./bin/run-example <class> [params] 用这个式子运行在spark根目录下

    例子:./bin/run-example SparkPi 10


    You can also run Spark interactively through a modified version of the Scala shell. This is a great way to learn the framework.

    spark master url

    ./bin/spark-shell --master local[2]

    The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. For a full list of options, run Spark shell with the --help option.

    这--master选项指定master url 为一个分布式集群还是本地单线程的,或者local[N]本地N线程的。你应该开始使用本地测试,运行Spark shell --help选项 

    local 本地单线程
    local[K] 本地多线程(指定K个内核)
    local[*] 本地多线程(指定所有可用内核)
    spark://HOST:PORT 连接到指定的 Spark standalone cluster master,需要指定端口。
    mesos://HOST:PORT 连接到指定的 Mesos 集群,需要指定端口。
    yarn-client客户端模式 连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。
    yarn-cluster集群模式 连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。

    spark-submit 可以用这个来提交java,python,R.

    例如:./bin/spark-submit examples/src/main/python/pi.py 10

    Spark例子位置

    [root@master mllib]# locate SparkPi
    /root/traffic-platform/spark-1.6.1/examples/src/main/java/org/apache/spark/examples/JavaSparkPi.java
    /root/traffic-platform/spark-1.6.1/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala

    Python

    >>>textFile.filter(lambda line: "Spark" in line).count() # How many lines contain "Spark"?

    scala

    scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?

    例子:

    sc.textFile("/lwtest/test.txt").filter(labmbda line:"Spark" in line).count()

    ./bin/spark-shell
    Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the Spark source directory:

    scala> val textFile = sc.textFile("README.md")//目录为HDFS下的的文件
    textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
    RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions:

    scala> textFile.count() // Number of items in this RDD,这个item是行数
    res0: Long = 126

    scala> textFile.first() // First item in this RDD

    eg:

    scala> val textFile=sc.textFile("/lwtest/test.txt")

    scala> textFile.filter(line => line.contains("season")).count()

    eg:Let’s say we want to find the line with the most words,求出现最多单词的这一行

    scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)

    This first maps a line to an integer value, creating a new RDD. reduce is called on that RDD to find the largest line count. The arguments to map and reduce are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We’ll use Math.max() function to make this code easier to understand:

    文件每一行映射为一个整数,创建一个新的RDD,reduce被调用在那个RDD去找最大单词数的行数。要求map和reduce是scala语法函数,也可以用任何scala/java库。例如,用简单的调用函数Math.max()

    scala> import java.lang.Math
    import java.lang.Math

    scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
    res5: Int = 15

    One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:

    一种普遍数据流模式是MAPREDUCE,作为HADOOP受欢迎的方式,Spark能容易的实现MAPREDUCE流

    scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
    wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8

    Here, we combined the flatMap, map, and reduceByKey transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the collect action:

    这里我们给合,flatMap,map,reducebykey转换,去计算每个单词数在这个文件作为一个RDD的(String,Int)对,要收集单词数,我们可以用collect动作

    scala> wordCounts.collect()
    res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)

    linux find

    http://www.cnblogs.com/peida/archive/2012/11/16/2773289.html

    find . -name "*sbt*"

    打包sbt与maven

    scala读取文件:

    默认是从hdfs读取文件,也可以指定sc.textFile("路径").在路径前面加上hdfs://表示从hdfs文件系统上读
    本地文件读取 sc.textFile("路径").在路径前面加上file:// 表示从本地文件系统读,如file:///home/user/spark/README.md  

  • 相关阅读:
    Sort
    MyOD
    Linux C语言编程基础(必做)
    团队作业(一):团队展示
    2.3.1测试
    《Unix/Linux系统编程》第四章学习笔记
    课堂测试2
    课堂测试
    第三章 Unix/Linux进程管理学习笔记
    团队作业(二):需求分析
  • 原文地址:https://www.cnblogs.com/lwhp/p/5687338.html
Copyright © 2011-2022 走看看