zoukankan      html  css  js  c++  java
  • Spark Examples

    Spark Examples

    Spark Examples

    Spark is built around distributed datasets that support types of parallel operations: transformations, which are lazy and yield another distributed dataset (e.g., map, filter, and join), and actions, which force the computation of a dataset and return a result (e.g., count). The following examples show off some of the available operations and features.

    Text Search

    In this example, we search through the error messages in a log file:
    val file = spark.textFile("hdfs://...")
    val errors = file.filter(line => line.contains("ERROR"))
    // Count all the errors
    errors.count()
    // Count errors mentioning MySQL
    errors.filter(line => line.contains("MySQL")).count()
    // Fetch the MySQL errors as an array of strings
    errors.filter(line => line.contains("MySQL")).collect()

    The red code fragments are Scala function literals (closures) that get passed automatically to the cluster. The blue ones are Spark operations.

    In-Memory Text Search

    Spark can cache datasets in memory to speed up reuse. In the example above, we can load just the error messages in RAM using:

    errors.cache()

    After the first action that uses errors, later ones will be much faster.

    Word Count

    In this example, we use a few more transformations to build a dataset of (String, Int) pairs called counts and then save it to a file.

    val file = spark.textFile("hdfs://...")
    val counts = file.flatMap(line => line.split(" "))
                      .map(word => (word, 1))
                      .reduceByKey(_ + _)
    counts.saveAsTextFile("hdfs://...")

    Estimating Pi

    Spark can also be used for compute-intensive tasks. This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.

    val count = spark.parallelize(1 to NUM_SAMPLES).map(i =>
      val x = Math.random
      val y = Math.random
      if (x*x + y*y < 1) 1.0 else 0.0
    ).reduce(_ + _)
    println("Pi is roughly " + 4 * count / NUM_SAMPLES)

    Logistic Regression

    This is an iterative machine learning algorithm that seeks to find the best hyperplane that separates two sets of points in a multi-dimensional feature space. It can be used to classify messages into spam vs non-spam, for example. Because the algorithm applies the same MapReduce operation repeatedly to the same dataset, it benefits greatly from caching the input data in RAM across iterations.

    val points = spark.textFile(...).map(parsePoint).cache()
    var w = Vector.random(D) // current separating plane
    for (i <- 1 to ITERATIONS) {
      val gradient = points.map(p =>
        (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
      
    ).reduce(_ + _)
      w -= gradient
    }
    println("Final separating plane: " + w)

    Note that w gets shipped automatically to the cluster with every map call.

    The graph below compares the performance of this Spark program against a Hadoop implementation on 30 GB of data on an 80-core cluster, showing the benefit of in-memory caching:

    Logistic regression performance in Spark vs Hadoop

  • 相关阅读:
    还不懂MySQL索引?这1次彻底搞懂B+树和B-树
    11条MySQL规范,你知道的有几个?
    4个点说清楚Java中synchronized和volatile的区别
    还不知道如何实践微服务的Java程序员,这遍文章千万不要错过!
    14个Java并发容器,你用过几个?
    分布式事务解决方案,中间件 Seata 的设计原理详解
    一篇文章搞明白Integer、new Integer() 和 int 的概念与区别
    一线大厂面试官最喜欢问的15道Java多线程面试题
    70道阿里百度高频Java面试题(框架+JVM+多线程+算法+数据库)
    关于spark当中生成的RDD分区的设置情况。
  • 原文地址:https://www.cnblogs.com/lexus/p/2422597.html
Copyright © 2011-2022 走看看