zoukankan      html  css  js  c++  java
  • Spark Standalone模式应用程序开发

      在本博客的《Spark高速入门指南(Quick Start Spark)》文章中简单地介绍了怎样通过Spark shell来高速地运用API。本文将介绍怎样高速地利用Spark提供的API开发Standalone模式的应用程序。Spark支持三种程序语言的开发:Scala (利用SBT进行编译), Java (利用Maven进行编译)以及Python。以下我将分别用Scala、Java和Python开发相同功能的程序:

    一、Scala版本号:

    程序例如以下:

    01package scala
    02/**
    03 * User: 过往记忆
    04 * Date: 14-6-10
    05 * Time: 下午11:37
    08 * 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货
    09 * 过往记忆博客微信公共帐号:iteblog_hadoop
    10 */
    11import org.apache.spark.SparkContext
    12import org.apache.spark.SparkConf
    13object Test {
    14    def main(args: Array[String]) {
    15      val logFile = "file:///spark-bin-0.9.1/README.md"
    16      val conf = new SparkConf().setAppName("Spark Application in Scala")
    17      val sc = new SparkContext(conf)
    18      val logData = sc.textFile(logFile, 2).cache()
    19      val numAs = logData.filter(line => line.contains("a")).count()
    20      val numBs = logData.filter(line => line.contains("b")).count()
    21      println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
    22    }
    23  }
    24}

    为了编译这个文件,须要创建一个xxx.sbt文件,这个文件相似于pom.xml文件,这里我们创建一个scala.sbt文件,内容例如以下:

    1name := "Spark application in Scala"
    2version := "1.0"
    3scalaVersion := "2.10.4"
    4libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0"
    5resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

    编译:

    1# sbt/sbt package
    2[info] Done packaging.
    3[success] Total time: 270 s, completed Jun 11, 2014 1:05:54 AM
    二、Java版本号
    01/**
    02 * User: 过往记忆
    03 * Date: 14-6-10
    04 * Time: 下午11:37
    07 * 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货
    08 * 过往记忆博客微信公共帐号:iteblog_hadoop
    09 */
    10/* SimpleApp.java */
    11import org.apache.spark.api.java.*;
    12import org.apache.spark.SparkConf;
    13import org.apache.spark.api.java.function.Function;
    14 
    15public class SimpleApp {
    16    public static void main(String[] args) {
    17        String logFile = "file:///spark-bin-0.9.1/README.md";
    18        SparkConf conf =new SparkConf().setAppName("Spark Application in Java");
    19        JavaSparkContext sc = new JavaSparkContext(conf);
    20        JavaRDD<String> logData = sc.textFile(logFile).cache();
    21 
    22        long numAs = logData.filter(new Function<String, Boolean>() {
    23            public Boolean call(String s) { return s.contains("a"); }
    24        }).count();
    25 
    26        long numBs = logData.filter(new Function<String, Boolean>() {
    27            public Boolean call(String s) { return s.contains("b"); }
    28        }).count();
    29 
    30        System.out.println("Lines with a: " + numAs +",lines with b: " + numBs);
    31    }
    32}

    本程序分别统计README.md文件里包括a和b的行数。本项目的pom.xml文件内容例如以下:

    01<?xml version="1.0" encoding="UTF-8"?>
    03         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    04         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
    05 
    06http://maven.apache.org/xsd/maven-4.0.0.xsd">
    07 
    08    <modelVersion>4.0.0</modelVersion>
    09 
    10    <groupId>spark</groupId>
    11    <artifactId>spark</artifactId>
    12    <version>1.0</version>
    13 
    14    <dependencies>
    15        <dependency>
    16            <groupId>org.apache.spark</groupId>
    17            <artifactId>spark-core_2.10</artifactId>
    18            <version>1.0.0</version>
    19        </dependency>
    20    </dependencies>
    21</project>

    利用Maven来编译这个工程:

    1# mvn install
    2[INFO] ------------------------------------------------------------------------
    3[INFO] BUILD SUCCESS
    4[INFO] ------------------------------------------------------------------------
    5[INFO] Total time: 5.815s
    6[INFO] Finished at: Wed Jun 11 00:01:57 CST 2014
    7[INFO] Final Memory: 13M/32M
    8[INFO] ------------------------------------------------------------------------
    三、Python版本号
    01#
    02# User: 过往记忆
    03# Date: 14-6-10
    04# Time: 下午11:37
    05# bolg: http://www.iteblog.com
    06# 本文地址:http://www.iteblog.com/archives/1041
    07# 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货
    08# 过往记忆博客微信公共帐号:iteblog_hadoop
    09#
    10from pyspark import SparkContext
    11 
    13sc = SparkContext("local", "Spark Application in Python")
    14logData = sc.textFile(logFile).cache()
    15 
    16numAs = logData.filter(lambda s: 'a' in s).count()
    17numBs = logData.filter(lambda s: 'b' in s).count()
    18 
    19print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
    四、測试执行

    本程序的程序环境是Spark 1.0.0,单机模式,測试例如以下:
    1、測试Scala版本号的程序

    1# bin/spark-submit --class "scala.Test" 
    2                   --master local[4]   
    3              target/scala-2.10/simple-project_2.10-1.0.jar
    4 
    514/06/11 01:07:53 INFO spark.SparkContext: Job finished:
    6count at Test.scala:18, took 0.019705 s
    7Lines with a: 62, Lines with b: 35

    2、測试Java版本号的程序

    1# bin/spark-submit --class "SimpleApp" 
    2                   --master local[4]   
    3              target/spark-1.0-SNAPSHOT.jar
    4 
    514/06/11 00:49:14 INFO spark.SparkContext: Job finished:
    6count at SimpleApp.java:22, took 0.019374 s
    7Lines with a: 62, lines with b: 35

    3、測试Python版本号的程序

    1# bin/spark-submit --master local[4]   
    2                simple.py
    3 
    4Lines with a: 62, lines with b: 35

    本文地址:《Spark Standalone模式应用程序开发》:http://www.iteblog.com/archives/1041,过往记忆,大量关于Hadoop、Spark等个人原创技术博客本博客文章除特别声明,所有都是原创!

    尊重原创,转载请注明: 转载自过往记忆(http://www.iteblog.com/)
    本文链接地址: 《Spark Standalone模式应用程序开发》(http://www.iteblog.com/archives/1041)
    E-mail:wyphao.2007@163.com    

  • 相关阅读:
    第03组 团队git现场编程实战
    第二次结对编程作业
    团队项目-选题报告
    第一次结对编程作业
    第一次编程作业
    软件工程第一次作业
    第09组 团队Git现场编程实战
    第二次结对编程作业
    团队项目-需求分析报告
    团队项目-选题报告
  • 原文地址:https://www.cnblogs.com/mfrbuaa/p/4330708.html
Copyright © 2011-2022 走看看