zoukankan      html  css  js  c++  java
  • spark学习11(Wordcount程序-本地测试)

    wordcount程序

    文件wordcount.txt

    hello wujiadong
    hello spark
    hello hadoop
    hello python
    
    
    
    

    程序示例

    package wujiadong_sparkCore
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
      * Created by Administrator on 2017/2/25.
      */
    object LocalSpark {
      def main(args: Array[String]): Unit = {
        //第一步:创建SparkConf对象,设置spark应用的配置信息
        //使用setMaster()可以设置spark应用程序要连接spark集群的master节点的url
        //设置为local则代表在本地运行
        val conf = new SparkConf().setAppName("localspark").setMaster("local")//在idea里运行的话才需要设置setMaster
        //创建SparkContext对象,SparkContex是spark所有功能的一个入口,主要作用包括初始化spark应用程序所需的一些核心组件,包括调度器
        //(DAGSchedule、TaskScheduler)还会去spark master节点上进行注册等等
        val sc = new SparkContext(conf)
        //本地文件
        val file = "C://Users//Administrator.USER-20160219OS//Desktop//wordcount.txt"
        //针对输入源(hdfs文件,本地文件等),创建一个初始的RDD,RDD中有元素这种概念,每一个元素就相当于文件的一行
        val lines = sc.textFile(file)
        //对初始RDD进行transformation操作
        //先将每一行拆分成一个一个的单词
        val wordRDD = lines.flatMap(line => line.split(" "))
        //将每个单词映射成(单词,1)这样的格式,后面才能根据单词作为key,来进行每个单词的出现次数的累加
        val wordpair = wordRDD.map(word => (word,1))
        //以单词作为key,统计每个单词出现的次数(对每个单词的key进行reduce操作)
        val result = wordpair.reduceByKey(_+_)
        //最后进行action操作,比如可以使用foreach进行触发
        result.foreach(wordNumberPair => println(wordNumberPair._1 + " , " + wordNumberPair._2))
    
    
      }
    
    }
    
    
    
    

    运行结果

    "C:Program FilesJavajdk1.8.0_101injava" -Dspark.master=local -Didea.launcher.port=7532 "-Didea.launcher.bin.path=C:Program Files (x86)JetBrainsIntelliJ IDEA Community Edition 2016.3.3in" -Dfile.encoding=UTF-8 -classpath "C:Program FilesJavajdk1.8.0_101jrelibcharsets.jar;C:Program FilesJavajdk1.8.0_101jrelibdeploy.jar;C:Program FilesJavajdk1.8.0_101jrelibextaccess-bridge-64.jar;C:Program FilesJavajdk1.8.0_101jrelibextcldrdata.jar;C:Program FilesJavajdk1.8.0_101jrelibextdnsns.jar;C:Program FilesJavajdk1.8.0_101jrelibextjaccess.jar;C:Program FilesJavajdk1.8.0_101jrelibextjfxrt.jar;C:Program FilesJavajdk1.8.0_101jrelibextlocaledata.jar;C:Program FilesJavajdk1.8.0_101jrelibext
    ashorn.jar;C:Program FilesJavajdk1.8.0_101jrelibextsunec.jar;C:Program FilesJavajdk1.8.0_101jrelibextsunjce_provider.jar;C:Program FilesJavajdk1.8.0_101jrelibextsunmscapi.jar;C:Program FilesJavajdk1.8.0_101jrelibextsunpkcs11.jar;C:Program FilesJavajdk1.8.0_101jrelibextzipfs.jar;C:Program FilesJavajdk1.8.0_101jrelibjavaws.jar;C:Program FilesJavajdk1.8.0_101jrelibjce.jar;C:Program FilesJavajdk1.8.0_101jrelibjfr.jar;C:Program FilesJavajdk1.8.0_101jrelibjfxswt.jar;C:Program FilesJavajdk1.8.0_101jrelibjsse.jar;C:Program FilesJavajdk1.8.0_101jrelibmanagement-agent.jar;C:Program FilesJavajdk1.8.0_101jrelibplugin.jar;C:Program FilesJavajdk1.8.0_101jrelib
    esources.jar;C:Program FilesJavajdk1.8.0_101jrelib
    t.jar;D:wujiadong.sparkoutproductionwujiadong.spark;C:Program Files (x86)scalalibscala-library.jar;C:Program Files (x86)scalalibscala-reflect.jar;F:spark-assembly-1.5.1-hadoop2.6.0.jar;C:Program Files (x86)JetBrainsIntelliJ IDEA Community Edition 2016.3.3libidea_rt.jar" com.intellij.rt.execution.application.AppMain wujiadong_sparkCore.LocalSpark
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    17/03/04 20:41:21 INFO SparkContext: Running Spark version 1.5.1
    17/03/04 20:41:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    17/03/04 20:41:24 INFO SecurityManager: Changing view acls to: Administrator
    17/03/04 20:41:24 INFO SecurityManager: Changing modify acls to: Administrator
    17/03/04 20:41:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Administrator); users with modify permissions: Set(Administrator)
    17/03/04 20:41:27 INFO Slf4jLogger: Slf4jLogger started
    17/03/04 20:41:27 INFO Remoting: Starting remoting
    17/03/04 20:41:28 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.11.25.3:64151]
    17/03/04 20:41:29 INFO Utils: Successfully started service 'sparkDriver' on port 64151.
    17/03/04 20:41:29 INFO SparkEnv: Registering MapOutputTracker
    17/03/04 20:41:30 INFO SparkEnv: Registering BlockManagerMaster
    17/03/04 20:41:31 INFO DiskBlockManager: Created local directory at C:UsersAdministrator.USER-20160219OSAppDataLocalTemplockmgr-8339dad4-0230-405c-8ff3-f28fe073b327
    17/03/04 20:41:35 INFO MemoryStore: MemoryStore started with capacity 972.5 MB
    17/03/04 20:41:38 INFO HttpFileServer: HTTP File server directory is C:UsersAdministrator.USER-20160219OSAppDataLocalTempspark-7aef918f-fd75-4153-833e-f29def7f1805httpd-e95baaaa-f8c5-43e3-be14-8b45a90fce45
    17/03/04 20:41:38 INFO HttpServer: Starting HTTP Server
    17/03/04 20:41:40 INFO Utils: Successfully started service 'HTTP file server' on port 64166.
    17/03/04 20:41:40 INFO SparkEnv: Registering OutputCommitCoordinator
    17/03/04 20:41:42 INFO Utils: Successfully started service 'SparkUI' on port 4040.
    17/03/04 20:41:42 INFO SparkUI: Started SparkUI at http://10.11.25.3:4040
    17/03/04 20:41:43 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
    17/03/04 20:41:43 INFO Executor: Starting executor ID driver on host localhost
    17/03/04 20:41:47 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 64205.
    17/03/04 20:41:47 INFO NettyBlockTransferService: Server created on 64205
    17/03/04 20:41:47 INFO BlockManagerMaster: Trying to register BlockManager
    17/03/04 20:41:47 INFO BlockManagerMasterEndpoint: Registering block manager localhost:64205 with 972.5 MB RAM, BlockManagerId(driver, localhost, 64205)
    17/03/04 20:41:47 INFO BlockManagerMaster: Registered BlockManager
    17/03/04 20:41:52 INFO MemoryStore: ensureFreeSpace(130448) called with curMem=0, maxMem=1019782103
    17/03/04 20:41:52 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 972.4 MB)
    17/03/04 20:41:53 INFO MemoryStore: ensureFreeSpace(14276) called with curMem=130448, maxMem=1019782103
    17/03/04 20:41:53 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 972.4 MB)
    17/03/04 20:41:53 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:64205 (size: 13.9 KB, free: 972.5 MB)
    17/03/04 20:41:53 INFO SparkContext: Created broadcast 0 from textFile at LocalSpark.scala:20
    17/03/04 20:41:56 INFO FileInputFormat: Total input paths to process : 1
    17/03/04 20:41:56 INFO SparkContext: Starting job: foreach at LocalSpark.scala:29
    17/03/04 20:41:58 INFO DAGScheduler: Registering RDD 3 (map at LocalSpark.scala:25)
    17/03/04 20:41:58 INFO DAGScheduler: Got job 0 (foreach at LocalSpark.scala:29) with 1 output partitions
    17/03/04 20:41:58 INFO DAGScheduler: Final stage: ResultStage 1(foreach at LocalSpark.scala:29)
    17/03/04 20:41:58 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
    17/03/04 20:41:58 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
    17/03/04 20:41:58 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at LocalSpark.scala:25), which has no missing parents
    17/03/04 20:41:59 INFO MemoryStore: ensureFreeSpace(4120) called with curMem=144724, maxMem=1019782103
    17/03/04 20:41:59 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.0 KB, free 972.4 MB)
    17/03/04 20:41:59 INFO MemoryStore: ensureFreeSpace(2337) called with curMem=148844, maxMem=1019782103
    17/03/04 20:41:59 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 972.4 MB)
    17/03/04 20:41:59 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:64205 (size: 2.3 KB, free: 972.5 MB)
    17/03/04 20:41:59 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
    17/03/04 20:41:59 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at LocalSpark.scala:25)
    17/03/04 20:41:59 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
    17/03/04 20:42:00 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 2164 bytes)
    17/03/04 20:42:00 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
    17/03/04 20:42:00 INFO HadoopRDD: Input split: file:/C:/Users/Administrator.USER-20160219OS/Desktop/wordcount.txt:0+54
    17/03/04 20:42:00 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
    17/03/04 20:42:00 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
    17/03/04 20:42:00 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
    17/03/04 20:42:00 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
    17/03/04 20:42:00 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
    17/03/04 20:42:01 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2253 bytes result sent to driver
    17/03/04 20:42:01 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1448 ms on localhost (1/1)
    17/03/04 20:42:01 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
    17/03/04 20:42:01 INFO DAGScheduler: ShuffleMapStage 0 (map at LocalSpark.scala:25) finished in 1.633 s
    17/03/04 20:42:01 INFO DAGScheduler: looking for newly runnable stages
    17/03/04 20:42:01 INFO DAGScheduler: running: Set()
    17/03/04 20:42:01 INFO DAGScheduler: waiting: Set(ResultStage 1)
    17/03/04 20:42:01 INFO DAGScheduler: failed: Set()
    17/03/04 20:42:01 INFO DAGScheduler: Missing parents for ResultStage 1: List()
    17/03/04 20:42:01 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at LocalSpark.scala:27), which is now runnable
    17/03/04 20:42:01 INFO MemoryStore: ensureFreeSpace(2224) called with curMem=151181, maxMem=1019782103
    17/03/04 20:42:01 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.2 KB, free 972.4 MB)
    17/03/04 20:42:01 INFO MemoryStore: ensureFreeSpace(1380) called with curMem=153405, maxMem=1019782103
    17/03/04 20:42:01 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1380.0 B, free 972.4 MB)
    17/03/04 20:42:01 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:64205 (size: 1380.0 B, free: 972.5 MB)
    17/03/04 20:42:01 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:861
    17/03/04 20:42:01 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at LocalSpark.scala:27)
    17/03/04 20:42:01 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
    17/03/04 20:42:01 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1901 bytes)
    17/03/04 20:42:01 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
    17/03/04 20:42:02 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
    17/03/04 20:42:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 55 ms
    spark , 1
    wujiadong , 1
    hadoop , 1
    python , 1
    hello , 4
    17/03/04 20:42:02 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver
    17/03/04 20:42:02 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 367 ms on localhost (1/1)
    17/03/04 20:42:02 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
    17/03/04 20:42:02 INFO DAGScheduler: ResultStage 1 (foreach at LocalSpark.scala:29) finished in 0.370 s
    17/03/04 20:42:02 INFO DAGScheduler: Job 0 finished: foreach at LocalSpark.scala:29, took 5.915115 s
    17/03/04 20:42:02 INFO SparkContext: Invoking stop() from shutdown hook
    17/03/04 20:42:02 INFO SparkUI: Stopped Spark web UI at http://10.11.25.3:4040
    17/03/04 20:42:02 INFO DAGScheduler: Stopping DAGScheduler
    17/03/04 20:42:02 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    17/03/04 20:42:03 INFO MemoryStore: MemoryStore cleared
    17/03/04 20:42:03 INFO BlockManager: BlockManager stopped
    17/03/04 20:42:03 INFO BlockManagerMaster: BlockManagerMaster stopped
    17/03/04 20:42:03 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    17/03/04 20:42:03 INFO SparkContext: Successfully stopped SparkContext
    17/03/04 20:42:03 INFO ShutdownHookManager: Shutdown hook called
    17/03/04 20:42:03 INFO ShutdownHookManager: Deleting directory C:UsersAdministrator.USER-20160219OSAppDataLocalTempspark-7aef918f-fd75-4153-833e-f29def7f1805
    
    Process finished with exit code 0
    
    
    
  • 相关阅读:
    如何给女朋友解释什么是分布式和集群?【转】
    彻底理解cookie、session、token 【转】
    API到底是什么? 【汇总,转】
    代理的基本原理【转】
    从未如此简单:10分钟带你逆袭Kafka!【转】
    一口气说出Kafka为啥这么快? 【转】
    kafka官网 http://kafka.apache.org/intro
    网络相关命令配置【汇总 更新中】
    Kafka的四个基础概念学习【转】
    Kafka简介及各个组件介绍 【转】
  • 原文地址:https://www.cnblogs.com/wujiadong2014/p/6389064.html
Copyright © 2011-2022 走看看