zoukankan      html  css  js  c++  java
  • Spark简单使用案例-WordCount

    一、基本步骤

    1.观察数据集

    2.编写代码测试数据集

    3.固化代码、提交集群运行上线

    二、编写代码方式

    1.spark-shell

      ·数据集的探索

      ·测试

    2.独立应用

      ·上线,放在集群运行

    三、WordCount案例

    步骤:1.读取文件

          2.差分单词

          3.给与每个单词词频为1

          4.按照单词进行词频聚合

    相关代码:

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>groupId</groupId>
        <artifactId>spark</artifactId>
        <version>1.0-SNAPSHOT</version>
    
        <properties>
            <scala.version>2.11.8</scala.version>
            <spark.version>2.2.0</spark.version>
            <slf4j.version>1.7.16</slf4j.version>
            <log4j.version>1.2.17</log4j.version>
        </properties>
    
        <dependencies>
            <dependency>
                <groupId>org.scala-lang</groupId>
                <artifactId>scala-library</artifactId>
                <version>${scala.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-common</artifactId>
                <version>2.7.7</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-client</artifactId>
                <version>2.7.7</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.11</artifactId>
                <version>${spark.version}</version>
            </dependency>
            <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>jcl-over-slf4j</artifactId>
                <version>${slf4j.version}</version>
            </dependency>
            <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-log4j12</artifactId>
                <version>${slf4j.version}</version>
            </dependency>
            <dependency>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-api</artifactId>
                <version>${slf4j.version}</version>
            </dependency>
            <dependency>
                <groupId>log4j</groupId>
                <artifactId>log4j</artifactId>
                <version>${log4j.version}</version>
            </dependency>
        </dependencies>
    
        <build>
            <sourceDirectory>src/main/scala</sourceDirectory>
            <testSourceDirectory>src/test/scala</testSourceDirectory>
    
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.0</version>
                    <configuration>
                        <source>1.8</source>
                        <target>1.8</target>
                        <encoding>UTF-8</encoding>
                    </configuration>
                </plugin>
    
                <plugin>
                    <groupId>net.alchem31.maven</groupId>
                    <artifactId>scala-maven-plugin</artifactId>
                    <version>3.2.0</version>
                    <executions>
                        <execution>
                            <goals>
                                <goal>compile</goal>
                                <goal>testCompile</goal>
                            </goals>
                            <configuration>
                                <args>
                                    <arg>-dependencyfile</arg>
                                    <arg>${project.build.directory}/.scala_dependencies</arg>
                                </args>
                            </configuration>
                        </execution>
                    </executions>
                </plugin>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-shade-plugin</artifactId>
                    <version>3.1.1</version>
                    <executions>
                        <execution>
                            <phase>package</phase>
                            <goals>
                                <goal>shade</goal>
                            </goals>
                            <configuration>
                                <filters>
                                    <filter>
                                        <artifact>*:*</artifact>
                                        <excludes>
                                            <exclude>META-INF/*.$F</exclude>
                                            <exclude>META-INF/*.DSA</exclude>
                                            <exclude>META_INF/*.RSA</exclude>
                                        </excludes>
                                    </filter>
                                </filters>
                            </configuration>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
        </build>
    </project>
    pom.xml
    package cn.itcasr.spark
    
    import java.util
    import org.apache.spark.{SparkConf,SparkContext}
    object WordCount {
      def main(args:util.Arrays[String]):Unit={
        //1.创建SparkContext
        val conf=new SparkConf().setMaster("local[6]").setAppName("word_count")
        val sc=new SparkContext(conf)
        //2.加载数据文件
            //2.1准备文件
            //2.2读取文件
        val rdd1=sc.textFile(path="dataset/wordcount.txt");
        //3.处理
          //3.1把整句话拆分成多个单词
        val rdd2=rdd1.flatMap(item=>item.split(""))
          //3.2把每个单词指定一个词频
        val rdd3=rdd2.map(item=>(item,1))
          //3.3整合
        val rdd4=rdd3.reduceByKey(curr,agg)=>curr+agg)
        //4.得到结果
        val result=rdd4.collect()
        println(result)
      }
    }
    WordCount
  • 相关阅读:
    Oracle 11g SQL Fundamentals Training Introduction02
    Chapter 05Reporting Aggregated data Using the Group Functions 01
    Chapter 01Restriicting Data Using The SQL SELECT Statemnt01
    Oracle 11g SQL Fundamentals Training Introduction01
    Chapter 04Using Conversion Functions and Conditional ExpressionsConditional Expressions
    Unix时代的开创者Ken Thompson (zz.is2120.bg57iv3)
    我心目中计算机软件科学最小必读书目 (zz.is2120)
    北京将评估分时分区单双号限行 推进错时上下班 (zz)
    佳能G系列领军相机G1X
    选购单反相机的新建议——心民谈宾得K5(转)
  • 原文地址:https://www.cnblogs.com/hhjing/p/14310026.html
Copyright © 2011-2022 走看看