zoukankan      html  css  js  c++  java
  • Spark Batch Demo

    原创转载请注明出处:https://www.cnblogs.com/agilestyle/p/14700811.html

    Project Directory

    Maven Dependency

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>org.fool</groupId>
        <artifactId>hellospark</artifactId>
        <version>1.0-SNAPSHOT</version>
    
        <properties>
            <maven.compiler.source>8</maven.compiler.source>
            <maven.compiler.target>8</maven.compiler.target>
        </properties>
    
        <dependencies>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.12</artifactId>
                <version>3.1.1</version>
            </dependency>
    
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-streaming_2.12</artifactId>
                <version>3.1.1</version>
            </dependency>
    
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql_2.12</artifactId>
                <version>3.1.1</version>
            </dependency>
        </dependencies>
    
        <build>
            <pluginManagement>
                <plugins>
                    <plugin>
                        <groupId>org.apache.maven.plugins</groupId>
                        <artifactId>maven-compiler-plugin</artifactId>
                        <version>3.8.1</version>
                    </plugin>
                </plugins>
            </pluginManagement>
        </build>
        
    </project>

    SpringBatchDemo.scala

    package org.fool.spark
    
    import org.apache.spark.sql.{SaveMode, SparkSession}
    
    object SparkBatchDemo {
      def main(args: Array[String]): Unit = {
        val hdfsSourcePath = "hdfs://127.0.0.1:9000/input/people.json"
    
        val sparkSession = SparkSession.builder().appName("SparkBatchDemo").master("local").getOrCreate();
    
        val df = sparkSession.read.json(hdfsSourcePath)
    
        df.show()
        df.printSchema()
    
        df.select("name").show()
        df.groupBy("age").count().show()
    
        val hdfsOutputPath = "hdfs://127.0.0.1:9000/output/"
    
        df.write.mode(SaveMode.Overwrite).json(hdfsOutputPath)
    
        sparkSession.close()
      }
    }

    Note: 先从HDFS读,然后show,最后再写回HDFS

    people.json

    {"name":"Caocao","create_time":"2021-06-22 01:45:52.478","update_time":"2021-06-22 02:45:52.478"}
    {"name":"Liubei", "age":30,"create_time":"2021-06-22 01:45:52.478","update_time":"2021-06-22 02:45:52.478"}
    {"name":"Guanyu", "age":19,"create_time":"2021-06-22 01:45:52.478","update_time":"2021-06-22 02:45:52.478"}
    {"name":"Zhangfei", "age":29,"create_time":"2021-06-22 01:45:52.478","update_time":"2021-06-22 02:45:52.478"}

    执行前需要把people.json文件上传到HDFS上

    cd $HADOOP_HOME/bin
    ./hdfs dfs -put ~/Desktop/people.json  /input

    Run

    HDFS Output

    查看HDFS中output目录下的结果

    强者自救 圣者渡人
  • 相关阅读:
    记录犯得最可笑的错误
    爬虫阶段内容总结
    docker_nginx_Elasticsearch
    git基础
    爬虫pearPro
    爬虫wangyiPro
    sunPro
    docker-compose终极搞定个人博客
    小程序下拉三个小点不显示问题
    vue鼠标拖动
  • 原文地址:https://www.cnblogs.com/agilestyle/p/14700811.html
Copyright © 2011-2022 走看看