zoukankan      html  css  js  c++  java
  • Spark读取txt , 并结构化后执行 SQL操作

    1.准备 idea , 配置好scala ,需要有   Spark sql包 !注意:如果自己Spark能跑 ,就不要复制我的POM了,代码能直接用.

     

    ---------------贴一下POM , 我用的是Spark版本是 2.4.3,  Spark_core以及sql是2.11

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>com.alpha3</groupId>
        <artifactId>Scala008</artifactId>
        <version>1.0-SNAPSHOT</version>
        <dependencies>
            <!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
            <!-- 以下dependency都要修改成自己的scala,spark,hadoop版本-->
            <dependency>
                <groupId>org.scala-lang</groupId>
                <artifactId>scala-library</artifactId>
                <version>2.11.12</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.11</artifactId>
                <version>2.4.3</version>
            </dependency>
            <dependency>
                <groupId>org.apache.hbase</groupId>
                <artifactId>hbase-client</artifactId>
                <version>1.2.0</version>
                <exclusions>
                    <exclusion>
                        <artifactId>hadoop-common</artifactId>
                        <groupId>org.apache.hadoop</groupId>
                    </exclusion>
                    <exclusion>
                        <artifactId>netty-all</artifactId>
                        <groupId>io.netty</groupId>
                    </exclusion>
                </exclusions>
            </dependency>
    
            <!--&lt;!&ndash; https://mvnrepository.com/artifact/org.apache.hbase/hbase-server &ndash;&gt;-->
            <dependency>
                <groupId>org.apache.hbase</groupId>
                <artifactId>hbase-server</artifactId>
                <version>1.2.0</version>
                <exclusions>
                    <exclusion>
                        <artifactId>hadoop-client</artifactId>
                        <groupId>org.apache.hadoop</groupId>
                    </exclusion>
                    <exclusion>
                        <artifactId>netty-all</artifactId>
                        <groupId>io.netty</groupId>
                    </exclusion>
                    <exclusion>
                        <artifactId>hadoop-common</artifactId>
                        <groupId>org.apache.hadoop</groupId>
                    </exclusion>
                </exclusions>
            </dependency>
    
            <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-client</artifactId>
                <version>2.6.0</version>
            </dependency>
    
            <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-mllib_2.11</artifactId>
                <version>2.4.3</version>
            </dependency>
    
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql_2.11</artifactId>
                <version>2.4.3</version>
            </dependency>
    
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-hive_2.11</artifactId>
                <version>2.4.3</version>
            </dependency>
    
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-streaming_2.11</artifactId>
                <version>2.4.3</version>
            </dependency>
    
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
                <version>2.4.3</version>
            </dependency>
    
            <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10 -->
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
                <version>2.4.3</version>
            </dependency>
    
    
            <dependency>
                <groupId>mysql</groupId>
                <artifactId>mysql-connector-java</artifactId>
                <version>5.1.6</version>
            </dependency>
    
            <dependency>
                <groupId>org.apache.hive</groupId>
                <artifactId>hive-jdbc</artifactId>
                <version>0.13.0</version>
            </dependency>
    
    
            <!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
            <dependency>
                <groupId>mysql</groupId>
                <artifactId>mysql-connector-java</artifactId>
                <version>5.1.38</version>
            </dependency>
        </dependencies>
        <build>
            <!--程序主目录,按照自己的路径修改,如果有测试文件还要加一个testDirectory-->
            <sourceDirectory>src/main/scala</sourceDirectory>
            <plugins>
                <plugin>
                    <groupId>org.scala-tools</groupId>
                    <artifactId>maven-scala-plugin</artifactId>
                    <version>2.15.2</version>
                    <executions>
                        <execution>
                            <goals>
                                <goal>compile</goal>
                                <goal>testCompile</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-shade-plugin</artifactId>
                    <version>2.4.3</version>
                    <executions>
                        <execution>
                            <phase>package</phase>
                            <goals>
                                <goal>shade</goal>
                            </goals>
                            <configuration>
                                <filters>
                                    <filter>
                                        <artifact>*:*</artifact>
                                        <excludes>
                                            <exclude>META-INF/*.SF</exclude>
                                            <exclude>META-INF/*.DSA</exclude>
                                            <exclude>META-INF/*.RSA</exclude>
                                        </excludes>
                                    </filter>
                                </filters>
                                <!--<transformers>-->
                                <!--<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">-->
                                <!--<mainClass></mainClass>-->
                                <!--</transformer>-->
                                <!--</transformers>-->
                            </configuration>
                        </execution>
                    </executions>
                </plugin>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <configuration>
                        <source>1.8</source>
                        <target>1.8</target>
                    </configuration>
                </plugin>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-jar-plugin</artifactId>
                    <configuration>
                        <archive>
                            <manifest>
                                <addClasspath>true</addClasspath>
                                <useUniqueVersions>false</useUniqueVersions>
                                <classpathPrefix>lib/</classpathPrefix>
                                <!--修改为自己的包名.类名,右键类->copy reference-->
                                <mainClass>com.me.Scala008</mainClass>
                            </manifest>
                        </archive>
                    </configuration>
                </plugin>
            </plugins>
        </build>
    </project>

    2. 第二步 ,创建伴生类 , 何谓伴生类 , 就是此类可以直接执行main方法

    import org.apache.spark.sql.types.{StringType, StructField, StructType}
    import org.apache.spark.sql.{Row, SparkSession}
    
    object Spark_File_to_SQL {
      def main(args: Array[String]): Unit = {
        import org.apache.log4j.{Level, Logger}
        Logger.getLogger("org").setLevel(Level.OFF)
    
        val ss = SparkSession
          .builder()
          .appName("Scala009")
          .master("local")
          .getOrCreate()
    
        //获取 SparkContext
        val sc = ss.sparkContext
    
        val rdd = sc.textFile("D:\aa.txt")
        val mapRDD= rdd.map(line=>Row(line.split(" ")(0),line.split(" ")(1)))
        val sf1=new StructField("ip",StringType,true) //这里是列1   信息是 IP
        val sf2=new StructField("user",StringType,true) //这里是列2  信息是user
    
        val table_sch=new StructType(Array(sf1,sf2)) //生成表结构 , 由两列叫IP和user的列组成的表 ,可以为空
    
        val df=ss.createDataFrame(mapRDD,table_sch)  //用mapRDD的分列数据去映射到 结构表里面,生成具有列信息的表
    
        df.createTempView("cyber")            //创建视图 cyber
    
        ss.sql("select * from cyber").show()    //打印视图(表)
    
        println("执行完毕")
    
      }
    }

    运行结果

    PS:我在D:\aa.txt  目录下新建了文本文档 , 组合方式为 IP+ 空格 +用户名

    ------------恢复内容结束------------

  • 相关阅读:
    e824. 获得和设置JSplitPane中的子组件
    e827. 设置JSplitPane中分隔物的大小
    e826. 获得和设置JSplitPane分开的位置
    e788. 取消JSpinner的键盘编辑能力
    e790. 设置JSpinner的边框
    e789. 限制用JSpinner实现数字选择的值
    e787. 用JSpinner实现小时选择
    e793. 监听JSpinner数据变化
    e791. 为JSpinner定制编辑器
    e792. 建立一个包括所有数据的SpinnerListModel
  • 原文地址:https://www.cnblogs.com/alpha-cat/p/11722661.html
Copyright © 2011-2022 走看看