zoukankan      html  css  js  c++  java
  • spark scala读取csv文件

    将以下内容保存为small_zipcode.csv

    id,zipcode,type,city,state,population
    1,704,STANDARD,,PR,30100
    2,704,,PASEO COSTA DEL SUR,PR,
    3,709,,BDA SAN LUIS,PR,3700
    4,76166,UNIQUE,CINGULAR WIRELESS,TX,84000
    5,76177,STANDARD,,TX,
    ,,,,,
    7,76179,STANDARD,,TX,

    打开spark-shell交互式命令行

    val filePath="small_zipcode.csv"
    val df=spark.read.options(
      Map("inferSchema"->"true","delimiter"->",","header"->"true")).csv(filePath)
    
    scala> df.show
    +----+-------+--------+-------------------+-----+----------+
    |  id|zipcode|    type|               city|state|population|
    +----+-------+--------+-------------------+-----+----------+
    |   1|    704|STANDARD|               null|   PR|     30100|
    |   2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
    |   3|    709|    null|       BDA SAN LUIS|   PR|      3700|
    |   4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
    |   5|  76177|STANDARD|               null|   TX|      null|
    |null|   null|    null|               null| null|      null|
    |   7|  76179|STANDARD|               null|   TX|      null|
    +----+-------+--------+-------------------+-----+----------+
    
    scala> df.na.drop("all").show()
    +---+-------+--------+-------------------+-----+----------+
    | id|zipcode|    type|               city|state|population|
    +---+-------+--------+-------------------+-----+----------+
    |  1|    704|STANDARD|               null|   PR|     30100|
    |  2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
    |  3|    709|    null|       BDA SAN LUIS|   PR|      3700|
    |  4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
    |  5|  76177|STANDARD|               null|   TX|      null|
    |  7|  76179|STANDARD|               null|   TX|      null|
    +---+-------+--------+-------------------+-----+----------+
    
    
    scala> df.na.drop().show()
    +---+-------+------+-----------------+-----+----------+
    | id|zipcode|  type|             city|state|population|
    +---+-------+------+-----------------+-----+----------+
    |  4|  76166|UNIQUE|CINGULAR WIRELESS|   TX|     84000|
    +---+-------+------+-----------------+-----+----------+
    参考:
    N多spark使用示例:https://sparkbyexamples.com/spark/spark-dataframe-drop-rows-with-null-values/
  • 相关阅读:
    Python导学基础(三)输入、格式化输出、基本运算符
    题解-FJOI2014 树的重心
    题解-CF1307G Cow and Exercise
    题解-SHOI2005 树的双中心
    笔记-CF643E Bear and Destroying Subtrees
    题解-CF643G Choosing Ads
    扩展Lucas
    线性筛筛积性函数
    整除分块(数论)
    2019暑假集训DAY17(problem2.b)(杜教筛)
  • 原文地址:https://www.cnblogs.com/v5captain/p/14248659.html
Copyright © 2011-2022 走看看