zoukankan      html  css  js  c++  java
  • spark scala 删除所有列全为空值的行

    删除表中全部为NaN的行

    df.na.drop("all")

    删除表任一列中有NaN的行

    df.na.drop("any")

    示例:

    scala> df.show
    +----+-------+--------+-------------------+-----+----------+
    |  id|zipcode|    type|               city|state|population|
    +----+-------+--------+-------------------+-----+----------+
    |   1|    704|STANDARD|               null|   PR|     30100|
    |   2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
    |   3|    709|    null|       BDA SAN LUIS|   PR|      3700|
    |   4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
    |   5|  76177|STANDARD|               null|   TX|      null|
    |null|   null|    null|               null| null|      null|
    |   7|  76179|STANDARD|               null|   TX|      null|
    +----+-------+--------+-------------------+-----+----------+
    
    
    scala> df.na.drop("all").show()
    +---+-------+--------+-------------------+-----+----------+
    | id|zipcode|    type|               city|state|population|
    +---+-------+--------+-------------------+-----+----------+
    |  1|    704|STANDARD|               null|   PR|     30100|
    |  2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
    |  3|    709|    null|       BDA SAN LUIS|   PR|      3700|
    |  4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
    |  5|  76177|STANDARD|               null|   TX|      null|
    |  7|  76179|STANDARD|               null|   TX|      null|
    +---+-------+--------+-------------------+-----+----------+
    
    
    scala> df.na.drop().show()
    +---+-------+------+-----------------+-----+----------+
    | id|zipcode|  type|             city|state|population|
    +---+-------+------+-----------------+-----+----------+
    |  4|  76166|UNIQUE|CINGULAR WIRELESS|   TX|     84000|
    +---+-------+------+-----------------+-----+----------+
    
    
    scala> df.na.drop("any").show()
    +---+-------+------+-----------------+-----+----------+
    | id|zipcode|  type|             city|state|population|
    +---+-------+------+-----------------+-----+----------+
    |  4|  76166|UNIQUE|CINGULAR WIRELESS|   TX|     84000|
    +---+-------+------+-----------------+-----+----------+

    删除给定列为Null的行:

    val nameArray = sparkEnv.sc.textFile("/master/abc.txt").collect()
    val df = df.na.drop("all", nameArray.toList.toArray)
    
    df.na.drop(Seq("population","type"))

    函数原型:

    def drop(): DataFrame
    Returns a new DataFrame that drops rows containing any null or NaN values.
    
    def drop(how: String): DataFrame
    Returns a new DataFrame that drops rows containing null or NaN values.
    If how is "any", then drop rows containing any null or NaN values. If how is "all", then drop rows only if every column is null or NaN for that row.
    
    def drop(how: String, cols: Seq[String]): DataFrame
    (Scala-specific) Returns a new DataFrame that drops rows containing null or NaN values in the specified columns.
    If how is "any", then drop rows containing any null or NaN values in the specified columns. If how is "all", then drop rows only if every specified column is null or NaN for that row.
    
    def drop(how: String, cols: Array[String]): DataFrame
    Returns a new DataFrame that drops rows containing null or NaN values in the specified columns.
    If how is "any", then drop rows containing any null or NaN values in the specified columns. If how is "all", then drop rows only if every specified column is null or NaN for that row.
    
    def drop(cols: Seq[String]): DataFrame
    (Scala-specific) Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.
    
    def drop(cols: Array[String]): DataFrame
    Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.

    更多函数原型:
    https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions


    参考:
    N多spark使用示例:https://sparkbyexamples.com/spark/spark-dataframe-drop-rows-with-null-values/
    示例代码及数据集:https://github.com/spark-examples/spark-scala-examples csv路径:src/main/resources/small_zipcode.csv
    https://www.jianshu.com/p/39852729736a

  • 相关阅读:
    用hmac验证客户端的合法性
    初级版python登录验证,上传下载文件加MD5文件校验
    用python实现一个简单的聊天功能,tcp,udp,socketserver版本
    用struct模块解决tcp的粘包问题
    最简单的socket通信
    python中的单例模式
    python中的反射
    前端工程化思想
    h5移动端flexible源码适配终端解读以及常用sass函数
    Vue生命周期函数详解
  • 原文地址:https://www.cnblogs.com/v5captain/p/14248636.html
Copyright © 2011-2022 走看看