zoukankan      html  css  js  c++  java
  • Spark DataFrame NOT IN实现方法

    来源:https://sqlandhadoop.com/spark-dataframe-in-isin-not-in/

    摘要:To use the condition as “NOT IN”, you can use negation (!) before the column name in the previous isin query.

    eg. df_pres.filter(!$"pres_bs".isin("New York","Ohio","Texas")).select($"pres_name",$"pres_dob",$"pres_bs").show()


    Spark Dataframe IN-ISIN-NOT IN

    IN or NOT IN conditions are used in FILTER/WHERE or even in JOINS when we have to specify multiple possible values for any column. If the value is one of the values mentioned inside “IN” clause then it will qualify. It is opposite for “NOT IN” where the value must not be among any one present inside NOT IN clause.
    So let’s look at the example for IN condition

     
    scala> df_pres.filter($"pres_bs" in ("New York","Ohio","Texas")).select($"pres_name",$"pres_dob",$"pres_bs").show()
    +--------------------+----------+--------+
    |           pres_name|  pres_dob| pres_bs|
    +--------------------+----------+--------+
    |    Martin Van Buren|1782-12-05|New York|
    |    Millard Fillmore|1800-01-07|New York|
    |    Ulysses S. Grant|1822-04-27|    Ohio|
    | Rutherford B. Hayes|1822-10-04|    Ohio|
    |   James A. Garfield|1831-11-19|    Ohio|
    |   Benjamin Harrison|1833-08-20|    Ohio|
    |    William McKinley|1843-01-29|    Ohio|
    |  Theodore Roosevelt|1858-10-27|New York|
    | William Howard Taft|1857-09-15|    Ohio|
    |   Warren G. Harding|1865-11-02|    Ohio|
    |Franklin D. Roose...|1882-01-30|New York|
    |Dwight D. Eisenhower|1890-10-14|   Texas|
    |   Lyndon B. Johnson|1908-08-27|   Texas|
    |        Donald Trump|1946-06-14|New York|
    +--------------------+----------+--------+
    

    Note: “in” method is not available in Spark 2.0. So prefer method is “isin”

    Other way of writing it could be and the one which I prefer is by using isin function.

    scala> df_pres.filter($"pres_bs".isin("New York","Ohio","Texas")).select($"pres_name",$"pres_dob",$"pres_bs").show()
    +--------------------+----------+--------+
    |           pres_name|  pres_dob| pres_bs|
    +--------------------+----------+--------+
    |    Martin Van Buren|1782-12-05|New York|
    |    Millard Fillmore|1800-01-07|New York|
    |    Ulysses S. Grant|1822-04-27|    Ohio|
    | Rutherford B. Hayes|1822-10-04|    Ohio|
    |   James A. Garfield|1831-11-19|    Ohio|
    |   Benjamin Harrison|1833-08-20|    Ohio|
    |    William McKinley|1843-01-29|    Ohio|
    |  Theodore Roosevelt|1858-10-27|New York|
    | William Howard Taft|1857-09-15|    Ohio|
    |   Warren G. Harding|1865-11-02|    Ohio|
    |Franklin D. Roose...|1882-01-30|New York|
    |Dwight D. Eisenhower|1890-10-14|   Texas|
    |   Lyndon B. Johnson|1908-08-27|   Texas|
    |        Donald Trump|1946-06-14|New York|
    +--------------------+----------+--------+
    

    To use the condition as “NOT IN”, you can use negation (!) before the column name in the previous isin query.

    scala> df_pres.filter(!$"pres_bs".isin("New York","Ohio","Texas")).select($"pres_name",$"pres_dob",$"pres_bs").show()
    +--------------------+----------+--------------------+
    |           pres_name|  pres_dob|             pres_bs|
    +--------------------+----------+--------------------+
    |   George Washington|1732-02-22|            Virginia|
    |          John Adams|1735-10-30|       Massachusetts|
    |    Thomas Jefferson|1743-04-13|            Virginia|
    |       James Madison|1751-03-16|            Virginia|
    |        James Monroe|1758-04-28|            Virginia|
    |   John Quincy Adams|1767-07-11|       Massachusetts|
    |      Andrew Jackson|1767-03-15|South/North Carolina|
    |William Henry Har...|1773-02-09|            Virginia|
    |          John Tyler|1790-03-29|            Virginia|
    |       James K. Polk|1795-11-02|      North Carolina|
    |      Zachary Taylor|1784-11-24|            Virginia|
    |     Franklin Pierce|1804-11-23|       New Hampshire|
    |      James Buchanan|1791-04-23|        Pennsylvania|
    |     Abraham Lincoln|1809-02-12|            Kentucky|
    |      Andrew Johnson|1808-12-29|      North Carolina|
    |   Chester A. Arthur|1829-10-05|             Vermont|
    |    Grover Cleveland|1837-03-18|          New Jersey|
    |    Grover Cleveland|1837-03-18|          New Jersey|
    |      Woodrow Wilson|1856-12-28|            Virginia|
    |     Calvin Coolidge|1872-07-04|             Vermont|
    +--------------------+----------+--------------------+
    only showing top 20 rows
  • 相关阅读:
    AntSword 中国蚁剑的下载安装配置(附下载文件)
    开园第一笔
    四舍五入小技巧
    PAT B# 1025 反转链表
    WebService如何根据对方提供的xml生成对象
    解决Web部署 svg/woff/woff2字体 404错误
    解决TryUpdateModel对象为空的问题
    IIS集成模式下,URL重写后获取不到Session值
    SQLServer清空数据库中所有的表并且ID自动归0
    win2003 64位系统IIS6.0 32位与64位间切换
  • 原文地址:https://www.cnblogs.com/144823836yj/p/13718376.html
Copyright © 2011-2022 走看看