zoukankan      html  css  js  c++  java
  • Pandas 与 SQL 对比

    Pandas 与 SQL 对比

    原文:http://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

        由于许多pandas用户使用SQL,因此本页面将提供一些使用pandas执行各种SQL操作的示例。如果你未接触过Pandas,你可能需要通过阅读10 Minutes to pandas,以熟悉一下这个库。

      按照惯例,我们导入pandas和numpy如下:

    In [1]: import pandas as pd
    
    In [2]: import numpy as np
    

      下面的大多数例子将使用在pandas测试集中的tips数据集。我们将数据读入一个名为tips的DataFrame,并假设我们也有一个具有相同名称和结构的数据库表tips。

    In [3]: url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'
    
    In [4]: tips = pd.read_csv(url)
    
    In [5]: tips.head()
    Out[5]: 
       total_bill   tip     sex smoker  day    time  size
    0       16.99  1.01  Female     No  Sun  Dinner     2
    1       10.34  1.66    Male     No  Sun  Dinner     3
    2       21.01  3.50    Male     No  Sun  Dinner     3
    3       23.68  3.31    Male     No  Sun  Dinner     2
    4       24.59  3.61  Female     No  Sun  Dinner     4
    

    SELECT

      在SQL中,使用逗号分隔的列选择列(或 选择所有列)进行选择:

    SELECT total_bill, tip, smoker, time
    FROM tips
    LIMIT 5;
    

      而pandas,通过将列名列表传递到DataFrame来完成列的选择:

    In [6]: tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
    Out[6]: 
       total_bill   tip smoker    time
    0       16.99  1.01     No  Dinner
    1       10.34  1.66     No  Dinner
    2       21.01  3.50     No  Dinner
    3       23.68  3.31     No  Dinner
    4       24.59  3.61     No  Dinner
    

      调用没有列名称列表的DataFrame将显示所有列(类似于SQL的 *)。

    WHERE

      在SQL中的筛选是通过WHERE子句完成的。

    SELECT *
    FROM tips
    WHERE time = 'Dinner'
    LIMIT 5;
    

      DataFrames可以以多种方式进行筛选;其中最直观的是使用关系表达式。

    In [7]: tips[tips['time'] == 'Dinner'].head(5)
    Out[7]: 
       total_bill   tip     sex smoker  day    time  size
    0       16.99  1.01  Female     No  Sun  Dinner     2
    1       10.34  1.66    Male     No  Sun  Dinner     3
    2       21.01  3.50    Male     No  Sun  Dinner     3
    3       23.68  3.31    Male     No  Sun  Dinner     2
    4       24.59  3.61  Female     No  Sun  Dinner     4
    

      上面的语句只是将一个Series的True 或 False对象传递给DataFrame,返回DataFrame中所有为True的行。

            也可以这样来写:

    In [8]: is_dinner = tips['time'] == 'Dinner'
    
    In [9]: is_dinner.value_counts()
    Out[9]: 
    True     176
    False     68
    Name: time, dtype: int64
    
    In [10]: tips[is_dinner].head(5)
    Out[10]: 
       total_bill   tip     sex smoker  day    time  size
    0       16.99  1.01  Female     No  Sun  Dinner     2
    1       10.34  1.66    Male     No  Sun  Dinner     3
    2       21.01  3.50    Male     No  Sun  Dinner     3
    3       23.68  3.31    Male     No  Sun  Dinner     2
    4       24.59  3.61  Female     No  Sun  Dinner     4
    

      就像SQL的OR和AND一样,可以使用 | (OR)和&(AND)将多个条件传递给DataFrame。

    注意条件要用圆括号。

    -- tips of more than $5.00 at Dinner meals
    SELECT *
    FROM tips
    WHERE time = 'Dinner' AND tip > 5.00;
    
    # tips of more than $5.00 at Dinner meals
    In [11]: tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
    Out[11]: 
         total_bill    tip     sex smoker  day    time  size
    23        39.42   7.58    Male     No  Sat  Dinner     4
    44        30.40   5.60    Male     No  Sun  Dinner     4
    47        32.40   6.00    Male     No  Sun  Dinner     4
    52        34.81   5.20  Female     No  Sun  Dinner     4
    59        48.27   6.73    Male     No  Sat  Dinner     4
    116       29.93   5.07    Male     No  Sun  Dinner     4
    155       29.85   5.14  Female     No  Sun  Dinner     5
    170       50.81  10.00    Male    Yes  Sat  Dinner     3
    172        7.25   5.15    Male    Yes  Sun  Dinner     2
    181       23.33   5.65    Male    Yes  Sun  Dinner     2
    183       23.17   6.50    Male    Yes  Sun  Dinner     4
    211       25.89   5.16    Male    Yes  Sat  Dinner     4
    212       48.33   9.00    Male     No  Sat  Dinner     4
    214       28.17   6.50  Female    Yes  Sat  Dinner     3
    239       29.03   5.92    Male     No  Sat  Dinner     3
    
    -- tips by parties of at least 5 diners OR bill total was more than $45
    SELECT *
    FROM tips
    WHERE size >= 5 OR total_bill > 45;
    
    # tips by parties of at least 5 diners OR bill total was more than $45
    In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]
    Out[12]: 
         total_bill    tip     sex smoker   day    time  size
    59        48.27   6.73    Male     No   Sat  Dinner     4
    125       29.80   4.20  Female     No  Thur   Lunch     6
    141       34.30   6.70    Male     No  Thur   Lunch     6
    142       41.19   5.00    Male     No  Thur   Lunch     5
    143       27.05   5.00  Female     No  Thur   Lunch     6
    155       29.85   5.14  Female     No   Sun  Dinner     5
    156       48.17   5.00    Male     No   Sun  Dinner     6
    170       50.81  10.00    Male    Yes   Sat  Dinner     3
    182       45.35   3.50    Male    Yes   Sun  Dinner     3
    185       20.69   5.00    Male     No   Sun  Dinner     5
    187       30.46   2.00    Male    Yes   Sun  Dinner     5
    212       48.33   9.00    Male     No   Sat  Dinner     4
    216       28.15   3.00    Male    Yes   Sat  Dinner     5
    

      使用 notna()isna()方法进行NULL检查。

    In [13]: frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],
       ....:                       'col2': ['F', np.NaN, 'G', 'H', 'I']})
       ....: 
    
    In [14]: frame
    Out[14]: 
      col1 col2
    0    A    F
    1    B  NaN
    2  NaN    G
    3    C    H
    4    D    I
    

      假设我们有一个与上面的DataFrame结构相同的表。通过以下查询,我们可以筛选出col2 为 NULL的记录:

    SELECT *
    FROM frame
    WHERE col2 IS NULL;
    
    In [15]: frame[frame['col2'].isna()]
    Out[15]: 
      col1 col2
    1    B  NaN
    

      使用notna()可以处理col1 为 NOT NULL的项目。

    SELECT *
    FROM frame
    WHERE col1 IS NOT NULL;
    
    In [16]: frame[frame['col1'].notna()]
    Out[16]: 
      col1 col2
    0    A    F
    1    B  NaN
    3    C    H
    4    D    I
    

    GROUP BY

      SQL的GROUP BY操作,在pandas中使用groupby()方法执行。groupby()将数据集分组,应用一些函数(通常是聚类),然后组合在一起。

      常见的SQL操作是获取数据集中每个组中的记录数。例如,查询tips表中不同性别的数量:

    SELECT sex, count(*)
    FROM tips
    GROUP BY sex;
    /*
    Female     87
    Male      157
    */
    

      相当于pandas的如下操作:

    In [17]: tips.groupby('sex').size()
    Out[17]: 
    sex
    Female     87
    Male      157
    dtype: int64
    

      注意,在pandas代码中,我们使用size()而不是count()这是因为count()将函数应用于每个列,返回每个列中不是Null的个数记录 。

    In [18]: tips.groupby('sex').count()
    Out[18]: 
            total_bill  tip  smoker  day  time  size
    sex                                             
    Female          87   87      87   87    87    87
    Male           157  157     157  157   157   157
    

      或者,我们可以将count()方法应用于单独的列:

    In [19]: tips.groupby('sex')['total_bill'].count()
    Out[19]: 
    sex
    Female     87
    Male      157
    Name: total_bill, dtype: int64
    

      也可以一次应用多个功能。例如,假设我们希望查看每天tip的总量 - agg()方法允许您将一个字典传递到已分组的DataFrame中,指明哪些函数应用于特定列。

    SELECT day, AVG(tip), COUNT(*)
    FROM tips
    GROUP BY day;
    /*
    Fri   2.734737   19
    Sat   2.993103   87
    Sun   3.255132   76
    Thur  2.771452   62
    */
    
    In [20]: tips.groupby('day').agg({'tip': np.mean, 'day': np.size})
    Out[20]: 
               tip  day
    day                
    Fri   2.734737   19
    Sat   2.993103   87
    Sun   3.255132   76
    Thur  2.771452   62
    

      通过将列列表传递到groupby()方法来对多个列进行分组。

    SELECT smoker, day, COUNT(*), AVG(tip)
    FROM tips
    GROUP BY smoker, day;
    /*
    smoker day
    No     Fri      4  2.812500
           Sat     45  3.102889
           Sun     57  3.167895
           Thur    45  2.673778
    Yes    Fri     15  2.714000
           Sat     42  2.875476
           Sun     19  3.516842
           Thur    17  3.030000
    */
    
    In [21]: tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
    Out[21]: 
                  tip          
                 size      mean
    smoker day                 
    No     Fri    4.0  2.812500
           Sat   45.0  3.102889
           Sun   57.0  3.167895
           Thur  45.0  2.673778
    Yes    Fri   15.0  2.714000
           Sat   42.0  2.875476
           Sun   19.0  3.516842
           Thur  17.0  3.030000
    

    JOIN

      可以使用join()merge()执行JOIN。默认情况下,DataFramess使用索引JOIN。每个方法都有参数,允许您指定要执行的连接类型(LEFT,RIGHT,INNER,FULL)或要连接的列(列名或索引)。

    In [22]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
       ....:                     'value': np.random.randn(4)})
       ....: 
    
    In [23]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
       ....:                     'value': np.random.randn(4)})
       ....: 
    

      假设我们有两个与我们的DataFrames具有相同名称和结构的数据库表。

      现在让我们来讨论各种类型的JOIN。

    INNER JOIN

    SELECT *
    FROM df1
    INNER JOIN df2
      ON df1.key = df2.key;
    
    # 一般merge 相当于一个 INNER JOIN
    In [24]: pd.merge(df1, df2, on='key')
    Out[24]: 
      key   value_x   value_y
    0   B -0.318214  0.543581
    1   D  2.169960 -0.426067
    2   D  2.169960  1.138079
    

      merge()还提供了将DataFrame的列与另一个DataFrame的索引相连接的情形。

    In [25]: indexed_df2 = df2.set_index('key')
    
    In [26]: pd.merge(df1, indexed_df2, left_on='key', right_index=True)
    Out[26]: 
      key   value_x   value_y
    1   B -0.318214  0.543581
    3   D  2.169960 -0.426067
    3   D  2.169960  1.138079
    

    LEFT OUTER JOIN

    -- show all records from df1
    SELECT *
    FROM df1
    LEFT OUTER JOIN df2
      ON df1.key = df2.key;
    
    # 从表 df1 选取所有记录
    In [27]: pd.merge(df1, df2, on='key', how='left')
    Out[27]: 
      key   value_x   value_y
    0   A  0.116174       NaN
    1   B -0.318214  0.543581
    2   C  0.285261       NaN
    3   D  2.169960 -0.426067
    4   D  2.169960  1.138079
    

    RIGHT JOIN

    -- show all records from df2
    SELECT *
    FROM df1
    RIGHT OUTER JOIN df2
      ON df1.key = df2.key;
    
    #  从表 df2 选取所有记录
    In [28]: pd.merge(df1, df2, on='key', how='right')
    Out[28]: 
      key   value_x   value_y
    0   B -0.318214  0.543581
    1   D  2.169960 -0.426067
    2   D  2.169960  1.138079
    3   E       NaN  0.086073
    

    FULL JOIN

      pandas还允许FULL JOIN,它选取两侧数据集,无论连接的列是否找到匹配。记住:所有RDBMS(MySQL)不支持FULL JOIN。

    -- show all records from both tables
    SELECT *
    FROM df1
    FULL OUTER JOIN df2
      ON df1.key = df2.key;
    
    # show all records from both frames
    In [29]: pd.merge(df1, df2, on='key', how='outer')
    Out[29]: 
      key   value_x   value_y
    0   A  0.116174       NaN
    1   B -0.318214  0.543581
    2   C  0.285261       NaN
    3   D  2.169960 -0.426067
    4   D  2.169960  1.138079
    5   E       NaN  0.086073
    

    UNION

      可以使用concat()执行UNION ALL。

    In [30]: df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
       ....:                     'rank': range(1, 4)})
       ....: 
    
    In [31]: df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
       ....:                     'rank': [1, 4, 5]})
       ....: 
    
    SELECT city, rank
    FROM df1
    UNION ALL
    SELECT city, rank
    FROM df2;
    /*
             city  rank
          Chicago     1
    San Francisco     2
    New York City     3
          Chicago     1
           Boston     4
      Los Angeles     5
    */
    
    In [32]: pd.concat([df1, df2])
    Out[32]: 
                city  rank
    0        Chicago     1
    1  San Francisco     2
    2  New York City     3
    0        Chicago     1
    1         Boston     4
    2    Los Angeles     5
    

      SQL的UNION类似于UNION ALL,但UNION将删除重复的行。

    SELECT city, rank
    FROM df1
    UNION
    SELECT city, rank
    FROM df2;
    -- notice that there is only one Chicago record this time
    /*
             city  rank
          Chicago     1
    San Francisco     2
    New York City     3
           Boston     4
      Los Angeles     5
    */
    

      在pandas中,您可以使用concat()drop_duplicates()结合使用。

    In [33]: pd.concat([df1, df2]).drop_duplicates()
    Out[33]: 
                city  rank
    0        Chicago     1
    1  San Francisco     2
    2  New York City     3
    1         Boston     4
    2    Los Angeles     5
    

    Pandas还有一些方法可对应于SQL分析和聚合函数

    Top N rows with offset

    -- MySQL
    SELECT * FROM tips
    ORDER BY tip DESC
    LIMIT 10 OFFSET 5;
    
    In [34]: tips.nlargest(10+5, columns='tip').tail(10)
    Out[34]: 
         total_bill   tip     sex smoker   day    time  size
    183       23.17  6.50    Male    Yes   Sun  Dinner     4
    214       28.17  6.50  Female    Yes   Sat  Dinner     3
    47        32.40  6.00    Male     No   Sun  Dinner     4
    239       29.03  5.92    Male     No   Sat  Dinner     3
    88        24.71  5.85    Male     No  Thur   Lunch     2
    181       23.33  5.65    Male    Yes   Sun  Dinner     2
    44        30.40  5.60    Male     No   Sun  Dinner     4
    52        34.81  5.20  Female     No   Sun  Dinner     4
    85        34.83  5.17  Female     No  Thur   Lunch     4
    211       25.89  5.16    Male    Yes   Sat  Dinner     4
    

    每组前N行

    -- Oracle's ROW_NUMBER() analytic function
    SELECT * FROM (
      SELECT
        t.*,
        ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
      FROM tips t
    )
    WHERE rn < 3
    ORDER BY day, rn;
    
    In [35]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)
       ....:                     .groupby(['day'])
       ....:                     .cumcount() + 1)
       ....:      .query('rn < 3')
       ....:      .sort_values(['day','rn'])
       ....: )
       ....: 
    Out[35]: 
         total_bill    tip     sex smoker   day    time  size  rn
    95        40.17   4.73    Male    Yes   Fri  Dinner     4   1
    90        28.97   3.00    Male    Yes   Fri  Dinner     2   2
    170       50.81  10.00    Male    Yes   Sat  Dinner     3   1
    212       48.33   9.00    Male     No   Sat  Dinner     4   2
    156       48.17   5.00    Male     No   Sun  Dinner     6   1
    182       45.35   3.50    Male    Yes   Sun  Dinner     3   2
    197       43.11   5.00  Female    Yes  Thur   Lunch     4   1
    142       41.19   5.00    Male     No  Thur   Lunch     5   2
    

      同样使用rank(method ='first')函数

    In [36]: (tips.assign(rnk=tips.groupby(['day'])['total_bill']
       ....:                      .rank(method='first', ascending=False))
       ....:      .query('rnk < 3')
       ....:      .sort_values(['day','rnk'])
       ....: )
       ....: 
    Out[36]: 
         total_bill    tip     sex smoker   day    time  size  rnk
    95        40.17   4.73    Male    Yes   Fri  Dinner     4  1.0
    90        28.97   3.00    Male    Yes   Fri  Dinner     2  2.0
    170       50.81  10.00    Male    Yes   Sat  Dinner     3  1.0
    212       48.33   9.00    Male     No   Sat  Dinner     4  2.0
    156       48.17   5.00    Male     No   Sun  Dinner     6  1.0
    182       45.35   3.50    Male    Yes   Sun  Dinner     3  2.0
    197       43.11   5.00  Female    Yes  Thur   Lunch     4  1.0
    142       41.19   5.00    Male     No  Thur   Lunch     5  2.0
    
    -- Oracle's RANK() analytic function
    SELECT * FROM (
      SELECT
        t.*,
        RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
      FROM tips t
      WHERE tip < 2
    )
    WHERE rnk < 3
    ORDER BY sex, rnk;
    

        我们来找tips中(rnk<3)(tips<2)(请注意,当使用rank(method='min')相当于Oracle的RANK()函数)

          In [37]: (tips[tips['tip'] < 2]

         ....: .assign(rnk_min=tips.groupby(['sex'])['tip']

       ....:                          .rank(method='min'))
       ....:      .query('rnk_min < 3')
       ....:      .sort_values(['sex','rnk_min'])
       ....: )
       ....: 
    Out[37]: 
         total_bill   tip     sex smoker  day    time  size  rnk_min
    67         3.07  1.00  Female    Yes  Sat  Dinner     1      1.0
    92         5.75  1.00  Female    Yes  Fri  Dinner     2      1.0
    111        7.25  1.00  Female     No  Sat  Dinner     1      1.0
    236       12.60  1.00    Male    Yes  Sat  Dinner     2      1.0
    237       32.83  1.17    Male    Yes  Sat  Dinner     2      2.0
    

    UPDATE

    UPDATE tips
    SET tip = tip*2
    WHERE tip < 2;
    
    In [38]: tips.loc[tips['tip'] < 2, 'tip'] *= 2
    现在.loc已不推荐使用了,有警告信息。

    DELETE

    DELETE FROM tips
    WHERE tip > 9;
    

    在pandas中,我们选择应该保留的行,而不是删除它们

    In [39]: tips = tips.loc[tips['tip'] <= 9]
  • 相关阅读:
    ACM/ICPC ZOJ1006-Do the Untwist 解题代码
    ACM/ICPC ZOJ1003-Crashing Balloon 解题代码
    数据结构(二)二叉搜索树-非递归实现遍历
    数据结构(一)二叉搜索树-递归实现
    Spark系列(三)SparkContext分析
    Docker系列(九)Kubernetes安装
    Docker系列(八)Kubernetes介绍
    Docker系列(七)Shipyard安装及介绍
    Docker系列(六)路由打通网络示例
    Esper系列(十四)Contained-Event Selection
  • 原文地址:https://www.cnblogs.com/cjtds/p/13182157.html
Copyright © 2011-2022 走看看