zoukankan      html  css  js  c++  java
  • python学习——pandas的拼接操作

    pandas的拼接操作

     

    pandas的拼接分为两种:

    • 级联:pd.concat, pd.append
    • 合并:pd.merge, pd.join
     

    0. 回顾numpy的级联

     

    ============================================

    练习12:

    1. 生成2个3*3的矩阵,对其分别进行两个维度上的级联

    ============================================

    In [19]:
    import numpy as np
    import pandas as pd
    from pandas import Series,DataFrame
    
    In [20]:
    nd = np.random.randint(0,10,size=(3,3))
    nd
    
    Out[20]:
    array([[6, 3, 4],
           [3, 9, 8],
           [8, 7, 8]])
    In [24]:
    np.concatenate((nd,nd),axis=0)#0代表行间操作
    
    Out[24]:
    array([[6, 3, 4],
           [3, 9, 8],
           [8, 7, 8],
           [6, 3, 4],
           [3, 9, 8],
           [8, 7, 8]])
    In [25]:
    np.concatenate([nd,nd],axis=1)#1代表列间操作,()huo[]效果一样
    
    Out[25]:
    array([[6, 3, 4, 6, 3, 4],
           [3, 9, 8, 3, 9, 8],
           [8, 7, 8, 8, 7, 8]])
     

    为方便讲解,我们首先定义一个生成DataFrame的函数:

    In [26]:
    def make_df(inds,cols):
        #字典的key作为列名进行展示
        data = {key:[key+str(i) for i in inds]for key in cols}
        
        return DataFrame(data,index=inds,columns=cols)
    
    In [28]:
    make_df([1,2],list('AB'))
    
    Out[28]:
     AB
    1 A1 B1
    2 A2 B2
     

    1. 使用pd.concat()级联

     

    pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:

    pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
              keys=None, levels=None, names=None, verify_integrity=False,
              copy=True)
     

    1) 简单级联

     

    和np.concatenate一样,优先增加行数(默认axis=0)

    In [29]:
    df1 = make_df([0,1],list('AB'))
    df2 = make_df([2,3],list('AB'))
    
    In [30]:
    display(df1,df2)
    
     
     AB
    0 A0 B0
    1 A1 B1
     
     AB
    2 A2 B2
    3 A3 B3
     

    可以通过设置axis来改变级联方向

    In [31]:
    pd.concat([df1,df2])
    
    Out[31]:
     AB
    0 A0 B0
    1 A1 B1
    2 A2 B2
    3 A3 B3
    In [32]:
    pd.concat((df1,df2),axis = 1)
    
    Out[32]:
     ABAB
    0 A0 B0 NaN NaN
    1 A1 B1 NaN NaN
    2 NaN NaN A2 B2
    3 NaN NaN A3 B3
     

    注意index在级联时可以重复

     

    也可以选择忽略ignore_index,重新索引

    In [34]:
    pd.concat((df1,df2),axis=1,ignore_index=True)
    
    Out[34]:
     0123
    0 A0 B0 NaN NaN
    1 A1 B1 NaN NaN
    2 NaN NaN A2 B2
    3 NaN NaN A3 B3
     

    或者使用多层索引 keys

    concat([x,y],keys=['x','y'])

    In [13]:
    pd.concat([df1,df2],keys=['x','y'])
    
    Out[13]:
      AB
    x0 A1 B1
    1 A2 B2
    y0 A3 B3
    1 A4 B4
    In [ ]:
    #pd 模块 import pandas as pd
    
    #df1,df2 具体的实例
    #级联的方法,属于上一级,DataFrame来自pandas
    
     

    ============================================

    练习13:

    1. 想一想级联的应用场景?

    2. 使用昨天的知识,建立一个期中考试张三、李四的成绩表ddd

    3. 假设新增考试学科"计算机",如何实现?

    4. 新增王老五同学的成绩,如何实现?

    ============================================

    In [ ]:
     
     

    2) 不匹配级联

     

    不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致

    In [38]:
    df1 = make_df([1,2],list('AB'))
    df2 = make_df([2,4],list('BC'))
    display(df1,df2)
    
     
     AB
    1 A1 B1
    2 A2 B2
     
     BC
    2 B2 C2
    4 B4 C4
     

    有3种连接方式:

    • 外连接:补NaN(默认模式)
    In [39]:
    pd.concat([df1,df2])
    
     
    C:UsersBLXAppDataRoamingPythonPython37site-packagesipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
    of pandas will change to not sort by default.
    
    To accept the future behavior, pass 'sort=False'.
    
    To retain the current behavior and silence the warning, pass 'sort=True'.
    
      """Entry point for launching an IPython kernel.
    
    Out[39]:
     ABC
    1 A1 B1 NaN
    2 A2 B2 NaN
    2 NaN B2 C2
    4 NaN B4 C4
     
    • 内连接:只连接匹配的项
    In [41]:
    #合并显示共有数据
    pd.concat((df1,df2),join = 'inner',axis = 1)
    
    Out[41]:
     ABBC
    2 A2 B2 B2 C2
     
    • 连接指定轴 join_axes
    In [42]:
    df2.columns
    
    Out[42]:
    Index(['B', 'C'], dtype='object')
    In [43]:
    #join_axex以某个DataFrame的列索引为新的列索引值
    pd.concat([df1,df2],join_axes=[df2.columns])
    
    Out[43]:
     BC
    1 B1 NaN
    2 B2 NaN
    2 B2 C2
    4 B4 C4
     

    ============================================

    练习14:

    假设【期末】考试ddd2的成绩没有张三的,只有李四、王老五、赵小六的,使用多种方法级联
    
    

    ============================================

     

    3) 使用append()函数添加

     

    由于在后面级联的使用非常普遍,因此有一个函数append专门用于在后面添加

    In [44]:
    display(df1,df2)
    
     
     AB
    1 A1 B1
    2 A2 B2
     
     BC
    2 B2 C2
    4 B4 C4
    In [49]:
    #append函数属于DataFrame,concat这函数属于pandas模块
    #pd.concat((df1,df2))
    df1.append(df2)
    
    Out[49]:
     ABC
    1 A1 B1 NaN
    2 A2 B2 NaN
    2 NaN B2 C2
    4 NaN B4 C4
     

    ============================================

    练习15:

    新建一个只有张三李四王老五的期末考试成绩单ddd3,使用append()与期中考试成绩表ddd级联
    
    

    ============================================

     

    2. 使用pd.merge()合并

     

    merge与concat的区别在于,merge需要依据某一共同的行或列来进行合并

    使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。

    注意每一列元素的顺序不要求一致

     

    1) 一对一合并

    In [54]:
    #merge根据相同的元素进行合并的
    df1 = DataFrame({'employee':['Po','Sara','Danis'],
                     'group':['sail','couting','marketing']})
    df2 = DataFrame({'employee':['Po','Sara','Bush'],
                     'work_time':[2,3,1]})
    display(df1,df2)
    
     
     employeegroup
    0 Po sail
    1 Sara couting
    2 Danis marketing
     
     employeework_time
    0 Po 2
    1 Sara 3
    2 Bush 1
    In [55]:
    pd.merge(df1,df2)
    
    Out[55]:
     employeegroupwork_time
    0 Po sail 2
    1 Sara couting 3
    In [56]:
    df1.merge(df2)
    
    Out[56]:
     employeegroupwork_time
    0 Po sail 2
    1 Sara couting 3
     

    2) 多对一合并

    In [57]:
    df1 = DataFrame({'employee':['Po','Sara','Danis'],
                     'group':['sail','couting','marketing']})
    df2 = DataFrame({'employee':['Po','Po','Bush'],
                     'work_time':[2,3,1]})
    display(df1,df2)
    
     
     employeegroup
    0 Po sail
    1 Sara couting
    2 Danis marketing
     
     employeework_time
    0 Po 2
    1 Po 3
    2 Bush 1
    In [58]:
    pd.merge(df1,df2)
    
    Out[58]:
     employeegroupwork_time
    0 Po sail 2
    1 Po sail 3
     

    3) 多对多合并

    In [61]:
    df1 = DataFrame({'employee':['Po','Po','Danis'],
                     'group':['sail','couting','marketing']})
    df2 = DataFrame({'employee':['Po','Po','Bush'],
                     'work_time':[2,3,1]})
    display(df1,df2)
    
     
     employeegroup
    0 Po sail
    1 Po couting
    2 Danis marketing
     
     employeework_time
    0 Po 2
    1 Po 3
    2 Bush 1
    In [62]:
    pd.merge(df1,df2)
    
    Out[62]:
     employeegroupwork_time
    0 Po sail 2
    1 Po sail 3
    2 Po couting 2
    3 Po couting 3
     

    4) key的规范化

     
    • 使用on=显式指定哪一列为key,当有多个key相同时使用
    In [66]:
    df3 = DataFrame({'employee':['Po','Summer','Flower'],
                     'group':['sail','marketing','serch'],
                     'salary':[12000,10000,8000]})
    df4 = DataFrame({'employee':['Po','Winter','Flower'],
                     'group':['marketing','marketing','serch'],
                     'work_time':[2,1,5]})
    display(df3,df4)
    
     
     employeegroupsalary
    0 Po sail 12000
    1 Summer marketing 10000
    2 Flower serch 8000
     
     employeegroupwork_time
    0 Po marketing 2
    1 Winter marketing 1
    2 Flower serch 5
    In [67]:
    pd.merge(df3,df4)
    
    Out[67]:
     employeegroupsalarywork_time
    0 Flower serch 8000 5
    In [70]:
    pd.merge(df3,df4,on='employee')
    
    Out[70]:
     employeegroup_xsalarygroup_ywork_time
    0 Po sail 12000 marketing 2
    1 Flower serch 8000 serch 5
    In [73]:
    pd.merge(df3,df4,on='group',suffixes=['_A','_B'])
    
    Out[73]:
     employee_Agroupsalaryemployee_Bwork_time
    0 Summer marketing 10000 Po 2
    1 Summer marketing 10000 Winter 1
    2 Flower serch 8000 Flower 5
     
    • 使用left_on和right_on指定左右两边的列作为key,当左右两边的key都不想等时使用
    • 参数1为左,参数2为右
    In [79]:
    df3 = DataFrame({'employer':['Po','Summer','Flower'],
                                        'Team':['sail','marketing','serch'],
                                        'salary':[12000,10000,8000]})
    df4 = DataFrame({'employee':['Po','Winter','Flower'],
                                        'group':['marketing','marketing','serch'],
                                        'work_time':[2,1,5]})
    display(df3,df4)
    
     
     employerTeamsalary
    0 Po sail 12000
    1 Summer marketing 10000
    2 Flower serch 8000
     
     employeegroupwork_time
    0 Po marketing 2
    1 Winter marketing 1
    2 Flower serch 5
    In [81]:
    pd.merge(df3,df4,left_on='employer',right_on='employee')
    
    Out[81]:
     employerTeamsalaryemployeegroupwork_time
    0 Po sail 12000 Po marketing 2
    1 Flower serch 8000 Flower serch 5
    In [82]:
    pd.merge(df3,df4,left_on='Team',right_on='group')
    
    Out[82]:
     employerTeamsalaryemployeegroupwork_time
    0 Summer marketing 10000 Po marketing 2
    1 Summer marketing 10000 Winter marketing 1
    2 Flower serch 8000 Flower serch 5
     

    ============================================

    练习16:

    1. 假设有两份成绩单,除了ddd是张三李四王老五之外,还有ddd4是张三和赵小六的成绩单,如何合并?

    2. 如果ddd4中张三的名字被打错了,成为了张十三,怎么办?

    3. 自行练习多对一,多对多的情况

    4. 自学left_index,right_index

    ============================================

     

    5) 内合并与外合并

     
    • 内合并:只保留两者都有的key(默认模式)
    In [85]:
    df1 = DataFrame({'age':[18,22,33],'height':[175,169,180]})
    
    df2 = DataFrame({'age':[18,23,31],'weight':[65,70,80]})
    
    In [86]:
    pd.merge(df1,df2)
    
    Out[86]:
     ageheightweight
    0 18 175 65
    In [87]:
    df1.merge(df2,how='inner')
    
    Out[87]:
     ageheightweight
    0 18 175 65
     
    • 外合并 how='outer':补NaN
    In [88]:
    df1.merge(df2,how = 'outer')
    
    Out[88]:
     ageheightweight
    0 18 175.0 65.0
    1 22 169.0 NaN
    2 33 180.0 NaN
    3 23 NaN 70.0
    4 31 NaN 80.0
     
    • 左合并、右合并:how='left',how='right',
    In [89]:
    df1.merge(df2,how = 'left')#保留左侧
    
    Out[89]:
     ageheightweight
    0 18 175 65.0
    1 22 169 NaN
    2 33 180 NaN
    In [90]:
    pd.merge(df1,df2,how='right')#保留右侧
    
    Out[90]:
     ageheightweight
    0 18 175.0 65
    1 23 NaN 70
    2 31 NaN 80
     

    ============================================

    练习17:

    1. 如果只有张三赵小六语数英三个科目的成绩,如何合并?

    2. 考虑应用情景,使用多种方式合并ddd与ddd4

    ============================================

     

    6) 列冲突的解决

     

    当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名

     

    可以使用suffixes=自己指定后缀

    In [91]:
    display(df3,df4)
    
     
     employerTeamsalary
    0 Po sail 12000
    1 Summer marketing 10000
    2 Flower serch 8000
     
     employeegroupwork_time
    0 Po marketing 2
    1 Winter marketing 1
    2 Flower serch 5
    In [93]:
    df3.columns = ['employee','group','salary']
    display(df3)
    
     
     employeegroupsalary
    0 Po sail 12000
    1 Summer marketing 10000
    2 Flower serch 8000
    In [94]:
    pd.merge(df3,df4,on='employee',suffixes=['_李','_王'])
    
    Out[94]:
     employeegroup_李salarygroup_王work_time
    0 Po sail 12000 marketing 2
    1 Flower serch 8000 serch 5
     

    ============================================

    练习18:

    假设有两个同学都叫李四,ddd5、ddd6都是张三和李四的成绩表,如何合并?
    
    

    ============================================

     

    作业

    3. 案例分析:美国各州人口数据分析

     

    首先导入文件,并查看数据样本

    In [62]:
    import numpy as np
    import pandas as pd
    from pandas import Series,DataFrame
    
    In [63]:
    #使用pandas读取数据
    pop = pd.read_csv('../../data/state-population.csv')
    
    areas = pd.read_csv('../../data/state-areas.csv')
    
    abb = pd.read_csv('../../data/state-abbrevs.csv')
    
    In [64]:
    pop.shape
    
    Out[64]:
    (2544, 4)
    In [65]:
    pop.head()
    
    Out[65]:
     state/regionagesyearpopulation
    0 AL under18 2012 1117489.0
    1 AL total 2012 4817528.0
    2 AL under18 2010 1130966.0
    3 AL total 2010 4785570.0
    4 AL under18 2011 1125763.0
    In [70]:
    areas.shape
    
    Out[70]:
    (52, 2)
    In [69]:
    abb.shape
    
    Out[69]:
    (51, 2)
     

    合并pop与abbrevs两个DataFrame,分别依据state/region列和abbreviation列来合并。

    为了保留所有信息,使用外合并。

    In [71]:
    pop.head()
    
    Out[71]:
     state/regionagesyearpopulation
    0 AL under18 2012 1117489.0
    1 AL total 2012 4817528.0
    2 AL under18 2010 1130966.0
    3 AL total 2010 4785570.0
    4 AL under18 2011 1125763.0
    In [72]:
    abb.head()
    
    Out[72]:
     stateabbreviation
    0 Alabama AL
    1 Alaska AK
    2 Arizona AZ
    3 Arkansas AR
    4 California CA
    In [73]:
    display(pop.shape,abb.shape)
    
     
    (2544, 4)
     
    (51, 2)
    In [78]:
    #此时的场景 left == outer left数据大于abb
    #left效果比outer差一些
    #abb 河北
    pop_m = pop.merge(abb,left_on='state/region',right_on='abbreviation',how = 'outer')
    pop_m.shape
    
    Out[78]:
    (2544, 6)
     

    去除abbreviation的那一列(axis=1)

    In [79]:
    pop_m.head()
    
    Out[79]:
     state/regionagesyearpopulationstateabbreviation
    0 AL under18 2012 1117489.0 Alabama AL
    1 AL total 2012 4817528.0 Alabama AL
    2 AL under18 2010 1130966.0 Alabama AL
    3 AL total 2010 4785570.0 Alabama AL
    4 AL under18 2011 1125763.0 Alabama AL
    In [83]:
    pop_m.drop('abbreviation',axis = 1,inplace=True)
    
     
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-83-15dcfc478d0b> in <module>()
    ----> 1 pop_m.drop('abbreviation',axis = 1,inplace=True)
    
    /usr/local/lib/python3.5/dist-packages/pandas/core/generic.py in drop(self, labels, axis, level, inplace, errors)
       2159                 new_axis = axis.drop(labels, level=level, errors=errors)
       2160             else:
    -> 2161                 new_axis = axis.drop(labels, errors=errors)
       2162             dropped = self.reindex(**{axis_name: new_axis})
       2163             try:
    
    /usr/local/lib/python3.5/dist-packages/pandas/core/indexes/base.py in drop(self, labels, errors)
       3622             if errors != 'ignore':
       3623                 raise ValueError('labels %s not contained in axis' %
    -> 3624                                  labels[mask])
       3625             indexer = indexer[~mask]
       3626         return self.delete(indexer)
    
    ValueError: labels ['abbreviation'] not contained in axis
    In [82]:
    pop_m.head()
    
    Out[82]:
     state/regionagesyearpopulationstate
    0 AL under18 2012 1117489.0 Alabama
    1 AL total 2012 4817528.0 Alabama
    2 AL under18 2010 1130966.0 Alabama
    3 AL total 2010 4785570.0 Alabama
    4 AL under18 2011 1125763.0 Alabama
     

    查看存在缺失数据的列。

    使用.isnull().any(),只有某一列存在一个缺失数据,就会显示True。

    In [88]:
    pop_m.isnull().any()
    
    Out[88]:
    state/region    False
    ages            False
    year            False
    population       True
    state            True
    dtype: bool
    In [ ]:
    #population 和 state这两列有数据缺失的情况
    
     

    查看缺失数据

    In [92]:
    #为空的行索引
    pop_m.loc[pop_m.isnull().any(axis = 1)]
    
    Out[92]:
     state/regionagesyearpopulationstate
    2448 PR under18 1990 NaN NaN
    2449 PR total 1990 NaN NaN
    2450 PR total 1991 NaN NaN
    2451 PR under18 1991 NaN NaN
    2452 PR total 1993 NaN NaN
    2453 PR under18 1993 NaN NaN
    2454 PR under18 1992 NaN NaN
    2455 PR total 1992 NaN NaN
    2456 PR under18 1994 NaN NaN
    2457 PR total 1994 NaN NaN
    2458 PR total 1995 NaN NaN
    2459 PR under18 1995 NaN NaN
    2460 PR under18 1996 NaN NaN
    2461 PR total 1996 NaN NaN
    2462 PR under18 1998 NaN NaN
    2463 PR total 1998 NaN NaN
    2464 PR total 1997 NaN NaN
    2465 PR under18 1997 NaN NaN
    2466 PR total 1999 NaN NaN
    2467 PR under18 1999 NaN NaN
    2468 PR total 2000 3810605.0 NaN
    2469 PR under18 2000 1089063.0 NaN
    2470 PR total 2001 3818774.0 NaN
    2471 PR under18 2001 1077566.0 NaN
    2472 PR total 2002 3823701.0 NaN
    2473 PR under18 2002 1065051.0 NaN
    2474 PR total 2004 3826878.0 NaN
    2475 PR under18 2004 1035919.0 NaN
    2476 PR total 2003 3826095.0 NaN
    2477 PR under18 2003 1050615.0 NaN
    ... ... ... ... ... ...
    2514 USA under18 1999 71946051.0 NaN
    2515 USA total 2000 282162411.0 NaN
    2516 USA under18 2000 72376189.0 NaN
    2517 USA total 1999 279040181.0 NaN
    2518 USA total 2001 284968955.0 NaN
    2519 USA under18 2001 72671175.0 NaN
    2520 USA total 2002 287625193.0 NaN
    2521 USA under18 2002 72936457.0 NaN
    2522 USA total 2003 290107933.0 NaN
    2523 USA under18 2003 73100758.0 NaN
    2524 USA total 2004 292805298.0 NaN
    2525 USA under18 2004 73297735.0 NaN
    2526 USA total 2005 295516599.0 NaN
    2527 USA under18 2005 73523669.0 NaN
    2528 USA total 2006 298379912.0 NaN
    2529 USA under18 2006 73757714.0 NaN
    2530 USA total 2007 301231207.0 NaN
    2531 USA under18 2007 74019405.0 NaN
    2532 USA total 2008 304093966.0 NaN
    2533 USA under18 2008 74104602.0 NaN
    2534 USA under18 2013 73585872.0 NaN
    2535 USA total 2013 316128839.0 NaN
    2536 USA total 2009 306771529.0 NaN
    2537 USA under18 2009 74134167.0 NaN
    2538 USA under18 2010 74119556.0 NaN
    2539 USA total 2010 309326295.0 NaN
    2540 USA under18 2011 73902222.0 NaN
    2541 USA total 2011 311582564.0 NaN
    2542 USA under18 2012 73708179.0 NaN
    2543 USA total 2012 313873685.0 NaN

    96 rows × 5 columns

     

    根据数据是否缺失情况显示数据,如果缺失为True,那么显示

     

    找到有哪些state/region使得state的值为NaN,使用unique()查看非重复值

    In [94]:
    condition = pop_m['state'].isnull()
    pop_m['state/region'][condition].unique()
    
    Out[94]:
    array(['PR', 'USA'], dtype=object)
    In [95]:
    areas
    
    Out[95]:
     statearea (sq. mi)
    0 Alabama 52423
    1 Alaska 656425
    2 Arizona 114006
    3 Arkansas 53182
    4 California 163707
    5 Colorado 104100
    6 Connecticut 5544
    7 Delaware 1954
    8 Florida 65758
    9 Georgia 59441
    10 Hawaii 10932
    11 Idaho 83574
    12 Illinois 57918
    13 Indiana 36420
    14 Iowa 56276
    15 Kansas 82282
    16 Kentucky 40411
    17 Louisiana 51843
    18 Maine 35387
    19 Maryland 12407
    20 Massachusetts 10555
    21 Michigan 96810
    22 Minnesota 86943
    23 Mississippi 48434
    24 Missouri 69709
    25 Montana 147046
    26 Nebraska 77358
    27 Nevada 110567
    28 New Hampshire 9351
    29 New Jersey 8722
    30 New Mexico 121593
    31 New York 54475
    32 North Carolina 53821
    33 North Dakota 70704
    34 Ohio 44828
    35 Oklahoma 69903
    36 Oregon 98386
    37 Pennsylvania 46058
    38 Rhode Island 1545
    39 South Carolina 32007
    40 South Dakota 77121
    41 Tennessee 42146
    42 Texas 268601
    43 Utah 84904
    44 Vermont 9615
    45 Virginia 42769
    46 Washington 71303
    47 West Virginia 24231
    48 Wisconsin 65503
    49 Wyoming 97818
    50 District of Columbia 68
    51 Puerto Rico 3515
    In [ ]:
    只有两个州对应的州名为空
    
     

    为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN!

    记住这样清除缺失数据NaN的方法!

    In [96]:
    #Puerto Rico 
    
    conditon = pop_m['state/region'] == 'PR'
    condition
    
    Out[96]:
    0       False
    1       False
    2       False
    3       False
    4       False
    5       False
    6       False
    7       False
    8       False
    9       False
    10      False
    11      False
    12      False
    13      False
    14      False
    15      False
    16      False
    17      False
    18      False
    19      False
    20      False
    21      False
    22      False
    23      False
    24      False
    25      False
    26      False
    27      False
    28      False
    29      False
            ...  
    2514     True
    2515     True
    2516     True
    2517     True
    2518     True
    2519     True
    2520     True
    2521     True
    2522     True
    2523     True
    2524     True
    2525     True
    2526     True
    2527     True
    2528     True
    2529     True
    2530     True
    2531     True
    2532     True
    2533     True
    2534     True
    2535     True
    2536     True
    2537     True
    2538     True
    2539     True
    2540     True
    2541     True
    2542     True
    2543     True
    Name: state, Length: 2544, dtype: bool
    In [97]:
    pop_m['state'][condition] = 'Puerto Rico'
    
     
    /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame
    
    See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
      """Entry point for launching an IPython kernel.
    
    In [99]:
    condition = pop_m['state/region'] == 'USA'
    pop_m['state'][condition] = 'United State'
    
     
    /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame
    
    See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
      
    
    In [100]:
    #刚才的填补操作,起作用了
    pop_m.isnull().any()
    
    Out[100]:
    state/region    False
    ages            False
    year            False
    population       True
    state           False
    dtype: bool
     

    合并各州面积数据areas,使用左合并。

    思考一下为什么使用外合并?

    In [102]:
    pop.head()
    #人口的DataFrame和abb合并,有了州名全程
    #可以和areas DataFrame进行合并
    
    Out[102]:
     state/regionagesyearpopulation
    0 AL under18 2012 1117489.0
    1 AL total 2012 4817528.0
    2 AL under18 2010 1130966.0
    3 AL total 2010 4785570.0
    4 AL under18 2011 1125763.0
    In [103]:
    pop_areas_m = pop_m.merge(areas,how = 'outer')
    
     

    继续寻找存在缺失数据的列

    In [105]:
    pop_areas_m.shape
    
    Out[105]:
    (2544, 6)
    In [109]:
    areas
    
    Out[109]:
     statearea (sq. mi)
    0 Alabama 52423
    1 Alaska 656425
    2 Arizona 114006
    3 Arkansas 53182
    4 California 163707
    5 Colorado 104100
    6 Connecticut 5544
    7 Delaware 1954
    8 Florida 65758
    9 Georgia 59441
    10 Hawaii 10932
    11 Idaho 83574
    12 Illinois 57918
    13 Indiana 36420
    14 Iowa 56276
    15 Kansas 82282
    16 Kentucky 40411
    17 Louisiana 51843
    18 Maine 35387
    19 Maryland 12407
    20 Massachusetts 10555
    21 Michigan 96810
    22 Minnesota 86943
    23 Mississippi 48434
    24 Missouri 69709
    25 Montana 147046
    26 Nebraska 77358
    27 Nevada 110567
    28 New Hampshire 9351
    29 New Jersey 8722
    30 New Mexico 121593
    31 New York 54475
    32 North Carolina 53821
    33 North Dakota 70704
    34 Ohio 44828
    35 Oklahoma 69903
    36 Oregon 98386
    37 Pennsylvania 46058
    38 Rhode Island 1545
    39 South Carolina 32007
    40 South Dakota 77121
    41 Tennessee 42146
    42 Texas 268601
    43 Utah 84904
    44 Vermont 9615
    45 Virginia 42769
    46 Washington 71303
    47 West Virginia 24231
    48 Wisconsin 65503
    49 Wyoming 97818
    50 District of Columbia 68
    51 Puerto Rico 3515
    In [106]:
    pop_areas_m.isnull().any()
    
    Out[106]:
    state/region     False
    ages             False
    year             False
    population        True
    state            False
    area (sq. mi)     True
    dtype: bool
     

    我们会发现area(sq.mi)这一列有缺失数据,为了找出是哪一行,我们需要找出是哪个state没有数据

    In [110]:
    cond = pop_areas_m['area (sq. mi)'].isnull()
    cond
    
    Out[110]:
    0       False
    1       False
    2       False
    3       False
    4       False
    5       False
    6       False
    7       False
    8       False
    9       False
    10      False
    11      False
    12      False
    13      False
    14      False
    15      False
    16      False
    17      False
    18      False
    19      False
    20      False
    21      False
    22      False
    23      False
    24      False
    25      False
    26      False
    27      False
    28      False
    29      False
            ...  
    2514     True
    2515     True
    2516     True
    2517     True
    2518     True
    2519     True
    2520     True
    2521     True
    2522     True
    2523     True
    2524     True
    2525     True
    2526     True
    2527     True
    2528     True
    2529     True
    2530     True
    2531     True
    2532     True
    2533     True
    2534     True
    2535     True
    2536     True
    2537     True
    2538     True
    2539     True
    2540     True
    2541     True
    2542     True
    2543     True
    Name: area (sq. mi), Length: 2544, dtype: bool
    In [111]:
    pop_areas_m['state/region'][cond]
    
    Out[111]:
    2496    USA
    2497    USA
    2498    USA
    2499    USA
    2500    USA
    2501    USA
    2502    USA
    2503    USA
    2504    USA
    2505    USA
    2506    USA
    2507    USA
    2508    USA
    2509    USA
    2510    USA
    2511    USA
    2512    USA
    2513    USA
    2514    USA
    2515    USA
    2516    USA
    2517    USA
    2518    USA
    2519    USA
    2520    USA
    2521    USA
    2522    USA
    2523    USA
    2524    USA
    2525    USA
    2526    USA
    2527    USA
    2528    USA
    2529    USA
    2530    USA
    2531    USA
    2532    USA
    2533    USA
    2534    USA
    2535    USA
    2536    USA
    2537    USA
    2538    USA
    2539    USA
    2540    USA
    2541    USA
    2542    USA
    2543    USA
    Name: state/region, dtype: object
     

    去除含有缺失数据的行

    In [112]:
    pop_areas_m.shape
    
    Out[112]:
    (2544, 6)
    In [114]:
    pop_areas_r = pop_areas_m.dropna()
    
    In [115]:
    pop_areas_r.shape
    
    Out[115]:
    (2476, 6)
     

    查看数据是否缺失

    In [116]:
    pop_areas_r.isnull().any()
    
    Out[116]:
    state/region     False
    ages             False
    year             False
    population       False
    state            False
    area (sq. mi)    False
    dtype: bool
     

    找出2010年的全民人口数据,df.query(查询语句)

    In [117]:
    pop_areas_r.head()
    
    Out[117]:
     state/regionagesyearpopulationstatearea (sq. mi)
    0 AL under18 2012 1117489.0 Alabama 52423.0
    1 AL total 2012 4817528.0 Alabama 52423.0
    2 AL under18 2010 1130966.0 Alabama 52423.0
    3 AL total 2010 4785570.0 Alabama 52423.0
    4 AL under18 2011 1125763.0 Alabama 52423.0
    In [120]:
    t_2010 = pop_areas_r.query("ages == 'total' and year == 2010")
    
    In [121]:
    t_2010.shape
    
    Out[121]:
    (52, 6)
    In [122]:
    t_2010
    
    Out[122]:
     state/regionagesyearpopulationstatearea (sq. mi)
    3 AL total 2010 4785570.0 Alabama 52423.0
    91 AK total 2010 713868.0 Alaska 656425.0
    101 AZ total 2010 6408790.0 Arizona 114006.0
    189 AR total 2010 2922280.0 Arkansas 53182.0
    197 CA total 2010 37333601.0 California 163707.0
    283 CO total 2010 5048196.0 Colorado 104100.0
    293 CT total 2010 3579210.0 Connecticut 5544.0
    379 DE total 2010 899711.0 Delaware 1954.0
    389 DC total 2010 605125.0 District of Columbia 68.0
    475 FL total 2010 18846054.0 Florida 65758.0
    485 GA total 2010 9713248.0 Georgia 59441.0
    570 HI total 2010 1363731.0 Hawaii 10932.0
    581 ID total 2010 1570718.0 Idaho 83574.0
    666 IL total 2010 12839695.0 Illinois 57918.0
    677 IN total 2010 6489965.0 Indiana 36420.0
    762 IA total 2010 3050314.0 Iowa 56276.0
    773 KS total 2010 2858910.0 Kansas 82282.0
    858 KY total 2010 4347698.0 Kentucky 40411.0
    869 LA total 2010 4545392.0 Louisiana 51843.0
    954 ME total 2010 1327366.0 Maine 35387.0
    965 MD total 2010 5787193.0 Maryland 12407.0
    1050 MA total 2010 6563263.0 Massachusetts 10555.0
    1061 MI total 2010 9876149.0 Michigan 96810.0
    1146 MN total 2010 5310337.0 Minnesota 86943.0
    1157 MS total 2010 2970047.0 Mississippi 48434.0
    1242 MO total 2010 5996063.0 Missouri 69709.0
    1253 MT total 2010 990527.0 Montana 147046.0
    1338 NE total 2010 1829838.0 Nebraska 77358.0
    1349 NV total 2010 2703230.0 Nevada 110567.0
    1434 NH total 2010 1316614.0 New Hampshire 9351.0
    1445 NJ total 2010 8802707.0 New Jersey 8722.0
    1530 NM total 2010 2064982.0 New Mexico 121593.0
    1541 NY total 2010 19398228.0 New York 54475.0
    1626 NC total 2010 9559533.0 North Carolina 53821.0
    1637 ND total 2010 674344.0 North Dakota 70704.0
    1722 OH total 2010 11545435.0 Ohio 44828.0
    1733 OK total 2010 3759263.0 Oklahoma 69903.0
    1818 OR total 2010 3837208.0 Oregon 98386.0
    1829 PA total 2010 12710472.0 Pennsylvania 46058.0
    1914 RI total 2010 1052669.0 Rhode Island 1545.0
    1925 SC total 2010 4636361.0 South Carolina 32007.0
    2010 SD total 2010 816211.0 South Dakota 77121.0
    2021 TN total 2010 6356683.0 Tennessee 42146.0
    2106 TX total 2010 25245178.0 Texas 268601.0
    2117 UT total 2010 2774424.0 Utah 84904.0
    2202 VT total 2010 625793.0 Vermont 9615.0
    2213 VA total 2010 8024417.0 Virginia 42769.0
    2298 WA total 2010 6742256.0 Washington 71303.0
    2309 WV total 2010 1854146.0 West Virginia 24231.0
    2394 WI total 2010 5689060.0 Wisconsin 65503.0
    2405 WY total 2010 564222.0 Wyoming 97818.0
    2490 PR total 2010 3721208.0 Puerto Rico 3515.0
     

    对查询结果进行处理,以state列作为新的行索引:set_index

    In [124]:
    t_2010.set_index('state',inplace=True)
    
    In [126]:
    t_2010
    
    Out[126]:
     state/regionagesyearpopulationarea (sq. mi)
    state     
    Alabama AL total 2010 4785570.0 52423.0
    Alaska AK total 2010 713868.0 656425.0
    Arizona AZ total 2010 6408790.0 114006.0
    Arkansas AR total 2010 2922280.0 53182.0
    California CA total 2010 37333601.0 163707.0
    Colorado CO total 2010 5048196.0 104100.0
    Connecticut CT total 2010 3579210.0 5544.0
    Delaware DE total 2010 899711.0 1954.0
    District of Columbia DC total 2010 605125.0 68.0
    Florida FL total 2010 18846054.0 65758.0
    Georgia GA total 2010 9713248.0 59441.0
    Hawaii HI total 2010 1363731.0 10932.0
    Idaho ID total 2010 1570718.0 83574.0
    Illinois IL total 2010 12839695.0 57918.0
    Indiana IN total 2010 6489965.0 36420.0
    Iowa IA total 2010 3050314.0 56276.0
    Kansas KS total 2010 2858910.0 82282.0
    Kentucky KY total 2010 4347698.0 40411.0
    Louisiana LA total 2010 4545392.0 51843.0
    Maine ME total 2010 1327366.0 35387.0
    Maryland MD total 2010 5787193.0 12407.0
    Massachusetts MA total 2010 6563263.0 10555.0
    Michigan MI total 2010 9876149.0 96810.0
    Minnesota MN total 2010 5310337.0 86943.0
    Mississippi MS total 2010 2970047.0 48434.0
    Missouri MO total 2010 5996063.0 69709.0
    Montana MT total 2010 990527.0 147046.0
    Nebraska NE total 2010 1829838.0 77358.0
    Nevada NV total 2010 2703230.0 110567.0
    New Hampshire NH total 2010 1316614.0 9351.0
    New Jersey NJ total 2010 8802707.0 8722.0
    New Mexico NM total 2010 2064982.0 121593.0
    New York NY total 2010 19398228.0 54475.0
    North Carolina NC total 2010 9559533.0 53821.0
    North Dakota ND total 2010 674344.0 70704.0
    Ohio OH total 2010 11545435.0 44828.0
    Oklahoma OK total 2010 3759263.0 69903.0
    Oregon OR total 2010 3837208.0 98386.0
    Pennsylvania PA total 2010 12710472.0 46058.0
    Rhode Island RI total 2010 1052669.0 1545.0
    South Carolina SC total 2010 4636361.0 32007.0
    South Dakota SD total 2010 816211.0 77121.0
    Tennessee TN total 2010 6356683.0 42146.0
    Texas TX total 2010 25245178.0 268601.0
    Utah UT total 2010 2774424.0 84904.0
    Vermont VT total 2010 625793.0 9615.0
    Virginia VA total 2010 8024417.0 42769.0
    Washington WA total 2010 6742256.0 71303.0
    West Virginia WV total 2010 1854146.0 24231.0
    Wisconsin WI total 2010 5689060.0 65503.0
    Wyoming WY total 2010 564222.0 97818.0
    Puerto Rico PR total 2010 3721208.0 3515.0
     

    计算人口密度。注意是Series/Series,其结果还是一个Series。

    In [127]:
    pop_density = t_2010['population']/t_2010["area (sq. mi)"]
    pop_density
    
    Out[127]:
    state
    Alabama                   91.287603
    Alaska                     1.087509
    Arizona                   56.214497
    Arkansas                  54.948667
    California               228.051342
    Colorado                  48.493718
    Connecticut              645.600649
    Delaware                 460.445752
    District of Columbia    8898.897059
    Florida                  286.597129
    Georgia                  163.409902
    Hawaii                   124.746707
    Idaho                     18.794338
    Illinois                 221.687472
    Indiana                  178.197831
    Iowa                      54.202751
    Kansas                    34.745266
    Kentucky                 107.586994
    Louisiana                 87.676099
    Maine                     37.509990
    Maryland                 466.445797
    Massachusetts            621.815538
    Michigan                 102.015794
    Minnesota                 61.078373
    Mississippi               61.321530
    Missouri                  86.015622
    Montana                    6.736171
    Nebraska                  23.654153
    Nevada                    24.448796
    New Hampshire            140.799273
    New Jersey              1009.253268
    New Mexico                16.982737
    New York                 356.094135
    North Carolina           177.617157
    North Dakota               9.537565
    Ohio                     257.549634
    Oklahoma                  53.778278
    Oregon                    39.001565
    Pennsylvania             275.966651
    Rhode Island             681.339159
    South Carolina           144.854594
    South Dakota              10.583512
    Tennessee                150.825298
    Texas                     93.987655
    Utah                      32.677188
    Vermont                   65.085075
    Virginia                 187.622273
    Washington                94.557817
    West Virginia             76.519582
    Wisconsin                 86.851900
    Wyoming                    5.768079
    Puerto Rico             1058.665149
    dtype: float64
     

    排序,并找出人口密度最高的五个州sort_values()

    In [128]:
    type(pop_density)
    
    Out[128]:
    pandas.core.series.Series
    In [130]:
    pop_density.sort_values(inplace=True)
    
     

    找出人口密度最低的五个州

    In [131]:
    pop_density[:5]
    
    Out[131]:
    state
    Alaska           1.087509
    Wyoming          5.768079
    Montana          6.736171
    North Dakota     9.537565
    South Dakota    10.583512
    dtype: float64
    In [132]:
    pop_density.tail()
    
    Out[132]:
    state
    Connecticut              645.600649
    Rhode Island             681.339159
    New Jersey              1009.253268
    Puerto Rico             1058.665149
    District of Columbia    8898.897059
    dtype: float64
     

    要点总结:

    • 统一用loc()索引
    • 善于使用.isnull().any()找到存在NaN的列
    • 善于使用.unique()确定该列中哪些key是我们需要的
    • 一般使用外合并、左合并,目的只有一个:宁愿该列是NaN也不要丢弃其他列的信息
     

    回顾:Series/DataFrame运算与ndarray运算的区别

     
    • Series与DataFrame没有广播,如果对应index没有值,则记为NaN;或者使用add的fill_value来补缺失值
    • ndarray有广播,通过重复已有值来计算
  • 相关阅读:
    hdu 1251 字典树模板题 ---多串 查找单词出现次数
    一个极其简洁的Python网页抓取程序,自己主动从雅虎財经抓取股票数据
    JSONObject与JSONArray的使用
    关于DPM(Deformable Part Model)算法中模型结构的解释
    fullcalendar日历控件知识点集合
    android--自己定义ProgressDialog显示位置(其他Dialog子类都能够设置)
    最简单的Java框架
    java final keyword
    IBinder对象在进程间传递的形式(一)
    windows的定时任务设置
  • 原文地址:https://www.cnblogs.com/bilx/p/11535559.html
Copyright © 2011-2022 走看看