zoukankan      html  css  js  c++  java
  • pandas(五)处理缺失数据和层次化索引

    pandas用浮点值Nan表示浮点和非浮点数组中的缺失数据。它只是一个便于被检测的标记而已。

    >>> string_data = Series(['aardvark','artichoke',np.nan,'avocado'])
    >>> string_data
    0     aardvark
    1    artichoke
    2          NaN
    3      avocado
    dtype: object
    >>> string_data.isnull()
    0    False
    1    False
    2     True
    3    False
    dtype: bool
    >>> string_data.notnull()
    0     True
    1     True
    2    False
    3     True
    dtype: bool
    >>> string_data.fillna("miss")
    0     aardvark
    1    artichoke
    2         miss
    3      avocado
    dtype: object
    >>> string_data
    0     aardvark
    1    artichoke
    2          NaN
    3      avocado
    dtype: object

    NA处理方法

    方法   说明
    dropna 根据个标签中的是否存在缺失数据进行过滤,可以通过阈值进行调整
    fillna   用指定值或插值来填充缺失数据
    isnull 返回一个含有布尔值的对象,这些布尔值表示哪些是缺失值,给对象的类型与原类型一样
    notnull isnull的否定式

    特别说明dropna方法:

      常用参数:

        axis  指定轴

        how  :“any/all” any代表只有有缺失值,all代表一列全部缺失

        thresh; 代表最少notnull值的个数,是一个整型。

    滤除缺失数据

    对于Series有两种方法实现:

      

    >>> from numpy import nan as NA
    >>>
    >>>
    >>> data = Series([1,NA,3.2,NA,5])
    >>> data
    0    1.0
    1    NaN
    2    3.2
    3    NaN
    4    5.0
    dtype: float64
    #方法一
    >>> data.dropna()
    0    1.0
    2    3.2
    4    5.0
    dtype: float64
    #方法二
    >>> data[data.notnull()]
    0    1.0
    2    3.2
    4    5.0
    dtype: float64

    而对于DataFrame对象,事情就有点复杂了。dropna默认丢弃任何含有缺失值的行。

    >>> frame = DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])
    >>>
    >>>
    >>>
    >>> frame
         0    1    2
    0  1.0  6.5  3.0
    1  1.0  NaN  NaN
    2  NaN  NaN  NaN
    3  NaN  6.5  3.0
    >>> clean_data = frame.dropna()#默认丢弃所有含有缺失值的行
    >>> clean_data
         0    1    2
    0  1.0  6.5  3.0
    
    >>> frame.dropna(how ='all')#只丢弃全部是缺失值的行
         0    1    2
    0  1.0  6.5  3.0
    1  1.0  NaN  NaN
    3  NaN  6.5  3.0
    >>> frame.dropna(axis = 1 ,how='all')#丢弃全部是缺失值的列
         0    1    2
    0  1.0  6.5  3.0
    1  1.0  NaN  NaN
    2  NaN  NaN  NaN
    3  NaN  6.5  3.0
    >>> frame.dropna(thresh =2)#丢弃剩余少于2个真实值的行
         0    1    2
    0  1.0  6.5  3.0
    3  NaN  6.5  3.0
    >>>

    填充缺失数据

    对于DataFrame对象

    >>> df = DataFrame(np.random.randn(7,3))
    >>> df.ix[:4 ,1] = NA
    >>> df.ix[:2,2] =NA
    >>> df
              0         1         2
    0 -1.362151       NaN       NaN
    1 -0.465262       NaN       NaN
    2  0.037518       NaN       NaN
    3 -2.895224       NaN -2.514141
    4 -0.635875       NaN  1.722823
    5 -0.479897  0.999354 -0.547433
    6 -0.744960  0.363400  0.706812
    >>> df.fillna(0) #元素级填充
              0         1         2
    0 -1.362151  0.000000  0.000000
    1 -0.465262  0.000000  0.000000
    2  0.037518  0.000000  0.000000
    3 -2.895224  0.000000 -2.514141
    4 -0.635875  0.000000  1.722823
    5 -0.479897  0.999354 -0.547433
    6 -0.744960  0.363400  0.706812
    #根据不同的列填充不同的数值
    >>> df.fillna({1:0.5,2:-1 })
              0         1         2
    0 -1.362151  0.500000 -1.000000
    1 -0.465262  0.500000 -1.000000
    2  0.037518  0.500000 -1.000000
    3 -2.895224  0.500000 -2.514141
    4 -0.635875  0.500000  1.722823
    5 -0.479897  0.999354 -0.547433
    6 -0.744960  0.363400  0.706812
    >>> df.fillna(method ='bfill')#method方法选择前向或后向填充
              0         1         2
    0 -1.362151  0.999354 -2.514141
    1 -0.465262  0.999354 -2.514141
    2  0.037518  0.999354 -2.514141
    3 -2.895224  0.999354 -2.514141
    4 -0.635875  0.999354  1.722823
    5 -0.479897  0.999354 -0.547433
    6 -0.744960  0.363400  0.706812
    >>> df.fillna(method ='bfill',limit =2)#限制后向填充为两行
              0         1         2
    0 -1.362151       NaN       NaN
    1 -0.465262       NaN -2.514141
    2  0.037518       NaN -2.514141
    3 -2.895224  0.999354 -2.514141
    4 -0.635875  0.999354  1.722823
    5 -0.479897  0.999354 -0.547433
    6 -0.744960  0.363400  0.706812
    >>>

    fillna默认会返回新对象,如果需要就地修改元数据,可以加上inplace = True

    >>> df.fillna(0,inplace = True)
    >>> df
              0         1         2
    0 -1.362151  0.000000  0.000000
    1 -0.465262  0.000000  0.000000
    2  0.037518  0.000000  0.000000
    3 -2.895224  0.000000 -2.514141
    4 -0.635875  0.000000  1.722823
    5 -0.479897  0.999354 -0.547433
    6 -0.744960  0.363400  0.706812

    fillna函数的参数

    参数 说明
    method 前向或后向填充
    value 待填充的值或字典对象
    axis 待填充的轴
    inplace 修改调用者对象而不产生副本
    limit 前向或后向填充的最大数量

    层次化索引

    能使你在一个轴上拥有多个索引级别。

    创建层次化索引

    >>> data = Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,1,2]])
    >>> data
    a  1   -0.450814
       2   -0.776317
       3   -0.140582
    b  1   -0.717184
       2    0.943802
       3    0.972454
    c  1   -0.390725
       2   -1.340875
    d  1   -0.648987
       2   -0.960173
    dtype: float64
    >>> data.index
    MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
               labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 0, 1]])
    >>>

    利用层次化索引来选取子集

    >>> data['a']
    1   -0.450814
    2   -0.776317
    3   -0.140582
    dtype: float64
    >>> data['c':'d']
    c  1   -0.390725
       2   -1.340875
    d  1   -0.648987
       2   -0.960173
    dtype: float64
    >>> data.ix[['a','c']]
    a  1   -0.450814
       2   -0.776317
       3   -0.140582
    c  1   -0.390725
       2   -1.340875
    dtype: float64
    选择内层子集
    >>> data['a',2]
    -0.7763173836675796
    >>> data[:,2]
    a   -0.776317
    b    0.943802
    c   -1.340875
    d   -0.960173
    dtype: float64

    利用stack和unstack可以实现层次化索引的Series和DataFrame的转换

    >>> frame
         0    1    2
    0  1.0  6.5  3.0
    1  1.0  NaN  NaN
    2  NaN  NaN  NaN
    3  NaN  6.5  3.0
    >>> frame.stack()
    0  0    1.0
       1    6.5
       2    3.0
    1  0    1.0
    3  1    6.5
       2    3.0
    dtype: float64
    >>> data.unstack()
              1         2         3
    a -0.450814 -0.776317 -0.140582
    b -0.717184  0.943802  0.972454
    c -0.390725 -1.340875       NaN
    d -0.648987 -0.960173       NaN

    重排分级顺序

    swaplevel根据给定的编号或name属性进行交换层次化索引

    sortlevel 根据给定的级别的值进行排序

    >>> frame = DataFrame(np.random.randn(5,4),index = [['a','a','a','b','b'],[1,2,3,1,2]],columns = pd.MultiIndex.from_arrays([['o','o','w','w'],[1,2,1,2]],names=['color','num']))
    >>> frame
    color         o                   w
    num           1         2         1         2
    a 1    1.558178  1.614265  0.674642 -0.269209
      2   -0.324755 -0.486829 -1.086918 -0.496748
      3    0.283367 -0.518154  0.551998  0.747767
    b 1    0.904257  1.315240  0.328065 -0.006465
      2    0.249438  0.946020  1.572290 -0.198329
    >>> frame.index.names = ['name','age']
    >>> frame
    color            o                   w
    num              1         2         1         2
    name age
    a    1    1.558178  1.614265  0.674642 -0.269209
         2   -0.324755 -0.486829 -1.086918 -0.496748
         3    0.283367 -0.518154  0.551998  0.747767
    b    1    0.904257  1.315240  0.328065 -0.006465
         2    0.249438  0.946020  1.572290 -0.198329
    >>> frame.swaplevel('name','age')
    color            o                   w
    num              1         2         1         2
    age name
    1   a     1.558178  1.614265  0.674642 -0.269209
    2   a    -0.324755 -0.486829 -1.086918 -0.496748
    3   a     0.283367 -0.518154  0.551998  0.747767
    1   b     0.904257  1.315240  0.328065 -0.006465
    2   b     0.249438  0.946020  1.572290 -0.198329
    >>> frame.sortlevel(1)
    __main__:1: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)
    color            o                   w
    num              1         2         1         2
    name age
    a    1    1.558178  1.614265  0.674642 -0.269209
    b    1    0.904257  1.315240  0.328065 -0.006465
    a    2   -0.324755 -0.486829 -1.086918 -0.496748
    b    2    0.249438  0.946020  1.572290 -0.198329
    a    3    0.283367 -0.518154  0.551998  0.747767
    >>> frame.sort_index(level = 1)#以后sortlevel会废弃,这里可以用sort_index的level选项排序
    color            o                   w
    num              1         2         1         2
    name age
    a    1    1.558178  1.614265  0.674642 -0.269209
    b    1    0.904257  1.315240  0.328065 -0.006465
    a    2   -0.324755 -0.486829 -1.086918 -0.496748
    b    2    0.249438  0.946020  1.572290 -0.198329
    a    3    0.283367 -0.518154  0.551998  0.747767

    可以根据级别汇总统计

    许多对DataFrame和Series的描述和汇总统计都有一个level选项,用于指定在某条轴上算术运算的级别

    >>> frame.sum(level = 'age')
    color         o                   w
    num           1         2         1         2
    age
    1      2.462435  2.929505  1.002707 -0.275673
    2     -0.075318  0.459191  0.485372 -0.695077
    3      0.283367 -0.518154  0.551998  0.747767
    >>> frame.sum(level = 'color',axis =1)
    color            o         w
    name age
    a    1    3.172443  0.405433
         2   -0.811584 -1.583666
         3   -0.234786  1.299765
    b    1    2.219497  0.321600
         2    1.195458  1.373961
    >>>

    使用DataFrame的列完成层次化行索引的转化

    >>> frame = DataFrame({'a':range(7),'b':range(7,0,-1),'c':['o','o','o','t','t','f','f'],'d':[1,2,3,4,1,2,3]})
    >>> frame
       a  b  c  d
    0  0  7  o  1
    1  1  6  o  2
    2  2  5  o  3
    3  3  4  t  4
    4  4  3  t  1
    5  5  2  f  2
    6  6  1  f  3
    >>> frame2 = frame.set_index(['c','d'])#将一个或多个列转换为行索引
    >>> frame2
         a  b
    c d
    o 1  0  7
      2  1  6
      3  2  5
    t 4  3  4
      1  4  3
    f 2  5  2
      3  6  1
    >>> frame2.reset_index(['c','d'])#将层次化索引转换为列
       c  d  a  b
    0  o  1  0  7
    1  o  2  1  6
    2  o  3  2  5
    3  t  4  3  4
    4  t  1  4  3
    5  f  2  5  2
    6  f  3  6  1

    在将列转换为层次化行索引的时候,默认会删除原来的列,如果要保留的话,需要drop选项

    >>> frame3 = frame.set_index(['c','d'],drop=False)
    >>> frame3
         a  b  c  d
    c d
    o 1  0  7  o  1
      2  1  6  o  2
      3  2  5  o  3
    t 4  3  4  t  4
      1  4  3  t  1
    f 2  5  2  f  2
      3  6  1  f  3
  • 相关阅读:
    centos 安装 Lamp(Linux + Apache + PHP) 并安装 phpmyadmin
    mysql常用内置函数-查询语句中不能使用strtotime()函数!
    Windows下 wamp下Apache配置虚拟域名
    thinkphp ajax调用demo
    phpMailer 手册
    wampServer2.2 You don't have permission to access /phpmyadmin/ on this server.
    打印对象
    最全的CSS浏览器兼容问题
    html 视频播放器
    C语言入门-结构类型
  • 原文地址:https://www.cnblogs.com/zuoshoushizi/p/8735497.html
Copyright © 2011-2022 走看看