zoukankan      html  css  js  c++  java
  • Pandas入门学习笔记2

    2 基本功能

    只是一些基本功能,更深奥的内容用到再摸索。

    2.1 重新索引

    reindex是pandas的重要方法,举个例子:

    In [101]: obj = Series([4,7,-5,3.4],index=['c','a','b','d'])
    
    In [102]: obj
    Out[102]:
    c    4.0
    a    7.0
    b   -5.0
    d    3.4
    dtype: float64
    
    In [103]: obj2 = obj.reindex(['a','b','c','d','e'])
    
    In [104]: obj2
    Out[104]:
    a    7.0
    b   -5.0
    c    4.0
    d    3.4
    e    NaN
    dtype: float64
    
    # 缺失值可以自定义
    
    In [105]: obj.reindex(['a','b','c','d','e'],fill_value=0)
    Out[105]:
    a    7.0
    b   -5.0
    c    4.0
    d    3.4
    e    0.0  #缺失值填充
    dtype: float64
    
    
    

    reindex的插值method选项:

    参数 说明
    ffill或pad 前向填充值
    bfill或backfill 后向填充值
    In [106]: obj3 = Series(['blue','purple','yellow'],index=[0,2,4])
    
    # 前向填充
    In [107]: obj3.reindex(range(6),method='ffill')
    Out[107]:
    0      blue
    1      blue
    2    purple
    3    purple
    4    yellow
    5    yellow
    dtype: object
    
    # 后向填充
    In [109]: obj3.reindex(range(6),method='bfill')
    Out[109]:
    0      blue
    1    purple
    2    purple
    3    yellow
    4    yellow
    5       NaN
    dtype: object
    

    针对DataFrame,可以修改行、列或两个都进行重新索引。

    In [111]: frame = DataFrame(np.arange(9).reshape(3,3), index=['a','b','c'],colmns=['Ohio','Texas','California'])
    
    In [112]: frame
    Out[112]:
       Ohio  Texas  California
    a     0      1           2
    b     3      4           5
    c     6      7           8
    
    In [113]: frame2 = frame.reindex(['a','b','c','d'])  # 默认行索引
    
    In [115]: frame2
    Out[115]:
       Ohio  Texas  California
    a   0.0    1.0         2.0
    b   3.0    4.0         5.0
    c   6.0    7.0         8.0
    d   NaN    NaN         NaN
    
    In [116]: states = ['Texas','Utah','California']
    
    In [117]: frame.reindex(columns=states)  #指定列索引
    Out[117]:
       Texas  Utah  California
    a      1   NaN           2
    b      4   NaN           5
    c      7   NaN           8
    
    # 对行、列都进行重新索引,
    # 并且进行插值,但是只能在0轴进行,即按行应用。
    In [118]: frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)
    Out[118]:
       Texas  Utah  California
    a      1   NaN           2
    b      4   NaN           5
    c      7   NaN           8
    d      7   NaN           8
    
    # 用ix更简洁。
    In [119]: frame.ix[['a','b','c','d'],states]
    Out[119]:
       Texas  Utah  California
    a    1.0   NaN         2.0
    b    4.0   NaN         5.0
    c    7.0   NaN         8.0
    d    NaN   NaN         NaN
    

    reindex函数的参数

    2.2 丢弃指定轴上的项

    丢弃项,只要一个索引或列表即可。drop方法会返回一个删除了指定值的新对象。

    In [120]: obj = Series(np.arange(5.),index=['a','b','c','d','e'])
    
    In [121]: new_obj = obj.drop('c')
    
    In [122]: new_obj
    Out[122]:
    a    0.0
    b    1.0
    d    3.0
    e    4.0
    dtype: float64
    
    In [124]: obj.drop(['d','c'])
    Out[124]:
    a    0.0
    b    1.0
    e    4.0
    dtype: float64
    
    

    针对DataFrame,可以删除任意轴上的索引值。

    In [125]: data = DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado',
         ...: 'Utah','New York'],columns=['one','two','three','four'])
    
    In [126]: data
    Out[126]:
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    In [127]: data.drop(['Colorado','Ohio'])
    Out[127]:
              one  two  three  four
    Utah        8    9     10    11
    New York   12   13     14    15
    
    In [128]: data.drop(['two',],axis=1)
    Out[128]:
              one  three  four
    Ohio        0      2     3
    Colorado    4      6     7
    Utah        8     10    11
    New York   12     14    15
    

    2.3 索引、选取和过滤

    In [129]: obj = Series(np.arange(4.),index=['a','b','c','d'])
    
    In [130]: obj['a']  #使用index索引
    Out[130]: 0.0
    
    In [131]: obj[0]    #使用序号来索引
    Out[131]: 0.0
    
    In [132]: obj[1]
    Out[132]: 1.0
    
    In [133]: obj
    Out[133]:
    a    0.0
    b    1.0
    c    2.0
    d    3.0
    dtype: float64
    
    In [134]: obj[1:2]  # 使用序号切片
    Out[134]:
    b    1.0
    dtype: float64
    
    In [135]: obj[1:3]
    Out[135]:
    b    1.0
    c    2.0
    dtype: float64
    
    In [136]: obj[obj<2]  # 使用值判断
    Out[136]:
    a    0.0
    b    1.0
    dtype: float64
    
    In [137]: obj['b':'c']  # 使用索引切片,注意是两端包含的。
    Out[137]:
    b    1.0
    c    2.0
    dtype: float64
    
    In [138]: obj['b':'c'] = 100  # 赋值
    
    In [139]: obj
    Out[139]:
    a      0.0
    b    100.0
    c    100.0
    d      3.0
    dtype: float64
    
    

    针对DataFrame,索引就是获取一个或多个列。
    使用列名:获取列
    使用序号或bool值:获取行

    In [140]: data
    Out[140]:
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    In [141]:
    
    In [141]: data['two']  # 获取第2列
    Out[141]:
    Ohio         1
    Colorado     5
    Utah         9
    New York    13
    Name: two, dtype: int32
    
    In [142]: data[['two','one']]  # 按要求获取列
    Out[142]:
              two  one
    Ohio        1    0
    Colorado    5    4
    Utah        9    8
    New York   13   12
    
    In [143]: data[:2]  # 获取前面两行,使用数字序号获取的是行
    Out[143]:
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    
    In [144]: data[data['three']>5]  # 获取第三列大于5的行
    Out[144]:
              one  two  three  four
    Colorado    4    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    

    DataFrame在语法上与ndarray是比较相似的。

    In [146]: data < 5
    Out[146]:
                one    two  three   four
    Ohio       True   True   True   True
    Colorado   True  False  False  False
    Utah      False  False  False  False
    New York  False  False  False  False
    
    In [147]: data[data<5] = 0
    
    In [148]: data
    Out[148]:
              one  two  three  four
    Ohio        0    0      0     0
    Colorado    0    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    

    索引字段ix:
    可以通过Numpy的标记法以及轴标签从DataFrame中选取行和列的子集。
    此外,ix得表述方式很简单

    In [150]: data.ix['Colorado',['two','three']]
    Out[150]:
    two      5
    three    6
    Name: Colorado, dtype: int32
    
    In [151]: data.ix[['Colorado','Utah'],[3,0,1]]
    Out[151]:
              four  one  two
    Colorado     7    0    5
    Utah        11    8    9
    
    In [152]: data.ix[2]
    Out[152]:
    one       8
    two       9
    three    10
    four     11
    Name: Utah, dtype: int32
    
    In [153]: data.ix[:'Utah','two']
    Out[153]:
    Ohio        0
    Colorado    5
    Utah        9
    Name: two, dtype: int32
    

    DataFrame的索引选项

    2.4 算术运算和数据对齐

    算术运算结果就是不同索引之间的并集,不存在的值之间运算结果用NaN表示。

    In [4]: s1 = Series([-2,-3,5,-1],index=list('abcd'))
    
    In [5]: s2 = Series([9,2,5,1,5],index=list('badef'))
    
    In [6]: s1 + s2
    Out[6]:
    a    0.0
    b    6.0
    c    NaN
    d    4.0
    e    NaN
    f    NaN
    dtype: float64
    
    

    DataFrame也是一样,会同时发生在行和列上。

    在算术方法中填充值

    In [7]: df1 = DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))
    
    In [8]: df2 = DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde'))
    
    In [9]: df1
    Out[9]:
         a    b     c     d
    0  0.0  1.0   2.0   3.0
    1  4.0  5.0   6.0   7.0
    2  8.0  9.0  10.0  11.0
    
    In [10]: df2
    Out[10]:
          a     b     c     d     e
    0   0.0   1.0   2.0   3.0   4.0
    1   5.0   6.0   7.0   8.0   9.0
    2  10.0  11.0  12.0  13.0  14.0
    3  15.0  16.0  17.0  18.0  19.0
    
    In [11]: df1 + df2  # 不填充值
    Out[11]:
          a     b     c     d   e
    0   0.0   2.0   4.0   6.0 NaN
    1   9.0  11.0  13.0  15.0 NaN
    2  18.0  20.0  22.0  24.0 NaN
    3   NaN   NaN   NaN   NaN NaN
    
    In [12]: df1.add(df2, fill_value=0)  # 填充0
    Out[12]:
          a     b     c     d     e
    0   0.0   2.0   4.0   6.0   4.0
    1   9.0  11.0  13.0  15.0   9.0
    2  18.0  20.0  22.0  24.0  14.0
    3  15.0  16.0  17.0  18.0  19.0
    
    In [13]: df1.reindex(columns=df2.columns, method='ffill')
    Out[13]:
         a    b     c     d   e
    0  0.0  1.0   2.0   3.0 NaN
    1  4.0  5.0   6.0   7.0 NaN
    2  8.0  9.0  10.0  11.0 NaN
    
    In [14]: df1.reindex(columns=df2.columns, fill_value=0)  # 重新索引的时候也可以填充。
    Out[14]:
         a    b     c     d  e
    0  0.0  1.0   2.0   3.0  0
    1  4.0  5.0   6.0   7.0  0
    2  8.0  9.0  10.0  11.0  0
    
    

    可用的算术算法有:

    • add:加法,
    • sub:减法,
    • div:除法
    • mul:乘法

    DataFrame和Series之间的运算

    采用广播的方式,就是会按照一定的规律作用到整个DataFrame之中。

    In [15]: frame = DataFrame(np.arange(12.).reshape(4,3),columns=list('bde'),index
        ...: =['Utah','Ohio','Texas','Oregon'])
    
    In [16]: series = frame.ix[0]  # 获取第一行
    
    In [17]: frame
    Out[17]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    In [18]: series
    Out[18]:
    b    0.0
    d    1.0
    e    2.0
    Name: Utah, dtype: float64
    
    In [19]: frame - series   # 自动广播到其他行
    Out[19]:
              b    d    e
    Utah    0.0  0.0  0.0
    Ohio    3.0  3.0  3.0
    Texas   6.0  6.0  6.0
    Oregon  9.0  9.0  9.0
    
    In [20]: series2 = Series(np.arange(3),index=list('bef'))
    
    In [21]: series2
    Out[21]:
    b    0
    e    1
    f    2
    dtype: int64
    
    In [22]: frame + series2  # 没有的列使用NaN
    Out[22]:
              b   d     e   f
    Utah    0.0 NaN   3.0 NaN
    Ohio    3.0 NaN   6.0 NaN
    Texas   6.0 NaN   9.0 NaN
    Oregon  9.0 NaN  12.0 NaN
    
    In [23]: series3 = frame['d']   # 获取列
    
    In [24]: frame.sub(series3, axis=0) #列相减,指定axis
    Out[24]:
              b    d    e
    Utah   -1.0  0.0  1.0
    Ohio   -1.0  0.0  1.0
    Texas  -1.0  0.0  1.0
    Oregon -1.0  0.0  1.0
    
    

    2.5 函数应用和映射

    Numpy中的通用函数(ufunc)也可以作用于pandas的Series和DataFrame对象。

    In [31]: np.abs(frame)
    Out[31]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    In [32]: np.max(frame)
    Out[32]:
    b     9.0
    d    10.0
    e    11.0
    dtype: float64
    
    

    DataFrame有一个apply方法,可以接受自定义函数。

    In [33]: f = lambda x: np.max(x) - np.min(x)
    
    In [34]: frame.apply(f)
    Out[34]:
    b    9.0
    d    9.0
    e    9.0
    dtype: float64
    
    In [35]: frame
    Out[35]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    In [36]: f = lambda x : np.exp2(x)
    
    In [37]: frame.apply(f)
    Out[37]:
                b       d       e
    Utah      1.0     2.0     4.0
    Ohio      8.0    16.0    32.0
    Texas    64.0   128.0   256.0
    Oregon  512.0  1024.0  2048.0
    
    

    许多常用的方法,DataFrame已经实现,不需要使用apply方法自定义。

    In [38]: f = lambda x: Series([np.max(x),np.min(x)],index=['max','min'])
    
    In [39]: frame.apply(f)
    Out[39]:
           b     d     e
    max  9.0  10.0  11.0
    min  0.0   1.0   2.0
    
    # 如果f函数是一个元素级别的函数,就使用applymap
    In [40]: f = lambda x : '%.2f' % x
    
    In [41]: frame.applymap(f)
    Out[41]:
               b      d      e
    Utah    0.00   1.00   2.00
    Ohio    3.00   4.00   5.00
    Texas   6.00   7.00   8.00
    Oregon  9.00  10.00  11.00
    
    # 同样对于Series就使用map,与DataFrame的applymap是对应的。
    In [43]: series
    Out[43]:
    b    0.0
    d    1.0
    e    2.0
    Name: Utah, dtype: float64
    
    In [44]: series.map(f)
    Out[44]:
    b    0.00
    d    1.00
    e    2.00
    Name: Utah, dtype: object
    
    
    

    2.6 排序与排名

    排序

    排序可以使用:

    • sort_index方法:按索引排序,
    • sort_value方法(order方法):按值排序,使用by参数
    In [45]: obj = Series(range(4),index=list('dbca'))
    
    In [46]: obj
    Out[46]:
    d    0
    b    1
    c    2
    a    3
    dtype: int64
    
    In [47]: obj.sort_index()
    Out[47]:
    a    3
    b    1
    c    2
    d    0
    dtype: int64
    
    In [50]: frame
    Out[50]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    In [51]: frame.sort_index()
    Out[51]:
              b     d     e
    Ohio    3.0   4.0   5.0
    Oregon  9.0  10.0  11.0
    Texas   6.0   7.0   8.0
    Utah    0.0   1.0   2.0
    
    In [52]: frame.sort_index(axis=0)
    Out[52]:
              b     d     e
    Ohio    3.0   4.0   5.0
    Oregon  9.0  10.0  11.0
    Texas   6.0   7.0   8.0
    Utah    0.0   1.0   2.0
    
    In [53]: frame.sort_index(axis=1)
    Out[53]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    In [54]: frame.sort_index(axis=1, ascending=False) # 倒序
    Out[54]:
               e     d    b
    Utah     2.0   1.0  0.0
    Ohio     5.0   4.0  3.0
    Texas    8.0   7.0  6.0
    Oregon  11.0  10.0  9.0
    
    

    按值排序:

    In [55]: s1 = Series([3,-2,-7,4])
    
    In [56]: s1.order()
    /Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: order is deprecated, use sort_values(...)
      #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
    Out[56]:
    2   -7
    1   -2
    0    3
    3    4
    dtype: int64
    
    
    In [58]: frame.sort_index(by='b')
    /Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
      #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
    Out[58]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    In [59]: frame.sort_values(by='b')
    Out[59]:
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    
    

    排名

    rank方法,默认情况下为“相同的值分配一个平均排名”:

    In [60]: s1 = Series([7,-5,7,4,2,0,4])
    
    In [61]: s1.rank()  # 可见0和2索引对应的值都是7,排名分别为6,7;因此取平均值6.5
    Out[61]:
    0    6.5
    1    1.0
    2    6.5
    3    4.5
    4    3.0
    5    2.0
    6    4.5
    dtype: float64
    
    

    当然,有很多方法可以“打破”这种平级关系。

    In [62]: s1.rank(method='first')  # 按原始数据出现顺序排序
    Out[62]:
    0    6.0
    1    1.0
    2    7.0
    3    4.0
    4    3.0
    5    2.0
    6    5.0
    dtype: float64
    
    In [63]: s1.rank(ascending=False, method='max')  # 倒序,平级处理使用最大排名
    Out[63]:
    0    2.0
    1    7.0
    2    2.0
    3    4.0
    4    5.0
    5    6.0
    6    4.0
    dtype: float64
    
    

    DataFrame排名可以使用axis按行或按列进行排名。

    2.7 带有重复值的轴索引

    目前所有的例子中索引都是唯一的,而且如pandas中的许多函数(reindex)就要求索引唯一。
    但是也不是强制的。

    In [64]: obj  = Series(range(5),index=list('aabbc'))
    
    In [65]: obj
    Out[65]:
    a    0
    a    1
    b    2
    b    3
    c    4
    dtype: int64
    
    In [67]: obj.index.is_unique
    Out[67]: False
    
    In [68]: obj['a']
    Out[68]:
    a    0
    a    1
    dtype: int64
    
    In [69]: obj['c']
    Out[69]: 4
    

    对于DataFrame,也是如此。

    In [70]: df =DataFrame(np.random.randn(4,3),index=list('aabb'))
    
    In [79]: df.ix['a']
    Out[79]:
              0         1         2
    a  1.099692 -0.491098  0.625690
    a -0.816857  1.025018  0.558494
    
    In [80]: df.reindex(['b','a'])  # 不能重新索引有重复索引的DataFrame
    ...
    ValueError: cannot reindex from a duplicate axis
    

    待续。。。

  • 相关阅读:
    Chapter 03Using SingleRow Functions to Customize Output(03)
    Chapter 03Using SingleRow Functions to Customize Output(01)
    Chapter 04Using Conversion Functions and Conditional ExpressionsNesting Functions
    Chapter 04Using Conversion Functions and Conditional ExpressionsGeneral Functions
    Chapter 11Creating Other Schema Objects Index
    传奇程序员John Carmack 访谈实录 (zz.is2120)
    保持简单纪念丹尼斯里奇(Dennis Ritchie) (zz.is2120.BG57IV3)
    王江民:传奇一生 (zz.is2120)
    2011台湾游日月潭
    2011台湾游星云大师的佛光寺
  • 原文地址:https://www.cnblogs.com/felo/p/6359895.html
Copyright © 2011-2022 走看看