zoukankan      html  css  js  c++  java
  • pandas(一)操作Series和DataFrame的基本功能

    reindex:重新索引

    pandas对象有一个重要的方法reindex,作用:创建一个适应新索引的新对象

    以Series为例

     1 >>> series_obj = Series([4.5,1.3,5,-5.5],index=('a','b','c','d'))
     2 >>> series_obj
     3 a    4.5
     4 b    1.3
     5 c    5.0
     6 d   -5.5
     7 dtype: float64
     8 >>> obj2 = series_obj.reindex(['a','b','c','e','f'])
     9 >>> obj2
    10 a    4.5
    11 b    1.3
    12 c    5.0
    13 e    NaN
    14 f    NaN
    15 dtype: float64
    View Code

    重新索引的时候可以自动填充Nan值

    1 >>> obj3 = series_obj.reindex(['a','b','c','e','f'],fill_value='0')
    2 >>> obj3
    3 a    4.5
    4 b    1.3
    5 c      5
    6 e      0
    7 f      0
    View Code

    对于时间序列这样的有序数据,重新索引可能需要做一些插值操作,reindex的method参数提供此功能。

    method的可选选项有:

    ffill或pad :前向填充或搬运值

    bfill或backfill:后向填充或搬运值

    不存在前向或后项的行自动填充Nan

     1 >>> obj4 = Series(['red','blue','green'],index=[0,2,4])
     2 >>> obj4
     3 0      red
     4 2     blue
     5 4    green
     6 dtype: object
     7 >>> obj4.reindex(range(6),method='ffill')
     8 0      red
     9 1      red
    10 2     blue
    11 3     blue
    12 4    green
    13 5    green
    14 dtype: object
    View Code

    DataFrame的重新索引

    只传入一个序列的时候,默认是重新索引“行”,可以用关键字参数来定义行索引(index)和列索引(columns)。

     1 >>> frame = DataFrame(np.arange(9).reshape((3,3)),index = ['a','b','c'],columns = ['Ohio','Texas',"Cali"])
     2 >>> frame2 = frame.reindex(['a','b','c','d'])
     3 >>> frame2
     4    Ohio  Texas  Cali
     5 a   0.0    1.0   2.0
     6 b   3.0    4.0   5.0
     7 c   6.0    7.0   8.0
     8 d   NaN    NaN   NaN
     9 
    10 >>> frame3 = frame.reindex(columns = ['Ohio','Texas','Cali','Wile'],index=['a','b','c','d'],fill_value=4)
    11 >>> frame3
    12    Ohio  Texas  Cali  Wile
    13 a     0      1     2     4
    14 b     3      4     5     4
    15 c     6      7     8     4
    16 d     4      4     4     4
    17 >>>
    View Code

    如果对DataFrame的行和列重新索引的时候,插值只能按行应用

    如果利用ix的标签索功能,重新索引会变得更简洁

    1 >>> frame5 = frame.ix[['a','b','c','d'], ['Ohio','Texas','Cali','Wile']]
    2 >>> frame5
    3    Ohio  Texas  Cali  Wile
    4 a   0.0    1.0   2.0   NaN
    5 b   3.0    4.0   5.0   NaN
    6 c   6.0    7.0   8.0   NaN
    7 d   NaN    NaN   NaN   NaN
    View Code

    drop:丢弃指定轴上的项

    >>> obj = Series(np.arange(5),index=['a','b','c','d','e'])
    >>> obj
    a    0
    b    1
    c    2
    d    3
    e    4
    dtype: int32
    >>> new_obj = obj.drop('b')
    >>> new_obj
    a    0
    c    2
    d    3
    e    4
    
    >>> new_obj2 = obj.drop(['b','c'])
    >>> new_obj2
    a    0
    d    3
    e    4
    dtype: int32
    View Code
    #dataframe
    >>> frame = DataFrame(np.arange(16).reshape((4,4)),index=['a','b','c','d'],columns=['one','two','three','four'])
    >>> frame
       one  two  three  four
    a    0    1      2     3
    b    4    5      6     7
    c    8    9     10    11
    d   12   13     14    15
    >>> new_frame = frame.drop('a')
    >>> new_frame
       one  two  three  four
    b    4    5      6     7
    c    8    9     10    11
    d   12   13     14    15
    >>> new_frame2 = frame.drop(['two','four'],axis = 1)
    >>> new_frame2
       one  three
    a    0      2
    b    4      6
    c    8     10
    d   12     14

    索引、选取和过滤

    Series的索引,既可以是类似NumPy数组的索引,也可以是自定义的index

    >>> obj
    a    0
    b    1
    c    2
    d    3
    e    4
    dtype: int32
    >>> obj['a']
    0
    >>> obj[1]
    1
    注意:利用标签的切片运算,标签的右侧是封闭区间的,即包含末端。 >>> obj['a':'c'] a 0 b 1 c 2 dtype: int32 >>> obj[3:4] d 3 dtype: int32 >>> obj[2:3] c 2 dtype: int32 >>> obj[[3,1]] d 3 b 1 dtype: int32 >>> obj[['a','c']] a 0 c 2 dtype: int32 >>>

    通过索引修改值

    >>> obj[['b','d']] *=2
    >>> obj
    a    0
    b    2
    c    2
    d    6
    e    4
    dtype: int32

    dataframe的索引:

    通过直接索引只能获取列

    >>> frame
       one  two  three  four
    a    0    1      2     3
    b    4    5      6     7
    c    8    9     10    11
    d   12   13     14    15
    >>> frame['a']
    KeyError: 'a'
    >>> frame['one']
    a     0
    b     4
    c     8
    d    12
    Name: one, dtype: int32
    >>> frame[['one','four']]
       one  four
    a    0     3
    b    4     7
    c    8    11
    d   12    15
    

    通过切片或布尔型数组,选取的是行

    >>> frame[1:3] #不闭合区间
       one  two  three  four
    b    4    5      6     7
    c    8    9     10    11
    >>> frame[frame['three'] > 8]
       one  two  three  four
    c    8    9     10    11
    d   12   13     14    15
    >>>

    DataFrame的索引字段ix

    >>> frame.ix['a'] #按照行索引
    one      0
    two      1
    three    2
    four     3
    Name: a, dtype: int32
    >>> frame.ix[['b','d']]
       one  two  three  four
    b    4    5      6     7
    d   12   13     14    15
    >>> frame.ix[1]#同样是按照行索引
    one      4
    two      5
    three    6
    four     7
    Name: b, dtype: int32
    >>> frame.ix[1:3]
       one  two  three  four
    b    4    5      6     7
    c    8    9     10    11
    >>> frame.ix[1:2,[2,3,1]]
       three  four  two
    b      6     7    5
    >>> frame.ix[1:3,[2,3,1]]
       three  four  two
    b      6     7    5
    c     10    11    9
    >>> frame.ix[['b','d'],['one','three']]
       one  three
    b    4      6
    d   12     14
    >>> frame.ix[['b','d'],[3,1,2]]
       four  two  three
    b     7    5      6
    d    15   13     14
    >>> frame.ix[:,[2,3,1]]# 选取所有行
       three  four  two
    a      2     3    1
    b      6     7    5
    c     10    11    9
    d     14    15   13

    >>> frame.ix[frame.three >5,:3]
    one two three
    b 4 5 6
    c 8 9 10
    d 12 13 14

    算术运算和数据对齐

    >>> s1 = Series([1.3,4.5,6.6,3.4],index=['a','b','c','d'])
    >>> s2 = Series([1,2,3,4,5,6,7],index=['a','b','c','d','e','f','g'])
    >>> s1+s2
    a    2.3
    b    6.5
    c    9.6
    d    7.4
    e    NaN
    f    NaN
    g    NaN
    dtype: float64
    #不重叠的索引处引入缺失值
    #DataFrame也是同理

    再算术方法中填充缺失值

    >>> df1 = DataFrame(np.arange(12).reshape((3,4)),columns=list('abcd'))
    >>> df2 = DataFrame(np.arange(20).reshape((4,5)),columns=list('abcde'))
    >>> df1+df2#普通的算术运算会产生缺失值
          a     b     c     d   e
    0   0.0   2.0   4.0   6.0 NaN
    1   9.0  11.0  13.0  15.0 NaN
    2  18.0  20.0  22.0  24.0 NaN
    3   NaN   NaN   NaN   NaN NaN
    #用算术运算方法,可以填充缺失值
    >>> df1.add(df2,fill_value=0)
          a     b     c     d     e
    0   0.0   2.0   4.0   6.0   4.0
    1   9.0  11.0  13.0  15.0   9.0
    2  18.0  20.0  22.0  24.0  14.0
    3  15.0  16.0  17.0  18.0  19.0
    >>>

    算术运算方法有

    add 加法

    sub 减法

    div 除法

    mul 乘法

    DataFrame和Series之间的运算

    >>> frame
       one  two  three  four
    a    0    1      2     3
    b    4    5      6     7
    c    8    9     10    11
    d   12   13     14    15
    >>> series = frame.ix[0]
    >>> series
    one      0
    two      1
    three    2
    four     3
    Name: a, dtype: int32
    >>> frame - series
       one  two  three  four
    a    0    0      0     0
    b    4    4      4     4
    c    8    8      8     8
    d   12   12     12    12
    >>>

    两者之间的运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播。

    如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的连个对象就会被重新索引以形成并集。

    >>> series2 = Series(range(3),index = ['two','four','five'])
    >>> frame +series2
       five  four  one  three   two
    a   NaN   4.0  NaN    NaN   1.0
    b   NaN   8.0  NaN    NaN   5.0
    c   NaN  12.0  NaN    NaN   9.0
    d   NaN  16.0  NaN    NaN  13.0

    如果希望匹配行,且在列上传播,则必须使用算术方法

    >>> series3 = frame['two']
    >>> frame.sub(series3,axis = 0)
       one  two  three  four
    a   -1    0      1     2
    b   -1    0      1     2
    c   -1    0      1     2
    d   -1    0      1     2
    >>>
  • 相关阅读:
    win7桌面的文件路径是否可以更改
    win10怎么更新flash到最新版本【系统天地】
    centos7 lnmp环境部署
    thinkphp5 部署注意事项
    linux下导入、导出mysql数据库命令的实现方法
    Centos7 系统下怎么更改apache默认网站目录
    centos7 配置lamp 环境
    Redis笔记(4)独立功能的实现
    Redis笔记(3)多数据库实现
    Redis笔记(2)单机数据库实现
  • 原文地址:https://www.cnblogs.com/zuoshoushizi/p/8733153.html
Copyright © 2011-2022 走看看