zoukankan      html  css  js  c++  java
  • Pandas入门学习笔记4

    5 层次化索引

    层次化索引是pandas的重要功能。以低维度的形式处理高维度数据。

    In [185]: data = Series(np.random.randn(10),index=[list('aaabbbccdd'),[1,2,3,1,2,3,2,3,2,3]])
    
    In [186]: data
    Out[186]:
    a  1    0.458553
       2    0.077532
       3   -1.561180
    b  1    2.498391
       2    0.243617
       3   -0.818542
    c  2   -1.222213
       3   -0.797079
    d  2    1.131352
       3   -1.292136
    dtype: float64
    

    获取索引。

    In [187]: data.index
    Out[187]:
    MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
               labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 1, 2, 1, 2]])
    
    In [188]: data['b']
    Out[188]:
    1    2.498391
    2    0.243617
    3   -0.818542
    dtype: float64
    
    In [189]: data['b':'c']
    Out[189]:
    b  1    2.498391
       2    0.243617
       3   -0.818542
    c  2   -1.222213
       3   -0.797079
    dtype: float64
    
    In [190]: data[:,2]  # 获取内层索引
    Out[190]:
    a    0.077532
    b    0.243617
    c   -1.222213
    d    1.131352
    dtype: float64
    
    
    In [191]: data.unstack()  # unstack来重新安排到dataframe中。
    Out[191]:
              1         2         3
    a  0.458553  0.077532 -1.561180
    b  2.498391  0.243617 -0.818542
    c       NaN -1.222213 -0.797079
    d       NaN  1.131352 -1.292136
    
    In [192]: data.unstack().stack()  # 逆运算--stack
    Out[192]:
    a  1    0.458553
       2    0.077532
       3   -1.561180
    b  1    2.498391
       2    0.243617
       3   -0.818542
    c  2   -1.222213
       3   -0.797079
    d  2    1.131352
       3   -1.292136
    dtype: float64
    
    

    DataFrame每条轴都可以分层索引。

    5.1 重排分级顺序

    可以重排调整某条轴上的索引顺序,swaplevel可以互换两个索引值,并范围一个新的对象。

    In [193]: frame = DataFrame(np.random.randn(4,3),index=[list('aabb'),[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
         ...:
    
    In [194]: frame
    Out[194]:
             Ohio            Colorado
            Green       Red     Green
    a 1  0.368997  0.670430  1.056365
      2 -0.352259 -0.656101  0.018544
    b 1 -0.574535 -0.531988  0.295466
      2 -0.973587  0.225511 -0.250887
    
    In [198]: frame.index.names = ['key1','key2']
    
    In [199]: frame.columns.names = ['state','color']
    
    In [200]: frame
    Out[200]:
    state          Ohio            Colorado
    color         Green       Red     Green
    key1 key2                              
    a    1     0.368997  0.670430  1.056365
         2    -0.352259 -0.656101  0.018544
    b    1    -0.574535 -0.531988  0.295466
         2    -0.973587  0.225511 -0.250887
    
    In [201]: frame.swaplevel('key1','key2')
    Out[201]:
    state          Ohio            Colorado
    color         Green       Red     Green
    key2 key1                              
    1    a     0.368997  0.670430  1.056365
    2    a    -0.352259 -0.656101  0.018544
    1    b    -0.574535 -0.531988  0.295466
    2    b    -0.973587  0.225511 -0.250887
    
    In [202]: frame.sortlevel(1)
    Out[202]:
    state          Ohio            Colorado
    color         Green       Red     Green
    key1 key2                              
    a    1     0.368997  0.670430  1.056365
    b    1    -0.574535 -0.531988  0.295466
    a    2    -0.352259 -0.656101  0.018544
    b    2    -0.973587  0.225511 -0.250887
    
    In [203]: frame.swaplevel(0,1)
    Out[203]:
    state          Ohio            Colorado
    color         Green       Red     Green
    key2 key1                              
    1    a     0.368997  0.670430  1.056365
    2    a    -0.352259 -0.656101  0.018544
    1    b    -0.574535 -0.531988  0.295466
    2    b    -0.973587  0.225511 -0.250887
    
    In [204]: frame.swaplevel(0,1).sortlevel(0)
    Out[204]:
    state          Ohio            Colorado
    color         Green       Red     Green
    key2 key1                              
    1    a     0.368997  0.670430  1.056365
         b    -0.574535 -0.531988  0.295466
    2    a    -0.352259 -0.656101  0.018544
         b    -0.973587  0.225511 -0.250887
    

    5.2 根据级别汇总统计

    许多DataFrame和Series汇总和统计方法都有level选项,指定在某个轴。

    
    In [205]: frame
    Out[205]:
    state          Ohio            Colorado
    color         Green       Red     Green
    key1 key2                              
    a    1     0.368997  0.670430  1.056365
         2    -0.352259 -0.656101  0.018544
    b    1    -0.574535 -0.531988  0.295466
         2    -0.973587  0.225511 -0.250887
    
    
    In [207]: frame.sum(level='key2')
    Out[207]:
    state      Ohio            Colorado
    color     Green       Red     Green
    key2                               
    1     -0.205538  0.138443  1.351831
    2     -1.325846 -0.430590 -0.232343
    
    In [209]: frame.sum(level='color',axis=1)
    Out[209]:
    color         Green       Red
    key1 key2                    
    a    1     1.425362  0.670430
         2    -0.333715 -0.656101
    b    1    -0.279069 -0.531988
         2    -1.224474  0.225511
    

    5.3 使用DataFrame的列

    经常需要用DataFrame的列作为索引,或者希望将索引变成DataFrame的列。

    In [210]: df = DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one']*7,'d':[0,1,2,0,1,2,3]})
    
    In [211]: df
    Out[211]:
       a  b    c  d
    0  0  7  one  0
    1  1  6  one  1
    2  2  5  one  2
    3  3  4  one  0
    4  4  3  one  1
    5  5  2  one  2
    6  6  1  one  3
    
    In [212]: df2 = df.set_index(['c','d']) #默认情况下,会将转换的这两列删除掉;
    
    In [213]: df2
    Out[213]:
           a  b
    c   d      
    one 0  0  7
        1  1  6
        2  2  5
        0  3  4
        1  4  3
        2  5  2
        3  6  1
    
        In [215]: df2 = df.set_index(['c','d'],drop=False) # 仍然保留这两列
    
        In [216]: df2
        Out[216]:
               a  b    c  d
        c   d              
        one 0  0  7  one  0
            1  1  6  one  1
            2  2  5  one  2
            0  3  4  one  0
            1  4  3  one  1
            2  5  2  one  2
            3  6  1  one  3
    
    

    用reset_index可以将索引合并到DataFrame中。

    In [217]: df2 = df.set_index(['c','d'])
    
    In [218]: df2
    Out[218]:
           a  b
    c   d      
    one 0  0  7
        1  1  6
        2  2  5
        0  3  4
        1  4  3
        2  5  2
        3  6  1
    
    In [219]: df2.reset_index()
    Out[219]:
         c  d  a  b
    0  one  0  0  7
    1  one  1  1  6
    2  one  2  2  5
    3  one  0  3  4
    4  one  1  4  3
    5  one  2  5  2
    6  one  3  6  1
    
    

    6 其他

    6.1 整数索引

    先看一个例子:我们很难判断是要通过位置还是通过标签的索引来获取数据。

    In [220]: ser = Series(np.arange(3))
    
    In [221]: ser
    Out[221]:
    0    0
    1    1
    2    2
    dtype: int64
    
    In [222]: ser[-1]
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    ...
    

    这样对于使用字母索引的Series就不存在这个问题。

    如果需要可靠的、不考虑索引类型的、基于位置的索引,可以使用:

    • Series:iget_value
    • DataFrame:irow和icol

    新的版本有些变化:都是用iloc来通过位置准确获取。

    In [231]: ser3 = Series(np.arange(3),index=[-5,1,3])
    
    In [232]: ser3.iget_value(2)
    /Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: iget_value(i) is deprecated. Please use .iloc[i] or .iat[i]
      #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
    Out[232]: 2
    
    In [236]: ser3.iloc[2]
    Out[236]: 2
    
    In [237]: ser3.iat[2]
    Out[237]: 2
    
    In [239]: frame = DataFrame(np.arange(6).reshape(3,2),index=[2,0,1])
    
    In [241]: frame
    Out[241]:
       0  1
    2  0  1
    0  2  3
    1  4  5
    
    In [242]: frame.irow(1)
    /Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: irow(i) is deprecated. Please use .iloc[i]
      #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
    Out[242]:
    0    2
    1    3
    Name: 0, dtype: int64
    
    In [243]: frame.icol(1)
    /Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: icol(i) is deprecated. Please use .iloc[:,i]
      #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
    Out[243]:
    2    1
    0    3
    1    5
    Name: 1, dtype: int64
    
    In [245]: frame.iloc[1]  # 按行位置获取
    Out[245]:
    0    2
    1    3
    Name: 0, dtype: int64
    
    In [246]: frame.iloc[:,1]  #按列位置获取
    Out[246]:
    2    1
    0    3
    1    5
    Name: 1, dtype: int64
    
    

    6.2 面板数据

    Panel数据结构,可以看成是一个三维的DataFrame数据结构。
    Panel中的每一项都是一个DataFrame。
    同样使用堆积式(层次化索引的)的DataFrame可以表示一个panel。

    In [247]: import pandas.io.data as web
    /Users/yangfeilong/anaconda/lib/python2.7/site-packages/pandas/io/data.py:35: FutureWarning:
    The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
    After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.
      FutureWarning)
    
    In [248]: web
    Out[248]: <module 'pandas.io.data' from '/Users/yangfeilong/anaconda/lib/python2.7/site-packages/pandas/io/data.py'>
    
    In [249]: pdata = pd.Panel(dict((stk ,web.get_data_yahoo(stk,'1/1/2009','6/1/2012')) for stk in ['AAPL','GOOG','MSFT','DELL']))
    
    In [250]: pdata
    Out[250]:
    <class 'pandas.core.panel.Panel'>
    Dimensions: 4 (items) x 868 (major_axis) x 6 (minor_axis)
    Items axis: AAPL to MSFT
    Major_axis axis: 2009-01-02 00:00:00 to 2012-06-01 00:00:00
    Minor_axis axis: Open to Adj Close
    
    In [252]: pdata = pdata.swapaxes('items','minor')
    
    In [253]: pdata['Adj Close']
    Out[253]:
                     AAPL      DELL        GOOG       MSFT
    Date                                                  
    2009-01-02  11.808505  10.39902  160.499779  16.501303
    ...
    2012-05-30  75.362333  12.14992  293.821674  25.878448
    2012-05-31  75.174961  11.92743  290.140354  25.746145
    2012-06-01  72.996726  11.67592  285.205295  25.093451
    
    [868 rows x 4 columns]
    
    In [256]: pdata.ix[:,'6/1/2012',:]  # ix扩展为三维
    Out[256]:
                Open        High         Low       Close       Volume   Adj Close
    AAPL  569.159996  572.650009  560.520012  560.989983  130246900.0   72.996726
    DELL   12.150000   12.300000   12.045000   12.070000   19397600.0   11.675920
    GOOG  571.790972  572.650996  568.350996  570.981000    6138700.0  285.205295
    MSFT   28.760000   28.959999   28.440001   28.450001   56634300.0   25.093451
    
    In [260]: pdata.ix[:,'5/30/2012':,:].to_frame()
    Out[260]:
                            Open        High         Low       Close       Volume  
    Date       minor                                                                
    2012-05-30 AAPL   569.199997  579.989990  566.559990  579.169998  132357400.0   
               DELL    12.590000   12.700000   12.460000   12.560000   19787800.0   
               GOOG   588.161028  591.901014  583.530999  588.230992    3827600.0   
               MSFT    29.350000   29.480000   29.120001   29.340000   41585500.0   
    2012-05-31 AAPL   580.740021  581.499985  571.460022  577.730019  122918600.0   
               DELL    12.530000   12.540000   12.330000   12.330000   19955600.0   
               GOOG   588.720982  590.001032  579.001013  580.860990    5958800.0   
               MSFT    29.299999   29.420000   28.940001   29.190001   39134000.0   
    2012-06-01 AAPL   569.159996  572.650009  560.520012  560.989983  130246900.0   
               DELL    12.150000   12.300000   12.045000   12.070000   19397600.0   
               GOOG   571.790972  572.650996  568.350996  570.981000    6138700.0   
               MSFT    28.760000   28.959999   28.440001   28.450001   56634300.0   
    
                       Adj Close  
    Date       minor              
    2012-05-30 AAPL    75.362333  
               DELL    12.149920  
               GOOG   293.821674  
               MSFT    25.878448  
    2012-05-31 AAPL    75.174961  
               DELL    11.927430  
               GOOG   290.140354  
               MSFT    25.746145  
    2012-06-01 AAPL    72.996726  
               DELL    11.675920  
               GOOG   285.205295  
               MSFT    25.093451  
    
    # 可以转化为DataFrame
    In [261]: stacked = pdata.ix[:,'5/30/2012':,:].to_frame()
    
    In [262]: stacked.to_panel() # 转化为panel
    Out[262]:
    <class 'pandas.core.panel.Panel'>
    Dimensions: 6 (items) x 3 (major_axis) x 4 (minor_axis)
    Items axis: Open to Adj Close
    Major_axis axis: 2012-05-30 00:00:00 to 2012-06-01 00:00:00
    Minor_axis axis: AAPL to MSFT
    
  • 相关阅读:
    SpringMVC
    Spring mvc 下Ajax获取JSON对象问题 406错误
    Docker国内镜像源
    获取redis cluster主从关系
    终端登录超时限制暂时解除
    vim全选,全部复制,全部删除
    [转]Redis集群搭建
    Jenkins持续集成01—Jenkins服务搭建和部署
    ELK重难点总结和整体优化配置
    ELK 经典用法—企业自定义日志收集切割和mysql模块
  • 原文地址:https://www.cnblogs.com/felo/p/6362359.html
Copyright © 2011-2022 走看看