zoukankan      html  css  js  c++  java
  • Python数据分析_Pandas01_数据框的创建和选取

    主要内容:

    • 创建数据表
    • 查看数据表
    • 数据表索引、选取部分数据
      • 通过标签选取.loc
      • 多重索引选取
      • 位置选取.iloc
      • 布尔索引

    Object Creation 新建数据

    • 用list建series序列
    In [73]: s = pd.Series([1,3,5,np.nan,6,8])
    
    In [74]: s
    Out[74]:
    0    1.0
    1    3.0
    2    5.0
    3    NaN
    4    6.0
    5    8.0
    dtype: float64
    
    
    • 用numpy array建dataframe
    In [75]: dates = pd.date_range('20130101', periods=6)
    
    In [76]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    
    In [77]: df
    Out[77]:
                       A         B         C         D
    2013-01-01 -0.411674  0.273549  0.629843  1.881497
    2013-01-02  1.240512  0.970725  0.033099  1.553420
    2013-01-03 -0.544326  0.545738 -1.325810  0.130738
    2013-01-04  1.044803 -0.117151  0.874583  2.278227
    2013-01-05 -2.194728 -2.536257  0.478644  0.057728
    2013-01-06 -1.092031  1.249952  1.598761 -0.153423
    
    #---pd.date_range?---
    In [115]: pd.date_range(start='12/31/2011', end='12/31/2013', freq='A')
    Out[115]: DatetimeIndex(['2011-12-31', '2012-12-31', '2013-12-31'], dtype='datetime64[ns]', freq='A-DEC')
    
    
    • 用dictionary
    In [78]: df2 = pd.DataFrame({ 'A' : 1.,
        ...:                      'B' : pd.Timestamp('20130102'),
        ...:                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
        ...:                      'D' : np.array([3] * 4,dtype='int32'),
        ...:                      'E' : pd.Categorical(["test","train","test","train"]),
        ...:                      'F' : 'foo' })
        ...: df2
        ...:
    Out[78]:
         A          B    C  D      E    F
    0  1.0 2013-01-02  1.0  3   test  foo
    1  1.0 2013-01-02  1.0  3  train  foo
    2  1.0 2013-01-02  1.0  3   test  foo
    3  1.0 2013-01-02  1.0  3  train  foo
    
    In [80]: df2.dtypes
    Out[80]:
    A           float64
    B    datetime64[ns]
    C           float32
    D             int32
    E          category
    F            object
    dtype: object
    

    在ipython中可以使用<tab>键进行自动补充,它会列出数据对象可以执行的操作。

    查看数据

    df.head()
    df.tail(3)
    df.index
    df.columns              #返回一个这样的东西:pandas.indexes.numeric.Int64Index
    df.values               #提取出数据框的数值,返回一个array
    

    数据选取

    建议 使用pandas的数据选取方法:.at, .iat, .loc, .iloc, .ix. 这些更高效。

    df['A']       # 选取某一列,返回一个Series,== df.A,【只能选某一列,不能用":"多选。】
    
    df[0:3]       # 选行
    df['20130102':'20130104']
    
    • 通过标签label选取,.loc

      用.loc[]选取数据时,方括号里对应的是:[行,列](逗号分隔),如果只有一个值,默认是行。可以用“:”。

      In [82]: df
      Out[82]:
                         A         B         C         D
      2013-01-01 -0.411674  0.273549  0.629843  1.881497
      2013-01-02  1.240512  0.970725  0.033099  1.553420
      2013-01-03 -0.544326  0.545738 -1.325810  0.130738
      2013-01-04  1.044803 -0.117151  0.874583  2.278227
      2013-01-05 -2.194728 -2.536257  0.478644  0.057728
      2013-01-06 -1.092031  1.249952  1.598761 -0.153423
      
      In [83]: df.loc[dates[0]]   # 作为index的日期列叫dates
      Out[83]:
      A   -0.411674
      B    0.273549
      C    0.629843
      D    1.881497
      Name: 2013-01-01 00:00:00, dtype: float64
      
      #---对多个维度轴axis进行选取---
      In [84]: df.loc['20130102':'20130104',['A','B']]
      Out[84]:
                         A         B
      2013-01-02  1.240512  0.970725
      2013-01-03 -0.544326  0.545738
      2013-01-04  1.044803 -0.117151
      
      #---选取某个数值---
      In [85]: df.loc[dates[0],'A']
      Out[85]: -0.41167416696608039
      
      In [86]: df.at[dates[0],'A']     # 更高效的做法
      Out[86]: -0.41167416696608039
      
      
    • 多重索引的选取

      index有多个维度

      #这里有一个多重索引
      MultiIndex(levels=[[1, 2, 3], ['count', 'mean', 'std', 'min', '5%', '10%', '15.0%', '20%', '25%', 
                                  '30.0%', '35%', '40%', '45%', '50%', '55.0%', '60.0%', '65%', '70%', 
                                  '75%', '80%', '85.0%', '90%', '95%', 'max']],
                 labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                          2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], 
                          [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
                           21, 22, 23, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 
                           18, 19, 20, 21, 22, 23, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
                           15, 16, 17, 18, 19, 20, 21, 22, 23]],
                 names=['label_1', None])
                 
      df[columnName]                      #选某一列,或多列(":",[,,,])
      df.loc[:,columnName]                #选某一列,或多列(":",[,,,])
      df.loc[1,columnName]             #可以直接用最外层的索引
      df.loc[(1,'std'),columnName]     #多层索引要用tuple。选多行用":"连接tuple。
      df.loc[[(1,'std'),(2,"count")],'feature_001']
      
    • 用位置选取:.iloc

      .lic[],位置索引,方括号里是整数值。同样的用“,”隔开行列。

      In [93]: df.iloc[3]
      Out[93]:
      A    1.044803
      B   -0.117151
      C    0.874583
      D    2.278227
      Name: 2013-01-04 00:00:00, dtype: float64
      
      In [94]: df.iloc[3:5,0:2]
      Out[94]:
                         A         B
      2013-01-04  1.044803 -0.117151
      2013-01-05 -2.194728 -2.536257
      
      In [95]: df.iat[1,1]
      Out[95]: 0.97072539301549565
      
    • **布尔索引 **Boolean Indexing

      某一列大于0的数据

      In [96]: df[df.A > 0]
      Out[96]:
                         A         B         C         D
      2013-01-02  1.240512  0.970725  0.033099  1.553420
      2013-01-04  1.044803 -0.117151  0.874583  2.278227
      

      整体大于零的数据。小于0的为NaN

      In [97]: df[df > 0]     
      Out[97]:
                         A         B         C         D
      2013-01-01       NaN  0.273549  0.629843  1.881497
      2013-01-02  1.240512  0.970725  0.033099  1.553420
      2013-01-03       NaN  0.545738       NaN  0.130738
      2013-01-04  1.044803       NaN  0.874583  2.278227
      2013-01-05       NaN       NaN  0.478644  0.057728
      2013-01-06       NaN  1.249952  1.598761       NaN
      

      对字符型数据选取

      #---isin ---
      In [98]: df2 = df.copy()
          ...: df2['E'] = ['one', 'one','two','three','four','three']
          ...: df2
          ...:
      Out[98]:
                         A         B         C         D      E
      2013-01-01 -0.411674  0.273549  0.629843  1.881497    one
      2013-01-02  1.240512  0.970725  0.033099  1.553420    one
      2013-01-03 -0.544326  0.545738 -1.325810  0.130738    two
      2013-01-04  1.044803 -0.117151  0.874583  2.278227  three
      2013-01-05 -2.194728 -2.536257  0.478644  0.057728   four
      2013-01-06 -1.092031  1.249952  1.598761 -0.153423  three
      
      In [99]: df2[df2['E'].isin(['two','four'])]
      Out[99]:
                         A         B         C         D     E
      2013-01-03 -0.544326  0.545738 -1.325810  0.130738   two
      2013-01-05 -2.194728 -2.536257  0.478644  0.057728  four
      

      使用布尔面具

      In [107]: mask = df2["A"] >0
      
      In [108]: df3 = df2[mask]
      
      In [109]: df3
      Out[109]:
                         A         B         C         D      E
      2013-01-02  1.240512  0.970725  0.033099  1.553420    ONE
      2013-01-04  1.044803 -0.117151  0.874583  2.278227  THREE
      
      # 查看无重复的值:.unique()
      In [101]: df2.loc[:,"E"].unique()
      Out[101]: array(['one', 'two', 'three', 'four'], dtype=object)
  • 相关阅读:
    [已解决] Python logging 重复打印日志信息
    scrapy
    Python 元编程
    MySQL性能优化 分区
    SQL Mode
    Golang 接口
    Python partial
    栈、队列(链表实现)
    Golang 位向量
    Java50题——学习以及思考
  • 原文地址:https://www.cnblogs.com/pejsidney/p/9226613.html
Copyright © 2011-2022 走看看