zoukankan      html  css  js  c++  java
  • pandas: DataFrame(二)

    pandas:DataFrame数据对齐与缺失数据

    DataFrame对象在运算时,同样会对数据对齐,结果的行索引和列索引分别为两个操作数的行索引与列索引的并集

    DataFrame处理缺失数据的方法

      1 dropna(axis=0,how='any')  #清除缺失数据,axis=0表示按行进行清除,axis=1表示按列清楚,how=any表示如果有任意一个NaN就清除,how=all表示该行(列)中的所有值为NaN就清除
      2 
      3 fillna()设置缺失值
      4 isnull()是否为空
      5 notnull()不为空
      6 
      7 In [62]: df2
      8 Out[62]: 
      9      open   close    high
     10 0  22.074  20.657  22.503
     11 1  20.750  20.489  20.944
     12 2  20.300  19.593  20.384
     13 3  19.426  19.977  20.308
     14 4  19.995  20.520  20.706
     15 5  20.353  20.273  20.454
     16 6  20.264  20.101  20.353
     17 7  19.999  19.739  19.999
     18 8  19.783  19.818  19.982
     19 9  19.558  19.841  19.911
     20 
     21 In [63]: df3
     22 Out[63]: 
     23          date    open   close     low
     24 0  2007-03-01  22.074  20.657  20.220
     25 1  2007-03-02  20.750  20.489  20.256
     26 2  2007-03-05  20.300  19.593  19.218
     27 3  2007-03-06  19.426  19.977  19.315
     28 4  2007-03-07  19.995  20.520  19.827
     29 5  2007-03-08  20.353  20.273  20.167
     30 6  2007-03-09  20.264  20.101  19.735
     31 7  2007-03-12  19.999  19.739  19.646
     32 8  2007-03-13  19.783  19.818  19.699
     33 9  2007-03-14  19.558  19.841  19.333
     34 
     35 In [64]: df4 = df2+df3
     36 
     37 In [65]: df4
     38 Out[65]: 
     39     close date  high  low    open
     40 0  41.314  NaN   NaN  NaN  44.148
     41 1  40.978  NaN   NaN  NaN  41.500
     42 2  39.186  NaN   NaN  NaN  40.600
     43 3  39.954  NaN   NaN  NaN  38.852
     44 4  41.040  NaN   NaN  NaN  39.990
     45 5  40.546  NaN   NaN  NaN  40.706
     46 6  40.202  NaN   NaN  NaN  40.528
     47 7  39.478  NaN   NaN  NaN  39.998
     48 8  39.636  NaN   NaN  NaN  39.566
     49 9  39.682  NaN   NaN  NaN  39.116
     50 
     51 In [66]: df4.dropna(axis=1,)
     52 Out[66]: 
     53     close    open
     54 0  41.314  44.148
     55 1  40.978  41.500
     56 2  39.186  40.600
     57 3  39.954  38.852
     58 4  41.040  39.990
     59 5  40.546  40.706
     60 6  40.202  40.528
     61 7  39.478  39.998
     62 8  39.636  39.566
     63 9  39.682  39.116
     64 
     65 
     66 In [67]: df4.fillna(0)
     67 Out[67]: 
     68     close  date  high  low    open
     69 0  41.314     0   0.0  0.0  44.148
     70 1  40.978     0   0.0  0.0  41.500
     71 2  39.186     0   0.0  0.0  40.600
     72 3  39.954     0   0.0  0.0  38.852
     73 4  41.040     0   0.0  0.0  39.990
     74 5  40.546     0   0.0  0.0  40.706
     75 6  40.202     0   0.0  0.0  40.528
     76 7  39.478     0   0.0  0.0  39.998
     77 8  39.636     0   0.0  0.0  39.566
     78 9  39.682     0   0.0  0.0  39.116
     79 
     80 In [68]: df4.isnull()
     81 Out[68]: 
     82    close  date  high   low   open
     83 0  False  True  True  True  False
     84 1  False  True  True  True  False
     85 2  False  True  True  True  False
     86 3  False  True  True  True  False
     87 4  False  True  True  True  False
     88 5  False  True  True  True  False
     89 6  False  True  True  True  False
     90 7  False  True  True  True  False
     91 8  False  True  True  True  False
     92 9  False  True  True  True  False
     93 
     94 In [69]: df4.notnull()
     95 Out[69]: 
     96    close   date   high    low  open
     97 0   True  False  False  False  True
     98 1   True  False  False  False  True
     99 2   True  False  False  False  True
    100 3   True  False  False  False  True
    101 4   True  False  False  False  True
    102 5   True  False  False  False  True
    103 6   True  False  False  False  True
    104 7   True  False  False  False  True
    105 8   True  False  False  False  True
    106 9   True  False  False  False  True

    pandas常用方法(适用于Series和DataFrame)

     1 In [89]: df5
     2 Out[89]: 
     3    id        date    open   close    high     low      volume    code
     4 0   0  2007-03-01  22.074  20.657  22.503  20.220  1977633.51  601318
     5 1   1  2007-03-02  20.750  20.489  20.944  20.256   425048.32  601318
     6 2   2  2007-03-05  20.300  19.593  20.384  19.218   419196.74  601318
     7 3   3  2007-03-06  19.426  19.977  20.308  19.315   297727.88  601318
     8 4   4  2007-03-07  19.995  20.520  20.706  19.827   287463.78  601318
     9 5   5  2007-03-08  20.353  20.273  20.454  20.167   130983.83  601318
    10 6   6  2007-03-09  20.264  20.101  20.353  19.735   160887.79  601318
    11 7   7  2007-03-12  19.999  19.739  19.999  19.646   145353.06  601318
    12 8   8  2007-03-13  19.783  19.818  19.982  19.699   102319.68  601318
    13 9   9  2007-03-14  19.558  19.841  19.911  19.333   173306.56  601318
    14 
    15 mean(axis=0,skipna=False)  #    求平均值
    16 
    17 In [90]: df5.mean()
    18 Out[90]: 
    19 id             4.5000
    20 open          20.2502
    21 close         20.1008
    22 high          20.5544
    23 low           19.7416
    24 volume    411992.1150
    25 code      601318.0000
    26 dtype: float64
    27 
    28 In [91]: df5['open'].mean()
    29 Out[91]: 20.2502
    30 
    31 sum(axis=1)
    32 
    33 In [93]: df5.sum()  # 求和
    34 Out[93]: 
    35 id                                                       45
    36 date      2007-03-012007-03-022007-03-052007-03-062007-0...
    37 open                                                202.502
    38 close                                               201.008
    39 high                                                205.544
    40 low                                                 197.416
    41 volume                                          4.11992e+06
    42 code                                                6013180
    43 dtype: object
    44 
    45 sort_index(axis,ascending,...)  #按行或列索引排序
    46 sort_values(by,axis,ascending)  # 按值排序
    47 
    48 In [99]: df5.sort_index(axis=0)
    49 Out[99]: 
    50    id        date    open   close    high     low      volume    code
    51 0   0  2007-03-01  22.074  20.657  22.503  20.220  1977633.51  601318
    52 1   1  2007-03-02  20.750  20.489  20.944  20.256   425048.32  601318
    53 2   2  2007-03-05  20.300  19.593  20.384  19.218   419196.74  601318
    54 3   3  2007-03-06  19.426  19.977  20.308  19.315   297727.88  601318
    55 4   4  2007-03-07  19.995  20.520  20.706  19.827   287463.78  601318
    56 5   5  2007-03-08  20.353  20.273  20.454  20.167   130983.83  601318
    57 6   6  2007-03-09  20.264  20.101  20.353  19.735   160887.79  601318
    58 7   7  2007-03-12  19.999  19.739  19.999  19.646   145353.06  601318
    59 8   8  2007-03-13  19.783  19.818  19.982  19.699   102319.68  601318
    60 9   9  2007-03-14  19.558  19.841  19.911  19.333   173306.56  601318
    61 
    62 
    63 In [102]: df5.sort_values(['close','open'])
    64 Out[102]: 
    65    id        date    open   close    high     low      volume    code
    66 2   2  2007-03-05  20.300  19.593  20.384  19.218   419196.74  601318
    67 7   7  2007-03-12  19.999  19.739  19.999  19.646   145353.06  601318
    68 8   8  2007-03-13  19.783  19.818  19.982  19.699   102319.68  601318
    69 9   9  2007-03-14  19.558  19.841  19.911  19.333   173306.56  601318
    70 3   3  2007-03-06  19.426  19.977  20.308  19.315   297727.88  601318
    71 6   6  2007-03-09  20.264  20.101  20.353  19.735   160887.79  601318
    72 5   5  2007-03-08  20.353  20.273  20.454  20.167   130983.83  601318
    73 1   1  2007-03-02  20.750  20.489  20.944  20.256   425048.32  601318
    74 4   4  2007-03-07  19.995  20.520  20.706  19.827   287463.78  601318
    75 0   0  2007-03-01  22.074  20.657  22.503  20.220  1977633.51  601318
     1 # apply(func, axis=0) #将自定义函数应用在各行或者各列上,func可返回标量或者Series
     2 #applymap(func) #将函数应用在DataFrame各个元素上
     3 #map(func)  将函数应用在Series各个元素上
     4 In [108]: df2
     5 Out[108]: 
     6      open   close    high     low      volume
     7 0  22.074  20.657  22.503  20.220  1977633.51
     8 1  20.750  20.489  20.944  20.256   425048.32
     9 2  20.300  19.593  20.384  19.218   419196.74
    10 3  19.426  19.977  20.308  19.315   297727.88
    11 4  19.995  20.520  20.706  19.827   287463.78
    12 5  20.353  20.273  20.454  20.167   130983.83
    13 6  20.264  20.101  20.353  19.735   160887.79
    14 7  19.999  19.739  19.999  19.646   145353.06
    15 8  19.783  19.818  19.982  19.699   102319.68
    16 9  19.558  19.841  19.911  19.333   173306.56
    17 
    18 In [110]: df2.apply(lambda x:x.sum())
    19 Out[110]: 
    20 open          202.502
    21 close         201.008
    22 high          205.544
    23 low           197.416
    24 volume    4119921.150
    25 dtype: float64
    26 
    27 In [109]: df2.applymap(lambda x:x+1)
    28 Out[109]: 
    29      open   close    high     low      volume
    30 0  23.074  21.657  23.503  21.220  1977634.51
    31 1  21.750  21.489  21.944  21.256   425049.32
    32 2  21.300  20.593  21.384  20.218   419197.74
    33 3  20.426  20.977  21.308  20.315   297728.88
    34 4  20.995  21.520  21.706  20.827   287464.78
    35 5  21.353  21.273  21.454  21.167   130984.83
    36 6  21.264  21.101  21.353  20.735   160888.79
    37 7  20.999  20.739  20.999  20.646   145354.06
    38 8  20.783  20.818  20.982  20.699   102320.68
    39 9  20.558  20.841  20.911  20.333   173307.56

    pandas: 层次化索引

    层次化索引是pandas的一项重要功能,它使我们能够在一个轴上拥有多个索引级别

     1 In [114]: import numpy as np
     2 In [115]: data = pd.Series(np.random.rand(9),index=[['a','a','a','b','b','b','c','c','c'],[
     3      ...: 1,2,3,1,2,3,1,2,3]])
     4 
     5 In [116]: data
     6 Out[116]: 
     7 a  1    0.445620
     8    2    0.584242
     9    3    0.454314
    10 b  1    0.439814
    11    2    0.714734
    12    3    0.415314
    13 c  1    0.491325
    14    2    0.411385
    15    3    0.617076
    16 dtype: float64
    17 
    18 In [118]: data['a']
    19 Out[118]: 
    20 1    0.445620
    21 2    0.584242
    22 3    0.454314
    23 dtype: float64
    24 
    25 In [119]: data['c']
    26 Out[119]: 
    27 1    0.491325
    28 2    0.411385
    29 3    0.617076
    30 dtype: float64
  • 相关阅读:
    如何解决WEB应用中文乱码问题
    js获取指定格式的时间字符串
    js 实现 Base64 编码的相互转换
    Sql Server 与 MySql 在使用 update inner join 时的区别
    VMware12创建新的虚拟机及设置硬件环境
    Sql Server 中使用日期遍历
    Linux 添加定时任务,crontab -e 命令与直接编辑 /etc/crontab 文件
    mysql 中 max_allowed_packet 查询和修改
    eclipse编译项目:Java @Override 注解报错的解决方法
    eclipse启动项目报错:java.lang.ClassNotFoundException: ContextLoaderListener
  • 原文地址:https://www.cnblogs.com/YingLai/p/9300775.html
Copyright © 2011-2022 走看看