zoukankan      html  css  js  c++  java
  • 3.1,pandas【基本功能】

    一:改变索引

      reindex方法对于Series直接索引,对于DataFrame既可以改变行索引,也可以改变列索引,还可以两个一起改变.

      1)对于Series

     1 In [2]: seri = pd.Series([4.5,7.2,-5.3,3.6],index = ['d','b','a','c'])
     2 
     3 In [3]: seri
     4 Out[3]:
     5 d    4.5
     6 b    7.2
     7 a   -5.3
     8 c    3.6
     9 dtype: float64
    10 
    11 In [4]: seri1 = seri.reindex(['a','b','c','d','e'])
    12 
    13 In [5]: seri1
    14 Out[5]:
    15 a   -5.3
    16 b    7.2
    17 c    3.6
    18 d    4.5
    19 e    NaN    #没有的即为NaN
    20 dtype: float64
    21 
    22 In [6]: seri.reindex(['a','b','c','d','e'], fill_value=0)
    23 Out[6]:
    24 a   -5.3
    25 b    7.2
    26 c    3.6
    27 d    4.5
    28 e    0.0     #没有的填充为0
    29 dtype: float64
    30 
    31 In [7]: seri
    32 Out[7]:
    33 d    4.5
    34 b    7.2
    35 a   -5.3
    36 c    3.6
    37 dtype: float64
    38 
    39 In [8]: seri_2 = pd.Series(['blue','purple','yellow'], index=[0,2,4])
    40 
    41 In [9]: seri_2
    42 Out[9]:
    43 0      blue
    44 2    purple
    45 4    yellow
    46 dtype: object
    47 
    48 #reindex可用的方法:ffill为向前填充,bfill为向后填充
    49 
    50 In [10]: seri_2.reindex(range(6),method='ffill')
    51 Out[10]:
    52 0      blue
    53 1      blue
    54 2    purple
    55 3    purple
    56 4    yellow
    57 5    yellow
    58 dtype: object
    59 
    60 In [11]: seri_2.reindex(range(6),method='bfill')
    61 Out[11]:
    62 0      blue
    63 1    purple
    64 2    purple
    65 3    yellow
    66 4    yellow
    67 5       NaN
    68 dtype: object
    Series的改变索引

      2)对于DataFrame

        其reindex的函数参数:method="ffill/bfill";fill_value=...[若为NaN时的填充值];......

     1 In [4]: dframe_1 = pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','b','c'],
     2 columns=['Ohio','Texas','Cal'])
     3 In [5]: dframe_1
     4 Out[5]:
     5    Ohio  Texas  Cal
     6 a     0      1    2
     7 b     3      4    5
     8 c     6      7    8
     9 
    10 In [6]: dframe_2 = dframe_1.reindex(['a','b','c','d'])
    11 
    12 In [7]: dframe_2
    13 Out[7]:
    14    Ohio  Texas  Cal
    15 a     0      1    2
    16 b     3      4    5
    17 c     6      7    8
    18 d   NaN    NaN  NaN
    19 
    20 In [16]: dframe_1.reindex(index=['a','b','c','d'],method='ffill',columns=['Ohio'
    21 ,'Beijin','Cal'])
    22 Out[16]:
    23    Ohio  Beijin  Cal
    24 a     0     NaN    2
    25 b     3     NaN    5
    26 c     6     NaN    8
    27 d     6     NaN    8
    28 
    29 In [17]: dframe_1.reindex(index=['a','b','c','d'],fill_value='Z',columns=['Ohio'
    30 Out[17]: ,'Cal'])
    31   Ohio Beijin Cal
    32 a    0      Z   2
    33 b    3      Z   5
    34 c    6      Z   8
    35 d    Z      Z   Z
    36 
    37 In [8]: dframe_1.reindex(columns=['Chengdu','Beijin','Shanghai','Guangdong'])
    38 Out[8]:
    39    Chengdu  Beijin  Shanghai  Guangdong
    40 a      NaN     NaN       NaN        NaN
    41 b      NaN     NaN       NaN        NaN
    42 c      NaN     NaN       NaN        NaN
    43 
    44 In [9]: dframe_1
    45 Out[9]:
    46    Ohio  Texas  Cal
    47 a     0      1    2
    48 b     3      4    5
    49 c     6      7    8
    50 
    51 #用ix关键字同时改变行/列索引
    52 In [10]: dframe_1.ix[['a','b','c','d'],['Ohio','Beijing','Guangdong']]
    53 Out[10]:
    54    Ohio  Beijing  Guangdong
    55 a     0      NaN        NaN
    56 b     3      NaN        NaN
    57 c     6      NaN        NaN
    58 d   NaN      NaN        NaN
    DataFrame的改变索引

    二:丢弃指定轴的数据

      drop方法, 通过索引删除

      1)对于Series

     1 In [21]: seri = pd.Series(np.arange(5),index=['a','b','c','d','e'])
     2 
     3 In [22]: seri
     4 Out[22]:
     5 a    0
     6 b    1
     7 c    2
     8 d    3
     9 e    4
    10 dtype: int32
    11 
    12 In [23]: seri.drop('b')
    13 Out[23]:
    14 a    0
    15 c    2
    16 d    3
    17 e    4
    18 dtype: int32
    19 
    20 In [24]: seri.drop(['d','e'])
    21 Out[24]:
    22 a    0
    23 b    1
    24 c    2
    25 dtype: int32
    Series的删除数据

      2)对于DataFrame

     1 In [29]: dframe = pd.DataFrame(np.arange(16).reshape((4,4)),index=['Chen','Bei',
     2 'Shang','Guang'],columns=['one','two','three','four'])
     3 
     4 In [30]: dframe
     5 Out[30]:
     6        one  two  three  four
     7 Chen     0    1      2     3
     8 Bei      4    5      6     7
     9 Shang    8    9     10    11
    10 Guang   12   13     14    15
    11 
    12 #删除行
    13 In [31]: dframe.drop(['Bei','Shang'])
    14 Out[31]:
    15        one  two  three  four
    16 Chen     0    1      2     3
    17 Guang   12   13     14    15
    18 
    19 #删除列
    20 In [33]: dframe.drop(['two','three'],axis=1)
    21 Out[33]:
    22        one  four
    23 Chen     0     3
    24 Bei      4     7
    25 Shang    8    11
    26 Guang   12    15
    27 
    28 #若第一个参数只有一个时可以不要【】
    DataFrame的删除数据

    三:索引,选取,过滤

      1)Series

        仍然可以向list那些那样用下标访问,不过我觉得不太还,最好还是选择用索引值来进行访问,并且索引值也可用于切片

    In [4]: seri = pd.Series(np.arange(4),index=['a','b','c','d'])
    
    In [5]: seri
    Out[5]:
    a    0
    b    1
    c    2
    d    3
    dtype: int32
    
    In [6]: seri['a']
    Out[6]: 0
    
    In [7]: seri[['b','a']]       #显示顺序也变了
    Out[7]:
    b    1
    a    0
    dtype: int32
    
    
    In [18]: seri[seri<2]    #!!元素级别运算!!
    Out[18]:
    a    0
    b    1
    dtype: int32
    
    In [11]: seri['a':'c']     #索引用于切片
    Out[11]:
    a    0
    b    1
    c    2
    dtype: int32
    
    In [12]: seri['a':'c']='z'
    
    In [13]: seri
    Out[13]:
    a    z
    b    z
    c    z
    d    3
    dtype: object
    Series选取

      2)DataFrame

        其实就是获取一个或多个列的问题。需要注意的是,其实DataFrame可以看作多列索引相同的Series组成的,对应DataFrame数据来说,其首行横向的字段才应该看作是他的索引,所以通过dframe【【n个索引值】】可以选出多列Series,而其中的索引值必须是首行横向的字段,否者报错。而想要取列的话可以通过切片完成,如dframe[:2]选出第0和1行。通过ix【参数1(x),参数2(y)】可以在两个方向上进行选取。

     1 In [19]: dframe = pd.DataFrame(np.arange(16).reshape((4,4)),index=['one','two','
     2 three','four'],columns=['Bei','Shang','Guang','Sheng'])
     3 
     4 In [21]: dframe
     5 Out[21]:
     6        Bei  Shang  Guang  Sheng
     7 one      0      1      2      3
     8 two      4      5      6      7
     9 three    8      9     10     11
    10 four    12     13     14     15
    11 
    12 In [22]: dframe[['one']]         #即是开头讲的索引值用的不正确而报错
    13 ---------------------------------------------------------------------------
    14 KeyError                                  Traceback (most recent call last)
    15 <ipython-input-22-c2522043b676> in <module>()
    16 ----> 1 dframe[['one']]
    17 
    18 In [25]: dframe[['Bei']]
    19 Out[25]:
    20        Bei
    21 one      0
    22 two      4
    23 three    8
    24 four    12
    25 
    26 In [26]: dframe[['Bei','Sheng']]
    27 Out[26]:
    28        Bei  Sheng
    29 one      0      3
    30 two      4      7
    31 three    8     11
    32 four    12     15
    33 
    34 In [27]: dframe[:2]        #取行
    35 Out[27]:
    36      Bei  Shang  Guang  Sheng
    37 one    0      1      2      3
    38 two    4      5      6      7
    39 
    40 In [32]: #为了在DataFrame中引入标签索引,用ix字段,其第一个参数是对行的控制,第二个为对列的控制
    41 
    42 In [33]: dframe.ix[['one','two'],['Bei','Shang']]
    43 Out[33]:
    44      Bei  Shang
    45 one    0      1
    46 two    4      5
    47 
    48 #有此可看出横向的每个字段为dframe实例的属性
    49 In [35]: dframe.Bei
    50 Out[35]:
    51 one       0
    52 two       4
    53 three     8
    54 four     12
    55 Name: Bei, dtype: int32
    56 
    57 In [36]: dframe[dframe.Bei<5]
    58 Out[36]:
    59      Bei  Shang  Guang  Sheng
    60 one    0      1      2      3
    61 two    4      5      6      7
    62 
    63 In [38]: dframe.ix[dframe.Bei<5,:2]
    64 Out[38]:
    65      Bei  Shang
    66 one    0      1
    67 two    4      5
    68 
    69 In [43]: dframe.ix[:'two',['Shang','Bei']]
    70 Out[43]:
    71      Shang  Bei
    72 one      1    0
    73 two      5    4
    DataFrame选取

    四:算术运算

      1)Series

        在运算时会自动按索引对齐后再运算,且在索引值不重叠时产生的运算结果是NaN值, 用运算函数时可以避免此情况。

     1 In [4]: seri_1 = pd.Series([1,2,3,4],index = ['a','b','c','d'])
     2 
     3 In [5]: seri_2 = pd.Series([5,6,7,8,9],index = ['a','c','e','g','f'])
     4 
     5 In [6]: seri_1 + seri_2
     6 Out[6]:
     7 a     6
     8 b   NaN
     9 c     9
    10 d   NaN
    11 e   NaN
    12 f   NaN
    13 g   NaN
    14 dtype: float64
    15 
    16 In [8]: seri_1.add(seri_2)
    17 Out[8]:
    18 a     6
    19 b   NaN
    20 c     9
    21 d   NaN
    22 e   NaN
    23 f   NaN
    24 g   NaN
    25 dtype: float64
    26 
    27 In [7]: seri_1.add(seri_2,fill_value = 0)
    28 Out[7]:
    29 a    6
    30 b    2
    31 c    9
    32 d    4
    33 e    7
    34 f    9
    35 g    8
    36 dtype: float64
    37 
    38 #上面的未重叠区依然有显示值而不是NaN!!
    39 #对应的方法是:add:+; mul: X; sub: -; div : /  
    Series算术运算

      2)DataFrame

     1 In [10]: df_1 = pd.DataFrame(np.arange(12).reshape((3,4)),columns = list('abcd')
     2 )
     3 In [11]: df_2 = pd.DataFrame(np.arange(20).reshape((4,5)),columns = list('abcde'
     4 ))
     5 In [12]: df_1 + df_2
     6 Out[12]:
     7     a   b   c   d   e
     8 0   0   2   4   6 NaN
     9 1   9  11  13  15 NaN
    10 2  18  20  22  24 NaN
    11 3 NaN NaN NaN NaN NaN
    12 
    13 In [13]: df_1.add(df_2)
    14 Out[13]:
    15     a   b   c   d   e
    16 0   0   2   4   6 NaN
    17 1   9  11  13  15 NaN
    18 2  18  20  22  24 NaN
    19 3 NaN NaN NaN NaN NaN
    20 
    21 In [14]: df_1.add(df_2, fill_value = 0)
    22 Out[14]:
    23     a   b   c   d   e
    24 0   0   2   4   6   4
    25 1   9  11  13  15   9
    26 2  18  20  22  24  14
    27 3  15  16  17  18  19
    DataFrame算术运算

      3)DataFrame与Series之间进行运算

      类似:np.array

     1 In [15]: arr_1 = np.arange(12).reshape((3,4))
     2 
     3 In [16]: arr_1 - arr_1[0]
     4 Out[16]:
     5 array([[0, 0, 0, 0],
     6        [4, 4, 4, 4],
     7        [8, 8, 8, 8]])
     8 
     9 In [17]: arr_1
    10 Out[17]:
    11 array([[ 0,  1,  2,  3],
    12        [ 4,  5,  6,  7],
    13        [ 8,  9, 10, 11]])
    array型
     1 In [18]: dframe_1 = pd.DataFrame(np.arange(12).reshape((4,3)),columns=list('bde'
     2 ),index = ['Chen','Bei','Shang','Sheng'])
     3 In [19]: dframe_1
     4 Out[19]:
     5        b   d   e
     6 Chen   0   1   2
     7 Bei    3   4   5
     8 Shang  6   7   8
     9 Sheng  9  10  11
    10 
    11 In [20]: seri = dframe_1.ix[0]
    12 
    13 In [21]: seri
    14 Out[21]:
    15 b    0
    16 d    1
    17 e    2
    18 Name: Chen, dtype: int32
    19 
    20 In [22]: dframe_1 - seri      #每行匹配的进行运算
    21 Out[22]:
    22        b  d  e
    23 Chen   0  0  0
    24 Bei    3  3  3
    25 Shang  6  6  6
    26 Sheng  9  9  9
    27 
    28 In [23]: seri_2 = pd.Series(range(3),index=['b','e','f'])
    29 
    30 In [24]: dframe_1 - seri_2          
    31 Out[24]:
    32        b   d   e   f
    33 Chen   0 NaN   1 NaN
    34 Bei    3 NaN   4 NaN
    35 Shang  6 NaN   7 NaN
    36 Sheng  9 NaN  10 NaN
    37 
    38 In [27]: seri_3 = dframe_1['d']
    39 
    40 In [28]: seri_3        #注意!Serie_3索引并不与dframe_1的相同,与上面的运算形式不同
    41 Out[28]:
    42 Chen      1
    43 Bei       4
    44 Shang     7
    45 Sheng    10
    46 Name: d, dtype: int32
    47 
    48 In [29]: dframe_1 - seri_3
    49 Out[29]:
    50        Bei  Chen  Shang  Sheng   b   d   e
    51 Chen   NaN   NaN    NaN    NaN NaN NaN NaN
    52 Bei    NaN   NaN    NaN    NaN NaN NaN NaN
    53 Shang  NaN   NaN    NaN    NaN NaN NaN NaN
    54 Sheng  NaN   NaN    NaN    NaN NaN NaN NaN
    55 #注意dframe的columns已经变成了Series的index和其自己的columns相加了
    56 
    57 #通过运算函数中的axis参数可改变匹配轴以避免上情况
    58 #0为列匹配,1为行匹配
    59 In [31]: dframe_1.sub(seri_3,axis=0)  
    60 Out[31]:
    61        b  d  e
    62 Chen  -1  0  1
    63 Bei   -1  0  1
    64 Shang -1  0  1
    65 Sheng -1  0  1
    66 
    67 In [33]: dframe_1.sub(seri_3,axis=1)
    68 Out[33]:
    69        Bei  Chen  Shang  Sheng   b   d   e
    70 Chen   NaN   NaN    NaN    NaN NaN NaN NaN
    71 Bei    NaN   NaN    NaN    NaN NaN NaN NaN
    72 Shang  NaN   NaN    NaN    NaN NaN NaN NaN
    73 Sheng  NaN   NaN    NaN    NaN NaN NaN NaN
    DataFrame & Series运算

        注:axis按轴取可以看成  0:以index为index的Series【竖轴】, 1:以colum为index的Series【横轴】

    五:使用函数

    使用函数
     1 In [6]: dframe=pd.DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Che
     2 n','Bei','Shang','Sheng'])
     3 In [7]: dframe
     4 Out[7]:
     5               b         d         e
     6 Chen   1.838620  1.023421  0.641420
     7 Bei    0.920563 -2.037778 -0.853871
     8 Shang -0.587332  0.576442  0.596269
     9 Sheng  0.366174 -0.689582 -1.064030
    10 
    11 In [8]: np.abs(dframe)       #绝对值函数
    12 Out[8]:
    13               b         d         e
    14 Chen   1.838620  1.023421  0.641420
    15 Bei    0.920563  2.037778  0.853871
    16 Shang  0.587332  0.576442  0.596269
    17 Sheng  0.366174  0.689582  1.064030
    18 
    19 In [9]: func = lambda x: x.max() - x.min()
    20 
    21 In [10]: dframe.apply(func)
    22 Out[10]:
    23 b    2.425952
    24 d    3.061200
    25 e    1.705449
    26 dtype: float64
    27 
    28 In [11]: dframe.apply(func,axis=1)
    29 Out[11]:
    30 Chen     1.197200
    31 Bei      2.958341
    32 Shang    1.183602
    33 Sheng    1.430204
    34 dtype: float64
    35 
    36 In [12]: dframe.max()  #即dframe.max(axis=0)
    37 Out[12]:
    38 b    1.838620
    39 d    1.023421
    40 e    0.641420
    41 dtype: float64
    42 
    43 In [15]: dframe.max(axis=1)
    44 Out[15]:
    45 Chen     1.838620
    46 Bei      0.920563
    47 Shang    0.596269
    48 Sheng    0.366174
    49 dtype: float64

     六:排序

      1)按索引排序:sort_index(【axis=0/1,ascending=False/True】)注,其中默认axis为0(index排序),ascending为True(升序)

     1 In [16]: seri = pd.Series(range(4),index=['d','a','d','c'])
     2 
     3 In [17]: seri
     4 Out[17]:
     5 d    0
     6 a    1
     7 d    2
     8 c    3
     9 dtype: int64
    10 
    11 In [18]: seri.sort_index()
    12 Out[18]:
    13 a    1
    14 c    3
    15 d    2
    16 d    0
    17 dtype: int64
    Series的索引排序
     1 In [22]: dframe
     2 Out[22]:
     3               c         a         b
     4 Chen   1.838620  1.023421  0.641420
     5 Bei    0.920563 -2.037778 -0.853871
     6 Shang -0.587332  0.576442  0.596269
     7 Sheng  0.366174 -0.689582 -1.064030
     8 
     9 In [23]: dframe.sort_index()
    10 Out[23]:
    11               c         a         b
    12 Bei    0.920563 -2.037778 -0.853871
    13 Chen   1.838620  1.023421  0.641420
    14 Shang -0.587332  0.576442  0.596269
    15 Sheng  0.366174 -0.689582 -1.064030
    16 
    17 In [24]: dframe.sort_index(axis=1)
    18 Out[24]:
    19               a         b         c
    20 Chen   1.023421  0.641420  1.838620
    21 Bei   -2.037778 -0.853871  0.920563
    22 Shang  0.576442  0.596269 -0.587332
    23 Sheng -0.689582 -1.064030  0.366174
    DataFrame的索引排序,用axis制定是按index(默认)还是columns进行排序(1)

      2)按值排序sort_values方法【注:order方法已不推荐使用了】

     1 In [32]: seri =pd.Series([4,7,np.nan,-1,2,np.nan])
     2 
     3 In [33]: seri
     4 Out[33]:
     5 0     4
     6 1     7
     7 2   NaN
     8 3    -1
     9 4     2
    10 5   NaN
    11 dtype: float64
    12 
    13 In [34]: seri.sort_values()
    14 Out[34]:
    15 3    -1
    16 4     2
    17 0     4
    18 1     7
    19 2   NaN
    20 5   NaN
    21 dtype: float64
    22 
    23 #NaN值会默认排到最后
    Series的值排序
     1 In [38]: dframe = pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})
     2 
     3 In [39]: dframe
     4 Out[39]:
     5    a  b
     6 0  0  4
     7 1  1  7
     8 2  0 -3
     9 3  1  2
    10 
    11 In [54]: dframe.sort_values('a')
    12 Out[54]:
    13    a  b
    14 0  0  4
    15 2  0 -3
    16 1  1  7
    17 3  1  2
    18 
    19 In [55]: dframe.sort_values('b')
    20 Out[55]:
    21    a  b
    22 2  0 -3
    23 3  1  2
    24 0  0  4
    25 1  1  7
    26 
    27 In [57]: dframe.sort_values(['a','b'])
    28 Out[57]:
    29    a  b
    30 2  0 -3
    31 0  0  4
    32 3  1  2
    33 1  1  7
    34 
    35 In [58]: dframe.sort_values(['b','a'])
    36 Out[58]:
    37    a  b
    38 2  0 -3
    39 3  1  2
    40 0  0  4
    41 1  1  7
    DataFrame的值排序

    七:排名

      rank方法

    八:统计计算

      count:非NaN值  describe:对Series或DataFrame列计算汇总统计  min,max  argmin,argmax(整数值):最值得索引值  idmax,idmin:最值索引值

      sum  mean:平均数  var:样本方差  std:样本标准差  kurt:峰值  cumsum:累积和  cummin/cummax:累计最值  pct_change:百分数变化

     1 In [63]: df = pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]]
     2 ,index=['a','b','c','d'],columns=['one','two'])
     3 
     4 In [64]: df
     5 Out[64]:
     6     one  two
     7 a  1.40  NaN
     8 b  7.10 -4.5
     9 c   NaN  NaN
    10 d  0.75 -1.3
    11 
    12 In [66]: df.sum()
    13 Out[66]:
    14 one    9.25
    15 two   -5.80
    16 dtype: float64
    17 
    18 In [67]: df.sum(axis=1)
    19 Out[67]:
    20 a    1.40
    21 b    2.60
    22 c     NaN
    23 d   -0.55
    24 dtype: float64
    25 
    26 #求平均值,skipna:跳过NaN
    27 In [68]: df.mean(axis=1,skipna=False)
    28 Out[68]:
    29 a      NaN
    30 b    1.300
    31 c      NaN
    32 d   -0.275
    33 dtype: float64
    34 
    35 
    36 In [70]: df.idxmax()
    37 Out[70]:
    38 one    b
    39 two    d
    40 dtype: object
    41 
    42 In [71]: df.cumsum()
    43 Out[71]:
    44     one  two
    45 a  1.40  NaN
    46 b  8.50 -4.5
    47 c   NaN  NaN
    48 d  9.25 -5.8
    49 
    50 In [72]: df.describe()
    51 Out[72]:
    52             one       two
    53 count  3.000000  2.000000
    54 mean   3.083333 -2.900000
    55 std    3.493685  2.262742
    56 min    0.750000 -4.500000
    57 25%    1.075000 -3.700000
    58 50%    1.400000 -2.900000
    59 75%    4.250000 -2.100000
    60 max    7.100000 -1.300000
    一些统计计算

    九:唯一值,值计数,以及成员资格

      unique方法  value_counts:顶级方法  isin方法

     1 In [74]: seri = pd.Series(['c','a','d','a','a','b','b','c','c'])
     2 
     3 In [75]: seri
     4 Out[75]:
     5 0    c
     6 1    a
     7 2    d
     8 3    a
     9 4    a
    10 5    b
    11 6    b
    12 7    c
    13 8    c
    14 dtype: object
    15 
    16 In [76]: seri.unique()
    17 Out[76]: array(['c', 'a', 'd', 'b'], dtype=object)
    18 
    19 In [77]: seri.value_counts()
    20 Out[77]:
    21 c    3
    22 a    3
    23 b    2
    24 d    1
    25 dtype: int64
    26 
    27 In [78]: pd.value_counts(seri.values,sort=False)
    28 Out[78]:
    29 a    3
    30 c    3
    31 b    2
    32 d    1
    33 dtype: int64
    34 
    35 
    36 In [81]: seri.isin(['b','c'])
    37 Out[81]:
    38 0     True
    39 1    False
    40 2    False
    41 3    False
    42 4    False
    43 5     True
    44 6     True
    45 7     True
    46 8     True
    47 dtype: bool
    唯一值,值计数,成员资格

    十:缺少数据处理

      一)删除NaN:dropna方法

        1)Series

          python中的None即是对应到的Numpy的NaN

     1 In [3]: seri = pd.Series(['aaa','bbb',np.nan,'ccc'])
     2 
     3 In [4]: seri[0]=None
     4 
     5 In [5]: seri
     6 Out[5]:
     7 0    None
     8 1     bbb
     9 2     NaN
    10 3     ccc
    11 dtype: object
    12 
    13 In [7]: seri.isnull()
    14 Out[7]:
    15 0     True
    16 1    False
    17 2     True
    18 3    False
    19 dtype: bool
    20 
    21 In [8]: seri.dropna()   #返回非NaN值
    22 Out[8]:
    23 1    bbb
    24 3    ccc
    25 dtype: object
    26 
    27 In [9]: seri
    28 Out[9]:
    29 0    None
    30 1     bbb
    31 2     NaN
    32 3     ccc
    33 dtype: object
    34 
    35 In [10]: seri[seri.notnull()]      #返回非空值
    36 Out[10]:
    37 1    bbb
    38 3    ccc
    39 dtype: object
    Series数据处理

        2)DataFrame

          对于DataFrame事情稍微复杂,有时希望删除全NaN或者含有NaN的行或列。

     1 In [15]: df = pd.DataFrame([[1,6.5,3],[1,np.nan,np.nan],[np.nan,np.nan,np.nan],[
     2 np.nan,6.5,3]])
     3 
     4 In [16]: df
     5 Out[16]:
     6     0    1   2
     7 0   1  6.5   3
     8 1   1  NaN NaN
     9 2 NaN  NaN NaN
    10 3 NaN  6.5   3
    11 
    12 In [17]: df.dropna()   #默认以行(axis=0),只要有NaN的就删除
    13 Out[17]:
    14    0    1  2
    15 0  1  6.5  3
    16 
    17 In [19]: df.dropna(how='all') #只删除全是NaN的行
    18 Out[19]:
    19     0    1   2
    20 0   1  6.5   3
    21 1   1  NaN NaN
    22 3 NaN  6.5   3
    23 
    24 In [21]: df.dropna(axis=1,how='all')  #以列为标准来丢弃列
    25 Out[21]:
    26     0    1   2
    27 0   1  6.5   3
    28 1   1  NaN NaN
    29 2 NaN  NaN NaN
    30 3 NaN  6.5   3
    31 
    32 In [22]: df.dropna(axis=1)    
    33 Out[22]:
    34 Empty DataFrame
    35 Columns: []
    36 Index: [0, 1, 2, 3]
    DataFrame的数据处理

      

      二)填充NaN:fillna方法    

     1 In [88]: df
     2 Out[88]:
     3     one  two
     4 a  1.40  NaN
     5 b  7.10 -4.5
     6 c   NaN  NaN
     7 d  0.75 -1.3
     8 
     9 In [90]: df.fillna(0)
    10 Out[90]:
    11     one  two
    12 a  1.40  0.0
    13 b  7.10 -4.5
    14 c  0.00  0.0
    15 d  0.75 -1.3
    填充NaN

    十一:层次化索引

     1 In [30]: seri = pd.Series(np.random.randn(10),index=[['a','a','a','b','b','b','c
     2 ','c','d','d'],[1,2,3,1,2,3,1,2,2,3]])
     3 In [31]: seri
     4 Out[31]:
     5 a  1    0.528387
     6    2   -0.152286
     7    3   -0.776540
     8 b  1    0.025425
     9    2   -1.412776
    10    3    0.969498
    11 c  1    0.478260
    12    2    0.116301
    13 d  2    1.464144
    14    3    2.266069
    15 dtype: float64
    16 
    17 In [32]: seri['a']
    18 Out[32]:
    19 1    0.528387
    20 2   -0.152286
    21 3   -0.776540
    22 dtype: float64
    23 
    24 In [33]: seri.index
    25 Out[33]:
    26 MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
    27            labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2
    28 ]])
    29 
    30 In [35]: seri['a':'c']
    31 Out[35]:
    32 a  1    0.528387
    33    2   -0.152286
    34    3   -0.776540
    35 b  1    0.025425
    36    2   -1.412776
    37    3    0.969498
    38 c  1    0.478260
    39    2    0.116301
    40 dtype: float64
    41 
    42 In [45]: seri.unstack()
    43 Out[45]:
    44           1         2         3
    45 a  0.528387 -0.152286 -0.776540
    46 b  0.025425 -1.412776  0.969498
    47 c  0.478260  0.116301       NaN
    48 d       NaN  1.464144  2.266069
    49 
    50 In [46]: seri.unstack().stack()
    51 Out[46]:
    52 a  1    0.528387
    53    2   -0.152286
    54    3   -0.776540
    55 b  1    0.025425
    56    2   -1.412776
    57    3    0.969498
    58 c  1    0.478260
    59    2    0.116301
    60 d  2    1.464144
    61    3    2.266069
    62 dtype: float64
    Series层次化索引,利用unstack方法可以转化为DataFrame型数据
     1 In [48]: df = pd.DataFrame(np.arange(12).reshape((4,3)),index=[['a','a','b','b']
     2 ,[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']])
     3 
     4 In [49]: df
     5 Out[49]:
     6      Ohio     Colorado
     7     Green Red    Green
     8 a 1     0   1        2
     9   2     3   4        5
    10 b 1     6   7        8
    11   2     9  10       11
    12 
    13 In [50]: df.index
    14 Out[50]:
    15 MultiIndex(levels=[[u'a', u'b'], [1, 2]],
    16            labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
    17 
    18 In [51]: df.columns
    19 Out[51]:
    20 MultiIndex(levels=[[u'Colorado', u'Ohio'], [u'Green', u'Red']],
    21            labels=[[1, 1, 0], [0, 1, 0]])
    22 
    23 In [53]: df['Ohio']
    24 Out[53]:
    25      Green  Red
    26 a 1      0    1
    27   2      3    4
    28 b 1      6    7
    29   2      9   10
    30 
    31 In [57]: df.ix['a','Ohio']
    32 Out[57]:
    33    Green  Red
    34 1      0    1
    35 2      3    4
    36 
    37 In [61]: df.ix['a','Ohio'].ix[1,'Red']
    38 Out[61]: 1
    DataFrame层次化索引

     

  • 相关阅读:
    微软职位内部推荐-Senior Software Engineer
    微软职位内部推荐-SENIOR SOFTWARE ENGINEER
    微软职位内部推荐-SDEII
    微软职位内部推荐-SOFTWARE ENGINEER II
    微软职位内部推荐-SOFTWARE ENGINEER II
    微软职位内部推荐-Senior SDE
    微软职位内部推荐-SDEII
    elasticsearch实现按天翻滚索引
    kafka中处理超大消息的一些处理
    Kafka主要配置
  • 原文地址:https://www.cnblogs.com/pengsixiong/p/5041514.html
Copyright © 2011-2022 走看看