zoukankan      html  css  js  c++  java
  • Pandas入门学习笔记1

    pandas是基于numpy创建,提供更高级的数据结构和操作工具,是一个不可获取的工具。

    为了更好的阐述,约定如下:

    import pandas as pd
    from pandas import Series,DataFrame
    

    1 pandas数据结构

    两个最重要的数据结构:Series和DataFrame,基本可以解决绝大部分问题。

    1.1 Series

    类似于一维数组的对象,由一组数据和一组索引组成。Series的表现形式一般为左边索引,
    右边值。

    In [50]: obj = Series([4,7,-5,3])
    
    In [51]: obj
    Out[51]:
    0    4
    1    7
    2   -5
    3    3
    dtype: int64
    
    In [52]: obj.values
    Out[52]: array([ 4,  7, -5,  3], dtype=int64)
    
    In [53]: obj.index
    Out[53]: RangeIndex(start=0, stop=4, step=1)
    

    可以自定义索引:

    In [54]: obj2 = Series([4,7,-5,3],index=['d','b','a','c'])
    
    In [55]: obj2
    Out[55]:
    d    4
    b    7
    a   -5
    c    3
    dtype: int64
    
    In [56]: obj2.index
    Out[56]: Index(['d', 'b', 'a', 'c'], dtype='object')
    

    Series类似一个有序字典,可以进行查询和赋值等。

    In [5]: obj2['d']
    Out[5]: 4
    
    In [6]: obj2['d'] = 100
    
    In [7]: obj2
    Out[7]:
    d    100
    b      7
    a     -5
    c      3
    dtype: int64
    
    In [8]: obj2[['d','a']]
    Out[8]:
    d    100
    a     -5
    dtype: int64
    

    同样可以对Series的值进行运算。

    In [9]: obj2 * 2
    Out[9]:
    d    200
    b     14
    a    -10
    c      6
    dtype: int64
    
    In [10]: import numpy as np
    
    In [11]: np.exp(obj2)
    Out[11]:
    d    2.688117e+43
    b    1.096633e+03
    a    6.737947e-03
    c    2.008554e+01
    dtype: float64
    

    和字典一样的特性

    In [12]: 'a' in obj2
    Out[12]: True
    
    In [13]: 6 in obj2
    Out[13]: False
    

    可以通过字典直接来创建Series,字典的键就是Series的索引。

    In [14]: sdata = {'ohio':35000, 'Text':72134,'Uthch':5000}
    
    In [15]: obj3 = Series(sdata)
    
    In [16]: obj3
    Out[16]:
    Text     72134
    Uthch     5000
    ohio     35000
    dtype: int64
    
    # 在sdata中找不到的键值对,将使用NaN(Not a Number)来代替。
    In [17]: obj4 = Series(sdata,index=['ohio','new_index'])
    
    In [18]: obj4
    Out[18]:
    ohio         35000.0
    new_index        NaN
    dtype: float64
    
    

    pandas可以使用isnull和notnull检测是否缺失数据。

    In [19]: obj3.isnull()
    Out[19]:
    Text     False
    Uthch    False
    ohio     False
    dtype: bool
    
    In [20]: obj4.notnull
    Out[20]:
    <bound method NDFrame.notnull of ohio         35000.0
    new_index        NaN
    dtype: float64>
    
    In [21]: obj4.notnull()
    Out[21]:
    ohio          True
    new_index    False
    dtype: bool
    In [22]: pd.isnull(obj4)
    Out[22]:
    ohio         False
    new_index     True
    dtype: bool
    

    Series在算术运算中会自动对齐索引,并进行计算。

    In [24]: obj3 + obj4
    Out[24]:
    Text             NaN  #obj3
    Uthch            NaN  #obj3
    new_index        NaN  #obj4
    ohio         70000.0
    dtype: float64
    
    

    Series还有一个name的属性,类似数据库表名。index也有一个name的属性,类似索引名字。

    In [27]: obj4.name
    
    In [28]: obj4.name = "obj444"
    
    In [29]: obj4
    Out[29]:
    ohio         35000.0
    new_index        NaN
    Name: obj444, dtype: float64
    
    In [30]: obj4.index.name = 'state'
    
    In [31]: obj4
    Out[31]:
    state
    ohio         35000.0
    new_index        NaN
    Name: obj444, dtype: float64
    

    可以通过赋值,就地就该index:

    In [39]: obj1 = Series([4,7,-5,3])
    
    In [40]: obj1.index = ['Bob',"Steve","Jeff",'Ryan']
    
    In [41]: obj1
    Out[41]:
    Bob      4
    Steve    7
    Jeff    -5
    Ryan     3
    dtype: int64
    

    1.2 DataFrame

    一个表格型数据结构,包含一组有序的列。典型的创建DataFrame如下:

    In [42]: data = {'state':['Ohio',"Ohio",'Ohio',"Nevada",'Nevada'],}
    
    In [43]: data = {'state':['Ohio',"Ohio",'Ohio',"Nevada",'Nevada'],
        ...:          'year':[2000,2001,2002,2001,2002],
        ...:          'pop':[1.5,1.7,3.6,2.4,2.9]}
    
    In [44]: frame = DataFrame(data)
    
    In [45]: frame  #自动增加索引,列顺序是随机的。
    Out[45]:
       pop   state  year
    0  1.5    Ohio  2000
    1  1.7    Ohio  2001
    2  3.6    Ohio  2002
    3  2.4  Nevada  2001
    4  2.9  Nevada  2002
    

    可以指定列进行创建:

    In [46]: DataFrame(data, columns=['year','state','pop'])
    Out[46]:
       year   state  pop
    0  2000    Ohio  1.5
    1  2001    Ohio  1.7
    2  2002    Ohio  3.6
    3  2001  Nevada  2.4
    4  2002  Nevada  2.9
    In [47]: frame2 = DataFrame(data,columns=['year','state','pop','debt'],
        ...: index=['one','two','three','four','five'])
    
    In [48]: frame2
    Out[48]:
           year   state  pop debt
    one    2000    Ohio  1.5  NaN
    two    2001    Ohio  1.7  NaN
    three  2002    Ohio  3.6  NaN
    four   2001  Nevada  2.4  NaN
    five   2002  Nevada  2.9  NaN
    In [49]: frame2.columns
    Out[49]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
    # 可以通过类似字典方式获取列数据,获取一列即为一个Series.
    # 注意返回的是一个name被自动设置为state,索引与DataFrame相同。
    In [50]: frame2['state']
    Out[50]:
    one        Ohio
    two        Ohio
    three      Ohio
    four     Nevada
    five     Nevada
    Name: state, dtype: object
    
    # 也可以获取行数据,通过ix
    # 获取的内容与列类似
    In [52]: frame2.ix['two']
    Out[52]:
    year     2001
    state    Ohio
    pop       1.7
    debt      NaN
    Name: two, dtype: object
    
    # 可以给debt列赋值
    In [53]: frame2['debt'] = 16.5  #可以是标量,自动全部填充
    
    In [54]: frame2
    Out[54]:
           year   state  pop  debt
    one    2000    Ohio  1.5  16.5
    two    2001    Ohio  1.7  16.5
    three  2002    Ohio  3.6  16.5
    four   2001  Nevada  2.4  16.5
    five   2002  Nevada  2.9  16.5
    
    In [55]: frame2['debt'] = np.arange(5.)  # 赋值长度必须与debt长度一直
    
    In [56]: frame2
    Out[56]:
           year   state  pop  debt
    one    2000    Ohio  1.5   0.0
    two    2001    Ohio  1.7   1.0
    three  2002    Ohio  3.6   2.0
    four   2001  Nevada  2.4   3.0
    five   2002  Nevada  2.9   4.0
    
    ## 也可以用一个Series来赋值,缺失值NaN
    In [59]: frame2['debt'] = Series([-1,-2,-4],index=['one','three','five'])
    
    In [60]: frame2
    Out[60]:
           year   state  pop  debt
    one    2000    Ohio  1.5  -1.0
    two    2001    Ohio  1.7   NaN
    three  2002    Ohio  3.6  -2.0
    four   2001  Nevada  2.4   NaN
    five   2002  Nevada  2.9  -4.0
    
    # 不存在列的赋值,会创建新列
    In [61]: frame2['eastern'] = frame2.state == 'Ohio'
    
    In [62]: frame2
    Out[62]:
           year   state  pop  debt eastern
    one    2000    Ohio  1.5  -1.0    True
    two    2001    Ohio  1.7   NaN    True
    three  2002    Ohio  3.6  -2.0    True
    four   2001  Nevada  2.4   NaN   False
    five   2002  Nevada  2.9  -4.0   False
    
    # del 会删除列
    
    In [63]: del frame2['eastern']
    
    In [64]: frame2.columns
    Out[64]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
    
    

    注意:通过索引获取的数据,均为原始数据的视图,所有修改都会反应到源DataFrame上。
    需要复制使用Series的copy方法。

    可以使用嵌套字典创建DataFrame。

    In [68]: pop = {'Nevada':{2001:2.4,2002:2.9},
        ...: 'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
    
    In [69]: frame3 = DataFrame(pop)
    
    In [70]: frame3
    Out[70]:
          Nevada  Ohio
    2000     NaN   1.5
    2001     2.4   1.7
    2002     2.9   3.6
    
    # 还可以转置
    In [72]: frame3.T
    Out[72]:
            2000  2001  2002
    Nevada   NaN   2.4   2.9
    Ohio     1.5   1.7   3.6
    
    # 排序合并索引
    In [73]: DataFrame(pop,index=[2001,2002,2003])
    Out[73]:
          Nevada  Ohio
    2001     2.4   1.7
    2002     2.9   3.6
    2003     NaN   NaN
    
    In [75]: pdata = {'Ohio':frame3['Ohio'][:-1],'Nevada':frame3['Nevada'][:2]}
    
    In [76]: DataFrame(pdata)
    Out[76]:
          Nevada  Ohio
    2000     NaN   1.5
    2001     2.4   1.7
    
    

    下面是所有DataFrame能接受的数据类型:

    In [78]: DataFrame(np.arange(10).reshape(2,5))
    Out[78]:
       0  1  2  3  4
    0  0  1  2  3  4
    1  5  6  7  8  9
    In [80]: frame3
    Out[80]:
          Nevada  Ohio
    2000     NaN   1.5
    2001     2.4   1.7
    2002     2.9   3.6
    
    In [81]: frame3.name = 'states';frame3.index.name = 'year'
    
    In [82]: frame3
    Out[82]:
          Nevada  Ohio
    year
    2000     NaN   1.5
    2001     2.4   1.7
    2002     2.9   3.6
    
    In [83]: frame3.columns.name = 'state'
    
    In [84]: frame3
    Out[84]:
    state  Nevada  Ohio
    year
    2000      NaN   1.5
    2001      2.4   1.7
    2002      2.9   3.6
    
    In [86]: frame3.values
    Out[86]:
    array([[ nan,  1.5],
           [ 2.4,  1.7],
           [ 2.9,  3.6]])
    
    In [87]: frame2.values
    Out[87]:
    array([[2000, 'Ohio', 1.5, -1.0],
           [2001, 'Ohio', 1.7, nan],
           [2002, 'Ohio', 3.6, -2.0],
           [2001, 'Nevada', 2.4, nan],
           [2002, 'Nevada', 2.9, -4.0]], dtype=object)
    
    

    1.3 索引

    索引对象负责管理轴标签和其他元数据,没有指定的话会自动生成。一旦生成,索引对象不能修改。

    In [88]: obj = Series(range(3),index=['a','b','c'])
    
    In [89]: obj.index
    Out[89]: Index(['a', 'b', 'c'], dtype='object')
    
    In [90]: index = obj.index
    
    In [91]: index[1:]
    Out[91]: Index(['b', 'c'], dtype='object')
    
    In [92]: index[1] = 'd'
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-92-676fdeb26a68> in <module>()
    ----> 1 index[1] = 'd'
    
    C:UsersyangflAnaconda3libsite-packagespandasindexesase.py in __setitem_
    _(self, key, value)
       1243
       1244     def __setitem__(self, key, value):
    -> 1245         raise TypeError("Index does not support mutable operations")
       1246
       1247     def __getitem__(self, key):
    
    TypeError: Index does not support mutable operations
    
    

    因为不可修改,因此可以在多个数据结构中间传递index对象。

    In [94]: index = pd.Index(np.arange(3))
    
    In [95]: obj2 = Series([-0.2,-1,2],index=index)
    
    In [96]: obj2.index == index
    Out[96]: array([ True,  True,  True], dtype=bool)
    

    pandas中主要的index对象:

    index对象很像数组,也是一个固定的集合。

    
    In [97]: frame3
    Out[97]:
    state  Nevada  Ohio
    year
    2000      NaN   1.5
    2001      2.4   1.7
    2002      2.9   3.6
    
    In [98]: 2001 in frame3.index
    Out[98]: True
    
    In [99]: 2003 in frame3.index
    Out[99]: False
    
    

    因此,索引也有自己的方法和属性:

    待续。。。

  • 相关阅读:
    Swift3 重写一个带占位符的textView
    Swift3 使用系统UIAlertView方法做吐司效果
    Swift3 页面顶部实现拉伸效果代码
    Swift3 倒计时按钮扩展
    iOS 获取当前对象所在的VC
    SpringBoot在IDEA下使用JPA
    hibernate 异常a different object with the same identifier value was already associated with the session
    SpringCloud IDEA 教学 番外篇 后台运行Eureka服务注册中心
    SpringCloud IDEA 教学 (五) 断路器控制台(HystrixDashboard)
    SpringCloud IDEA 教学 (四) 断路器(Hystrix)
  • 原文地址:https://www.cnblogs.com/felo/p/6358862.html
Copyright © 2011-2022 走看看