zoukankan html css js c++ java

pandas的数据结构

要使用pandas，需要熟悉它的两个主要的数据结构，Series和DataFrame。

Series

series是一种类似于以为数组的对象，它由一组数据（各种numpy的数据类型）以及一组与之相关的数据标签（索引）组成。仅有一组数据即可产生简单的Series：

In [11]: from pandas import Series,DataFrame

In [12]: import pandas as pd

In [13]: obj=Series([4,-2,5,0])

In [14]: obj
Out[14]:
0    4
1   -2
2    5
3    0
dtype: int64

In [15]: type(obj)
Out[15]: pandas.core.series.Series

series的字符串表现形式为：索引在左边，值在右边。由于我们没有为数据指定索引，于是自动创建一个0到N-1（N为数据的长度）的整数型索引。可以通过Series的values和index属性获取其数组表现形式和索引对象：

In [16]: obj.values
Out[16]: array([ 4, -2,  5,  0], dtype=int64)

In [17]: obj.index
Out[17]: RangeIndex(start=0, stop=4, step=1)

通常，我们希望所创建的Series带有一个可以对各个数据点进行标记的索引：

In [18]: obj2=Series([4,7,5,-3],index=['d','b','a','c'])

In [19]: obj2
Out[19]:
d    4
b    7
a    5
c   -3
dtype: int64

In [20]: obj2.index
Out[20]: Index(['d', 'b', 'a', 'c'], dtype='object')

于普通numpy相比，你可以通过索引的方式选取Series的单个或一组值

In [21]: obj2['a']
Out[21]: 5

In [22]: obj2['d']=6

In [23: obj2[['c','a','d']]
Out[23:
c   -3
a    5
d    6
dtype: int64

numpy数组运算都会保留索引与值之间的链接：

In [26]: obj2[obj2>0]
Out[26]:
d    6
b    7
a    5
dtype: int64

In [27]: obj2*2
Out[27]:
d    12
b    14
a    10
c    -6
dtype: int64

In [28]: np.exp(obj2)
Out[28]:
d     403.428793
b    1096.633158
a     148.413159
c       0.049787
dtype: float64

还可以将Series看成一个定长的有序字典，因为它是索引值到数据值的一个映射。它可以在许多原本需要字典参数的函数中：

In [29]: 'b' in obj2
Out[29]: True

In [30]: 'e' in obj2
Out[30]: False

如果数据被存放在一个Python字典中，也可以直接通过这个字典来创建：

In [32]: sdata={'a':1,'b':2,'c':3}

In [33]: obj3=Series(sdata)

In [34]: obj3
Out[34]:
a    1
b    2
c    3
dtype: int64

如果只传入一个字典，则结果series中的索引就是原字典的键（有序排列）

In [41]: states=['one','a','b']

In [42]: obj4=Series(sdata,index=states)

In [43]: obj4
Out[43]:
one    NaN
a      1.0
b      2.0
dtype: float64

例子中sdata中的states索引相匹配的那2个值会被找出来并放到相应的位置上。找不到的则用缺失值Na表示。

pandas中的isnull和notnull可用于检测缺失数据：

In [44]: pd.isnull(obj4)
Out[44]:
one     True
a      False
b      False
dtype: bool

In [45]: pd.notnull(obj4)
Out[45]:
one    False
a       True
b       True
dtype: bool

series中也有类似的实例方法：

In [46]: obj4.isnull()
Out[46]:
one     True
a      False
b      False
dtype: bool

Series中最重要的一个功能是：它在算术运算中会自动对齐不同索引的数据。

In [47]: obj3
Out[47]:
a    1
b    2
c    3
dtype: int64

In [48]: obj4
Out[48]:
one    NaN
a      1.0
b      2.0
dtype: float64

In [49]: obj3+obj4
Out[49]:
a      2.0
b      4.0
c      NaN
one    NaN
dtype: float64

series对象本身及其索引都有一个name属性，该属性跟pandas其他关键功能关系非常密切：

In [50]: obj4.name='pop4'

In [51]: obj4.index.name='state4'

In [52]: obj4
Out[52]:
state4
one    NaN
a      1.0
b      2.0
Name: pop4, dtype: float64

series索引可以通过赋值的方式就地修改：

In [53]: obj
Out[53]:
0    4
1   -2
2    5
3    0
dtype: int64

In [54]: obj.index=['a','b','c','d']

In [55]: obj
Out[55]:
a    4
b   -2
c    5
d    0
dtype: int64

DataFrame

DataFrame是一个表格型数据结构。它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等）。DataFrame即可有行索引也可以有列索引，它可以被看做是由Series组成的字典（共同一个索引）跟其他的类似的数据结构相比，DataFrame中面向行和面向列的操作基本上是平衡的。其实，DataFrame中数据是以一个或多个二维块存放的。

构建DataFrame最常见的方法是直接传入一个等长列表或numpy数组组成的字典：

In [65]: data={'state':[True,True,False,True,False],'year':[2000,2001,2002,2003,2004]}

In [66]: data
Out[66]:
{'state': [True, True, False, True, False],
 'year': [2000, 2001, 2002, 2003, 2004]}

In [67]: frame=DataFrame(data)

In [68]: frame
Out[68]:
   state  year
0   True  2000
1   True  2001
2  False  2002
3   True  2003
4  False  2004

如果指定了列序列，则DataFrame的列就会按照指定顺序进行排列：

In [69]: DataFrame(data,columns=['year','state'])
Out[69]:
   year  state
0  2000   True
1  2001   True
2  2002  False
3  2003   True
4  2004  False

跟series一样，如果传入的列在数据中找不到就会产生NA值。

通过类似字典标记的方式或属性的方式，可以将DataFrame的列获取为一个Series：

In [70]: frame['state']
Out[70]:
0     True
1     True
2    False
3     True
4    False
Name: state, dtype: bool

In [71]: frame['year']
Out[71]:
0    2000
1    2001
2    2002
3    2003
4    2004
Name: year, dtype: int64

In [72]: type(frame['year'])
Out[72]: pandas.core.series.Series

返回的series拥有DataFrame相同的索引，且其name属性也已经被相应的设置好了。

列可以通过赋值的方式进行修改，如我们增加一列‘debt’，赋上一个标量值或一组值：

In [77]: frame['debt']=16.25

In [78]: frame
Out[78]:
   state  year   debt
0   True  2000  16.25
1   True  2001  16.25
2  False  2002  16.25
3   True  2003  16.25
4  False  2004  16.25

In [79]: frame['debt']=np.arange(5)

In [80]: frame
Out[80]:
state year debt
0   True 2000     0
1   True 2001     1
2 False 2002     2
3   True 2003     3
4 False 2004     4

将列表或数组赋值给某一列时，长度必须要跟DataFrame的长度相匹配。如果赋值的是一个Series，就会精匹配DataFrame的索引，所有空位都会被填上缺省值：

In [85]: frame
Out[85]:
       state  year  debt  
one     True  2000     0   
two     True  2001     1    
three  False  2002     2   
four    True  2003     3    
five   False  2004     4    

In [86]: val=Series([-1.2,-1.5,-1.7],index=['one','two','three'])

In [87]: frame['debt2']=val

In [88]: frame
Out[88]:
       state  year  debt  debt2
one     True  2000     0   -1.2
two     True  2001     1   -1.5
three  False  2002     2   -1.7
four    True  2003     3    NaN
five   False  2004     4    NaN

为不存在的列赋值会创出一个心裂，关键字del用于删除列

In [92]: del frame['state1']

In [93]: frame
Out[93]:
       state  year  debt  debt2
one     True  2000     0   -1.2
two     True  2001     1   -1.5
three  False  2002     2   -1.7
four    True  2003     3    NaN
five   False  2004     4    NaN

另一种常见的数据形式的嵌套字典：

In [94]: pop={'year':{2001:1.5,2002:1.6,2007:2},'prices':{2001:2.5,2002:3}}

如果将它传给DataFrame，它就会被解释为：外层的字典作为键的关键列，内层的则作为行索引：

In [95]: frame3=DataFrame(pop)

In [96]: frame3
Out[96]:
      year  prices
2001   1.5     2.5
2002   1.6     3.0
2007   2.0     NaN

可以对结果进行转置：

In [97]: frame3.T
Out[97]:
        2001  2002  2007
year     1.5   1.6   2.0
prices   2.5   3.0   NaN

内层的字典的键会被合并、排序以形成最终的索引。如果显式指定了索引：

In [109]: frame3.index=[2001,2002,2003]

In [111]: frame3
Out[111]:
      year  prices
2001   1.5     2.5
2002   1.6     3.0
2003   2.0     NaN

可以输入给DataFrame构造器的数据：

1.二维ndarry

2.由数组、列表或元祖组成的字典

3.numpy结构化

4.Series组成的字典

5.由字典组成的字典

6.字典或series的列表

7.由列表或元祖组成的列表

8.另一个DataFrame

9.numpy的MaskedArray

如果设置了DataFrame的index和columns的name属性，则这些信息也会显示出来

In [113]: frame3.index.name='year';frame3.columns.name='state'

In [114]: frame3
Out[114]:
state  year  prices
year
2001    1.5     2.5
2002    1.6     3.0
2003    2.0     NaN

索引对象

pandas的索引对象负责管理轴标签和其他元数据（如轴名称）构建series或DataFrame时，所用到的任何数组或其他序列的标签都会被转换成一个index：

In [116]: obj=Series(range(3),index=['b','a','c'])

In [117]: index=obj.index

In [118]: index
Out[118]: Index(['b', 'a', 'c'], dtype='object')

In [121]: index[1:]
Out[121]: Index(['a', 'c'], dtype='object')

Index对象是不可修改的（immutable），因此用户不可对其进行修改

In [122]: index[1]='f'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-122-c2c86828e313> in <module>()
----> 1 index[1]='f'

d:pythonpython36libsite-packagespandascoreindexesase.py in __setitem__(self, key, value)
   2048
   2049     def __setitem__(self, key, value):
-> 2050         raise TypeError("Index does not support mutable operations")
   2051
   2052     def __getitem__(self, key):

TypeError: Index does not support mutable operations

不可修改属性非常重要，因为这样才能使Index对象在多个数据结构之间数据安全共享

In [123]: index=pd.Index(np.arange(3))

In [127]: obj2=Series([-1.5,2.6,0],index=index)

In [129]: obj2
Out[129]:
0   -1.5
1    2.6
2    0.0
dtype: float64

In [130]: obj2.index is index
Out[130]: True

pandas中主要的index对象：

index ：最泛化的index对象，将轴标签表示为一个由Python对象组成的numpy数组

int64index：针对整数的特殊index

Multiindex ：层次化索引对象，表示单个轴上的多层索引。可以看成由元组组成的数组

DatatimeIndex ：存储纳秒级时间戳

Periodindex：针对Period数据（时间间隔）的特殊index

除了长得像数组，index的功能类似一个固定大小的集合：

In [131]: frame3
Out[131]:
state  year  prices
year
2001    1.5     2.5
2002    1.6     3.0
2003    2.0     NaN

In [132]: 'year' in frame3.columns
Out[132]: True

In [133]: '2001'  in frame3.index
Out[133]: False

In [134]: 2001  in frame3.index
Out[134]: True

index的方法和属性：

append：连接另一个index对象，产生一个新的index

diff：计算差集，并得到一个新的index

intersection：计算交集

union：计算并集

isin：计算一个指示各值是否都包含在参数集合中的布尔型数组

delete：删除索引i处的元素，并得到一个新的index

drop：删除传入的值，并得到一个新的index

insert：将元素插入到索引i处，并得到一个新的index

is_monotonic：当各元素大于等于前一个元素时，返回True

is_unique：当index没有重复值时，返回True

unique：计算index中唯一值得数组

查看全文

相关阅读:
Scite 中文支持
 【线段上】简单贪心总结……未完
 Happy new year！
poj 2960 SNim
【转】SG函数资料（入门必备）
poj 2478 Farey Sequence
Poj 3083 Children of the Candy Corn
Poj 1077 Eight 八数码
 Poj 1830 开关问题：高斯消元
 关于poj 放苹果

原文地址：https://www.cnblogs.com/catxjd/p/9977852.html