pandas 是基于numpy构件的强大的数据处理模块,其核心的数据结构有两个:Series 与 DataFrame
一:Series
Series 是一种类似于表的东西,拥有索引(index)与其对应的值(value)
1)创建Series:
Sereies方法接收两个参数,第一个与value相关,第二个用来指定索引。而创建的方式有两种:
一种为用两个list作为参数分别代表value和index的值[index参数不写则默认0开始自增长]
另一种为dict作为第一参数,若不写第二参数,则其key变成index,value即是value,若有第二参数,则用第二参数元素作为index.[index对应不上的则被抛弃]
import pandas as pd
1 obj_1 = pd.Series([1,2,3,4]) #若不指定索引则默认为从零开始的自增长 2 3 --->obj_1 4 0 1 5 1 2 6 2 3 7 3 4 8 dtype: int64 9 10 obj_2 = pd.Series([1,2,3,4], index=['a','b','c','d']) #指定索引 11 12 obj_2 13 --->a 1 14 b 2 15 c 3 16 d 4 17 dtype: int64
1 sdata = {'Ohio':3500,'Texas':7100,'Oregon':1600,'Utah':500} 2 3 obj_3 = pd.Series(sdata) 4 5 obj_3 6 --->Ohio 3500 7 Oregon 1600 8 Texas 7100 9 Utah 500 10 dtype: int64 11 12 13 14 states = ['California','Ohio','Texas'] 15 16 obj_4 = pd.Series(sdata,index=states) 17 18 obj_4 19 --->California NaN 20 Ohio 3500 21 Texas 7100 #由于states列表并没有Oregen与Utah,故无法对应起来 22 dtype: float64
2) 索引
obj_1.values #调出所有元素值
--->array([1, 2, 3, 4], dtype=int64)
obj_1.index #调出索引值
--->Int64Index([0, 1, 2, 3], dtype='int64')
#改变index值
obj_4.index = ['bob','steve','jeff'] #注:若要改变index,数量必须与原本的数量相同,不能少也不能多
obj_4
bob NaN
steve 3500
jeff 7100
dtype: float64
obj_2['a'] #利用索引进行取值
--->1
obj_2[['c','b','a']] #可以用索引一次取多个值,并且按其给定的顺序输出
--->c 3
b 2
a 1
dtype: int64
'b' in obj_2 #检验索引是否存在
--->True
二:DataFrame
一种表格型的数据结构,每列可以是不同的数值类型,且它既有行索引,还有列索引,并且他们是平衡的
1)创建DataFrame
DataFram(data[,columns = ... , index = ...])
注:若data为字典型数据,则keys自动变成columns,若data仅是列表类,columns与index都是默认0开始自增长的数
1 data=[['ohio','nevada','nevada'],[2000,1000,1000],[1.5,1.7,3.6]] 2 3 frame_1 = pd.DataFrame(data) 4 5 frame_1 6 0 1 2 7 0 ohio nevada nevada 8 1 2000 1000 1000 9 2 1.5 1.7 3.6 10 11 frame_2 = pd.DataFrame(data,columns=['first','second','third']) 12 13 frame_2 14 first second third #注意此处结果与使用字典时比较,这里一个list定义了一行,而字典的是一列 15 0 ohio nevada nevada 16 1 2000 1000 1000 17 2 1.5 1.7 3.6 18 19 frame_2 = pd.DataFrame(data,columns=['first','second','third'],index=['one','two','three']) 20 21 frame_2 22 first second third 23 one ohio nevada nevada 24 two 2000 1000 1000 25 three 1.5 1.7 3.6
1 data2 = {'states':['ohio','nevada','nevada'],'year':[2000,1000,1000],'pop':[1.5,1.7,3.6]} 2 3 frame_4=pd.DataFrame(data2) 4 5 frame_4 6 pop states year 7 0 1.5 ohio 2000 8 1 1.7 nevada 1000 9 2 3.6 nevada 1000 10 11 frame_5=pd.DataFrame(data2,index=['one','two','three']) 12 13 frame_5 14 pop states year 15 one 1.5 ohio 2000 16 two 1.7 nevada 1000 17 three 3.6 nevada 1000 18 19
2)索引
同Series一样可以通过values与index属性查看这两个值
1 In [62]: frame_4 2 Out[62]: 3 pop states year 4 0 1.2 ohio 2000 5 1 2.1 new state new year 6 2 3.6 nevada 1000 7 8 In [63]: frame_4.index 9 Out[63]: Int64Index([0, 1, 2], dtype='int64') 10 11 In [64]: frame_4.index.name 12 13 In [65]: frame_4.index 14 Out[65]: Int64Index([0, 1, 2], dtype='int64') 15 16 In [66]: frame_4.values 17 Out[66]: 18 array([[1.2, 'ohio', 2000L], 19 [2.1, 'new state', 'new year'], 20 [3.6, 'nevada', 1000L]], dtype=object)
通过对column的索引可以获取以Series的形式返回一列
1 In [38]: frame_4 2 Out[38]: 3 pop states year 4 0 1.5 ohio 2000 5 1 1.7 nevada 1000 6 2 3.6 nevada 1000 7 8 In [39]: frame_4['pop'] 9 Out[39]: 10 0 1.5 11 1 1.7 12 2 3.6 13 Name: pop, dtype: float64
通过索引字段ix可以以Series形式返回一行的内容【实际上ix关键字可以实现两个方向上的选取,其接收两个参数,第一个取行,第二个取列,返回并集】
1 In [40]: frame_4.ix[1] 2 Out[40]: 3 pop 1.7 4 states nevada 5 year 1000 6 Name: 1, dtype: object
In [8]: frame_4.ix[1,:1]
Out[8]:
pop 1.7
Name: 1, dtype: object
3)赋值
列赋值
1 In [41]: frame_4['pop']=2.0 2 3 In [42]: frame_4 4 Out[42]: 5 pop states year 6 0 2 ohio 2000 7 1 2 nevada 1000 8 2 2 nevada 1000
行赋值
1 In [44]: frame_4 2 Out[44]: 3 pop states year 4 0 2 ohio 2000 5 1 hello hello hello 6 2 2 nevada 1000
通过Series进行赋值
1 In [45]: val = pd.Series([1.2,2.0,3.6],index=[0,1,2]) 2 3 In [46]: frame_4['pop']=val 4 5 In [47]: frame_4 6 Out[47]: 7 pop states year 8 0 1.2 ohio 2000 9 1 2.0 hello hello 10 2 3.6 nevada 1000
1 In [48]: val_2 = pd.Series([2.1,'new state','new year'],index=['pop','states','y 2 ear']) 3 In [49]: frame_4.ix[1]=val_2 4 5 In [50]: frame_4 6 Out[50]: 7 pop states year 8 0 1.2 ohio 2000 9 1 2.1 new state new year 10 2 3.6 nevada 1000
增与删
1 In [52]: frame_4['stars']=['one','two','five'] #没有则直接新建 2 3 In [53]: frame_4 4 Out[53]: 5 pop states year stars 6 0 1.2 ohio 2000 one 7 1 2.1 new state new year two 8 2 3.6 nevada 1000 five 9 10 In [54]: del frame_4['stars'] 11 12 In [55]: frame_4 13 Out[55]: 14 pop states year 15 0 1.2 ohio 2000 16 1 2.1 new state new year 17 2 3.6 nevada 1000
4)转置:.T [只是返回一个转置的副本,本身并不转置]
1 In [56]: frame_4 2 Out[56]: 3 pop states year 4 0 1.2 ohio 2000 5 1 2.1 new state new year 6 2 3.6 nevada 1000 7 8 In [57]: frame_4.T 9 Out[57]: 10 0 1 2 11 pop 1.2 2.1 3.6 12 states ohio new state nevada 13 year 2000 new year 1000