pandas是一个强大的python数据分析的工具包,是基于NumPy构建的
主要功能:
- 具备对其功能的数据结构DataFrame Series
- 集成时间序列功能
- 提供丰富的数学运算和操作
- 灵活处理缺失数据
安装:pip install pandas
引用:import pandas as pd
Series-一维数据对象
Series是一种类似于一维数组的对象,由一组数据和一组与之相关的数据标签(索引)组成
创建方式
In [206]: import pandas as pd In [207]: pd.Series([4,7,-5,3]) Out[207]: 0 4 1 7 2 -5 3 3 dtype: int64 In [208]: pd.Series([4,7,-5,3], index=['a','b','c','d']) Out[208]: a 4 b 7 c -5 d 3 dtype: int64 In [209]: pd.Series({'a':1,'b':2}) Out[209]: a 1 b 2 dtype: int64 In [210]: pd.Series(0, index=['a','b','c','d']) Out[210]: a 0 b 0 c 0 d 0 dtype: int64
获取值数组和索引数组: values属性和index属性
In [211]: a = pd.Series([4,7,-5,3], index=['a','b','c','d']) In [212]: a.values Out[212]: array([ 4, 7, -5, 3], dtype=int64) In [214]: a.index Out[214]: Index(['a', 'b', 'c', 'd'], dtype='object')
Series比较像列表(数组)和字典的结合体
Series-使用特性
Series支持array的特性
- 与标量运算 sr*2
In [217]: sr Out[217]: a 4 b 7 c -5 d 3 dtype: int64 In [218]: sr * 2 Out[218]: a 8 b 14 c -10 d 6 dtype: int64
- 与变量运算 sr1+sr2 标签一致的情况下,数值才会相加,否则会增加标签
In [221]: sr2 = pd.Series([1,2,3,4],index=['a','b','c','d']) In [222]: sr + sr2 Out[222]: a 5 b 9 c -2 d 7 dtype: int64
- 索引 sr[0],sr[[1,2,4]]
In [224]: sr Out[224]: a 4 b 7 c -5 d 3 dtype: int64 In [225]: sr[0] Out[225]: 4 In [226]: sr[[0,2,3]] Out[226]: a 4 c -5 d 3 dtype: int64
- 切片 sr[:2]
In [227]: sr[:2] Out[227]: a 4 b 7 dtype: int64
- 通用函数 np.abs(sr)
In [228]: sr Out[228]: a 4 b 7 c -5 d 3 dtype: int64 In [229]: np.abs(sr) Out[229]: a 4 b 7 c 5 d 3 dtype: int64
- 布尔值过滤
In [230]: sr Out[230]: a 4 b 7 c -5 d 3 dtype: int64 In [231]: sr[sr>0] Out[231]: a 4 b 7 d 3 dtype: int64
Series支持字典的特性(标签)
- 从字典创建Series Series(dic)
In [232]: pd.Series({'a':1,'b':5}) Out[232]: a 1 b 5 dtype: int64
- 标签in运算判断 ‘a’ in sr,循环时,默认循环值
In [233]: sr Out[233]: a 4 b 7 c -5 d 3 dtype: int64 In [234]: 'a' in sr Out[234]: True In [236]: for i in sr: ...: print(i) 4 7 -5 3
- 键索引 sr['a'], sr[['a','b','d']]
In [237]: sr Out[237]: a 4 b 7 c -5 d 3 dtype: int64 In [238]: sr['a'] Out[238]: 4 In [239]: sr[['a','c']] Out[239]: a 4 c -5 dtype: int64 In [240]: sr['a':'c'] #起始值和终点值都能取到 Out[240]: a 4 b 7 c -5 dtype: int64
Series-整数索引
如果series对象里的键为整数时,就存在键取值和索引取值搞混的问题,默认是键取值
In [40]: sr = pd.Series(np.arange(10)) In [41]: sr Out[41]: 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 dtype: int32 In [44]: sr2 = sr[5:].copy() In [45]: sr2 Out[45]: 5 5 6 6 7 7 8 8 9 9 dtype: int32 In [46]: sr2[5] #取键为5的值,如果是索引,肯定报错 Out[46]: 5 In [47]: sr2[-1] --------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-47-3882bebf0859> in <module>() ----> 1 sr2[-1] C:python serveranacondalibsite-packagespandascoreseries.py in __getitem__(self, key) 621 key = com._apply_if_callable(key, self) 622 try: --> 623 result = self.index.get_value(self, key) 624 625 if not is_scalar(result): C:python serveranacondalibsite-packagespandascoreindexesase.py in get_value(self, series, key) 2558 try: 2559 return self._engine.get_value(s, k, -> 2560 tz=getattr(series.dtype, 'tz', None)) 2561 except KeyError as e1: 2562 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() KeyError: -1
解决方法:指定取值方法,loc标签取值,iloc就是索引取值
In [48]: sr2.loc[5] #ioc就是 标签 或 键取值 Out[48]: 5 In [50]: sr2.iloc[4] #iloc就下标 或 索引取值 Out[50]: 9 In [51]: sr2.iloc[-1] Out[51]: 9
Series数据对齐
pandas在进行两个Series对象的运算时,会按索引进行对齐然后计算
如果两个Series对象的索引不完全相同,则结果的索引是两个操作数索引的并集
如果只有一个对象在某个索引下有值,则结果中该索引的值为nan(缺失值)
In [6]: sr1 = pd.Series([4,9,100],index=['a','b','c']) In [7]: sr1 Out[7]: a 4 b 9 c 100 dtype: int64 In [8]: sr2 = pd.Series([4,5,6],index=['b','c','d']) In [9]: sr2 Out[9]: b 4 c 5 d 6 dtype: int64 In [10]: sr1 + sr2 Out[10]: a NaN b 13.0 c 105.0 d NaN dtype: float64
如果处理缺失值呢,比如上面的,a处理为4,另外一组没有的处理成0
你可以用灵活的算术方法:add,sub,div,mul
sr1 + sr2 等同 sr1.add(sr2),利用函数的填充参数fill_value处理缺失值
In [10]: sr1 + sr2 Out[10]: a NaN b 13.0 c 105.0 d NaN dtype: float64 In [11]: sr1.add(sr2) Out[11]: a NaN b 13.0 c 105.0 d NaN dtype: float64 In [12]: sr1.add(sr2,fill_value=0) Out[12]: a 4.0 b 13.0 c 105.0 d 6.0 dtype: float64
Series-缺失数据
缺失数据:使用NaN(Not a Number)来表示缺失数据,其值等于np.nan,内置的None值也会被当做NaN处理
提供这么几个方法帮助我们处理缺失值
- dropna() 过滤掉值为NaN的行
- fillna 填充缺失数据
- isnull 返回布尔数组,缺失值对应为True
- notnull 返回布尔数组,缺失值对为False
第一种方式:扔掉缺失值
In [61]: sr = sr1 + sr2 In [62]: sr Out[62]: a 33.0 b NaN c 32.0 d 45.0 dtype: float64 In [63]: sr.isnull() #判断是否为nan值 Out[63]: a False b True c False d False dtype: bool In [64]: sr.notnull() #不是nan值,那就可以通过这个,结合series的过滤掉缺失值 Out[64]: a True b False c True d True dtype: bool In [65]: sr[sr.notnull()] #过滤 Out[65]: a 33.0 c 32.0 d 45.0 dtype: float64 In [66]: sr.dropna() #series对象里本身就提供一个扔掉nan值的方法 Out[66]: a 33.0 c 32.0 d 45.0 dtype: float64
第二种处理缺失值的方式:填充
In [68]: sr.fillna(0) #填充0,还可以填充均值sr.mean() 均值函数会跳过nan值 Out[68]: a 33.0 b 0.0 c 32.0 d 45.0 dtype: float64 In [69]: sr = sr.fillna(0) #由于不会对已有对象进行修改,需要重新赋值
DataFrame-二维数据对象
DataFrame是一个表格型的数据结构,含有一组有序的列,DataFrame可以被看做是由Series组成的字典,并且共用一个索引
创建方式
In [3]: pd.DataFrame({'one':[1,2,3,4], 'two':[4,3,2,1]}) Out[3]: one two 0 1 4 1 2 3 2 3 2 3 4 1 In [4]: pd.DataFrame({'one': pd.Series([1,2,3], index=['a','b','c']), 'two': pd.Series([1,2,3,4],index=['b','a','c','d'])}) Out[4]: one two a 1.0 2 b 2.0 1 c 3.0 3 d NaN 4 #如果有缺失值,就以nan值返回 In [5]: pd.DataFrame({'one':[1,2,3,4], 'two':[4,3,2,1]}, index=['a','b','c','d']) Out[5]: one two a 1 4 b 2 3 c 3 2 d 4 1
DataFrame-常用属性
- index 获取索引(行名)
- values 获取值数组
- columns 获取列索引(列名)
In [5]: pd.DataFrame({'one':[1,2,3,4], 'two':[4,3,2,1]}, index=['a','b','c','d']) Out[5]: one two a 1 4 b 2 3 c 3 2 d 4 1 In [6]: df = _5 In [7]: df.index Out[7]: Index(['a', 'b', 'c', 'd'], dtype='object') In [8]: df.values Out[8]: array([[1, 4], [2, 3], [3, 2], [4, 1]], dtype=int64) In [9]: df.columns Out[9]: Index(['one', 'two'], dtype='object')
- T 转置,行列对换
In [10]: df Out[10]: one two a 1 4 b 2 3 c 3 2 d 4 1 In [11]: df.T Out[11]: a b c d one 1 2 3 4 two 4 3 2 1
- describe() 获取快速统计,主要统计每列中个数,平均数,最大,最小,标准差,中位数等
In [13]: df.describe() #对列进行统计 Out[13]: one two count 4.000000 4.000000 #个数(nan不包括) mean 2.500000 5.500000 #均值 std 1.290994 1.290994 #标准差 min 1.000000 4.000000 #最小 25% 1.750000 4.750000 50% 2.500000 5.500000 #中位数 75% 3.250000 6.250000 max 4.000000 7.000000 #最大
DataFrame-索引和切片
DataFrame是一个二维数据类型,所以有行索引和列索引
DataFrame同样可以通过标签和位置两种方式来进行索引和切片
列表索引方式
获取时先列后行,支持只获取列,但是不支持只获取行
In [18]: df Out[18]: one two a 1 4 b 2 3 c 3 2 d 4 1 InIn [19]: In [19]: df['one']['a'] #中括号取值,先列后行 Out[19]: 1 In [20]: df['one'] #单取列 Out[20]: a 1 b 2 c 3 d 4 Name: one, dtype: int64 In [21]: df['a'] #报错 #中括号取值方式 不支持只取行,因为行不是Series对象,列才是
行列索引方式
- loc属性 标签获取方式(行名和列名获取)
- iloc属性 索引获取方式
使用方法:逗号隔开,前为行索引,后为列索引
行/列索引部分可以是常规索引、切片、布尔值索引、花式索引任意搭配
In [22]: df Out[22]: one two a 1 4 b 2 3 c 3 2 d 4 1 In [23]: df.loc['a',] #loc方式就支持光取行 Out[23]: one 1 two 4 Name: a, dtype: int64 In [24]: df.loc[['a','c'],] #支持花式索引 Out[24]: one two a 1 4 c 3 2 In [25]: df.loc['a':'c','one'] Out[25]: a 1 b 2 c 3 Name: one, dtype: int64 In [26]: df.iloc[0] #iloc方式就支持光取行 Out[26]: one 1 two 4 Name: a, dtype: int64 In [27]: df.iloc[0][1] Out[27]: 4 In [28]: df.iloc[0,1] Out[28]: 4
DataFrame-数据对齐与缺失数据
DataFrame对象在运算时,同样会进行数据对齐,其行索引和列索引分别对齐
In [29]: pd.DataFrame({'one':[1,2,3,4],'two':[4,5,6,7]}, index=['a','b','c','d']) Out[29]: one two a 1 4 b 2 5 c 3 6 d 4 7 In [30]: df = _29 In [31]: df2 = pd.DataFrame({'two':[7,8,7,8],'one':[8,9,8,8]}, index=['a','c','d','b']) In [32]: df2 Out[32]: two one a 7 8 c 8 9 d 7 8 b 8 8 In [33]: df + df2 Out[33]: one two a 9 11 b 10 13 c 12 14 d 12 14
缺失值处理方式一:填充 fillna()
In [35]: df.loc['e', 'one'] = np.nan In [36]: df.loc['e', 'two'] = 10 In [37]: df.loc['f', 'one'] = np.nan In [38]: df.loc['f', 'two'] = np.nan In [39]: df Out[39]: one two a 1.0 4.0 b 2.0 5.0 c 3.0 6.0 d 4.0 7.0 e NaN 10.0 f NaN NaN In [40]: df.fillna(0) Out[40]: one two a 1.0 4.0 b 2.0 5.0 c 3.0 6.0 d 4.0 7.0 e 0.0 10.0 f 0.0 0.0
缺失值处理方式二:扔掉
- dropna() axis指定操作删除对象类型是行还是列,默认为0就是行,1为列 where指定什么情况下删除,any表示有nan就删除,而all表示行或列中都为nan删除
In [39]: df2.dropna() #默认是how=any Out[39]: one two a 8.0 7.0 c 9.0 8.0 d 8.0 7.0 b 8.0 8.0 In [40]: df2.dropna(how='all') #删除所有列都为nan的行 Out[40]: one two a 8.0 7.0 c 9.0 8.0 d 8.0 7.0 b 8.0 8.0 e NaN 10.0 In [41]: df2.dropna(how='any') #删除含nan值的行 Out[41]: one two a 8.0 7.0 c 9.0 8.0 d 8.0 7.0 b 8.0 8.0 In [42]: df.loc['a','one'] = np.nan In [43]: df Out[43]: one two a NaN 4 b 2.0 5 c 3.0 6 d 4.0 7 In [44]: df.dropna(axis=1) #删除含nan值的列 Out[44]: two a 4 b 5 c 6 d 7
- isnull()
- notnull()
pandas-常用方法
- mean(axis=0, skipna=True) 对列(行)求平均值,默认0为列
- sum(axis=1) 对列(行)求和
In [45]: df Out[45]: one two a NaN 4 b 2.0 5 c 3.0 6 d 4.0 7 In [46]: df.mean() #默认对列求均值 Out[46]: one 3.0 two 5.5 dtype: float64 In [47]: df.mean(axis=1) #对行求均值 Out[47]: a 4.0 b 3.5 c 4.5 d 5.5 dtype: float64 In [48]: df.sum() #对列求和 Out[48]: one 9.0 two 22.0 dtype: float64
- sort_index(axis=0,ascending=True) 对列(行)索引排序,ascending为True时,为升序,False为降序
- sort_values(by,axis=0,ascending=True) 对列(行)的值排序 by为哪一列或哪一行
In [49]: df.sort_values(by='two') #对某列值进行升序 Out[49]: one two a NaN 4 b 2.0 5 c 3.0 6 d 4.0 7 In [50]: df.sort_values(by='two',ascending=False) #对某列进行降序 Out[50]: one two d 4.0 7 c 3.0 6 b 2.0 5 a NaN 4 In [52]: df.sort_values(by='a',ascending=False,axis=1) #对某行进行降序 Out[52]: two one a 4 NaN b 5 2.0 c 6 3.0 d 7 4.0 In [53]: df.sort_values(by='one') #nan值不参与排序,放到最后 Out[53]: one two b 2.0 5 c 3.0 6 d 4.0 7 a NaN 4 In [54]: df.sort_values(by='one',ascending=False) Out[54]: one two d 4.0 7 c 3.0 6 b 2.0 5 a NaN 4 In [55]: df.sort_index() #按行升序 Out[55]: one two a NaN 4 b 2.0 5 c 3.0 6 d 4.0 7 In [56]: df.sort_index(ascending=False) #按行降序 Out[56]: one two d 4.0 7 c 3.0 6 b 2.0 5 a NaN 4 In [57]: df.sort_index(axis=1) #按列排 Out[57]: one two a NaN 4 b 2.0 5 c 3.0 6 d 4.0 7 In [58]: df.sort_index(ascending=False,axis=1) Out[58]: two one a 4 NaN b 5 2.0 c 6 3.0 d 7 4.0
其他
- apply(func, axis=0) 将自定义函数应用在各行或者各列上,func可返回标量或者Series
- applymap(func) 将函数应用在DataFrame各个元素上
- map(func) 将函数应用在Series各个元素上
pandas-时间对象处理
生成时间对象数组:date_range
- start 开始时间
- end 结束时间
- periods 时间长度
- freq 时间频率,默认为'D', 可选H(our) W(eek) B(usiness) S(emi-) M(onth) (min)T(es) S(econd), A(year)
In [71]: pd.date_range('2018-01-01',periods=10) Out[71]: DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08', '2018-01-09', '2018-01-10'], dtype='datetime64[ns]', freq='D') In [72]: pd.date_range('2018-01-01','2030-01-01',freq='A') Out[72]: DatetimeIndex(['2018-12-31', '2019-12-31', '2020-12-31', '2021-12-31', '2022-12-31', '2023-12-31', '2024-12-31', '2025-12-31', '2026-12-31', '2027-12-31', '2028-12-31', '2029-12-31'], dtype='datetime64[ns]', freq='A-DEC')
时间序列就是以时间对象为索引的Series或DataFrame
datetime对象作为索引时是存储在DatetimeIndex对象中的
In [73]: sr = pd.Series(np.arange(20), index=pd.date_range('2018-01-01', periods=20)) In [74]: sr Out[74]: 2018-01-01 0 2018-01-02 1 2018-01-03 2 2018-01-04 3 2018-01-05 4 2018-01-06 5 2018-01-07 6 2018-01-08 7 2018-01-09 8 2018-01-10 9 2018-01-11 10 2018-01-12 11 2018-01-13 12 2018-01-14 13 2018-01-15 14 2018-01-16 15 2018-01-17 16 2018-01-18 17 2018-01-19 18 2018-01-20 19 Freq: D, dtype: int32 In [75]: sr.index Out[75]: DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08', '2018-01-09', '2018-01-10', '2018-01-11', '2018-01-12', '2018-01-13', '2018-01-14', '2018-01-15', '2018-01-16', '2018-01-17', '2018-01-18', '2018-01-19', '2018-01-20'], dtype='datetime64[ns]', freq='D')
时间序列特殊功能:
- 传入'年'或'年月'作为切片方式
In [32]: sr = pd.Series(np.arange(1000),index=pd.date_range('2018-01-01',periods=1000)) In [33]: sr['2018-03'] #切某年的某个月 Out[33]: 2018-03-01 59 2018-03-02 60 2018-03-03 61 2018-03-04 62 2018-03-05 63 2018-03-06 64 2018-03-07 65 2018-03-08 66 2018-03-09 67 2018-03-10 68 2018-03-11 69 2018-03-12 70 2018-03-13 71 2018-03-14 72 2018-03-15 73 2018-03-16 74 2018-03-17 75 2018-03-18 76 2018-03-19 77 2018-03-20 78 2018-03-21 79 2018-03-22 80 2018-03-23 81 2018-03-24 82 2018-03-25 83 2018-03-26 84 2018-03-27 85 2018-03-28 86 2018-03-29 87 2018-03-30 88 2018-03-31 89 Freq: D, dtype: int32 In [35]: sr['2019'] #切某年 Out[35]: 2019-01-01 365 2019-01-02 366 2019-01-03 367 2019-01-04 368 2019-01-05 369 2019-01-06 370 2019-01-07 371 2019-01-08 372 2019-01-09 373 2019-01-10 374 2019-01-11 375 2019-01-12 376 2019-01-13 377 2019-01-14 378 2019-01-15 379 2019-01-16 380 2019-01-17 381 2019-01-18 382 2019-01-19 383 2019-01-20 384 2019-01-21 385 2019-01-22 386 2019-01-23 387 2019-01-24 388 2019-01-25 389 2019-01-26 390 2019-01-27 391 2019-01-28 392 2019-01-29 393 2019-01-30 394 ... 2019-12-02 700 2019-12-03 701 2019-12-04 702 2019-12-05 703 2019-12-06 704 2019-12-07 705 2019-12-08 706 2019-12-09 707 2019-12-10 708 2019-12-11 709 2019-12-12 710 2019-12-13 711 2019-12-14 712 2019-12-15 713 2019-12-16 714 2019-12-17 715 2019-12-18 716 2019-12-19 717 2019-12-20 718 2019-12-21 719 2019-12-22 720 2019-12-23 721 2019-12-24 722 2019-12-25 723 2019-12-26 724 2019-12-27 725 2019-12-28 726 2019-12-29 727 2019-12-30 728 2019-12-31 729 Freq: D, Length: 365, dtype: int32
- 传入日期范围作为切片方式
In [36]: sr['2018-11':'2019-01'] #按年月切片 Out[36]: 2018-11-01 304 2018-11-02 305 2018-11-03 306 2018-11-04 307 2018-11-05 308 2018-11-06 309 2018-11-07 310 2018-11-08 311 2018-11-09 312 2018-11-10 313 2018-11-11 314 2018-11-12 315 2018-11-13 316 2018-11-14 317 2018-11-15 318 2018-11-16 319 2018-11-17 320 2018-11-18 321 2018-11-19 322 2018-11-20 323 2018-11-21 324 2018-11-22 325 2018-11-23 326 2018-11-24 327 2018-11-25 328 2018-11-26 329 2018-11-27 330 2018-11-28 331 2018-11-29 332 2018-11-30 333 ... 2019-01-02 366 2019-01-03 367 2019-01-04 368 2019-01-05 369 2019-01-06 370 2019-01-07 371 2019-01-08 372 2019-01-09 373 2019-01-10 374 2019-01-11 375 2019-01-12 376 2019-01-13 377 2019-01-14 378 2019-01-15 379 2019-01-16 380 2019-01-17 381 2019-01-18 382 2019-01-19 383 2019-01-20 384 2019-01-21 385 2019-01-22 386 2019-01-23 387 2019-01-24 388 2019-01-25 389 2019-01-26 390 2019-01-27 391 2019-01-28 392 2019-01-29 393 2019-01-30 394 2019-01-31 395 Freq: D, Length: 92, dtype: int32 In [37]: sr['2018-12-03':'2019-01-01'] #按日期切片 Out[37]: 2018-12-03 336 2018-12-04 337 2018-12-05 338 2018-12-06 339 2018-12-07 340 2018-12-08 341 2018-12-09 342 2018-12-10 343 2018-12-11 344 2018-12-12 345 2018-12-13 346 2018-12-14 347 2018-12-15 348 2018-12-16 349 2018-12-17 350 2018-12-18 351 2018-12-19 352 2018-12-20 353 2018-12-21 354 2018-12-22 355 2018-12-23 356 2018-12-24 357 2018-12-25 358 2018-12-26 359 2018-12-27 360 2018-12-28 361 2018-12-29 362 2018-12-30 363 2018-12-31 364 2019-01-01 365 Freq: D, dtype: int32
- 丰富的函数支持:resample(),strftime()
In [38]: sr.resample('W').sum() #按周求和 Out[38]: 2018-01-07 21 2018-01-14 70 2018-01-21 119 2018-01-28 168 2018-02-04 217 2018-02-11 266 2018-02-18 315 2018-02-25 364 2018-03-04 413 2018-03-11 462 2018-03-18 511 2018-03-25 560 2018-04-01 609 2018-04-08 658 2018-04-15 707 2018-04-22 756 2018-04-29 805 2018-05-06 854 2018-05-13 903 2018-05-20 952 2018-05-27 1001 2018-06-03 1050 2018-06-10 1099 2018-06-17 1148 2018-06-24 1197 2018-07-01 1246 2018-07-08 1295 2018-07-15 1344 2018-07-22 1393 2018-07-29 1442 ... 2020-03-08 5558 2020-03-15 5607 2020-03-22 5656 2020-03-29 5705 2020-04-05 5754 2020-04-12 5803 2020-04-19 5852 2020-04-26 5901 2020-05-03 5950 2020-05-10 5999 2020-05-17 6048 2020-05-24 6097 2020-05-31 6146 2020-06-07 6195 2020-06-14 6244 2020-06-21 6293 2020-06-28 6342 2020-07-05 6391 2020-07-12 6440 2020-07-19 6489 2020-07-26 6538 2020-08-02 6587 2020-08-09 6636 2020-08-16 6685 2020-08-23 6734 2020-08-30 6783 2020-09-06 6832 2020-09-13 6881 2020-09-20 6930 2020-09-27 5979 Freq: W-SUN, Length: 143, dtype: int32 In [39]: sr.resample('A').sum() #按年求和 Out[39]: 2018-12-31 66430 2019-12-31 199655 2020-12-31 233415 Freq: A-DEC, dtype: int32 In [40]: sr.resample('M').mean() #按月求平均值 Out[40]: 2018-01-31 15.0 2018-02-28 44.5 2018-03-31 74.0 2018-04-30 104.5 2018-05-31 135.0 2018-06-30 165.5 2018-07-31 196.0 2018-08-31 227.0 2018-09-30 257.5 2018-10-31 288.0 2018-11-30 318.5 2018-12-31 349.0 2019-01-31 380.0 2019-02-28 409.5 2019-03-31 439.0 2019-04-30 469.5 2019-05-31 500.0 2019-06-30 530.5 2019-07-31 561.0 2019-08-31 592.0 2019-09-30 622.5 2019-10-31 653.0 2019-11-30 683.5 2019-12-31 714.0 2020-01-31 745.0 2020-02-29 775.0 2020-03-31 805.0 2020-04-30 835.5 2020-05-31 866.0 2020-06-30 896.5 2020-07-31 927.0 2020-08-31 958.0 2020-09-30 986.5 Freq: M, dtype: float64 In [41]: sr.truncate(before='2019-11-12') #截断日期之前的,因为切片能力非常强大,这个已经变的没什么意义了 Out[41]: 2019-11-12 680 2019-11-13 681 2019-11-14 682 2019-11-15 683 2019-11-16 684 2019-11-17 685 2019-11-18 686 2019-11-19 687 2019-11-20 688 2019-11-21 689 2019-11-22 690 2019-11-23 691 2019-11-24 692 2019-11-25 693 2019-11-26 694 2019-11-27 695 2019-11-28 696 2019-11-29 697 2019-11-30 698 2019-12-01 699 2019-12-02 700 2019-12-03 701 2019-12-04 702 2019-12-05 703 2019-12-06 704 2019-12-07 705 2019-12-08 706 2019-12-09 707 2019-12-10 708 2019-12-11 709 ... 2020-08-28 970 2020-08-29 971 2020-08-30 972 2020-08-31 973 2020-09-01 974 2020-09-02 975 2020-09-03 976 2020-09-04 977 2020-09-05 978 2020-09-06 979 2020-09-07 980 2020-09-08 981 2020-09-09 982 2020-09-10 983 2020-09-11 984 2020-09-12 985 2020-09-13 986 2020-09-14 987 2020-09-15 988 2020-09-16 989 2020-09-17 990 2020-09-18 991 2020-09-19 992 2020-09-20 993 2020-09-21 994 2020-09-22 995 2020-09-23 996 2020-09-24 997 2020-09-25 998 2020-09-26 999 Freq: D, Length: 320, dtype: int32
pandas-文件处理
读取操作:read_csv
数据文件常用格式:csv(以某间隔符分割数据)
pandas读取文件:从文件名、URL、文件对象中加载数据
- read_csv 默认分隔符为逗号
- read_table 默认分隔符为制表符
参数解析
- sep 指定分隔符,可用正则表达式如's+'
- index_col 指定某列作为索引
In [87]: df.to_csv('test.csv',header=True,index=True,na_rep='null',encoding='gbk',columns=['one','two']) #用DataFrame对象的方法构造一个文件 In [88]: pd.read_csv('test.csv') Out[88]: Unnamed: 0 one two 0 a 1.0 4.0 1 b 2.0 5.0 2 c 3.0 6.0 3 d 4.0 7.0 4 e NaN 10.0 5 f NaN NaN In [89]: pd.read_csv('test.csv',index_col=0) #可以通过列的索引值来指定行标签 Out[89]: one two a 1.0 4.0 b 2.0 5.0 c 3.0 6.0 d 4.0 7.0 e NaN 10.0 f NaN NaN In [90]: pd.read_csv('test.csv',index_col='one') #可以通过列名来指定行标签 Out[90]: Unnamed: 0 two one 1.0 a 4.0 2.0 b 5.0 3.0 c 6.0 4.0 d 7.0 NaN e 10.0 NaN f NaN
如果把时间按上述方式读进来,还有个问题,就是时间读进行,虽然做了索引,并不是一个时间对象
只是一个字符串,怎么转化成时间对象呢?
- parse_dates 指定某些列是否被解析为日期,类型为布尔或列表
pd.read_csv('test.csv',index_col='date',parse_dates=True) #对表里的所有的能解析成时间序列都解析 pd.read_csv('test.csv',index_col='date',parse_dates=['date']) #对这一列进行时间解析
- header=None 指定文件无列名
- names 指定列名,传列表
如果不存在列名这行,数据获取时,会以第一行的数据为列名,如果要指定,可以如下操作
In [106]: pd.read_csv('test.csv') Out[106]: 1.0 4.0 0 2.0 5.0 1 3.0 6.0 2 4.0 7.0 3 NaN 10.0 4 NaN NaN In [107]: pd.read_csv('test.csv',header=None) #告诉解析器说数据不带列名,同时也是不把第一行数据作为列名,列名默认为从0开始的数字 Out[107]: 0 1 0 1.0 4.0 1 2.0 5.0 2 3.0 6.0 3 4.0 7.0 4 NaN 10.0 5 NaN NaN In [108]: pd.read_csv('test.csv',header=None,names=list('gh')) #指定列名gk,传列表 Out[108]: g h 0 1.0 4.0 1 2.0 5.0 2 3.0 6.0 3 4.0 7.0 4 NaN 10.0 5 NaN NaN
- na_values 指定某个值,或者说某个字符串表示缺失值(NaN)
- skiprows 指定跳过某些行
In [110]: pd.read_csv('test.csv',header=None,names=list('gh'),na_values='10') Out[110]: g h 0 1.0 4.0 1 2.0 5.0 2 3.0 6.0 3 4.0 7.0 4 NaN NaN 5 NaN NaN In [111]: pd.read_csv('test.csv',header=None,names=list('gh'),na_values='10',skiprows=[4,5]) Out[111]: g h 0 1.0 4.0 1 2.0 5.0 2 3.0 6.0 3 4.0 7.0
写入操作:to_csv函数
- sep 指定文件分隔符
- na_rep 指定缺失值转换的字符串,默认为空字符串
- header=False 不输出列名一行
- index=False 不输出行索引一列
- cols 指定输入的列,传入列表
In [59]: df.to_csv('test3.csv',header=False,index=False,na_rep='null',encoding='gbk',columns=['年份',' ...: 股票代码','股票价格']) In [60]: pd.read_csv('test3.csv',encoding='gbk')
pandas支持的其他文件类型:json,XML,HTML,数据库,pickle,excel...
In [68]: df.to_html('test.html',header=False,index=False,na_rep='null',columns=['年份','股票代码',' ...: 股票价格']) In [5]: pd.read_html('test.html',encoding='gbk') #读这些文件类型都要安装另外的模块