numpy处理数值,但是pandas除了处理数值之外(基于numpy),还能处理其他类型数据。
pandas常用数据类型
Series一维,带标签数组
DataFrame二维,Series容器
Series创建
In [97]: import pandas as pd
In [98]: pd.Series([1,2,3,4,5]) # 第一列对象键(index, 索引),第二列对象值(values)
Out[98]:
0 1
1 2
2 3
3 4
4 5
dtype: int64
In [99]: t = pd.Series([1,2,3,4,5])
In [100]: type(t)
Out[100]: pandas.core.series.Series
In [101]: t2 = pd.Series([1,2,3,4],index=list('abcd')) # index指定索引
In [102]: t2
Out[102]:
a 1
b 2
c 3
d 4
dtype: int64
In [105]: pd.Series(np.arange(2))
Out[105]:
0 0
1 1
dtype: int64
In [107]: t3 = pd.Series({'name':'xyp', 'age':18})
In [108]: t3
Out[108]:
name xyp
age 18
dtype: object
In [109]: t3.dtype
Out[109]: dtype('O') # o代表object
In [110]: t2.dtype
Out[110]: dtype('int64')
In [111]: t2.astype(float)
Out[111]:
a 1.0
b 2.0
c 3.0
d 4.0
dtype: float64
Series切片和索引
In [118]: t3 = pd.Series({'name':'xyp','age':18,'tel':10086})
In [119]: t3
Out[119]:
name xyp
age 18
tel 10086
dtype: object
In [120]: t3[0]
Out[120]: 'xyp'
In [121]: t3['name']
Out[121]: 'xyp'
In [122]: t3[[1,2]]
Out[122]:
age 18
tel 10086
dtype: object
In [123]: t3[[0,2]]
Out[123]:
name xyp
tel 10086
dtype: object
In [125]: t3[['age','tel']]
Out[125]:
age 18
tel 10086
dtype: object
In [128]: t3
Out[128]:
name xyp
age 18
tel 10086
dtype: object
In [129]: t3.index
Out[129]: Index(['name', 'age', 'tel'], dtype='object')
In [130]: for i in t3.index:
...: print(i)
...:
name
age
tel
In [131]: type(t3.index)
Out[131]: pandas.core.indexes.base.Index
In [132]: len(t3.index)
Out[132]: 3
In [133]: list(t3.index)
Out[133]: ['name', 'age', 'tel']
In [136]: s = pd.Series(range(5))
In [137]: s.where(s>0)
Out[137]:
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
In [139]: s.where(s>1, 10) # s元素 = s元素 s元素>1 else 10,和numpy用法相反
Out[139]:
0 10
1 10
2 2
3 3
4 4
dtype: int64
In [126]: t
Out[126]:
0 1
1 2
2 3
3 4
4 5
dtype: int64
In [127]: t[t>3]
Out[127]:
3 4
4 5
dtype: int64
pandas读取外部文件
pd.read_csv
pd.read_excel
pd.read_sql
mongodb数据和mysql不同,需先连接mongodb取出数据然后放入pd.Series(data)中处理
......
DataFrame创建
In [4]: pd.DataFrame(np.arange(12).reshape(3,4)) # 第一行为行索引,第一列为列索引
Out[4]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [6]: pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('WXYZ'))
Out[6]:
W X Y Z
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
In [7]: t1 = pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('WXYZ'))
In [8]: t1
Out[8]:
W X Y Z
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
In [9]: d1 = {'name':['xyp','oynn'],'age':[18,20],'tel':[10086,10087]}
In [10]: pd.DataFrame(d1)
Out[10]:
name age tel
0 xyp 18 10086
1 oynn 20 10087
In [11]: t1 = pd.DataFrame(d1)
In [12]: type(t1)
Out[12]: pandas.core.frame.DataFrame
In [13]: d2 = [{'name':'xyp','age':18,'tel':10086},{'name':'oynn','age':20},{'age':16,'tel':10087}]
In [14]: t2 = pd.DataFrame(d2)
In [15]: t2
Out[15]:
name age tel
0 xyp 18 10086.0
1 oynn 20 NaN
2 NaN 16 10087.0
DataFrame基础属性
In [15]: t2
Out[15]:
name age tel
0 xyp 18 10086.0
1 oynn 20 NaN
2 NaN 16 10087.0
In [16]: t2.index
Out[16]: RangeIndex(start=0, stop=3, step=1)
In [17]: t2.columns
Out[17]: Index(['name', 'age', 'tel'], dtype='object')
In [18]: t2.values
Out[18]:
array([['xyp', 18, 10086.0],
['oynn', 20, nan],
[nan, 16, 10087.0]], dtype=object)
In [19]: t2.shape
Out[19]: (3, 3)
In [20]: t2.dtypes
Out[20]:
name object
age int64
tel float64
dtype: object
In [21]: t2.ndim # t2维度
Out[21]: 2
In [22]: t2.head(2) # 显示头部几行,默认5行
Out[22]:
name age tel
0 xyp 18 10086.0
1 oynn 20 NaN
In [23]: t2.tail(2) # 显示末尾几行,默认5行
Out[23]:
name age tel
1 oynn 20 NaN
2 NaN 16 10087.0
In [24]: t2.info() # 相关信息概览:行数,列数,列索引,列非空值个数,列类型,内存占用
Out[24]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2 # 3行,索引0 to 2
Data columns (total 3 columns):
name 2 non-null object
age 3 non-null int64
tel 2 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes # 内存
In [25]: t2.describe() # 快速综合统计结果
Out[25]:
age tel
count 3.0 2.000000 # 计数
mean 18.0 10086.500000 # 均值
std 2.0 0.707107 # 标准差
min 16.0 10086.000000 # 最小值
25% 17.0 10086.250000 # 四分之一中位数
50% 18.0 10086.500000 # 中位数
75% 19.0 10086.750000 # 四分之三中位数
max 20.0 10087.000000 # 最大值
DataFrame排序方法
In [27]: t2
Out[27]:
name age tel
0 xyp 18 10086.0
1 oynn 20 NaN
2 NaN 16 10087.0
In [30]: t2.sort_values(by='age')
Out[30]:
name age tel
2 NaN 16 10087.0
0 xyp 18 10086.0
1 oynn 20 NaN
In [31]: t2.sort_values(by='age',ascending=False)
Out[31]:
name age tel
1 oynn 20 NaN
0 xyp 18 10086.0
2 NaN 16 10087.0
In [32]: df = t2.sort_values(by='age',ascending=False)
In [33]: df.head(2)
Out[33]:
name age tel
1 oynn 20 NaN
0 xyp 18 10086.0
DataFrame的索引
In [33]: df
Out[33]:
name age tel
1 oynn 20 NaN
0 xyp 18 10086.0
2 NaN 16 10087.0
In [34]: df[:2]
Out[34]:
name age tel
1 oynn 20 NaN
0 xyp 18 10086.0
In [35]: df[:2]['age']
Out[35]:
1 20
0 18
Name: age, dtype: int64
In [36]: df['name']
Out[36]:
1 oynn
0 xyp
2 NaN
Name: name, dtype: object
In [37]: type(df['name'])
Out[37]: pandas.core.series.Series
loc通过标签索引获取数据
In [48]: t2
Out[48]:
name age tel
0 xyp 18 10086.0
1 oynn 20 NaN
2 NaN 16 10087.0
In [49]: t2.loc[0,'name'] # loc[a,b] a为行索引,b为列索引
Out[49]: 'xyp'
In [50]: t2.loc[1,'age']
Out[50]: 20
In [51]: type(t2.loc[1,'age'])
Out[51]: numpy.int64
In [55]: t2.loc[2]
Out[55]:
name NaN
age 16
tel 10087
Name: 2, dtype: object
In [56]: t2.loc[2,:]
Out[56]:
name NaN
age 16
tel 10087
Name: 2, dtype: object
In [57]: t2.loc[:,'age']
Out[57]:
0 18
1 20
2 16
Name: age, dtype: int64
In [58]: t2.loc[[0,2],:]
Out[58]:
name age tel
0 xyp 18 10086.0
2 NaN 16 10087.0
In [59]: t2.loc[:,['name','age']]
Out[59]:
name age
0 xyp 18
1 oynn 20
2 NaN 16
In [62]: t2.loc[[1,2],['name','tel']]
Out[62]:
name tel
1 oynn NaN
2 NaN 10087.0
In [63]: type(t2.loc[[1,2],['name','tel']])
Out[63]: pandas.core.frame.DataFrame
In [64]: t2.loc[0:2,['name','tel']] # loc取索引,能取到索引最后一个数据
Out[64]:
name tel
0 xyp 10086.0
1 oynn NaN
2 NaN 10087.0
In [64]: t2.loc[0:2,'name':'tel'] # loc取索引,能取到索引最后一个数据
Out[64]:
name age tel
0 xyp 18 10086.0
1 oynn 20 NaN
2 NaN 16 10087.0
iloc通过位置索引获取数据
In [66]: t2.iloc[0:2,[0,1]]
Out[66]:
name age
0 xyp 18
1 oynn 20
In [67]: t2.iloc[0:2,0:2] # iloc取索引不包括最后一个数据
Out[67]:
name age
0 xyp 18
1 oynn 20
In [73]: t2.iloc[:,1] = 18
In [74]: t2
Out[74]:
name age tel
0 xyp 18 10086.0
1 oynn 18 NaN
2 NaN 18 10087.0
布尔索引
In [79]: t2
Out[79]:
name age tel
0 xyp 18 10086.0
1 oynn 20 NaN
2 NaN 22 10087.0
In [81]: t2[t2['age']>18]
Out[81]:
name age tel
1 oynn 20 NaN
2 NaN 22 10087.0
In [84]: t2[(t2['age']>18) & (t2['age']<22)]
Out[84]:
name age tel
1 oynn 20 NaN
In [85]: t2[(t2['age']>18) | (t2['age']<22)]
Out[85]:
name age tel
0 xyp 18 10086.0
1 oynn 20 NaN
2 NaN 22 10087.0
字符串方法
In [95]: t2
Out[95]:
name age tel
0 xyp 18 10086,10087
1 oynn 20 NaN
2 NaN 22 10086,10087
In [96]: t2['tel'].str.split(',')
Out[96]:
0 [10086, 10087]
1 NaN
2 [10086, 10087]
Name: tel, dtype: object
....
pandas缺失数据处理
In [99]: t2
Out[99]:
name age tel
0 xyp 18 10086,10087
1 oynn 20 NaN
2 NaN 22 10086,10087
In [101]: pd.isnull(t2) # 判断数据是否有NaN
Out[101]:
name age tel
0 False False False
1 False False True
2 True False False
In [102]: pd.notnull(t2) # 判断数据是否有NaN
Out[102]:
name age tel
0 True True True
1 True True False
2 False True True
1. 删除NaN所在的行列
In [104]: t2.dropna(axis=0) # 默认为dropna(axis=0, how='any', inplace=False), how='all'时为数据均为NaN才会删除,inplace=True时直接修改t2数组,不需要重新赋值就能改变t2数组
Out[104]:
name age tel
0 xyp 18 10086,10087
In [106]: t2.dropna(axis=0, how='all')
Out[106]:
name age tel
0 xyp 18 10086,10087
1 oynn 20 NaN
2 NaN 22 10086,10087
2. 填充数据
In [107]: t2
Out[107]:
name age tel
0 xyp 18 10086,10087
1 oynn 20 NaN
2 NaN 22 10086,10087
In [111]: t2.fillna(t2.mean()) # t2.mean() nan不参与计算均值。name,tel不是数值所以填充失败。
Out[111]:
name age tel
0 xyp 18 10086,10087
1 oynn 20 NaN
2 NaN 22 10086,10087
In [112]: t2.fillna(0)
Out[112]:
name age tel
0 xyp 18 10086,10087
1 oynn 20 0
2 0 22 10086,10087

import pandas as pd
file_path = 'IMDB-Movie-Data.csv'
df = pd.read_csv(file_path)
# 获取电影平均评分
print(df['Rating'].mean()) # 6.7232
# 导演人数
print(len(set(df['Director'].tolist()))) # 644
print(len(df['Director'].unique())) # 644
# 获取演员人数
temp_actors_list = df['Actors'].str.split(',').tolist()
actors_list = [j for i in temp_actors_list for j in i] # 双重循环推导式
actors_num = len(set(actors_list))
print(actors_num) # 2394
获取电影平均评分
数据合并
# 数组合并join,行合并
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df1 = pd.DataFrame(np.ones((2,4)),index=['A','B'],columns=list('abcd'))
In [4]: df1
Out[4]:
a b c d
A 1.0 1.0 1.0 1.0
B 1.0 1.0 1.0 1.0
In [5]: df2 = pd.DataFrame(np.zeros((3,3)),index=['A','B','C'],columns=list('xyz'))
In [6]: df2
Out[6]:
x y z
A 0.0 0.0 0.0
B 0.0 0.0 0.0
C 0.0 0.0 0.0
In [7]: df1.join(df2)
Out[7]:
a b c d x y z
A 1.0 1.0 1.0 1.0 0.0 0.0 0.0
B 1.0 1.0 1.0 1.0 0.0 0.0 0.0
In [8]: df2.join(df1)
Out[8]:
x y z a b c d
A 0.0 0.0 0.0 1.0 1.0 1.0 1.0
B 0.0 0.0 0.0 1.0 1.0 1.0 1.0
C 0.0 0.0 0.0 NaN NaN NaN NaN
# 数组合并merge,列合并
In [12]: df1
Out[12]:
a b c d
A 1.0 1.0 1.0 1.0
B 1.0 1.0 1.0 1.0
In [14]: df3 = pd.DataFrame(np.arange(9).reshape((3,3)),columns=list('fax'))
In [15]: df3
Out[15]:
f a x
0 0 1 2
1 3 4 5
2 6 7 8
In [16]: df1.merge(df3,on='a') # on交集合并,按照a列合并,df1中a列坐标(A,a)和(B,a),和df3中a列坐标(0,a)相等,df1中共2行,合并成2行
Out[16]:
a b c d f x
0 1.0 1.0 1.0 1.0 0 2
1 1.0 1.0 1.0 1.0 0 2
In [17]: df1.loc['A','a'] = 100
In [18]: df1
Out[18]:
a b c d
A 100.0 1.0 1.0 1.0
B 1.0 1.0 1.0 1.0
In [19]: df1.merge(df3,on='a') # on交集合并,按照a列合并,df1中a列坐标(B,a),和df3中a列坐标(0,a)相等,df1中共1行,合并成1行
Out[19]:
a b c d f x
0 1.0 1.0 1.0 1.0 0 2
In [20]: df1.merge(df3,on='a',how='inner') # inner内连接,交集
Out[20]:
a b c d f x
0 1.0 1.0 1.0 1.0 0 2
In [21]: df1.merge(df3,on='a',how='outer') # outer外连接,并集
Out[21]:
a b c d f x
0 100.0 1.0 1.0 1.0 NaN NaN
1 1.0 1.0 1.0 1.0 0.0 2.0
2 4.0 NaN NaN NaN 3.0 5.0
3 7.0 NaN NaN NaN 6.0 8.0
分组聚合
df.groupby(by='') # by值可以为字符串,单个字段分组;by值为[df[''], df[''], ...]多个字段分组
......
索引和复合索引
In [5]: df1
Out[5]:
a b c d
A 1.0 1.0 1.0 1.0
B 1.0 1.0 1.0 1.0
In [6]: df1.index # 获取索引
Out[6]: Index(['A', 'B'], dtype='object')
In [8]: df1.index = ['a', 'b'] # 指定索引
In [9]: df1
Out[9]:
a b c d
a 1.0 1.0 1.0 1.0
b 1.0 1.0 1.0 1.0
In [10]: df1.index
Out[10]: Index(['a', 'b'], dtype='object')
In [11]: df1.reindex(['a','f']) # 重新设置索引,用的比较少
Out[11]:
a b c d
a 1.0 1.0 1.0 1.0
f NaN NaN NaN NaN
In [12]: df1
Out[12]:
a b c d
a 1.0 1.0 1.0 1.0
b 1.0 1.0 1.0 1.0
In [13]: df1.set_index('a') # 指定某一列作为索引
Out[13]:
b c d
a
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
In [14]: df1.set_index('a',drop=False) # drop=False不删除指定索引列
Out[14]:
a b c d
a
1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0
In [31]: df1
Out[31]:
a b c d
A 100.0 1.0 1.0 1.0
B 1.0 1.0 1.0 1.0
In [32]: df1['d'].unique() # 对某一列来取值,且值不重复
Out[32]: array([1.])
In [33]: df1['a'].unique()
Out[33]: array([100., 1.])
In [36]: df1.set_index('b').index.unique()
Out[36]: Float64Index([1.0], dtype='float64', name='b')
In [37]: len(df1.set_index('b').index.unique())
Out[37]: 1
In [38]: len(df1.set_index('b').index)
Out[38]: 2
In [39]: list(df1.set_index('b').index)
Out[39]: [1.0, 1.0]
时间序列
# 时间序列
In [40]: pd.date_range(start='20190831',end='20190930',freq='D')
Out[40]:
DatetimeIndex(['2019-08-31', '2019-09-01', '2019-09-02', '2019-09-03',
'2019-09-04', '2019-09-05', '2019-09-06', '2019-09-07',
'2019-09-08', '2019-09-09', '2019-09-10', '2019-09-11',
'2019-09-12', '2019-09-13', '2019-09-14', '2019-09-15',
'2019-09-16', '2019-09-17', '2019-09-18', '2019-09-19',
'2019-09-20', '2019-09-21', '2019-09-22', '2019-09-23',
'2019-09-24', '2019-09-25', '2019-09-26', '2019-09-27',
'2019-09-28', '2019-09-29', '2019-09-30'],
dtype='datetime64[ns]', freq='D')
In [41]: pd.date_range(start='20190831',end='20190930',freq='10D')
Out[41]: DatetimeIndex(['2019-08-31', '2019-09-10', '2019-09-20', '2019-09-30'], dtype='datetime64[ns]', freq='10D')
In [42]: pd.date_range(start='20190831',periods=10,freq='D')
Out[42]:
DatetimeIndex(['2019-08-31', '2019-09-01', '2019-09-02', '2019-09-03',
'2019-09-04', '2019-09-05', '2019-09-06', '2019-09-07',
'2019-09-08', '2019-09-09'],
dtype='datetime64[ns]', freq='D')
In [43]: pd.date_range(start='20190831',periods=10,freq='M')
Out[43]:
DatetimeIndex(['2019-08-31', '2019-09-30', '2019-10-31', '2019-11-30',
'2019-12-31', '2020-01-31', '2020-02-29', '2020-03-31',
'2020-04-30', '2020-05-31'],
dtype='datetime64[ns]', freq='M')
pd.to_datetime(df['timeStamp'],format='')
# resample重采样,是对原样本重新处理的一个方法,是一个对常规时间序列数据重新采样和频率转换的便捷的方法。
降采样:高频数据到低频数据
升采样:低频数据到高频数据
df.resample('D')
# PeriodIndex重组时间序列,主要将数据中的分离的时间字段,重组为时间序列,并指定为index
period = pd.PeriodIndex(year=df['year'],month=df['month'],day=df['day'],hour=df['hour'],freq='H')