pandas数据预处理 / pandas data pre-processing
目录
Pandas起源
Python Data Analysis Library或pandas是基于NumPy的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。
Pandas是python的一个数据分析包,最初由AQR Capital Management于2008年4月开发,并于2009年底开源出来,目前由专注于Python数据包开发的PyData开发team继续开发和维护,属于PyData项目的一部分。Pandas最初被作为金融数据分析工具而开发出来,因此,pandas为时间序列分析提供了很好的支持。 Pandas的名称来自于面板数据(panel data)和python数据分析(data analysis)。panel data是经济学中关于多维数据集的一个术语,在Pandas中也提供了panel的数据类型。
Pandas中的数据结构
Series:
一维数组,与Numpy中的一维Array类似。二者与Python基本的数据结构List也很相近,其区别是,List中的元素可以是不同的数据类型,而Array和Series中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。
Time- Series:
以时间为索引的Series。
DataFrame:
二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。以下的内容主要以DataFrame为主。
Panel:
三维的数组,可以理解为DataFrame的容器。
Pandas中一般的数据结构构成为DataFrame -> Series -> ndarray
环境安装:
pip install pandas
2.1 常量 / Constants
pass
2.2 函数 / Function
2.2.1 read_csv()函数
函数调用: info = pd.read_csv(filename)
函数功能:读取指定的csv文件,生成一个包含csv数据的DataFrame
传入参数: filename
filename: str类型,需要读取的文件名
返回参数: info
info: DataFrame类型,读取文件生成的DataFrame
类似方法还有: read_excel / read_json / read_sql / read_html等
2.2.2 isnull()函数
函数调用: bool = pd.isnull(obj)
函数功能:返回一个包含数据是否是null的信息数据
传入参数: obj
obj: DataFrame/Series类型,待判断的数据
返回参数: bool
bool: DataFrame/Series类型,返回的判断结果,True表示null,False则不是
2.2.3 to_datetime()函数
函数调用: date = pd.to_datetime(arg)
函数功能:将传入的数据转换成日期数据格式返回
传入参数: arg
arg: int/float/srting/datetime/list/tuple/1-d array/Series类型,argument,可传入一维数组或Series,0.18.1版本中加入DataFrame和dict-like结构
返回参数: date
date: 返回的数据类型由传入的参数确定
Note: pandas中通过to_datetime函数转换的而成的数据其dtype为datetime64[ns],该数据存在的Series可以通过.dt.month/year/day获取所需要的日期信息
2.3 类 / Class
2.3.1 DataFrame类
类实例化:df = pd.DataFrame(data, index=) / pd.read_xxx(file_name)
类的功能:用于生成DataFrame
传入参数: data, index / file_name
data: ndarray类型,包含需要构建成DataFrame的数据(二维)
index: Series类型,决定作为索引的列参数
file_name: str类型,需要读取的文件名
返回参数: df
df: DataFrame类型,生成的DataFrame
2.3.1.1 dtypes属性
属性调用: fmt = df.dtypes
属性功能: 返回数据结构中每列的数据类型(由于是多个,使用dtypes,numpy中单个,使用dtype)
属性参数: fmt
fmt: Series类型,包含每个数据值的数据类型,index为列名,value为类型,其中,object类型相当于Python中的string
2.3.1.2 columns属性
属性调用: index_name = df.columns
属性功能: 返回数据结构中每列的列名
属性参数: index_name
Index_name: Index类型,<class 'pandas.core.indexes.base.Index'>,包含每列的列名
2.3.1.3 shape属性方法
属性调用: shp = df.shape
属性功能: 返回数据结构的行列参数
属性参数: shp
shp: tuple类型,(row, column),返回行列数
2.3.1.4 loc属性
属性调用: index = df.loc
属性功能: 返回一个index的类
属性参数: index
index: obj类型,<class 'pandas.core.indexing._LocIndexer'>,可用于切片获取数据信息的DataFrame,如index[0]获取第一行,index[3:7]获取3-7行的数据
2.3.1.5 head()方法
函数调用: hdf = df.head(num=5)
函数功能: 返回csv列表中的前num行数据
传入参数: num
num: int类型,需要获取的行数
返回参数: hdf
hdf: DataFrame类型,原数据的前num行数据
2.3.1.6 tail()方法
函数调用: tdf = df.tail(num=5)
函数功能: 返回csv列表中的后num行数据
传入参数: num
num: int类型,需要获取的行数
返回参数: tdf
tdf: DataFrame类型,原数据的后num行数据
2.3.1.7 describe()方法
函数调用: ddf = df.describe()
函数功能: 返回csv列表中每个列的一些统计描述参数
返回参数: 无
返回参数: ddf
ddf: DataFrame类型,包括的信息有,每一列的数量count,均值mean,标准差std,最小值min,1/4位数25%,中位数50%,3/4位数75%,最大值max
2.3.1.8 sort_values()方法
函数调用: sdf = df.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)
函数功能: 返回按参数排序的DataFrame
传入参数: by, axis, ascending, inplace, kind, na_position
by: str类型,DataFrame的行/列名
axis: int类型,0按列(第一轴)sort,1按行(最后轴)sort
ascending: bool类型,True为升序排列, False为降序排列
inplace: bool类型,True则修改原DataFrame,False则返回新的DataFrame
kind: str类型,确定sort的排序算法,包括{‘quicksort’, ‘mergesort’, ‘heapsort’}
na_position: str类型,确定na数据存在的位置,‘first’/‘last’
返回参数: sdf
sdf: DataFrame类型,重排后的DataFrame
2.3.1.9 mean ()方法
函数调用: mdf = df.mean(axis=0)
函数功能: 返回存储所有非NaN的值的平均值DataFrame
传入参数: axis
axis: int类型,0按列(第一轴)sort,1按行(最后轴)sort
返回参数: mdf
mdf: DataFrame类型,存储均值的数据类型为float
2.3.1.10 pivot_table ()方法
函数调用: cdf = df.pivot_table(index=, values=, aggfunc=)
函数功能: 根据index将数据分组,对于values列的值(相同类型)执行aggfunc函数
传入参数: index, values, aggfunc
index: str类型,进行分组的列的列名
values: str/list类型,需要计算的列的列名,多个则使用list
aggfunc: method类型,需要调用的方法
返回参数: cdf
cdf: DataFrame类型,通过自定义函数运算后得到的DataFrame
2.3.1.11 dropna ()方法
函数调用: ddf = df.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
函数功能: 根据要求删除带有NaN值的行列
传入参数: axis, how, thresh, subset, inplace
axis: int/str类型,搜索方向,0/‘index’为行搜索,1/‘columns’为列搜索
how: str类型,‘any’只要出现NA值就删除该行/列数据,‘all’所有值都是NA才删除
thresh: int/None类型,表示对有效数据数量的最小要求(为2则要求该行/列至少2个有效非NA数据存在)
subset: str/list类型,表示在特定子集中寻找NA
inplace: bool类型,表示是否在原数据操作,True修改原数据,False返回新数据
返回参数: cdf
cdf: DataFrame类型,通过删除NA函数运算后得到的DataFrame
2.3.1.12 reset_index ()方法
函数调用: rdf = df.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill=’’)
函数功能: 重置(一般为经过排序后的)DataFrame的序号
传入参数: level, drop, inplace, col_level, col_fill
level: int/str/tuple/list类型,Only remove the given levels from the index. Removes all levels by default
drop: bool类型,是否删除原始的index列,True删除,False保留
inplace: bool类型,是否在原数据上操作
col_level: int/str类型,If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level
col_fill: obj类型,If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.
返回参数: rdf
rdf: DataFrame类型,通过重排index后的DataFrame
2.3.1.13 set_index ()方法
函数调用: sdf = df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
函数功能: 根据现有的columns参数重新设置index索引
传入参数: keys, drop, append, inplace, verify_integrity
keys: str类型,需要作为索引的列名
drop: bool类型,是否删除作为索引的列,True删除,False保留
append: bool类型,是否添加默认的index(序号索引)
inplace: bool类型,是否在原数据上操作
verify_integrity: bool类型,Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method
返回参数: sdf
sdf: DataFrame类型,通过重设index后的DataFrame
2.3.1.14 apply ()方法
函数调用: re = df.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
函数功能: 对DataFrame相应的方向使用自定义函数
传入参数: func, axis, broadcast, raw, reduce, args, **kwds
func: method类型,用于各行/列的函数
axis: int/str类型,0/‘index’对每列使用函数,1/‘column’对每行使用函数
broadcast: bool类型,For aggregation functions, return object of same size with values propagated
raw: bool类型,If False, convert each row or column into a Series. If raw=True the passed function will receive ndarray objects instead. If you are just applying a NumPy eduction function this will achieve much better performance
reduce: bool/None类型,Try to apply reduction procedures. If the DataFrame is empty, apply will use reduce to determine whether the result should be a Series or a DataFrame. If reduce is None (the default), apply's return value will be guessed by calling func an empty Series (note: while guessing, exceptions raised by func will be ignored). If reduce is True a Series will always be returned, and if False a DataFrame will always be returned
args: tuple类型,Positional arguments to pass to function in addition to the array/series
**kwds: 其余关键字参数将会被当做参数传给调用函数
返回参数: rdf
rdf: DataFrame类型,通过重排index后的DataFrame
2.3.1.15 ix属性
属性调用: ix_obj = df.ix
属性功能: 返回一个index类的数据
属性参数: ix_obj
ix_obj: obj类型,<class 'pandas.core.indexing._IXIndexer'>
Note: 后续可通过ix_obj[rows, cols]获取DataFrame或Series,rows/cols可以是需要取的行索引/列名
2.3.2 Series类
类实例化:sr = pd.Series(data, index=) / df[colomn_name]
类的功能:用于生成Series
传入参数: data, index / column_name
data: ndarray类型,包含需要构建成Series的数据(一维)
index: Series类型,决定作为索引的列参数
column_name: str类型,需要获取Series的列名
返回参数: sr
sr: Series类型,生成的Series
2.3.2.1 values属性
属性调用: values = sr.values
属性功能: 返回Series的所有value值
属性参数: values
values: ndarray类型,Series的所有值形成的一维ndarray
2.3.2.2 tolist()方法
函数调用: list =sr.tolist()
函数功能:将Series或Index类的数据变成list形式返回
传入参数: 无
返回参数: list
list: list类型,返回的数据列表
2.3.2.3 max/min()方法
函数调用: value =sr.max/min()
函数功能:获取Series中的最大/最小值
传入参数: 无
返回参数: value
value: int/str等类型,返回的最值
2.3.2.4 sort_values()方法
函数调用: ssr = sr.sort_values(axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)
函数功能: 返回按参数排序的Series
传入参数: axis, ascending, inplace, kind, na_position
axis: int类型,0按列(第一轴)sort,1按行(最后轴)sort
ascending: bool类型,True为升序排列, False为降序排列
inplace: bool类型,True则修改原DataFrame,False则返回新的DataFrame
kind: str类型,确定sort的排序算法,包括{‘quicksort’, ‘mergesort’, ‘heapsort’}
na_position: str类型,确定na数据存在的位置,‘first’/‘last’
返回参数: ssr
ssr: Series类型,重排后的Series
2.3.2.5 mean ()方法
函数调用: msr = sr.mean()
函数功能: 返回存储所有非NaN的值的平均值Series
传入参数: 无
返回参数: msr
msr: Series类型,存储均值的数据类型为float
2.3.2.6 reset_index ()方法
函数调用: rsr = sr.reset_index(level=None, drop=False, name=None, inplace=False)
函数功能: 重置(一般为经过排序后的)Series的序号
传入参数: level, drop, name, inplace
level: int/str/tuple/list类型,Only remove the given levels from the index. Removes all levels by default
drop: bool类型,是否删除原始的index列,True删除,False保留
name: obj类型,The name of the column corresponding to the Series values
inplace: bool类型,是否在原数据上操作
返回参数: rsr
rsr: Series类型,通过重排index后的Series
2.3.2.7 value_counts ()方法
函数调用: csr = sr.value_counts(dropna=True)
函数功能: 计算Series中各个values值的数量
传入参数: dropna
dropna: bool类型,是否计算NA的数量,True不计算,False计算
返回参数: csr
csr: Series类型,各数据值为索引,数量为value的Series
首先对csv文件进行读取操作,利用read_csv函数,值得注意的是,存储的csv文件必须利用Excel另存为的方式获得,而不能以修改后缀名的方法得到。
1 import pandas as pd 2 3 # info = pd.read_csv('info.csv', encoding='latin1') 4 # info = pd.read_csv('info.csv', encoding='ISO-8859-1') 5 # info = pd.read_csv('info.csv', encoding='cp1252') 6 info = pd.read_csv('info.csv') 7 # Get the info of whole csv list, and the info of row and column 8 print(info)
输出结果为
No. Type Info Number Rank Mark. 0 1001 BUTTER_1 NaN 4.000000 A cc 1 1002 BUTTER_2 NaN NaN C dd 2 1003 BUTTER_3 NaN NaN NaN ff 3 1004 BUTTER_4 NaN NaN NaN NaN 4 1005 BUTTER_5 df 543.000000 F cx 5 1006 BUTTER_6 fa 345.000000 A cc 6 1007 BUTTER_7 jhf 67.000000 S dd 7 1008 BUTTER_8 ad 567.000000 S ff 8 1009 BUTTER_9 gdfs 34.000000 C aa 9 1010 BUTTER_10 vczx 34.000000 C cx 10 1011 BUTTER_11 as 89.000000 E cc 11 1012 BUTTER_12 cd 90.000000 D dd 12 1013 BUTTER_13 qwe 14.000000 S ff 13 1014 WATER_1 asd 186.635198 A aa 14 1015 WATER_2 as 222.000000 B cc 15 1016 WATER_3 fa 193.026806 A cc 16 1017 WATER_4 jhf 196.222611 C dd 17 1018 WATER_5 ad 199.418415 B ff 18 1019 WATER_6 gdfs 202.614219 D aa 19 1020 WATER_7 vczx 205.810023 F cx 20 1021 WATER_8 as 209.005827 A cc 21 1022 WATER_9 cd 212.201632 S dd 22 1023 WATER_10 qwe 215.397436 S ff 23 1024 WATER_11 asd 218.593240 C aa 24 1025 WATER_12 df 221.789044 C cx 25 1026 WATER_13 fa 224.984848 E cc 26 1027 WATER_14 jhf 228.180653 D dd 27 1028 WATER_15 ad 231.376457 S ff 28 1029 WATER_16 gdfs 234.572261 A aa 29 1030 WATER_17 vczx 237.768065 B cx .. ... ... ... ... ... ... 70 1071 CHEESE_11 as 368.796037 E cc 71 1072 CHEESE_12 cd 371.991842 D dd 72 1073 CHEESE_13 qwe 375.187646 S ff 73 1074 CHEESE_14 asd 378.383450 A aa 74 1075 CHEESE_15 df 381.579254 B cx 75 1076 CHEESE_16 fa 384.775058 A cc 76 1077 CHEESE_17 jhf 387.970863 C dd 77 1078 CHEESE_18 ad 391.166667 B ff 78 1079 CHEESE_19 gdfs 394.362471 D aa 79 1080 CHEESE_20 vczx 397.558275 F cx 80 1081 CHEESE_21 as 400.754079 A cc 81 1082 CHEESE_22 cd 403.949883 S dd 82 1083 CHEESE_23 qwe 407.145688 S ff 83 1084 CHEESE_24 asd 410.341492 C aa 84 1085 CHEESE_25 df 413.537296 C cx 85 1086 MILK_1 fa 416.733100 E cc 86 1087 MILK_2 jhf 419.928904 D dd 87 1088 MILK_3 ad 423.124709 S ff 88 1089 MILK_4 gdfs 426.320513 A aa 89 1090 MILK_5 vczx 429.516317 B cx 90 1091 MILK_6 as 432.712121 A cc 91 1092 MILK_7 cd 435.907925 C dd 92 1093 MILK_8 qwe 439.103730 B ff 93 1094 MILK_9 asd 442.299534 D aa 94 1095 MILK_10 df 445.495338 F cx 95 1096 MILK_11 fa 448.691142 A cc 96 1097 MILK_12 jhf 451.886946 S dd 97 1098 MILK_13 ad 455.082751 S ff 98 1099 MILK_14 gdfs 458.278555 C aa 99 1100 MILK_15 vczx 461.474359 C cx [100 rows x 6 columns]
可以看到,pandas已经将csv文件中的数据成功导入
接着可以查看导入的数据类型
1 # Get the type of info 2 print(type(info)) # <class 'pandas.core.frame.DataFrame'> 3 print('-----------') 4 # Get the type of each column(The object dtype equal to the string type in python) 5 print(info.dtypes) ''' No. int64 6 Type object 7 Info object 8 Number float64 9 Rank object 10 Mark. object 11 dtype: object '''
最后还可以利用基本函数获取前/后 n 行,列名信息以及基本描述等
1 # Get the first x row of csv list, default is 5 2 print(info.head(7)) 3 print('-----------') 4 # Get the last x row of csv list, default is 5 5 print(info.tail(7)) 6 print('-----------') 7 # Get the name of each column 8 print(info.columns) 9 print('-----------') 10 # Get the shape of csv list 11 print(info.shape) 12 print('-----------') 13 # Get the statistics parameter of cvs list(for digit data) 14 # Such as count, mean, standard deviation, min, 25%, 50%, 75%, max 15 print(info.describe())
输出结果
No. Type Info Number Rank Mark. 0 1001 BUTTER_1 NaN 4.0 A cc 1 1002 BUTTER_2 NaN NaN C dd 2 1003 BUTTER_3 NaN NaN NaN ff 3 1004 BUTTER_4 NaN NaN NaN NaN 4 1005 BUTTER_5 df 543.0 F cx 5 1006 BUTTER_6 fa 345.0 A cc 6 1007 BUTTER_7 jhf 67.0 S dd ----------- No. Type Info Number Rank Mark. 93 1094 MILK_9 asd 442.299534 D aa 94 1095 MILK_10 df 445.495338 F cx 95 1096 MILK_11 fa 448.691142 A cc 96 1097 MILK_12 jhf 451.886946 S dd 97 1098 MILK_13 ad 455.082751 S ff 98 1099 MILK_14 gdfs 458.278555 C aa 99 1100 MILK_15 vczx 461.474359 C cx ----------- Index(['No.', 'Type', 'Info', 'Number', 'Rank', 'Mark.'], dtype='object') ----------- (100, 6) ----------- No. Number count 100.000000 97.000000 mean 1050.500000 309.401389 std 29.011492 110.975188 min 1001.000000 4.000000 25% 1025.750000 240.963869 50% 1050.500000 317.663170 75% 1075.250000 391.166667 max 1100.000000 567.000000
对于pandas,由于其基本结构是基于numpy的ndarray,因此numpy的基本计算操作对于pandas的DataFrame及Series也都适用。
下面是pandas的一些基本计算方法的示例,
完整代码
1 import pandas as pd 2 3 info = pd.read_csv('info.csv') 4 # Get the certain row of csv list 5 print(info.loc[0]) 6 print(info.loc[3:7]) 7 print('----------') 8 # Get certain column(columns) by column name(name list) 9 print(info['Type']) 10 print(info[['Type', 'No.']]) 11 # Get the column name and save it as a list 12 col_names = info.columns.tolist() 13 print(col_names) 14 15 # Filter off the column name that end with '.' 16 dotList = [] 17 for n in col_names: 18 if n.endswith('.'): 19 dotList.append(n) 20 newList = info[dotList] 21 print(newList) 22 23 # Operation for column will act to each element as numpy does 24 print(info['Number'] * 10) 25 26 # Operation for two csv with same shape will act each corresponding element 27 x = info['Number'] 28 y = info['No.'] 29 print(x+y) 30 # Act for string 31 x = info['Rank'] 32 y = info['Mark.'] 33 print(x+y) 34 35 # Add a column after the tail column(the dimension of new one should be same as origin) 36 print(info.shape) 37 info['New'] = x+y 38 print(info.shape) 39 print('----------') 40 41 # Get the max/min value of a column 42 print(info['Number'].max()) 43 print(info['Number'].min()) 44 45 num = info['Number'] 46 num_null_true = pd.isnull(num) 47 # If these is a null value in DataFrame, the calculated result will be NaN 48 print(sum(info['Number'])/len(info['Number'])) # return nan 49 # Use the DataFrame == False to reverse the DataFrame 50 good_value = info['Number'][num_null_true == False] 51 print(sum(good_value)/len(good_value)) 52 print(good_value.mean()) 53 # mean method can filter off the missing data automatically 54 print(info['Number'].mean()) 55 print('---------')
分段解释
首先导入pandas及数据文件,利用loc获取pandas的某行数据,可以使用类似list的切片操作
1 import pandas as pd 2 3 info = pd.read_csv('info.csv') 4 # Get the certain row of csv list 5 print(info.loc[0]) 6 print(info.loc[3:7]) 7 print('----------') 8 # Get certain column(columns) by column name(name list) 9 print(info['Type']) 10 print(info[['Type', 'No.']])
结果如下,内容较长
No. 1001 Type BUTTER_1 Info NaN Number 4 Rank A Mark. cc Name: 0, dtype: object No. Type Info Number Rank Mark. 3 1004 BUTTER_4 NaN NaN NaN NaN 4 1005 BUTTER_5 df 543.0 F cx 5 1006 BUTTER_6 fa 345.0 A cc 6 1007 BUTTER_7 jhf 67.0 S dd 7 1008 BUTTER_8 ad 567.0 S ff ---------- 0 BUTTER_1 1 BUTTER_2 2 BUTTER_3 3 BUTTER_4 4 BUTTER_5 5 BUTTER_6 6 BUTTER_7 7 BUTTER_8 8 BUTTER_9 9 BUTTER_10 10 BUTTER_11 11 BUTTER_12 12 BUTTER_13 13 WATER_1 14 WATER_2 15 WATER_3 16 WATER_4 17 WATER_5 18 WATER_6 19 WATER_7 20 WATER_8 21 WATER_9 22 WATER_10 23 WATER_11 24 WATER_12 25 WATER_13 26 WATER_14 27 WATER_15 28 WATER_16 29 WATER_17 ... 70 CHEESE_11 71 CHEESE_12 72 CHEESE_13 73 CHEESE_14 74 CHEESE_15 75 CHEESE_16 76 CHEESE_17 77 CHEESE_18 78 CHEESE_19 79 CHEESE_20 80 CHEESE_21 81 CHEESE_22 82 CHEESE_23 83 CHEESE_24 84 CHEESE_25 85 MILK_1 86 MILK_2 87 MILK_3 88 MILK_4 89 MILK_5 90 MILK_6 91 MILK_7 92 MILK_8 93 MILK_9 94 MILK_10 95 MILK_11 96 MILK_12 97 MILK_13 98 MILK_14 99 MILK_15 Name: Type, Length: 100, dtype: object Type No. 0 BUTTER_1 1001 1 BUTTER_2 1002 2 BUTTER_3 1003 3 BUTTER_4 1004 4 BUTTER_5 1005 5 BUTTER_6 1006 6 BUTTER_7 1007 7 BUTTER_8 1008 8 BUTTER_9 1009 9 BUTTER_10 1010 10 BUTTER_11 1011 11 BUTTER_12 1012 12 BUTTER_13 1013 13 WATER_1 1014 14 WATER_2 1015 15 WATER_3 1016 16 WATER_4 1017 17 WATER_5 1018 18 WATER_6 1019 19 WATER_7 1020 20 WATER_8 1021 21 WATER_9 1022 22 WATER_10 1023 23 WATER_11 1024 24 WATER_12 1025 25 WATER_13 1026 26 WATER_14 1027 27 WATER_15 1028 28 WATER_16 1029 29 WATER_17 1030 .. ... ... 70 CHEESE_11 1071 71 CHEESE_12 1072 72 CHEESE_13 1073 73 CHEESE_14 1074 74 CHEESE_15 1075 75 CHEESE_16 1076 76 CHEESE_17 1077 77 CHEESE_18 1078 78 CHEESE_19 1079 79 CHEESE_20 1080 80 CHEESE_21 1081 81 CHEESE_22 1082 82 CHEESE_23 1083 83 CHEESE_24 1084 84 CHEESE_25 1085 85 MILK_1 1086 86 MILK_2 1087 87 MILK_3 1088 88 MILK_4 1089 89 MILK_5 1090 90 MILK_6 1091 91 MILK_7 1092 92 MILK_8 1093 93 MILK_9 1094 94 MILK_10 1095 95 MILK_11 1096 96 MILK_12 1097 97 MILK_13 1098 98 MILK_14 1099 99 MILK_15 1100 [100 rows x 2 columns]
获取pandas的列名
1 # Get the column name and save it as a list 2 col_names = info.columns.tolist() 3 print(col_names)
结果如下
['No.', 'Type', 'Info', 'Number', 'Rank', 'Mark.']
过滤出所有以‘.’结尾的列
1 # Filter off the column name that end with '.' 2 dotList = [] 3 for n in col_names: 4 if n.endswith('.'): 5 dotList.append(n) 6 newList = info[dotList] 7 print(newList)
基本计算操作会作用于pandas的Series每个值
1 # Operation for column will act to each element as numpy does 2 print(info['Number'] * 10)
对两个结构形状相同的Series,其运算会作用到每个values上
1 # Operation for two csv with same shape will act each corresponding element 2 x = info['Number'] 3 y = info['No.'] 4 print(x+y) 5 # Act for string 6 x = info['Rank'] 7 y = info['Mark.'] 8 print(x+y)
创建出一个列名为‘New’的新列,值为两个列的值之和
1 # Add a column after the tail column(the dimension of new one should be same as origin) 2 print(info.shape) 3 info['New'] = x+y 4 print(info.shape) 5 print('----------')
获取Series中的最值
1 # Get the max/min value of a column 2 print(info['Number'].max()) 3 print(info['Number'].min())
均值计算的两种方式,
- 直接求和平均,当计算中有NaN值时,计算的结果将会为NaN
- 利用mean函数进行计算,mean函数将会过自动滤掉NaN缺失数据
1 num = info['Number'] 2 num_null_true = pd.isnull(num) 3 # If these is a null value in DataFrame, the calculated result will be NaN 4 print(sum(info['Number'])/len(info['Number'])) # return nan 5 # Use the DataFrame == False to reverse the DataFrame 6 good_value = info['Number'][num_null_true == False] 7 print(sum(good_value)/len(good_value)) 8 print(good_value.mean()) 9 # mean method can filter off the missing data automatically 10 print(info['Number'].mean()) 11 print('---------')
下面介绍 pandas 中的数据类型 Series 的一些基本使用方法,
完整代码
1 import pandas as pd 2 3 info = pd.read_csv('info.csv') 4 5 # Fetch a series from DataFrame 6 rank_series = info['Rank'] 7 print(type(info)) # <class 'pandas.core.frame.DataFrame'> 8 print(type(rank_series)) # <class 'pandas.core.series.Series'> 9 print(rank_series[0:5]) 10 11 # New a series 12 from pandas import Series 13 # Build a rank series 14 rank = rank_series.values 15 print(rank) 16 # DataFrame --> Series --> ndarray 17 print(type(rank)) # <class 'numpy.ndarray'> 18 # Build a type series 19 type_series = info['Type'] 20 types = type_series.values 21 # Build a new series based on former two(type and rank) 22 # Series(values, index=) 23 series_custom = Series(rank, index=types) 24 print(series_custom) 25 # Fetch Series by key name list 26 print(series_custom[['MILK_14', 'MILK_15']]) 27 # Fetch Series by index 28 print(series_custom[0:2]) 29 30 # Sorted to Series will return a list by sorted value 31 print(sorted(series_custom, key=lambda x: 0 if isinstance(x, str) else x)) 32 33 # Re-sort by index for a Series 34 original_index = series_custom.index.tolist() 35 sorted_index = sorted(original_index) 36 sorted_by_index = series_custom.reindex(sorted_index) 37 print(sorted_by_index) 38 # Series sort function 39 print(series_custom.sort_index()) 40 print(series_custom.sort_values()) 41 42 import numpy as np 43 # Add operation for Series will add the values for each row(if the dimensions of two series are same) 44 print(np.add(series_custom, series_custom)) 45 # Apply sin funcion to each value 46 print(np.sin(info['Number'])) 47 # Return the max value(return a single value not a Series) 48 # If more than one max value exist, only return one 49 print(np.max(filter(lambda x: isinstance(x, float), series_custom))) 50 51 # Filter values in range 52 criteria_one = series_custom > 'C' 53 criteria_two = series_custom < 'S' 54 print(series_custom[criteria_one & criteria_two])
分段解释
利用列名从DataFrame中获取一个Series
1 import pandas as pd 2 3 info = pd.read_csv('info.csv') 4 5 # Fetch a series from DataFrame 6 rank_series = info['Rank'] 7 print(type(info)) # <class 'pandas.core.frame.DataFrame'> 8 print(type(rank_series)) # <class 'pandas.core.series.Series'> 9 print(rank_series[0:5])
新建一个Series的方法,先获取一个作为index的列,在获取一个作为values的列,利用Series函数生成新的Series
1 # New a series 2 from pandas import Series 3 # Build a rank series 4 rank = rank_series.values 5 print(rank) 6 # DataFrame --> Series --> ndarray 7 print(type(rank)) # <class 'numpy.ndarray'> 8 # Build a type series 9 type_series = info['Type'] 10 types = type_series.values 11 # Build a new series based on former two(type and rank) 12 # Series(values, index=) 13 series_custom = Series(rank, index=types) 14 print(series_custom)
利用列名列表或索引从DataFrame中获取多个Series
1 # Fetch Series by key name list 2 print(series_custom[['MILK_14', 'MILK_15']]) 3 # Fetch Series by index 4 print(series_custom[0:2])
利用sorted函数根据values大小重排Series,返回值为一个list
1 # Sorted to Series will return a list by sorted value 2 print(sorted(series_custom, key=lambda x: 0 if isinstance(x, str) else x))
两种sort方法对Series进行排列
1. 获取index索引值,对索引值进行排列,再使用reindex函数获取新的Series
1 # Re-sort by index for a Series 2 original_index = series_custom.index.tolist() 3 sorted_index = sorted(original_index) 4 sorted_by_index = series_custom.reindex(sorted_index) 5 print(sorted_by_index)
2.使用sort_index或sort_values函数
1 # Series sort function 2 print(series_custom.sort_index()) 3 print(series_custom.sort_values())
Series的相加/正余弦/max,利用numpy函数,将Series的对应values值进行处理
1 import numpy as np 2 # Add operation for Series will add the values for each row(if the dimensions of two series are same) 3 print(np.add(series_custom, series_custom)) 4 # Apply sin funcion to each value 5 print(np.sin(info['Number'])) 6 # Return the max value(return a single value not a Series) 7 # If more than one max value exist, only return one 8 print(np.max(filter(lambda x: isinstance(x, float), series_custom)))
利用True/False列表获取在范围内满足条件的Series
1 # Filter values in range 2 criteria_one = series_custom > 'C' 3 criteria_two = series_custom < 'S' 4 print(series_custom[criteria_one & criteria_two])
下面是一些pandas常用的函数示例
完整代码
1 import pandas as pd 2 import numpy as np 3 4 info = pd.read_csv('info.csv') 5 6 # Sort value by column 7 # inplace is True will sort value base on origin, False will return a new DataFrame 8 new = info.sort_values('Mark.', inplace=False, na_position='last') 9 print(new) 10 # Sorted by ascending order in default(ascending=True) 11 # No matter ascending or descending sort, the NaN(NA, missing value) value will be placed at tail 12 info.sort_values('Mark.', inplace=True, ascending=False) 13 print(info) 14 print('---------') 15 # Filter off the null row 16 num = info['Number'] 17 # isnull will return a list contains the status of null or not, True for null, False for not 18 num_null_true = pd.isnull(num) 19 print(num_null_true) 20 num_null = num[num_null_true] 21 print(num_null) # 12 NaN 22 print('---------') 23 24 # pivot_table function can calulate certain para that with same attribute group by using certain function 25 # index tells the method which column to group by 26 # value is the column that we want to apply the calculation to 27 # aggfunc specifies the calculation we want to perform, default function is mean 28 avg_by_rank = info.pivot_table(index='Rank', values='Number', aggfunc=np.sum) 29 print(avg_by_rank) 30 print('---------') 31 # Operate to multi column 32 sum_by_rank = info.pivot_table(index='Rank', values=['Number', 'No.'], aggfunc=np.sum) 33 print(sum_by_rank) 34 print('---------') 35 36 # dropna function can drop any row/columns that have null values 37 info = pd.read_csv('info.csv') 38 # Drop the columns that contain NaN (axis=0 for row) 39 drop_na_column = info.dropna(axis=1) 40 print(drop_na_column) 41 print('---------') 42 # Drop the row that subset certains has NaN 43 # thresh to decide how many valid value required 44 drop_na_row = info.dropna(axis=0, thresh=1, subset=['Number', 'Info', 'Rank', 'Mark.']) 45 print(drop_na_row) 46 print('---------') 47 # Locate to a certain value by its row number(plus 1 for No.) and column name 48 print(info) 49 row_77_Rank = info.loc[77, 'Rank'] 50 print(row_77_Rank) 51 row_88_Info = info.loc[88, 'Info'] 52 print(row_88_Info) 53 print('---------') 54 55 # reset_index can reset the index for sorted DataFrame 56 new_info = info.sort_values('Rank', ascending=False) 57 print(new_info[0:10]) 58 print('---------') 59 # drop=True will drop the index column, otherwise will keep former index colunn (default False) 60 reset_new_info = new_info.reset_index(drop=True) 61 print(reset_new_info[0:10]) 62 print('---------') 63 64 # Define your own function for pandas 65 # Use apply function to implement your own function 66 def hundredth_row(col): 67 hundredth_item = col.loc[99] 68 return hundredth_item 69 hundred_row = info.apply(hundredth_row, axis=0) 70 print(hundred_row) 71 print('---------') 72 # Null count 73 # The apply function will act to each column 74 def null_count(column): 75 column_null = pd.isnull(column) 76 null = column[column_null] 77 return len(null) 78 # Passing in axis para 0 to iterate over rows instead of column 79 # Note: 0 for act by row but passing by column, 1 for act by column but passing by row 80 # Passing by column can act for each column then get row 81 # Passing by row can act for each row than get column 82 column_null_count = info.apply(null_count, axis=0) 83 print(column_null_count) 84 print('---------') 85 86 # Example: classify the data by Rank, and calculate the sum for each 87 def rank_sort(row): 88 rank = row['Rank'] 89 if rank == 'S': 90 return 'Excellent' 91 elif rank == 'A': 92 return 'Great' 93 elif rank == 'B': 94 return 'Good' 95 elif rank == 'C': 96 return 'Pass' 97 else: 98 return 'Failed' 99 # Format a classified column 100 rank_info = info.apply(rank_sort, axis=1) 101 print(rank_info) 102 print('---------') 103 # Add the column to DataFrame 104 info['Rank_Classfied'] = rank_info 105 # Calculate the sum of 'Number' according to 'Rank_Classfied' 106 new_rank_number = info.pivot_table(index='Rank_Classfied', values='Number', aggfunc=np.sum) 107 print(new_rank_number) 108 109 # set_index will return a new DataFrame that is indexed by values in the specified column 110 # And will drop that column(default is True) 111 # The column set to be index will not be dropped if drop=False 112 index_type = info.set_index('Type', drop=False, append=True) 113 print(index_type) 114 print('---------') 115 116 # Use string index to slice the DataFrame 117 # Note: the index(key) should be unique 118 print(index_type['MILK_1':'MILK_7']) 119 print('---------') 120 print(index_type.loc['MILK_1':'MILK_7']) 121 # Value index is available too 122 print('---------') 123 print(index_type[-15:-8]) 124 print('---------') 125 126 # Calculate the standard deviation for each element from two different index 127 cal_list = info[['Number', 'No.']] 128 # np.std([x, y]) --> std value 129 # The lambda x is a Series 130 # cal_list.apply(lambda x: print(type(x)), axis=1) 131 print(cal_list.apply(lambda x: np.std(x), axis=1))
分段解释
首先导入模块,然后利用sort_values函数对DataFrame或Series进行排序操作
1 mport pandas as pd 2 import numpy as np 3 4 info = pd.read_csv('info.csv') 5 6 # Sort value by column 7 # inplace is True will sort value base on origin, False will return a new DataFrame 8 new = info.sort_values('Mark.', inplace=False, na_position='last') 9 print(new) 10 # Sorted by ascending order in default(ascending=True) 11 # No matter ascending or descending sort, the NaN(NA, missing value) value will be placed at tail 12 info.sort_values('Mark.', inplace=True, ascending=False) 13 print(info) 14 print('---------')
利用isnull函数对null值的数据进行过滤,可利用Series==False对isnull得到的序列进行反转
1 # Filter off the null row 2 num = info['Number'] 3 # isnull will return a list contains the status of null or not, True for null, False for not 4 num_null_true = pd.isnull(num) 5 print(num_null_true) 6 num_null = num[num_null_true] 7 print(num_null) # 12 NaN 8 print('---------')
利用pivot_table函数对相同属性分组的数据进行指定函数的计算
1 # pivot_table function can calulate certain para that with same attribute group by using certain function 2 # index tells the method which column to group by 3 # value is the column that we want to apply the calculation to 4 # aggfunc specifies the calculation we want to perform, default function is mean 5 avg_by_rank = info.pivot_table(index='Rank', values='Number', aggfunc=np.sum) 6 print(avg_by_rank) 7 print('---------') 8 # Operate to multi column 9 sum_by_rank = info.pivot_table(index='Rank', values=['Number', 'No.'], aggfunc=np.sum) 10 print(sum_by_rank) 11 print('---------')
利用dropna函数删除空值数据
1 # dropna function can drop any row/columns that have null values 2 info = pd.read_csv('info.csv') 3 # Drop the columns that contain NaN (axis=0 for row) 4 drop_na_column = info.dropna(axis=1) 5 print(drop_na_column) 6 print('---------') 7 # Drop the row that subset certains has NaN 8 # thresh to decide how many valid value required 9 drop_na_row = info.dropna(axis=0, thresh=1, subset=['Number', 'Info', 'Rank', 'Mark.']) 10 print(drop_na_row) 11 print('---------')
利用loc对数据进行定位
1 # Locate to a certain value by its row number(plus 1 for No.) and column name 2 print(info) 3 row_77_Rank = info.loc[77, 'Rank'] 4 print(row_77_Rank) 5 row_88_Info = info.loc[88, 'Info'] 6 print(row_88_Info) 7 print('---------')
利用reset_index函数对索引进行重排
1 # reset_index can reset the index for sorted DataFrame 2 new_info = info.sort_values('Rank', ascending=False) 3 print(new_info[0:10]) 4 print('---------') 5 # drop=True will drop the index column, otherwise will keep former index colunn (default False) 6 reset_new_info = new_info.reset_index(drop=True) 7 print(reset_new_info[0:10]) 8 print('---------')
利用apply函数运行自定义函数
1 # Define your own function for pandas 2 # Use apply function to implement your own function 3 def hundredth_row(col): 4 hundredth_item = col.loc[99] 5 return hundredth_item 6 hundred_row = info.apply(hundredth_row, axis=0) 7 print(hundred_row) 8 print('---------') 9 # Null count 10 # The apply function will act to each column 11 def null_count(column): 12 column_null = pd.isnull(column) 13 null = column[column_null] 14 return len(null) 15 # Passing in axis para 0 to iterate over rows instead of column 16 # Note: 0 for act by row but passing by column, 1 for act by column but passing by row 17 # Passing by column can act for each column then get row 18 # Passing by row can act for each row than get column 19 column_null_count = info.apply(null_count, axis=0) 20 print(column_null_count) 21 print('---------') 22 23 # Example: classify the data by Rank, and calculate the sum for each 24 def rank_sort(row): 25 rank = row['Rank'] 26 if rank == 'S': 27 return 'Excellent' 28 elif rank == 'A': 29 return 'Great' 30 elif rank == 'B': 31 return 'Good' 32 elif rank == 'C': 33 return 'Pass' 34 else: 35 return 'Failed' 36 # Format a classified column 37 rank_info = info.apply(rank_sort, axis=1) 38 print(rank_info) 39 print('---------')
添加一个column到DataFrame并进行计算处理
1 # Add the column to DataFrame 2 info['Rank_Classfied'] = rank_info 3 # Calculate the sum of 'Number' according to 'Rank_Classfied' 4 new_rank_number = info.pivot_table(index='Rank_Classfied', values='Number', aggfunc=np.sum) 5 print(new_rank_number)
利用set_index函数设置新的索引,利用索引进行切片操作,切片如果是列名字符串,将返回两个列名索引之间所有的数据
1 # set_index will return a new DataFrame that is indexed by values in the specified column 2 # And will drop that column(default is True) 3 # The column set to be index will not be dropped if drop=False 4 index_type = info.set_index('Type', drop=False, append=True) 5 print(index_type) 6 print('---------') 7 8 # Use string index to slice the DataFrame 9 # Note: the index(key) should be unique 10 print(index_type['MILK_1':'MILK_7']) 11 print('---------') 12 print(index_type.loc['MILK_1':'MILK_7']) 13 # Value index is available too 14 print('---------') 15 print(index_type[-15:-8]) 16 print('---------')
对两个不同索引内的元素分别进行标准差计算
1 # Calculate the standard deviation for each element from two different index 2 cal_list = info[['Number', 'No.']] 3 # np.std([x, y]) --> std value 4 # The lambda x is a Series 5 # cal_list.apply(lambda x: print(type(x)), axis=1) 6 print(cal_list.apply(lambda x: np.std(x), axis=1))
1. pandas许多函数底层是基于numpy进行的,pandas一个函数可能调用了numpy的多个函数进行实现;
2. object dtype 和 Python中的string相同;
3. pandas中如果不指定列名则默认文件中第一行为列名;
4. 基本结构包括DataFrame和Series,DataFrame可以分解为Series,DataFrame是由一系列的Series组成的,DataFrame相当于矩阵,Series相当于行或者列。
相关阅读
1. numpy 的使用