这里主要是记录一些pandas的基本方法,熟练使用这里的方法可以放你在处理数据值的时候无往不利。
一、生成对象
pandns主要有两种数据结构:series和DataFrame。对着两个两种数据结构的操作的简单的增删改查的操作也在前面的博客里介绍过,有问题的请跳转:https://www.cnblogs.com/ppzhang/p/13747910.html
二、查看数据
在这里主要是介绍查看二维数组DataFrame的数据。
1、head():从上到下查看数据
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df2 = pd.DataFrame({'A': 1, 'B': pd.Timestamp('20130102'), 'C': pd.Series(3, index=list(range(4)), dtype='float32'), 'D': np.array([3] * 4, dtype='int32'), 'E': pd.Categorical(["test", "train", "test", "train"]), 'F': 'foo'}) return df2 def print_data(): df = create_numpay_dataform() print("-------显示原始数据-----") print(df) print("-------显示前1行数据-----") print(df.head(1)) print("-------显示前3行数据-----") print(df.head(3)) if __name__ == '__main__': print_data() #结果如下 -------显示原始数据----- A B C D E F 0 1 2013-01-02 3.0 3 test foo 1 1 2013-01-02 3.0 3 train foo 2 1 2013-01-02 3.0 3 test foo 3 1 2013-01-02 3.0 3 train foo -------显示前1行数据----- A B C D E F 0 1 2013-01-02 3.0 3 test foo -------显示前3行数据----- A B C D E F 0 1 2013-01-02 3.0 3 test foo 1 1 2013-01-02 3.0 3 train foo 2 1 2013-01-02 3.0 3 test foo
2、tail():从下往上查看数据
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df2 = pd.DataFrame({'A': 1, 'B': pd.Timestamp('20130102'), 'C': pd.Series(3, index=list(range(4)), dtype='float32'), 'D': np.array([3] * 4, dtype='int32'), 'E': pd.Categorical(["test", "train", "test", "train"]), 'F': 'foo'}) return df2 def print_data(): df = create_numpay_dataform() print("-------显示原始数据-----") print(df) print("-------显示后1行数据-----") print(df.tail(1)) print("-------显示后2行数据-----") print(df.tail(2)) if __name__ == '__main__': print_data() #结果如下 -------显示原始数据----- A B C D E F 0 1 2013-01-02 3.0 3 test foo 1 1 2013-01-02 3.0 3 train foo 2 1 2013-01-02 3.0 3 test foo 3 1 2013-01-02 3.0 3 train foo -------显示后1行数据----- A B C D E F 3 1 2013-01-02 3.0 3 train foo -------显示后2行数据----- A B C D E F 2 1 2013-01-02 3.0 3 test foo 3 1 2013-01-02 3.0 3 train foo
3、to_numpy():输出底成数据的numpy对象。
注意:
a.Numpy的数组只有一种数据类型
b.DataFrame数组每列的数据类型各不相同
c.DataFrame数组的列有多种数据类型组成,该操作消费系统资源较大
d.调用to_numpy()时Pandas查找支持查找DataFrame里说有数据类型的Numpy数据类型
e.还有一种数据类型时object,可以将DataFrame列里的值强行转化成python对象
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) df2 = pd.DataFrame({'A': 1, 'B': pd.Timestamp('20130102'), 'C': pd.Series(3, index=list(range(4)), dtype='float32'), 'D': np.array([3] * 4, dtype='int32'), 'E': pd.Categorical(["test", "train", "test", "train"]), 'F': 'foo'}) return df,df2 def print_data(): df1,df2 = create_numpay_dataform() print("-------DF1-----") print(df1) print(" "+"-------DF2-----") print(df2) print(" " + "-------DF1 to to_numpy-----") print(" df1 这个 DataFrame 里的值都是浮点数,DataFrame.to_numpy() 的操作会很快,而且不复制数据。") print(df1.to_numpy()) print(" " + "-------DF2 to to_numpy-----") print("df2 这个 DataFrame 包含了多种类型,DataFrame.to_numpy() 操作就会耗费较多资源。") print(df2.to_numpy()) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B C D 2013-01-01 0.214933 -0.932719 0.409751 -1.579671 2013-01-02 0.857846 -0.450446 1.334220 -0.256340 2013-01-03 1.855527 -0.459457 -0.088609 1.970731 2013-01-04 -0.315940 1.216017 0.145649 0.844216 2013-01-05 1.229986 -0.307384 -0.816692 -1.266780 2013-01-06 -0.324177 -0.606538 -0.993541 -1.018344 -------DF2----- A B C D E F 0 1 2013-01-02 3.0 3 test foo 1 1 2013-01-02 3.0 3 train foo 2 1 2013-01-02 3.0 3 test foo 3 1 2013-01-02 3.0 3 train foo -------DF1 to to_numpy----- df1 这个 DataFrame 里的值都是浮点数,DataFrame.to_numpy() 的操作会很快,而且不复制数据。 [[ 0.21493314 -0.93271907 0.40975128 -1.57967127] [ 0.85784569 -0.45044625 1.3342199 -0.25634002] [ 1.85552743 -0.45945651 -0.08860859 1.97073069] [-0.31593997 1.2160171 0.14564932 0.8442159 ] [ 1.22998622 -0.30738437 -0.81669186 -1.26677969] [-0.3241766 -0.60653794 -0.99354086 -1.01834351]] -------DF2 to to_numpy----- df2 这个 DataFrame 包含了多种类型,DataFrame.to_numpy() 操作就会耗费较多资源。 [[1 Timestamp('2013-01-02 00:00:00') 3.0 3 'test' 'foo'] [1 Timestamp('2013-01-02 00:00:00') 3.0 3 'train' 'foo'] [1 Timestamp('2013-01-02 00:00:00') 3.0 3 'test' 'foo'] [1 Timestamp('2013-01-02 00:00:00') 3.0 3 'train' 'foo']]
4、describe():可以快速查看数据的统计摘要,有三个参数
a.第一个percentiles,这个参数可以设定数值型特征的统计量,默认是[.25, .5, .75],也就是返回25%,50%,75%数据量时的数字,但是这个可以修改的,describe(percentiles=[.2,.75, .8])默认有5
b.第二个参数:include,这个参数默认是只计算数值型特征的统计量,当输入include=['O'],会计算离散型变量的统计特征,此外传参数是‘all’的时候会把数值型和离散型特征的统计都进行显示。
c.第三个参数的设计就更贴心了,第二个参数是你可以指定选那些,第三个参数就是你可以指定不选哪些,人性化设计。这个参数默认不丢弃任何列,相当于无影响。
d.如果只想显示某一行的结果需要使用:
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df = pd.DataFrame(data={ 'A': list('abaacdadaf'), 'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2] }) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print(" "+"-------DF2.describe()无参数-----") #第一个参数默认显示25%,50%,75% #第二个参数默认显示数值类型 #第三个参数默认为None不丢弃任何列,相当于无影响 print(df1.describe()) print(" " + "-------DF2.describe()第一个参数-----") #返回10 %,60 %,80 %,90 % 数据量时的数字,50 %默认显示 print(df1.describe(percentiles=[.1,.6,.8,.9])) print(" " + "-------DF2.describe()第二个参数=all -----") #‘all’的时候会把数值型和离散型特征的统计都进行显示 print(df1.describe(include="all")) print(" " + "-------DF2.describe()第二个参数=O -----") #include=['O'],会计算离散型变量的统计特征 print(df1.describe(include='O')) print(" " + "-------DF2.describe()第三个参数 -----") #exclude='O'表示不输出离散型 print(df1.describe(exclude='O')) print(" " + "-------DF2.describe()显示第N行结果 -----") print(df1.describe().iloc[4]) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B 0 a 2 1 b 4 2 a 6 3 a 3 4 c 6 5 d 2 6 a 5 7 d 8 8 a 0 9 f 2 -------DF2.describe()无参数----- B count 10.000000 mean 3.800000 std 2.440401 min 0.000000 25% 2.000000 50% 3.500000 75% 5.750000 max 8.000000 -------DF2.describe()第一个参数----- B count 10.000000 mean 3.800000 std 2.440401 min 0.000000 10% 1.800000 50% 3.500000 60% 4.400000 80% 6.000000 90% 6.200000 max 8.000000 -------DF2.describe()第二个参数=all ----- A B count 10 10.000000 unique 5 NaN top a NaN freq 5 NaN mean NaN 3.800000 std NaN 2.440401 min NaN 0.000000 25% NaN 2.000000 50% NaN 3.500000 75% NaN 5.750000 max NaN 8.000000 -------DF2.describe()第二个参数=O ----- A count 10 unique 5 top a freq 5 -------DF2.describe()第三个参数 ----- B count 10.000000 mean 3.800000 std 2.440401 min 0.000000 25% 2.000000 50% 3.500000 75% 5.750000 max 8.000000 -------DF2.describe()显示第N行结果 ----- B 2.0 Name: 25%, dtype: float64 进程已结束,退出代码 0
5、sort_index():按轴排序
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df = pd.DataFrame(data={ 'A': list('abaacdadaf'), 'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2], 'F': [2, 6, 6, 7, 9, 2, 0, 8, 0, 2], 'D': list('afvarfxaes'), }) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print(" "+"-------DF2按索引排序-----") #sort_index()默认正序,ascending=False print(df1.sort_index( ascending=False)) print(" " + "-------DF2按列排序(表头排序)-----") # sort_index()默认正序,ascending=False print(df1.sort_index(axis=1 ,ascending=False)) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B F D 0 a 2 2 a 1 b 4 6 f 2 a 6 6 v 3 a 3 7 a 4 c 6 9 r 5 d 2 2 f 6 a 5 0 x 7 d 8 8 a 8 a 0 0 e 9 f 2 2 s -------DF2按索引排序----- A B F D 9 f 2 2 s 8 a 0 0 e 7 d 8 8 a 6 a 5 0 x 5 d 2 2 f 4 c 6 9 r 3 a 3 7 a 2 a 6 6 v 1 b 4 6 f 0 a 2 2 a -------DF2按列排序(表头排序)----- F D B A 0 2 a 2 a 1 6 f 4 b 2 6 v 6 a 3 7 a 3 a 4 9 r 6 c 5 2 f 2 d 6 0 x 5 a 7 8 a 8 d 8 0 e 0 a 9 2 s 2 f 进程已结束,退出代码 0
6、sort_value():按值排序
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df = pd.DataFrame(data={ 'A': list('abaacdadaf'), 'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2], 'F': [2, 6, 6, 7, 9, 2, 0, 8, 0, 2], 'D': list('afvarfxaes'), }) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print(" "+"-------DF1值排序-----") #sort_index()默认正序,ascending=False print(df1.sort_values(by='B' ,ascending=False)) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B F D 0 a 2 2 a 1 b 4 6 f 2 a 6 6 v 3 a 3 7 a 4 c 6 9 r 5 d 2 2 f 6 a 5 0 x 7 d 8 8 a 8 a 0 0 e 9 f 2 2 s -------DF1值排序----- A B F D 7 d 8 8 a 2 a 6 6 v 4 c 6 9 r 6 a 5 0 x 1 b 4 6 f 3 a 3 7 a 0 a 2 2 a 5 d 2 2 f 9 f 2 2 s 8 a 0 0 e 进程已结束,退出代码 0
7、inde:显示索引(列,最前面一列)
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df2 = pd.DataFrame({'A': 1, 'B': pd.Timestamp('20130102'), 'C': pd.Series(3, index=list(range(4)), dtype='float32'), 'D': np.array([3] * 4, dtype='int32'), 'E': pd.Categorical(["test", "train", "test", "train"]), 'F': 'foo'}) return df2 def print_data(): df = create_numpay_dataform() print("-------显示原始数据-----") print(df) print("-------索引-----") print(df.index) if __name__ == '__main__': print_data() #结果如下 -------显示原始数据----- A B C D E F 0 1 2013-01-02 3.0 3 test foo 1 1 2013-01-02 3.0 3 train foo 2 1 2013-01-02 3.0 3 test foo 3 1 2013-01-02 3.0 3 train foo -------索引----- Int64Index([0, 1, 2, 3], dtype='int64')
8、columns:显示列名(行,最上面一行)
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df2 = pd.DataFrame({'A': 1, 'B': pd.Timestamp('20130102'), 'C': pd.Series(3, index=list(range(4)), dtype='float32'), 'D': np.array([3] * 4, dtype='int32'), 'E': pd.Categorical(["test", "train", "test", "train"]), 'F': 'foo'}) return df2 def print_data(): df = create_numpay_dataform() print("-------显示原始数据-----") print(df) print("-------列名-----") print(df.columns) if __name__ == '__main__': print_data() #结果如下 -------显示原始数据----- A B C D E F 0 1 2013-01-02 3.0 3 test foo 1 1 2013-01-02 3.0 3 train foo 2 1 2013-01-02 3.0 3 test foo 3 1 2013-01-02 3.0 3 train foo -------列名----- Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
9、T:转置数据
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df = pd.DataFrame(data={ 'A': list('abaacdadaf'), 'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2], 'F': [2, 6, 6, 7, 9, 2, 0, 8, 0, 2], 'D': list('afvarfxaes'), }) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print(" "+"-------DF1行列转换-----") print(df1.T) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B F D 0 a 2 2 a 1 b 4 6 f 2 a 6 6 v 3 a 3 7 a 4 c 6 9 r 5 d 2 2 f 6 a 5 0 x 7 d 8 8 a 8 a 0 0 e 9 f 2 2 s -------DF1行列转换----- 0 1 2 3 4 5 6 7 8 9 A a b a a c d a d a f B 2 4 6 3 6 2 5 8 0 2 F 2 6 6 7 9 2 0 8 0 2 D a f v a r f x a e s 进程已结束,退出代码 0
三、选择数据
1、选择单列[ "列名" ] 或者 df.列名
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df = pd.DataFrame(data={ 'A': list('abaacdadaf'), 'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2], 'F': [2, 6, 6, 7, 9, 2, 0, 8, 0, 2], 'D': list('afvarfxaes'), }) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print(" "+"-------选择单列['列名']-----") print(df1["D"]) print(" " + "-------选择单列DF1.列名-----") print(df1.A) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B F D 0 a 2 2 a 1 b 4 6 f 2 a 6 6 v 3 a 3 7 a 4 c 6 9 r 5 d 2 2 f 6 a 5 0 x 7 d 8 8 a 8 a 0 0 e 9 f 2 2 s -------选择单列['列名']----- 0 a 1 f 2 v 3 a 4 r 5 f 6 x 7 a 8 e 9 s Name: D, dtype: object -------选择单列DF1.列名----- 0 a 1 b 2 a 3 a 4 c 5 d 6 a 7 d 8 a 9 f Name: A, dtype: object 进程已结束,退出代码 0
2、用 [ ] 切片行
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df = pd.DataFrame(data={ 'A': list('abaacdadaf'), 'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2], 'F': [2, 6, 6, 7, 9, 2, 0, 8, 0, 2], 'D': list('afvarfxaes'), }) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print(" "+"-------切片行-----") print(df1[4:6]) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B F D 0 a 2 2 a 1 b 4 6 f 2 a 6 6 v 3 a 3 7 a 4 c 6 9 r 5 d 2 2 f 6 a 5 0 x 7 d 8 8 a 8 a 0 0 e 9 f 2 2 s -------切片行----- A B F D 4 c 6 9 r 5 d 2 2 f 进程已结束,退出代码 0
3、loc:按标签选择
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df = pd.DataFrame(data={ 'A': list('abaacdadaf'), 'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2], 'F': [2, 6, 6, 7, 9, 2, 0, 8, 0, 2], 'D': list('afvarfxaes'), }) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print(" "+"-------标签选择某行数据------") print(df1.loc[8]) print(" " + "-------标签选择多行数据------") print(df1.loc[[1,6,8]]) print(" " + "-------标签选择多列数据------") print(df1.loc[:,['A', 'D']]) print(" " + "-------标签选择规定行,规定列数据------") print(df1.loc[4:7, ['A', 'D']]) print(" " + "-------数据降维------") print(df1.loc[5, ['A', 'D']]) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B F D 0 a 2 2 a 1 b 4 6 f 2 a 6 6 v 3 a 3 7 a 4 c 6 9 r 5 d 2 2 f 6 a 5 0 x 7 d 8 8 a 8 a 0 0 e 9 f 2 2 s -------标签选择某行数据------ A a B 0 F 0 D e Name: 8, dtype: object -------标签选择多行数据------ A B F D 1 b 4 6 f 6 a 5 0 x 8 a 0 0 e -------标签选择多列数据------ A D 0 a a 1 b f 2 a v 3 a a 4 c r 5 d f 6 a x 7 d a 8 a e 9 f s -------标签选择规定行,规定列数据------ A D 4 c r 5 d f 6 a x 7 d a -------数据降维------ A d D f Name: 5, dtype: object 进程已结束,退出代码 0
4、iloc:按位置选取
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df = pd.DataFrame(data={ 'A': list('abaacdadaf'), 'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2], 'F': [2, 6, 6, 7, 9, 2, 0, 8, 0, 2], 'D': list('afvarfxaes'), }) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print(" " + "-------按位置选择行------") print(df1.iloc[3]) print(" " + "-------按位置选择切片行和列------") print(df1.iloc[3:,:3]) print(" " + "-------按位置选择指定行和列------") print(df1.iloc[[1,3,5], [0,2]]) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B F D 0 a 2 2 a 1 b 4 6 f 2 a 6 6 v 3 a 3 7 a 4 c 6 9 r 5 d 2 2 f 6 a 5 0 x 7 d 8 8 a 8 a 0 0 e 9 f 2 2 s -------按位置选择行------ A a B 3 F 7 D a Name: 3, dtype: object -------按位置选择切片行和列------ A B F 3 a 3 7 4 c 6 9 5 d 2 2 6 a 5 0 7 d 8 8 8 a 0 0 9 f 2 2 -------按位置选择指定行和列------ A F 1 b 6 3 a 7 5 d 2 进程已结束,退出代码 0
6、单个值布尔判断(注意判断时要保证数据结构是一致的,不同数据结构之间判断会报错)
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print(" " + "-------布尔索引判断某一列的值------") print(df1[df1.B > 0]) print(" " + "-------布尔索引判断整体的值------") print(df1[df1 > 0.543791]) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B C D 2013-01-01 -0.571184 0.810240 -1.834513 -0.185410 2013-01-02 -0.085790 -1.009361 1.311410 0.141120 2013-01-03 0.672282 0.569641 -1.394152 0.832807 2013-01-04 0.170832 -0.882142 0.928596 -0.945374 2013-01-05 -1.100324 -1.045981 1.217005 1.420321 2013-01-06 -0.952931 0.575549 -0.164552 -1.097455 -------布尔索引判断某一列的值------ A B C D 2013-01-01 -0.571184 0.810240 -1.834513 -0.185410 2013-01-03 0.672282 0.569641 -1.394152 0.832807 2013-01-06 -0.952931 0.575549 -0.164552 -1.097455 -------布尔索引判断整体的值------ A B C D 2013-01-01 NaN 0.810240 NaN NaN 2013-01-02 NaN NaN 1.311410 NaN 2013-01-03 0.672282 0.569641 NaN 0.832807 2013-01-04 NaN NaN 0.928596 NaN 2013-01-05 NaN NaN 1.217005 1.420321 2013-01-06 NaN 0.575549 NaN NaN 进程已结束,退出代码 0
7、isin():多个值做布尔判断
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df = pd.DataFrame(data={ 'A': list('abaacdadaf'), 'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2], 'F': [2, 6, 6, 7, 9, 2, 0, 8, 0, 2], 'D': list('afvarfxaes'), }) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print(" " + "-------单列筛选-------") print(df1.F.isin(["a","b",6,9])) print(" " + "-------整体筛选-------") print(df1.isin(["a", "b", 6, 9])) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B F D 0 a 2 2 a 1 b 4 6 f 2 a 6 6 v 3 a 3 7 a 4 c 6 9 r 5 d 2 2 f 6 a 5 0 x 7 d 8 8 a 8 a 0 0 e 9 f 2 2 s -------单列筛选------- 0 False 1 True 2 True 3 False 4 True 5 False 6 False 7 False 8 False 9 False Name: F, dtype: bool -------整体筛选------- A B F D 0 True False False True 1 True False True False 2 True True True False 3 True False False True 4 False True True False 5 False False False False 6 True False False False 7 False False False True 8 True False False False 9 False False False False 进程已结束,退出代码 0
8、赋值:赋值是个非常简单的操作,只要使用标签选择或者位置选择找到了对应的数据,直接赋值即可。在赋值的时候也可以使用条件判断来赋值
9、缺失值:Pandas 主要用 np.nan 表示缺失数据。 计算时,默认不包含空值。
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): df = pd.DataFrame(data={ 'A': list('abaacdadaf'), 'B': [2, 4, 6, 3, 6, 2, 5, 8, 0, 2], 'F': [2, 6, 6, 7, 9, 2, 0, 8, 0, 2], 'D': list('afvarfxaes'), }) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print("-------设置缺失值----------") df1.iloc[1:4, :-1] = np.nan print(df1) print("-------删除缺失行----------") #how = all整行数据都确实时才删除行数据,any只要存在缺失数据,就会删除本行 print(df1.dropna(how="all")) print("-------填充缺失数据----------") print(df1.fillna(value=10086)) print("-------判断是否是缺失数据----------") print(pd.isna(df1)) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B F D 0 a 2 2 a 1 b 4 6 f 2 a 6 6 v 3 a 3 7 a 4 c 6 9 r 5 d 2 2 f 6 a 5 0 x 7 d 8 8 a 8 a 0 0 e 9 f 2 2 s -------设置缺失值---------- A B F D 0 a 2.0 2.0 a 1 NaN NaN NaN f 2 NaN NaN NaN v 3 NaN NaN NaN a 4 c 6.0 9.0 r 5 d 2.0 2.0 f 6 a 5.0 0.0 x 7 d 8.0 8.0 a 8 a 0.0 0.0 e 9 f 2.0 2.0 s -------删除缺失行---------- A B F D 0 a 2.0 2.0 a 1 NaN NaN NaN f 2 NaN NaN NaN v 3 NaN NaN NaN a 4 c 6.0 9.0 r 5 d 2.0 2.0 f 6 a 5.0 0.0 x 7 d 8.0 8.0 a 8 a 0.0 0.0 e 9 f 2.0 2.0 s -------填充缺失数据---------- A B F D 0 a 2.0 2.0 a 1 10086 10086.0 10086.0 f 2 10086 10086.0 10086.0 v 3 10086 10086.0 10086.0 a 4 c 6.0 9.0 r 5 d 2.0 2.0 f 6 a 5.0 0.0 x 7 d 8.0 8.0 a 8 a 0.0 0.0 e 9 f 2.0 2.0 s -------判断是否是缺失数据---------- A B F D 0 False False False False 1 True True True False 2 True True True False 3 True True True False 4 False False False False 5 False False False False 6 False False False False 7 False False False False 8 False False False False 9 False False False False 进程已结束,退出代码 0
四、统计
1、在对数组进行计算的时候要使数组对齐,这时候需要使用shift()方法。(效率不是特别高)
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd import numpy as np def create_numpay_dataform(): dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print("------行向下平移一行--------") print(df1.shift()) print("------列向左平移一列--------") print(df1.shift(-1,axis=1)) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B C D 2013-01-01 -1.494982 -1.816127 -1.557673 0.676270 2013-01-02 -0.382565 -0.772728 -2.028113 -1.000548 2013-01-03 1.024764 1.438836 -2.294408 -0.391837 2013-01-04 0.460244 1.823243 -0.183927 1.755757 2013-01-05 0.655894 -0.193546 1.155935 -0.773810 2013-01-06 -2.142355 -0.583462 1.369368 0.703252 ------行向下平移一行-------- A B C D 2013-01-01 NaN NaN NaN NaN 2013-01-02 -1.494982 -1.816127 -1.557673 0.676270 2013-01-03 -0.382565 -0.772728 -2.028113 -1.000548 2013-01-04 1.024764 1.438836 -2.294408 -0.391837 2013-01-05 0.460244 1.823243 -0.183927 1.755757 2013-01-06 0.655894 -0.193546 1.155935 -0.773810 ------列向左平移一列-------- A B C D 2013-01-01 -1.816127 -1.557673 0.676270 NaN 2013-01-02 -0.772728 -2.028113 -1.000548 NaN 2013-01-03 1.438836 -2.294408 -0.391837 NaN 2013-01-04 1.823243 -0.183927 1.755757 NaN 2013-01-05 -0.193546 1.155935 -0.773810 NaN 2013-01-06 -0.583462 1.369368 0.703252 NaN 进程已结束,退出代码 0
2、mean():计算平均值
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd from pandas import Series,DataFrame import numpy as np def create_numpay_dataform(): dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print("-------设置缺失值----------") df1.iloc[1:4, :-1] = np.nan print(df1) print("-------填充缺失数据----------") c1 = df1.fillna(value=10086) print(c1) print("-------计算无缺失数组的平均值----------") #axis = 0 以列计算平均值 1 以行计算平均值,默认为0 print(c1.mean(axis=1)) print("-------计算有缺失数组的平均值----------") #有缺失数组的平均值去掉缺失数据然后计算 print(df1.mean()) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B C D 2013-01-01 0.504033 0.167604 0.656164 -0.305116 2013-01-02 0.743423 1.004330 -1.858694 -0.962968 2013-01-03 -0.978681 -0.858943 1.527813 0.442333 2013-01-04 -0.447715 -1.075530 0.655507 1.271325 2013-01-05 0.877627 0.641684 -1.701115 -0.211141 2013-01-06 2.704554 -0.666753 -1.092838 -2.232137 -------设置缺失值---------- A B C D 2013-01-01 0.504033 0.167604 0.656164 -0.305116 2013-01-02 NaN NaN NaN -0.962968 2013-01-03 NaN NaN NaN 0.442333 2013-01-04 NaN NaN NaN 1.271325 2013-01-05 0.877627 0.641684 -1.701115 -0.211141 2013-01-06 2.704554 -0.666753 -1.092838 -2.232137 -------填充缺失数据---------- A B C D 2013-01-01 0.504033 0.167604 0.656164 -0.305116 2013-01-02 10086.000000 10086.000000 10086.000000 -0.962968 2013-01-03 10086.000000 10086.000000 10086.000000 0.442333 2013-01-04 10086.000000 10086.000000 10086.000000 1.271325 2013-01-05 0.877627 0.641684 -1.701115 -0.211141 2013-01-06 2.704554 -0.666753 -1.092838 -2.232137 -------计算无缺失数组的平均值---------- A 5043.681036 B 5043.023756 C 5042.643702 D -0.332951 dtype: float64 -------计算有缺失数组的平均值---------- A 1.362071 B 0.047512 C -0.712596 D -0.332951 dtype: float64 进程已结束,退出代码 0
3、diff():计算两行之间的差值
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd import numpy as np def create_numpay_dataform(): dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) return df def print_data(): df1 = create_numpay_dataform() print("-------DF1-----") print(df1) print("------上下两行的差值(后面的数-前面的数)--------") print(df1.diff(-1)) print("------左右步长为2的差值(后面的数-前面的数)--------") print(df1.diff(2,axis=1)) if __name__ == '__main__': print_data() #结果如下 -------DF1----- A B C D 2013-01-01 -1.482337 0.735672 -0.523935 1.441714 2013-01-02 -0.293590 -1.251721 -0.532770 -0.178270 2013-01-03 0.464124 0.148478 0.647906 -0.462180 2013-01-04 -1.313573 -0.280773 -0.815059 0.449937 2013-01-05 -0.042054 0.037449 -1.380082 1.694301 2013-01-06 -0.685625 0.379272 -0.009392 -0.563834 ------上下两行的差值(后面的数-前面的数)-------- A B C D 2013-01-01 -1.188747 1.987393 0.008835 1.619984 2013-01-02 -0.757714 -1.400199 -1.180676 0.283910 2013-01-03 1.777696 0.429250 1.462965 -0.912117 2013-01-04 -1.271519 -0.318222 0.565023 -1.244364 2013-01-05 0.643571 -0.341823 -1.370691 2.258135 2013-01-06 NaN NaN NaN NaN ------左右步长为2的差值(后面的数-前面的数)-------- A B C D 2013-01-01 NaN NaN 0.958402 0.706042 2013-01-02 NaN NaN -0.239180 1.073451 2013-01-03 NaN NaN 0.183783 -0.610658 2013-01-04 NaN NaN 0.498513 0.730710 2013-01-05 NaN NaN -1.338028 1.656852 2013-01-06 NaN NaN 0.676234 -0.943107 进程已结束,退出代码 0
4、sub():两个DataFrame或者DataFrame和series减法计算
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd import numpy as np def create_numpay_dataform(): dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) df1 = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates) return df,df1,s def print_data(): df,df1,s1 = create_numpay_dataform() print("-------DF-----") print(df) print("-------DF1--------") print(df1) print("-------s1--------") print(s1) print("------DataFrame和Series计算--------") print(df.sub(s1,axis="index")) print("------DataFrame和DataFrame计算--------") print(df.sub(df1)) if __name__ == '__main__': print_data() #结果如下 -------DF----- A B C D 2013-01-01 -1.117915 -0.387520 0.013181 -0.732305 2013-01-02 0.259804 0.943158 0.209316 -0.179862 2013-01-03 -0.681971 -1.385040 0.354760 -0.572621 2013-01-04 -0.019748 -0.703220 -0.765874 -0.584478 2013-01-05 1.187278 -0.287918 -0.215136 0.075496 2013-01-06 -1.160146 -0.882323 -0.620577 0.380190 -------DF1-------- A B C D 2013-01-01 1.532277 -0.527844 -0.345524 0.701999 2013-01-02 0.794895 -2.042780 1.163952 -0.877180 2013-01-03 -0.489494 -0.131753 0.444089 0.789567 2013-01-04 0.440047 -0.693099 -0.243348 -0.612980 2013-01-05 -1.128350 -1.012848 0.632883 -0.023234 2013-01-06 -0.672428 -0.249193 1.676576 -1.486626 -------s1-------- 2013-01-01 1.0 2013-01-02 3.0 2013-01-03 5.0 2013-01-04 NaN 2013-01-05 6.0 2013-01-06 8.0 Freq: D, dtype: float64 ------DataFrame和Series计算-------- A B C D 2013-01-01 -2.117915 -1.387520 -0.986819 -1.732305 2013-01-02 -2.740196 -2.056842 -2.790684 -3.179862 2013-01-03 -5.681971 -6.385040 -4.645240 -5.572621 2013-01-04 NaN NaN NaN NaN 2013-01-05 -4.812722 -6.287918 -6.215136 -5.924504 2013-01-06 -9.160146 -8.882323 -8.620577 -7.619810 ------DataFrame和DataFrame计算-------- A B C D 2013-01-01 -2.650191 0.140324 0.358705 -1.434305 2013-01-02 -0.535091 2.985938 -0.954636 0.697318 2013-01-03 -0.192477 -1.253287 -0.089329 -1.362188 2013-01-04 -0.459795 -0.010121 -0.522526 0.028502 2013-01-05 2.315628 0.724930 -0.848019 0.098730 2013-01-06 -0.487718 -0.633130 -2.297154 1.866815 进程已结束,退出代码 0
5、apply():是pandas里面所有函数中自由度最高的函数。apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
a.该函数最有用的是第一个参数,这个参数是函数,这个函数需要自己实现
b.函数的传入参数根据axis来定,比如axis = 1,就会把一行数据作为Series的数据结构传入给自己实现的函数中,我们在函数中实现对Series不同属性之间的计算,
c.则apply函数会自动遍历每一行DataFrame的数据,最后将所有结果组合成一个Series数据结构并返回
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd import numpy as np def create_numpay_dataform(): dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) return df def print_data(): df = create_numpay_dataform() print("-------DF-----") print(df) print("------apply调用已有函数--------") #np.cumsun累加 print(df.apply(np.cumsum)) print("------apply调用自定义函数--------") print(df.apply(lambda x : x.max()-x[2])) if __name__ == '__main__': print_data() #结果如下 -------DF----- A B C D 2013-01-01 -2.086164 0.201652 1.722858 1.071210 2013-01-02 -0.462887 -0.188189 0.733832 -1.798445 2013-01-03 0.672316 -1.359191 -0.031073 0.508793 2013-01-04 -0.624844 -0.503734 0.262923 -0.519521 2013-01-05 -1.170108 -0.308858 -0.653888 0.552537 2013-01-06 -1.408287 0.406629 0.000608 0.085242 ------DataFrame和Series计算-------- A B C D 2013-01-01 -2.086164 0.201652 1.722858 1.071210 2013-01-02 -2.549051 0.013464 2.456690 -0.727234 2013-01-03 -1.876735 -1.345727 2.425617 -0.218442 2013-01-04 -2.501579 -1.849461 2.688541 -0.737963 2013-01-05 -3.671687 -2.158320 2.034652 -0.185426 2013-01-06 -5.079974 -1.751691 2.035260 -0.100184 ------DataFrame和Series计算-------- A 0.000000 B 1.765820 C 1.753931 D 0.562418 dtype: float64 进程已结束,退出代码 0
五、合并
1、concat:多个DataFrame拼接。
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd import numpy as np def create_numpay_dataform(): dates = pd.date_range('20130101', periods=9) df = pd.DataFrame(np.random.randn(9, 4), index=dates, columns=list('ABCD')) return df def print_data(): df = create_numpay_dataform() print("-------DF-----") print(df) print("-------切片后的DF-------") pieces = [df[:3], df[3:7], df[7:]] print(pieces[0]) print(pieces[1]) print(pieces[2]) print("-------将切片后的数组拼接上-------") print(pd.concat([pieces[0],pieces[2],pieces[1]])) if __name__ == '__main__': print_data() #结果如下 -------DF----- A B C D 2013-01-01 0.276946 1.235298 0.932776 -0.565113 2013-01-02 -0.503525 0.365262 0.884855 -1.432992 2013-01-03 -0.042289 0.923140 -0.067742 -0.993290 2013-01-04 -0.560989 -0.433529 -0.339409 0.099952 2013-01-05 0.032306 0.003271 0.605058 0.398746 2013-01-06 0.033632 -1.831336 0.828554 -0.745181 2013-01-07 -0.306900 0.027087 0.387204 -1.099752 2013-01-08 0.580035 -0.305193 -0.287659 -1.204415 2013-01-09 1.077574 1.034927 -0.360812 -0.792874 -------切片后的DF------- A B C D 2013-01-01 0.276946 1.235298 0.932776 -0.565113 2013-01-02 -0.503525 0.365262 0.884855 -1.432992 2013-01-03 -0.042289 0.923140 -0.067742 -0.993290 A B C D 2013-01-04 -0.560989 -0.433529 -0.339409 0.099952 2013-01-05 0.032306 0.003271 0.605058 0.398746 2013-01-06 0.033632 -1.831336 0.828554 -0.745181 2013-01-07 -0.306900 0.027087 0.387204 -1.099752 A B C D 2013-01-08 0.580035 -0.305193 -0.287659 -1.204415 2013-01-09 1.077574 1.034927 -0.360812 -0.792874 -------将切片后的数组拼接上------- A B C D 2013-01-01 0.276946 1.235298 0.932776 -0.565113 2013-01-02 -0.503525 0.365262 0.884855 -1.432992 2013-01-03 -0.042289 0.923140 -0.067742 -0.993290 2013-01-08 0.580035 -0.305193 -0.287659 -1.204415 2013-01-09 1.077574 1.034927 -0.360812 -0.792874 2013-01-04 -0.560989 -0.433529 -0.339409 0.099952 2013-01-05 0.032306 0.003271 0.605058 0.398746 2013-01-06 0.033632 -1.831336 0.828554 -0.745181 2013-01-07 -0.306900 0.027087 0.387204 -1.099752 进程已结束,退出代码 0
2、merge():sql风格的合并,类似于连接。
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd import numpy as np def create_numpay_dataform(): df1 = pd.DataFrame({"rng":["xiaohua","ming","uzi"],"age":[22,19,24]}) df2 = pd.DataFrame({"rng": ["xiaohua", "ming", "uzi"], "role": ["mid", "sup", "adc"]}) df3 = pd.DataFrame({"team":["rng","rng","rng"],"name":["xiaohua","ming","uzi"]}) df4 = pd.DataFrame({"team":["rng","rng"],"opponent":["LGD","IG"]}) return df1,df2,df3,df4 def print_data(): df1,df2,df3,df4 = create_numpay_dataform() print("-------DF1-----") print(df1) print("-------DF2-----") print(df2) print("-------DF3-----") print(df3) print("-------DF4-----") print(df4) print("-------key下是不同的value的使用方法-------") print(pd.merge(df1,df2,on="rng")) print("-------key下是统一value的使用-------") print(pd.merge(df3,df4,on="team")) if __name__ == '__main__': print_data() #结果如下 -------DF1----- rng age 0 xiaohua 22 1 ming 19 2 uzi 24 -------DF2----- rng role 0 xiaohua mid 1 ming sup 2 uzi adc -------DF3----- team name 0 rng xiaohua 1 rng ming 2 rng uzi -------DF4----- team opponent 0 rng LGD 1 rng IG -------key下是不同的value的使用方法------- rng age role 0 xiaohua 22 mid 1 ming 19 sup 2 uzi 24 adc -------key下是统一value的使用------- team name opponent 0 rng xiaohua LGD 1 rng xiaohua IG 2 rng ming LGD 3 rng ming IG 4 rng uzi LGD 5 rng uzi IG 进程已结束,退出代码 0
3、append():在DataFrame最后追加数据
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd import numpy as np def create_numpay_dataform(): dates = pd.date_range('20130101', periods=9) df = pd.DataFrame(np.random.randn(9, 4), index=dates, columns=list('ABCD')) return df def print_data(): df = create_numpay_dataform() print("-------DF-----") print(df) print("-------选择DF里的N行数据-----") s = df.iloc[0:2] print(s) print("-------DF追加到最后-------") #ignore_index忽略带哦索引行 print(df.append(s,ignore_index= True)) if __name__ == '__main__': print_data() #结果如下 -------DF----- A B C D 2013-01-01 0.180073 1.027674 0.699021 0.211052 2013-01-02 0.700873 0.893067 0.234802 1.378712 2013-01-03 -0.318609 -0.291524 0.123771 1.057293 2013-01-04 -0.145169 0.213432 0.285161 -0.231468 2013-01-05 -0.916774 -1.284495 1.661716 -0.258821 2013-01-06 0.460373 -2.351527 -0.462772 -0.587480 2013-01-07 -1.149013 -1.290900 0.171418 -0.076885 2013-01-08 -1.621095 0.704023 -0.706554 0.016696 2013-01-09 -0.405135 -1.019510 0.863830 -1.316628 -------选择DF里的N行数据----- A B C D 2013-01-01 0.180073 1.027674 0.699021 0.211052 2013-01-02 0.700873 0.893067 0.234802 1.378712 -------DF追加到最后------- A B C D 0 0.180073 1.027674 0.699021 0.211052 1 0.700873 0.893067 0.234802 1.378712 2 -0.318609 -0.291524 0.123771 1.057293 3 -0.145169 0.213432 0.285161 -0.231468 4 -0.916774 -1.284495 1.661716 -0.258821 5 0.460373 -2.351527 -0.462772 -0.587480 6 -1.149013 -1.290900 0.171418 -0.076885 7 -1.621095 0.704023 -0.706554 0.016696 8 -0.405135 -1.019510 0.863830 -1.316628 9 0.180073 1.027674 0.699021 0.211052 10 0.700873 0.893067 0.234802 1.378712 进程已结束,退出代码 0
六、分组
group by:指的是涵盖以下一项或者多项的步骤流程:
a. 分割:按条件将数据分割成多组
b. 应用:为每组单独应用函数
c. 组合:将处理结果组合成一个数据结构
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd import numpy as np def create_numpay_dataform(): df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'], 'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'], 'C': np.random.randn(8), 'D': np.random.randn(8)}) return df def print_data(): df = create_numpay_dataform() print("-------DF-----") print(df) print("-------单条件分组后在计算-----") print(df.groupby('A').sum()) print("-------多条件分组在计算-------") print(df.groupby(["B","A"]).sum()) print("-------我是华丽的分割线-------") print(df.groupby(["A", "B"]).sum()) if __name__ == '__main__': print_data() #结果如下 -------DF----- A B C D 0 foo one 0.453100 -0.544181 1 bar one 1.692183 -0.253889 2 foo two -0.656308 -1.177487 3 bar three -1.078701 1.239209 4 foo two -0.866770 -0.949062 5 bar two -1.305346 -1.705380 6 foo one -0.259537 -1.492884 7 foo three -0.669982 -0.943082 -------单条件分组后在计算----- C D A bar -0.691863 -0.720059 foo -1.999496 -5.106695 -------多条件分组在计算------- C D B A one bar 1.692183 -0.253889 foo 0.193563 -2.037065 three bar -1.078701 1.239209 foo -0.669982 -0.943082 two bar -1.305346 -1.705380 foo -1.523078 -2.126549 -------我是华丽的分割线------- C D A B bar one 1.692183 -0.253889 three -1.078701 1.239209 two -1.305346 -1.705380 foo one 0.193563 -2.037065 three -0.669982 -0.943082 two -1.523078 -2.126549 进程已结束,退出代码 0
七、数据透视表
什么是数据透视表?
数据透视表是一种交互式的表,可以自由选择多个字段的不同组合,用于快速汇总、分析大量数据中字段与字段之间的关联关系。使用数据透视表可以按照数据表格的不同字段从多个角度进行透视,并建立交叉表格,用以查看数据表格不同层面的汇总信息、分析结果以及摘要数据。
数据透视表的优势?
-
- 对数值数据快速分类汇总,按分类和子分类查看数据信息。
- 展开或折叠所关注的数据,快速查看摘要数据的明细信息。
- 建立交叉表格(将行移动到列或将列移动到行),以查看数据的不同汇总。
- 快速的计算数值数据的汇总信息、差异等。
pivot_table():用法 pivot_table(data, values=None, index=None, columns=None,aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
四个最重要的参数:index,values,columns,aggfunc
a. index:每个pivot_table必须拥有一个index,作为透视表的索引列,可以是一层索引,也可以是多层索引
b. values:筛选需要计算的数据
c. columns:Columns类似Index可以设置列层次字段,它不是一个必要参数,作为一种分割数据的可选方式。
d. aggfunc:aggfunc参数可以设置我们对数据聚合时进行的函数操作,这个参数是一个函数。
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
import pandas as pd import numpy as np def create_numpay_dataform(): df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3, 'B': ['A', 'B', 'C'] * 4, 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2, 'D': np.random.randn(12), 'E': np.random.randn(12)}) return df def table_func(data): for i in data: if i > 0: return True else: return False def print_data(): df = create_numpay_dataform() print("-------DF-----") print(df) print("-------数据透视表-----") print(pd.pivot_table(df, values='E', index=['A', 'B'],columns=["C"])) print("--------------------") print(pd.pivot_table(df, values='E', index=['A', 'B'],columns=["C"], aggfunc=[np.sum,np.mean])) if __name__ == '__main__': print_data() #结果如下 -------DF----- A B C D E 0 one A foo 0.596744 1.260272 1 one B foo -0.560929 2.077597 2 two C foo -1.326983 -0.997230 3 three A bar 0.714451 0.520551 4 one B bar 2.378704 0.336855 5 one C bar -0.771644 0.109514 6 two A foo -2.606868 -0.279142 7 three B foo -0.775949 -1.383773 8 one C foo 0.106014 -0.840803 9 one A bar -0.877053 0.090785 10 two B bar -1.594153 -1.002086 11 three C bar -0.032272 -0.700847 -------数据透视表----- C bar foo A B one A 0.090785 1.260272 B 0.336855 2.077597 C 0.109514 -0.840803 three A 0.520551 NaN B NaN -1.383773 C -0.700847 NaN two A NaN -0.279142 B -1.002086 NaN C NaN -0.997230 sum mean C bar foo bar foo A B one A 0.090785 1.260272 0.090785 1.260272 B 0.336855 2.077597 0.336855 2.077597 C 0.109514 -0.840803 0.109514 -0.840803 three A 0.520551 NaN 0.520551 NaN B NaN -1.383773 NaN -1.383773 C -0.700847 NaN -0.700847 NaN two A NaN -0.279142 NaN -0.279142 B -1.002086 NaN -1.002086 NaN C NaN -0.997230 NaN -0.997230 进程已结束,退出代码 0