Pandas主要用来处理数据框或异质型数据。它有两种常用的数据结构:Series and Pandas。一般,我们这样调包:
import pandas as pd
from pandas import Series, DataFrame
一.Series
暂时搁置
二.DataFrame
2.1 创建DataFrame
2.1.1 以字典的形式创建
data = {"Altitude":[1111,2222,3333,4444,5555,6666],
"Block":["a","b","c","d","e","f"],
"Color":[1,2,3,4,5,6],
"Damage":[11,22,33,44,55,66],
"Expression":[101,202,303,404,505,606],
"Forecast":[1,0,1,0,1,0]
}
df = pd.DataFrame(data)
print(df)
Altitude Block Color Damage Expression Forecast
0 1111 a 1 11 101 1
1 2222 b 2 22 202 0
2 3333 c 3 33 303 1
3 4444 d 4 44 404 0
4 5555 e 5 55 505 1
5 6666 f 6 66 606 0
这时候可以了解一下我们刚创建出来的宝宝长啥样:
print("1.查看数据框的列名:", df.columns)
print("2.查看数据框的行名:", df.index)
print("3.查看数据框的值,不要行名、列名:
", df.values)
1.查看数据框的列名: Index(['Altitude', 'Block', 'Color', 'Damage', 'Expression', 'Forecast'], dtype='object')
2.查看数据框的行名: RangeIndex(start=0, stop=6, step=1)
3.查看数据框的值,不要行名、列名:
[[1111 'a' 1 11 101 1]
[2222 'b' 2 22 202 0]
[3333 'c' 3 33 303 1]
[4444 'd' 4 44 404 0]
[5555 'e' 5 55 505 1]
[6666 'f' 6 66 606 0]]
2.2 改变列名、行名
df.columns = ["A","B","C","D","E","F"]
df.index = ["one","two","three","four","five","six"]
print(df)
A B C D E F
one 1111 a 1 11 101 1
two 2222 b 2 22 202 0
three 3333 c 3 33 303 1
four 4444 d 4 44 404 0
five 5555 e 5 55 505 1
six 6666 f 6 66 606 0
2.3 索引
2.3.1 选列
df.A #单行
one 1111
two 2222
three 3333
four 4444
five 5555
six 6666
Name: A, dtype: int64
df["A"]
one 1111
two 2222
three 3333
four 4444
five 5555
six 6666
Name: A, dtype: int64
df.loc[:, "A"] #用行列标签来选列,:表示所有行
one 1111
two 2222
three 3333
four 4444
five 5555
six 6666
Name: A, dtype: int64
df.iloc[:, 0] #用行列的序号来选列,注意从0开始作为第一列,instead of 1
one 1111
two 2222
three 3333
four 4444
five 5555
six 6666
Name: A, dtype: int64
2.3.2 选行
df.loc["one"]
A 1111
B a
C 1
D 11
E 101
F 1
Name: one, dtype: object
df.iloc[0,:] #:表示所有的列
A 1111
B a
C 1
D 11
E 101
F 1
Name: one, dtype: object
2.3.3 你想选多行多列,选定特定位置上的数据(想一下数学里的矩阵)
多行多列,那就要用列表的形式框起来,([ ],[ ]),像酱纸。
有时候可省,有时候不可省,实在记不住就直接用loc、iloc,用上冒号:,万无一失。如果只想精简代码,“记住记住记住”,没别的。
2.3.3.1 选多列
df[["A","B"]] #多列
|
A |
B |
one |
1111 |
a |
two |
2222 |
b |
three |
3333 |
c |
four |
4444 |
d |
five |
5555 |
e |
six |
6666 |
f |
df.loc[:,["A","B"]]
|
A |
B |
one |
1111 |
a |
two |
2222 |
b |
three |
3333 |
c |
four |
4444 |
d |
five |
5555 |
e |
six |
6666 |
f |
df.iloc[:,[0, 1]]
|
A |
B |
one |
1111 |
a |
two |
2222 |
b |
three |
3333 |
c |
four |
4444 |
d |
five |
5555 |
e |
six |
6666 |
f |
2.3.3.2 选多行
df.loc[["one","two"]]
|
A |
B |
C |
D |
E |
F |
one |
1111 |
a |
1 |
11 |
101 |
1 |
two |
2222 |
b |
2 |
22 |
202 |
0 |
df.loc[["one","two"],:]
|
A |
B |
C |
D |
E |
F |
one |
1111 |
a |
1 |
11 |
101 |
1 |
two |
2222 |
b |
2 |
22 |
202 |
0 |
df.iloc[[0,1]]
|
A |
B |
C |
D |
E |
F |
one |
1111 |
a |
1 |
11 |
101 |
1 |
two |
2222 |
b |
2 |
22 |
202 |
0 |
df.iloc[[0,1],:]
|
A |
B |
C |
D |
E |
F |
one |
1111 |
a |
1 |
11 |
101 |
1 |
two |
2222 |
b |
2 |
22 |
202 |
0 |
2.3.3.3 选特定位置上的值
df.loc[["one","three"],["A","D"]]
|
A |
D |
one |
1111 |
11 |
three |
3333 |
33 |
df.iloc[[0,2],[0,3]]
|
A |
D |
one |
1111 |
11 |
three |
3333 |
33 |
df.loc[["one","three"],"A"]
one 1111
three 3333
Name: A, dtype: int64
df.iloc[0,[0,3]]
A 1111
D 11
Name: one, dtype: object
感觉想要什么都可以随便拿,if you want.
2.4 添加新列或者修改某一列的值
df["G"] = "new" #添加新列
print(df)
A B C D E F G
one 1111 a 1 11 101 1 new
two 2222 b 2 22 202 0 new
three 3333 c 3 33 303 1 new
four 4444 d 4 44 404 0 new
five 5555 e 5 55 505 1 new
six 6666 f 6 66 606 0 new
df["A"] = "changed" #改某一列的值
print(df)
A B C D E F G
one changed a 1 11 101 1 new
two changed b 2 22 202 0 new
three changed c 3 33 303 1 new
four changed d 4 44 404 0 new
five changed e 5 55 505 1 new
six changed f 6 66 606 0 new
2.5 删除行或列
关键是设置axis、inplace
2.5.1 删除列
del df["A"] #删掉A列
print(df)
B C D E F G
one a 1 11 101 1 new
two b 2 22 202 0 new
three c 3 33 303 1 new
four d 4 44 404 0 new
five e 5 55 505 1 new
six f 6 66 606 0 new
df.drop(["G"],axis=1) #删G列,这样还真没删到
print("对比一下^-^^-^ 啊哈哈哈哈哈哈
")
print(df)
print("====" * 8)
print(df.drop(["G"],axis=1))
对比一下^-^^-^ 啊哈哈哈哈哈哈
B C D E F G
one a 1 11 101 1 new
two b 2 22 202 0 new
three c 3 33 303 1 new
four d 4 44 404 0 new
five e 5 55 505 1 new
six f 6 66 606 0 new
================================
B C D E F
one a 1 11 101 1
two b 2 22 202 0
three c 3 33 303 1
four d 4 44 404 0
five e 5 55 505 1
six f 6 66 606 0
df.drop(["G"],axis=1,inplace=True) #删G列,这回真删到了
print(df)
B C D E F
one a 1 11 101 1
two b 2 22 202 0
three c 3 33 303 1
four d 4 44 404 0
five e 5 55 505 1
six f 6 66 606 0
2.5.1 删除行
df.drop(["six"], axis=0, inplace=True) #删除six这一行
print(df)
B C D E F
one a 1 11 101 1
two b 2 22 202 0
three c 3 33 303 1
four d 4 44 404 0
five e 5 55 505 1
2.6 数据框的转置
df.T #注意啦,这只是给你看转置后的样子,实际上df还是原来那个样子,并没有改变
|
one |
two |
three |
four |
five |
B |
a |
b |
c |
d |
e |
C |
1 |
2 |
3 |
4 |
5 |
D |
11 |
22 |
33 |
44 |
55 |
E |
101 |
202 |
303 |
404 |
505 |
F |
1 |
0 |
1 |
0 |
1 |
df
|
B |
C |
D |
E |
F |
one |
a |
1 |
11 |
101 |
1 |
two |
b |
2 |
22 |
202 |
0 |
three |
c |
3 |
33 |
303 |
1 |
four |
d |
4 |
44 |
404 |
0 |
five |
e |
5 |
55 |
505 |
1 |
2.7 按行名或列排序 sort_values(),还可指定升序or降序
默认按行索引排序axis=0,默认升序排列ascending=True.
2.7.1按行标签排序
df
|
A |
B |
C |
D |
E |
F |
one |
1111 |
a |
1 |
11 |
101 |
1 |
two |
2222 |
b |
2 |
22 |
202 |
0 |
three |
3333 |
c |
3 |
33 |
303 |
1 |
four |
4444 |
d |
4 |
44 |
404 |
0 |
five |
5555 |
e |
5 |
55 |
505 |
1 |
six |
6666 |
f |
6 |
66 |
606 |
0 |
df.sort_index()
|
A |
B |
C |
D |
E |
F |
five |
5555 |
e |
5 |
55 |
505 |
1 |
four |
4444 |
d |
4 |
44 |
404 |
0 |
one |
1111 |
a |
1 |
11 |
101 |
1 |
six |
6666 |
f |
6 |
66 |
606 |
0 |
three |
3333 |
c |
3 |
33 |
303 |
1 |
two |
2222 |
b |
2 |
22 |
202 |
0 |
df.sort_index(axis=0)
|
A |
B |
C |
D |
E |
F |
five |
5555 |
e |
5 |
55 |
505 |
1 |
four |
4444 |
d |
4 |
44 |
404 |
0 |
one |
1111 |
a |
1 |
11 |
101 |
1 |
six |
6666 |
f |
6 |
66 |
606 |
0 |
three |
3333 |
c |
3 |
33 |
303 |
1 |
two |
2222 |
b |
2 |
22 |
202 |
0 |
df.sort_index(ascending=True)
|
A |
B |
C |
D |
E |
F |
five |
5555 |
e |
5 |
55 |
505 |
1 |
four |
4444 |
d |
4 |
44 |
404 |
0 |
one |
1111 |
a |
1 |
11 |
101 |
1 |
six |
6666 |
f |
6 |
66 |
606 |
0 |
three |
3333 |
c |
3 |
33 |
303 |
1 |
two |
2222 |
b |
2 |
22 |
202 |
0 |
df.sort_index(ascending=False) #指定降序
|
A |
B |
C |
D |
E |
F |
two |
2222 |
b |
2 |
22 |
202 |
0 |
three |
3333 |
c |
3 |
33 |
303 |
1 |
six |
6666 |
f |
6 |
66 |
606 |
0 |
one |
1111 |
a |
1 |
11 |
101 |
1 |
four |
4444 |
d |
4 |
44 |
404 |
0 |
five |
5555 |
e |
5 |
55 |
505 |
1 |
2.7.2 按列排序
df.sort_values(by=["A"])
|
A |
B |
C |
D |
E |
F |
one |
1111 |
a |
1 |
11 |
101 |
1 |
two |
2222 |
b |
2 |
22 |
202 |
0 |
three |
3333 |
c |
3 |
33 |
303 |
1 |
four |
4444 |
d |
4 |
44 |
404 |
0 |
five |
5555 |
e |
5 |
55 |
505 |
1 |
six |
6666 |
f |
6 |
66 |
606 |
0 |
我们改一下数据,让按多列排序的更明显
df.A = [1111,1111,1111,4444,4444,4444]
df.sort_values(by=["A","F"],ascending=False)
|
A |
B |
C |
D |
E |
F |
five |
4444 |
e |
5 |
55 |
505 |
1 |
four |
4444 |
d |
4 |
44 |
404 |
0 |
six |
4444 |
f |
6 |
66 |
606 |
0 |
one |
1111 |
a |
1 |
11 |
101 |
1 |
three |
1111 |
c |
3 |
33 |
303 |
1 |
two |
1111 |
b |
2 |
22 |
202 |
0 |
2.8 按行名或列排名 rank()
就是排第一大第二大,这种,注意区分上面的排序
df2 = pd.DataFrame({"col1":[1,2,1,3,3,5],
"col2":[777,999,888,777,777,666]})
df2
|
col1 |
col2 |
0 |
1 |
777 |
1 |
2 |
999 |
2 |
1 |
888 |
3 |
3 |
777 |
4 |
3 |
777 |
5 |
5 |
666 |
df2.rank() #默认是说这个位置上的数,在这一列的排第几(默认按列排序)。小数是因为他采取了同位置的平均值
|
col1 |
col2 |
0 |
1.5 |
3.0 |
1 |
3.0 |
6.0 |
2 |
1.5 |
5.0 |
3 |
4.5 |
3.0 |
4 |
4.5 |
3.0 |
5 |
6.0 |
1.0 |
参数解释:
- method:"average"平均值(默认),"min"最小的排位,"max"最大的排位,"first"谁先出现谁排名在前面,后来的继续接上去,
"dense" 同排名的选最小排位的,如果有同排名的,下一组的排名将不会是连续的,会断的,这点区别于"min"。跟我们发奖状一样,如果有两个第二名,那么就没有第三名,只有第四名。不懂就看下面的对比吧。
- axis:默认axis=0,即按列排序。可设置axis=1,按行排列。
df2.rank(method="first")
|
col1 |
col2 |
0 |
1.0 |
2.0 |
1 |
3.0 |
6.0 |
2 |
2.0 |
5.0 |
3 |
4.0 |
3.0 |
4 |
5.0 |
4.0 |
5 |
6.0 |
1.0 |
df2.rank(method="dense")
|
col1 |
col2 |
0 |
1.0 |
2.0 |
1 |
2.0 |
4.0 |
2 |
1.0 |
3.0 |
3 |
3.0 |
2.0 |
4 |
3.0 |
2.0 |
5 |
4.0 |
1.0 |
df2.rank(method="min")
|
col1 |
col2 |
0 |
1.0 |
2.0 |
1 |
3.0 |
6.0 |
2 |
1.0 |
5.0 |
3 |
4.0 |
2.0 |
4 |
4.0 |
2.0 |
5 |
6.0 |
1.0 |
df2.rank(method="max")
|
col1 |
col2 |
0 |
2.0 |
4.0 |
1 |
3.0 |
6.0 |
2 |
2.0 |
5.0 |
3 |
5.0 |
4.0 |
4 |
5.0 |
4.0 |
5 |
6.0 |
1.0 |
df2.rank(axis=1)
|
col1 |
col2 |
0 |
1.0 |
2.0 |
1 |
1.0 |
2.0 |
2 |
1.0 |
2.0 |
3 |
1.0 |
2.0 |
4 |
1.0 |
2.0 |
5 |
1.0 |
2.0 |
2.9 行列求和sum()
df2.sum() #每列求和,默认行axis=0
col1 15
col2 4884
dtype: int64
df2.sum(axis=1) #每行求和
0 778
1 1001
2 889
3 780
4 780
5 671
dtype: int64
2.10 行列求平均值mean()
只要这一列或一行上不全是NA,则会自动跳过NA,计算均值。可设置skipna=False,实现不跳过NA值。
df2.mean() #默认列求和
col1 2.5
col2 814.0
dtype: float64
df2.mean(axis=1) #设置行求和
0 389.0
1 500.5
2 444.5
3 390.0
4 390.0
5 335.5
dtype: float64
import numpy as np
df3 = pd.DataFrame({"col1":[1,2,np.nan],
"col2":[666,8888,345]})
df3
|
col1 |
col2 |
0 |
1.0 |
666 |
1 |
2.0 |
8888 |
2 |
NaN |
345 |
df3.mean()
col1 1.500000
col2 3299.666667
dtype: float64
df3.mean(skipna=False)
col1 NaN
col2 3299.666667
dtype: float64
2.10 描述性统计、汇总用函数
用sklearn里的iris数据集来举栗子
from sklearn.datasets import load_iris
iris = load_iris()
x = iris.data
x = pd.DataFrame(x, columns = iris.feature_names)
x.describe()
#显示有效技术个数、均值、标准差、最小值、四分位数、最大值,这是针对数值型数据的,df.describe(include=['O'])看分类型变量,‘all’ 看所有变量
|
sepal length (cm) |
sepal width (cm) |
petal length (cm) |
petal width (cm) |
count |
150.000000 |
150.000000 |
150.000000 |
150.000000 |
mean |
5.843333 |
3.054000 |
3.758667 |
1.198667 |
std |
0.828066 |
0.433594 |
1.764420 |
0.763161 |
min |
4.300000 |
2.000000 |
1.000000 |
0.100000 |
25% |
5.100000 |
2.800000 |
1.600000 |
0.300000 |
50% |
5.800000 |
3.000000 |
4.350000 |
1.300000 |
75% |
6.400000 |
3.300000 |
5.100000 |
1.800000 |
max |
7.900000 |
4.400000 |
6.900000 |
2.500000 |
x.count() #统计非NA值得个数
sepal length (cm) 150
sepal width (cm) 150
petal length (cm) 150
petal width (cm) 150
dtype: int64
x.min() #每列最小
sepal length (cm) 4.3
sepal width (cm) 2.0
petal length (cm) 1.0
petal width (cm) 0.1
dtype: float64
x.max() #每列最大
sepal length (cm) 7.9
sepal width (cm) 4.4
petal length (cm) 6.9
petal width (cm) 2.5
dtype: float64
x.idxmin() #返回最小值对应的行索引
sepal length (cm) 13
sepal width (cm) 60
petal length (cm) 22
petal width (cm) 9
dtype: int64
x.idxmax() #返回最大值对应的行索引
sepal length (cm) 131
sepal width (cm) 15
petal length (cm) 118
petal width (cm) 100
dtype: int64
x.quantile(0.25)
# 0 <= q <= 1,可以查看0-1之间的一个分位值是多少,默认为0.5,即中位数。不懂可以试试0.75,跟x.describe()的输出对比一下就知道啦
sepal length (cm) 5.1
sepal width (cm) 2.8
petal length (cm) 1.6
petal width (cm) 0.3
Name: 0.25, dtype: float64
x.quantile()
sepal length (cm) 5.80
sepal width (cm) 3.00
petal length (cm) 4.35
petal width (cm) 1.30
Name: 0.5, dtype: float64
x.median() #中位数
sepal length (cm) 5.80
sepal width (cm) 3.00
petal length (cm) 4.35
petal width (cm) 1.30
dtype: float64
x.mad() #平均值的平均绝对误差,这个我也不懂,要去补补理论
sepal length (cm) 0.687556
sepal width (cm) 0.333093
petal length (cm) 1.561920
petal width (cm) 0.658933
dtype: float64
x.prod() #列里相乘得出来的积
sepal length (cm) 2.257440e+114
sepal width (cm) 1.197477e+72
petal length (cm) 3.774489e+76
petal width (cm) 2.972714e-12
dtype: float64
x.var() #每列的样本方差
sepal length (cm) 0.685694
sepal width (cm) 0.188004
petal length (cm) 3.113179
petal width (cm) 0.582414
dtype: float64
x.std() #每列的标准差
sepal length (cm) 0.828066
sepal width (cm) 0.433594
petal length (cm) 1.764420
petal width (cm) 0.763161
dtype: float64
x.skew() #样本偏度(三阶矩)
sepal length (cm) 0.314911
sepal width (cm) 0.334053
petal length (cm) -0.274464
petal width (cm) -0.104997
dtype: float64
x.kurt() #y样本的峰度(四阶矩)
sepal length (cm) -0.552064
sepal width (cm) 0.290781
petal length (cm) -1.401921
petal width (cm) -1.339754
dtype: float64
用x的部分数据来说明(较易懂):
xx = x.head()
xx
|
sepal length (cm) |
sepal width (cm) |
petal length (cm) |
petal width (cm) |
0 |
5.1 |
3.5 |
1.4 |
0.2 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
4.7 |
3.2 |
1.3 |
0.2 |
3 |
4.6 |
3.1 |
1.5 |
0.2 |
4 |
5.0 |
3.6 |
1.4 |
0.2 |
xx.cumsum() #每列累加,得出结果
|
sepal length (cm) |
sepal width (cm) |
petal length (cm) |
petal width (cm) |
0 |
5.1 |
3.5 |
1.4 |
0.2 |
1 |
10.0 |
6.5 |
2.8 |
0.4 |
2 |
14.7 |
9.7 |
4.1 |
0.6 |
3 |
19.3 |
12.8 |
5.6 |
0.8 |
4 |
24.3 |
16.4 |
7.0 |
1.0 |
xx.cummin() #计算该位置的累计值时,其中最小的值。如1+2+4=7,对应的就是1.
|
sepal length (cm) |
sepal width (cm) |
petal length (cm) |
petal width (cm) |
0 |
5.1 |
3.5 |
1.4 |
0.2 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
4.7 |
3.0 |
1.3 |
0.2 |
3 |
4.6 |
3.0 |
1.3 |
0.2 |
4 |
4.6 |
3.0 |
1.3 |
0.2 |
xx.cummax() ##计算该位置的累计值时,其中最大的值。如1+2+4=7,对应的就是4.
|
sepal length (cm) |
sepal width (cm) |
petal length (cm) |
petal width (cm) |
0 |
5.1 |
3.5 |
1.4 |
0.2 |
1 |
5.1 |
3.5 |
1.4 |
0.2 |
2 |
5.1 |
3.5 |
1.4 |
0.2 |
3 |
5.1 |
3.5 |
1.5 |
0.2 |
4 |
5.1 |
3.6 |
1.5 |
0.2 |
xx.cumprod() #每列的累计积
|
sepal length (cm) |
sepal width (cm) |
petal length (cm) |
petal width (cm) |
0 |
5.1000 |
3.500 |
1.4000 |
0.20000 |
1 |
24.9900 |
10.500 |
1.9600 |
0.04000 |
2 |
117.4530 |
33.600 |
2.5480 |
0.00800 |
3 |
540.2838 |
104.160 |
3.8220 |
0.00160 |
4 |
2701.4190 |
374.976 |
5.3508 |
0.00032 |
xx.diff() #每一列的后一位减去前一位,时间序列专用
|
sepal length (cm) |
sepal width (cm) |
petal length (cm) |
petal width (cm) |
0 |
NaN |
NaN |
NaN |
NaN |
1 |
-0.2 |
-0.5 |
0.0 |
0.0 |
2 |
-0.2 |
0.2 |
-0.1 |
0.0 |
3 |
-0.1 |
-0.1 |
0.2 |
0.0 |
4 |
0.4 |
0.5 |
-0.1 |
0.0 |
xx.pct_change() #(后一个值-前一个值)/前一个值,计算百分比
|
sepal length (cm) |
sepal width (cm) |
petal length (cm) |
petal width (cm) |
0 |
NaN |
NaN |
NaN |
NaN |
1 |
-0.039216 |
-0.142857 |
0.000000 |
0.0 |
2 |
-0.040816 |
0.066667 |
-0.071429 |
0.0 |
3 |
-0.021277 |
-0.031250 |
0.153846 |
0.0 |
4 |
0.086957 |
0.161290 |
-0.066667 |
0.0 |
xx.corr() #计算每个变量之间的相关性
|
sepal length (cm) |
sepal width (cm) |
petal length (cm) |
petal width (cm) |
sepal length (cm) |
1.000000 |
0.680019 |
-0.170499 |
NaN |
sepal width (cm) |
0.680019 |
1.000000 |
-0.136590 |
NaN |
petal length (cm) |
-0.170499 |
-0.136590 |
1.000000 |
NaN |
petal width (cm) |
NaN |
NaN |
NaN |
NaN |
xx.columns
Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)'],
dtype='object')
xx.corrwith(xx["sepal width (cm)"]) #计算sepal width (cm)与其他列的线性相关系数
sepal length (cm) 0.680019
sepal width (cm) 1.000000
petal length (cm) -0.136590
petal width (cm) NaN
dtype: float64
2.11 唯一值、计数和成员资格
df4 = pd.DataFrame({"a":[111,222,111,333,333,333],
"b":[1,0,1,0,1,1]})
df4
|
a |
b |
0 |
111 |
1 |
1 |
222 |
0 |
2 |
111 |
1 |
3 |
333 |
0 |
4 |
333 |
1 |
5 |
333 |
1 |
df4.a.unique() #相当于去重,重复的没有
array([111, 222, 333], dtype=int64)
df4.a.value_counts() #计算每个值出现的次数
333 3
111 2
222 1
Name: a, dtype: int64
df4.a.isin([111,222]) #值在不在我们指定的范围内
0 True
1 True
2 True
3 False
4 False
5 False
Name: a, dtype: bool
未完待续