pandas官方文档:https://pandas.pydata.org/pandas-docs/stable/?v=20190307135750
pandas基于Numpy,可以看成是处理文本或者表格数据。pandas中有两个主要的数据结构,其中Series数据结构类似于Numpy中的一维数组,DataFrame类似于多维表格数据结构。
pandas是python数据分析的核心模块。它主要提供了五大功能:
- 支持文件存取操作,支持数据库(sql)、html、json、pickle、csv(txt、excel)、sas、stata、hdf等。
- 支持增删改查、切片、高阶函数、分组聚合等单表操作,以及和dict、list的互相转换。
- 支持多表拼接合并操作。
- 支持简单的绘图操作。
- 支持简单的统计分析操作。
一、Series数据结构
Series是一种类似于一维数组的对象,由一组数据和一组与之相关的数据标签(索引)组成。
Series比较像列表(数组)和字典的结合体
1 2
|
import numpy as np import pandas as pd
|
1 2
|
df = pd.Series(0, index=['a', 'b', 'c', 'd']) print(df)
|
1 2 3 4 5
|
a 0 b 0 c 0 d 0 dtype: int64
|
1
|
Index(['a', 'b', 'c', 'd'], dtype='object')
|
1.1 Series支持NumPy模块的特性(下标)
详解 | 方法 |
从ndarray创建Series |
Series(arr) |
与标量运算 |
df*2 |
两个Series运算 |
df1+df2 |
索引 |
df[0], df[[1,2,4]] |
切片 |
df[0:2] |
通用函数 |
np.abs(df) |
布尔值过滤 |
df[df>0] |
1 2
|
arr = np.array([1, 2, 3, 4, np.nan]) print(arr)
|
1 2
|
df = pd.Series(arr, index=['a', 'b', 'c', 'd', 'e']) print(df)
|
1 2 3 4 5 6
|
a 1.0 b 2.0 c 3.0 d 4.0 e NaN dtype: float64
|
1 2 3 4 5 6
|
a 1.0 b 4.0 c 9.0 d 16.0 e NaN dtype: float64
|
1 2 3 4
|
a 1.0 b 2.0 c 3.0 dtype: float64
|
1 2 3
|
a 1.0 b 2.0 dtype: float64
|
1 2 3 4 5 6
|
a 0.841471 b 0.909297 c 0.141120 d -0.756802 e NaN dtype: float64
|
1 2 3 4
|
b 2.0 c 3.0 d 4.0 dtype: float64
|
1.2 Series支持字典的特性(标签)
详解 | 方法 |
从字典创建Series |
Series(dic), |
in运算 |
’a’ in sr |
键索引 |
sr[‘a’], sr[[‘a’, ‘b’, ‘d’]] |
1 2
|
df = pd.Series({'a': 1, 'b': 2}) print(df)
|
1.3 Series缺失数据处理
方法 | 详解 |
dropna() |
过滤掉值为NaN的行 |
fillna() |
填充缺失数据 |
isnull() |
返回布尔数组,缺失值对应为True |
notnull() |
返回布尔数组,缺失值对应为False |
1 2
|
df = pd.Series([1, 2, 3, 4, np.nan], index=['a', 'b', 'c', 'd', 'e']) print(df)
|
1 2 3 4 5 6
|
a 1.0 b 2.0 c 3.0 d 4.0 e NaN dtype: float64
|
1 2 3 4 5
|
a 1.0 b 2.0 c 3.0 d 4.0 dtype: float64
|
1 2 3 4 5 6
|
a 1.0 b 2.0 c 3.0 d 4.0 e 5.0 dtype: float64
|
1 2 3 4 5 6
|
a False b False c False d False e True dtype: bool
|
1 2 3 4 5 6
|
a True b True c True d True e False dtype: bool
|
二、DataFrame数据结构
DataFrame是一个表格型的数据结构,含有一组有序的列。
DataFrame可以被看做是由Series组成的字典,并且共用一个索引。
2.1 产生时间对象数组:date_range
date_range参数详解:
参数 | 详解 |
start |
开始时间 |
end |
结束时间 |
periods |
时间长度 |
freq |
时间频率,默认为’D’,可选H(our),W(eek),B(usiness),S(emi-)M(onth),(min)T(es), S(econd), A(year),… |
1 2
|
dates = pd.date_range('20190101', periods=6, freq='M') print(dates)
|
1 2 3
|
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30', '2019-05-31', '2019-06-30'], dtype='datetime64[ns]', freq='M')
|
1 2 3
|
np.random.seed(1) arr = 10 * np.random.randn(6, 4) print(arr)
|
1 2 3 4 5 6
|
[[ 16.24345364 -6.11756414 -5.28171752 -10.72968622] [ 8.65407629 -23.01538697 17.44811764 -7.61206901] [ 3.19039096 -2.49370375 14.62107937 -20.60140709] [ -3.22417204 -3.84054355 11.33769442 -10.99891267] [ -1.72428208 -8.77858418 0.42213747 5.82815214] [-11.00619177 11.4472371 9.01590721 5.02494339]]
|
1 2
|
df = pd.DataFrame(arr, index=dates, columns=['c1', 'c2', 'c3', 'c4']) df
|
| c1 | c2 | c3 | c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
三、DataFrame属性
属性 | 详解 |
dtype是 |
查看数据类型 |
index |
查看行序列或者索引 |
columns |
查看各列的标签 |
values |
查看数据框内的数据,也即不含表头索引的数据 |
describe |
查看数据每一列的极值,均值,中位数,只可用于数值型数据 |
transpose |
转置,也可用T来操作 |
sort_index |
排序,可按行或列index排序输出 |
sort_values |
按数据值来排序 |
1 2
|
# 查看数据类型 print(df2.dtypes)
|
1 2 3 4 5
|
0 float64 1 float64 2 float64 3 float64 dtype: object
|
| c1 | c2 | c3 | c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
1 2 3
|
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30', '2019-05-31', '2019-06-30'], dtype='datetime64[ns]', freq='M')
|
1
|
Index(['c1', 'c2', 'c3', 'c4'], dtype='object')
|
1 2 3 4 5 6
|
[[ 16.24345364 -6.11756414 -5.28171752 -10.72968622] [ 8.65407629 -23.01538697 17.44811764 -7.61206901] [ 3.19039096 -2.49370375 14.62107937 -20.60140709] [ -3.22417204 -3.84054355 11.33769442 -10.99891267] [ -1.72428208 -8.77858418 0.42213747 5.82815214] [-11.00619177 11.4472371 9.01590721 5.02494339]]
|
| c1 | c2 | c3 | c4 |
count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
mean |
2.022213 |
-5.466424 |
7.927203 |
-6.514830 |
std |
9.580084 |
11.107772 |
8.707171 |
10.227641 |
min |
-11.006192 |
-23.015387 |
-5.281718 |
-20.601407 |
25% |
-2.849200 |
-8.113329 |
2.570580 |
-10.931606 |
50% |
0.733054 |
-4.979054 |
10.176801 |
-9.170878 |
75% |
7.288155 |
-2.830414 |
13.800233 |
1.865690 |
max |
16.243454 |
11.447237 |
17.448118 |
5.828152 |
| 2019-01-31 00:00:00 | 2019-02-28 00:00:00 | 2019-03-31 00:00:00 | 2019-04-30 00:00:00 | 2019-05-31 00:00:00 | 2019-06-30 00:00:00 |
c1 |
16.243454 |
8.654076 |
3.190391 |
-3.224172 |
-1.724282 |
-11.006192 |
c2 |
-6.117564 |
-23.015387 |
-2.493704 |
-3.840544 |
-8.778584 |
11.447237 |
c3 |
-5.281718 |
17.448118 |
14.621079 |
11.337694 |
0.422137 |
9.015907 |
c4 |
-10.729686 |
-7.612069 |
-20.601407 |
-10.998913 |
5.828152 |
5.024943 |
1 2
|
# 按行标签[c1, c2, c3, c4]从大到小排序 df.sort_index(axis=0)
|
| c1 | c2 | c3 | c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
1 2
|
# 按列标签[2019-01-01, 2019-01-02...]从大到小排序 df.sort_index(axis=1)
|
| c1 | c2 | c3 | c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
1 2
|
# 按c2列的值从大到小排序 df.sort_values(by='c2')
|
| c1 | c2 | c3 | c4 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
四、DataFrame取值
| c1 | c2 | c3 | c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
4.1 通过columns取值
1 2 3 4 5 6 7
|
2019-01-31 -6.117564 2019-02-28 -23.015387 2019-03-31 -2.493704 2019-04-30 -3.840544 2019-05-31 -8.778584 2019-06-30 11.447237 Freq: M, Name: c2, dtype: float64
|
| c2 | c3 |
2019-01-31 |
-6.117564 |
-5.281718 |
2019-02-28 |
-23.015387 |
17.448118 |
2019-03-31 |
-2.493704 |
14.621079 |
2019-04-30 |
-3.840544 |
11.337694 |
2019-05-31 |
-8.778584 |
0.422137 |
2019-06-30 |
11.447237 |
9.015907 |
4.2 loc(通过行标签取值)
1 2
|
# 通过自定义的行标签选择数据 df.loc['2019-01-01':'2019-01-03']
|
| c1 | c2 | c3 | c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
4.3 iloc(类似于numpy数组取值)
1 2 3 4 5 6
|
array([[ 16.24345364, -6.11756414, -5.28171752, -10.72968622], [ 8.65407629, -23.01538697, 17.44811764, -7.61206901], [ 3.19039096, -2.49370375, 14.62107937, -20.60140709], [ -3.22417204, -3.84054355, 11.33769442, -10.99891267], [ -1.72428208, -8.77858418, 0.42213747, 5.82815214], [-11.00619177, 11.4472371 , 9.01590721, 5.02494339]])
|
1 2
|
# 通过行索引选择数据 print(df.iloc[2, 1])
|
| c2 | c3 | c4 |
2019-02-28 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.840544 |
11.337694 |
-10.998913 |
4.4 使用逻辑判断取值
| c1 | c2 | c3 | c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
1
|
df[(df['c1'] > 0) & (df['c2'] > -8)]
|
| c1 | c2 | c3 | c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
五、DataFrame值替换
| c1 | c2 | c3 | c4 |
2019-01-31 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-02-28 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-03-31 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
1 2
|
df.iloc[0:3, 0:2] = 0 df
|
| c1 | c2 | c3 | c4 |
2019-01-31 |
0.000000 |
0.000000 |
-5.281718 |
-10.729686 |
2019-02-28 |
0.000000 |
0.000000 |
17.448118 |
-7.612069 |
2019-03-31 |
0.000000 |
0.000000 |
14.621079 |
-20.601407 |
2019-04-30 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
1 2 3 4 5 6 7
|
2019-01-31 False 2019-02-28 True 2019-03-31 True 2019-04-30 True 2019-05-31 False 2019-06-30 False Freq: M, Name: c3, dtype: bool
|
1 2 3
|
# 针对行做处理 df[df['c3'] > 10] = 100 df
|
| c1 | c2 | c3 | c4 |
2019-01-31 |
0.000000 |
0.000000 |
-5.281718 |
-10.729686 |
2019-02-28 |
100.000000 |
100.000000 |
100.000000 |
100.000000 |
2019-03-31 |
100.000000 |
100.000000 |
100.000000 |
100.000000 |
2019-04-30 |
100.000000 |
100.000000 |
100.000000 |
100.000000 |
2019-05-31 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-06-30 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
1 2 3 4
|
# 针对行做处理 df = df.astype(np.int32) df[df['c3'].isin([100])] = 1000 df
|
| c1 | c2 | c3 | c4 |
2019-01-31 |
0 |
0 |
-5 |
-10 |
2019-02-28 |
1000 |
1000 |
1000 |
1000 |
2019-03-31 |
1000 |
1000 |
1000 |
1000 |
2019-04-30 |
1000 |
1000 |
1000 |
1000 |
2019-05-31 |
-1 |
-8 |
0 |
5 |
2019-06-30 |
-11 |
11 |
9 |
5 |
六、读取CSV文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|
import pandas as pd from io import StringIO test_data = ''' 5.1,,1.4,0.2 4.9,3.0,1.4,0.2 4.7,3.2,,0.2 7.0,3.2,4.7,1.4 6.4,3.2,4.5,1.5 6.9,3.1,4.9, ,,, '''
test_data = StringIO(test_data) df = pd.read_csv(test_data, header=None) df.columns = ['c1', 'c2', 'c3', 'c4'] df
|
| c1 | c2 | c3 | c4 |
0 |
5.1 |
NaN |
1.4 |
0.2 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
4.7 |
3.2 |
NaN |
0.2 |
3 |
7.0 |
3.2 |
4.7 |
1.4 |
4 |
6.4 |
3.2 |
4.5 |
1.5 |
5 |
6.9 |
3.1 |
4.9 |
NaN |
6 |
NaN |
NaN |
NaN |
NaN |
七、处理丢失数据
| c1 | c2 | c3 | c4 |
0 |
False |
True |
False |
False |
1 |
False |
False |
False |
False |
2 |
False |
False |
True |
False |
3 |
False |
False |
False |
False |
4 |
False |
False |
False |
False |
5 |
False |
False |
False |
True |
6 |
True |
True |
True |
True |
1 2
|
# 通过在isnull()方法后使用sum()方法即可获得该数据集某个特征含有多少个缺失值 print(df.isnull().sum())
|
1 2 3 4 5
|
c1 1 c2 2 c3 2 c4 2 dtype: int64
|
1 2
|
# axis=0删除有NaN值的行 df.dropna(axis=0)
|
| c1 | c2 | c3 | c4 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
3 |
7.0 |
3.2 |
4.7 |
1.4 |
4 |
6.4 |
3.2 |
4.5 |
1.5 |
1 2
|
# axis=1删除有NaN值的列 df.dropna(axis=1)
|
1 2
|
# 删除全为NaN值得行或列 df.dropna(how='all')
|
| c1 | c2 | c3 | c4 |
0 |
5.1 |
NaN |
1.4 |
0.2 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
4.7 |
3.2 |
NaN |
0.2 |
3 |
7.0 |
3.2 |
4.7 |
1.4 |
4 |
6.4 |
3.2 |
4.5 |
1.5 |
5 |
6.9 |
3.1 |
4.9 |
NaN |
1 2
|
# 删除行不为4个值的 df.dropna(thresh=4)
|
| c1 | c2 | c3 | c4 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
3 |
7.0 |
3.2 |
4.7 |
1.4 |
4 |
6.4 |
3.2 |
4.5 |
1.5 |
1 2
|
# 删除c2中有NaN值的行 df.dropna(subset=['c2'])
|
| c1 | c2 | c3 | c4 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
4.7 |
3.2 |
NaN |
0.2 |
3 |
7.0 |
3.2 |
4.7 |
1.4 |
4 |
6.4 |
3.2 |
4.5 |
1.5 |
5 |
6.9 |
3.1 |
4.9 |
NaN |
1 2
|
# 填充nan值 df.fillna(value=10)
|
| c1 | c2 | c3 | c4 |
0 |
5.1 |
10.0 |
1.4 |
0.2 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
4.7 |
3.2 |
10.0 |
0.2 |
3 |
7.0 |
3.2 |
4.7 |
1.4 |
4 |
6.4 |
3.2 |
4.5 |
1.5 |
5 |
6.9 |
3.1 |
4.9 |
10.0 |
6 |
10.0 |
10.0 |
10.0 |
10.0 |
八、合并数据
1 2
|
df1 = pd.DataFrame(np.zeros((3, 4))) df1
|
| 0 | 1 | 2 | 3 |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
1 2
|
df2 = pd.DataFrame(np.ones((3, 4))) df2
|
| 0 | 1 | 2 | 3 |
0 |
1.0 |
1.0 |
1.0 |
1.0 |
1 |
1.0 |
1.0 |
1.0 |
1.0 |
2 |
1.0 |
1.0 |
1.0 |
1.0 |
1 2
|
# axis=0合并列 pd.concat((df1, df2), axis=0)
|
| 0 | 1 | 2 | 3 |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
0 |
1.0 |
1.0 |
1.0 |
1.0 |
1 |
1.0 |
1.0 |
1.0 |
1.0 |
2 |
1.0 |
1.0 |
1.0 |
1.0 |
1 2
|
# axis=1合并行 pd.concat((df1, df2), axis=1)
|
| 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
1.0 |
1.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1 2
|
# append只能合并列 df1.append(df2)
|
| 0 | 1 | 2 | 3 |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
0 |
1.0 |
1.0 |
1.0 |
1.0 |
1 |
1.0 |
1.0 |
1.0 |
1.0 |
2 |
1.0 |
1.0 |
1.0 |
1.0 |
九、导入导出数据
使用df = pd.read_excel(filename)读取文件,使用df.to_excel(filename)保存文件。
9.1 读取文件导入数据
读取文件导入数据函数主要参数:
参数 | 详解 |
sep |
指定分隔符,可用正则表达式如’s+’ |
header=None |
指定文件无行名 |
name |
指定列名 |
index_col |
指定某列作为索引 |
skip_row |
指定跳过某些行 |
na_values |
指定某些字符串表示缺失值 |
parse_dates |
指定某些列是否被解析为日期,布尔值或列表 |
1 2
|
df = pd.read_excel(filename) df = pd.read_csv(filename)
|
9.2 写入文件导出数据
写入文件函数的主要参数:
参数 | 详解 |
sep |
分隔符 |
na_rep |
指定缺失值转换的字符串,默认为空字符串 |
header=False |
不保存列名 |
index=False |
不保存行索引 |
cols |
指定输出的列,传入列表 |
十、pandas读取json文件
1 2 3 4 5 6 7 8
|
strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000}, {"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000}, {"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000}, {"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000}, {"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'
df = pd.read_json(strtext, orient='records') df
|
| code | code1 | code2 | issue | time | ttery |
0 |
8,4,5,2,9 |
297734529 |
NaN |
20130801-3391 |
1013395466000 |
min |
1 |
7,8,2,1,2 |
298058212 |
NaN |
20130801-3390 |
1013395406000 |
min |
2 |
5,9,1,2,9 |
298329129 |
NaN |
20130801-3389 |
1013395346000 |
min |
3 |
3,8,7,3,3 |
298588733 |
NaN |
20130801-3388 |
1013395286000 |
min |
4 |
0,8,5,2,7 |
298818527 |
NaN |
20130801-3387 |
1013395226000 |
min |
1 2 3
|
df.to_excel('pandas处理json.xlsx', index=False, columns=["ttery", "issue", "code", "code1", "code2", "time"])
|
10.1 orient参数的五种形式
orient是表明预期的json字符串格式。orient的设置有以下五个值:
1.’split’ : dict like {index -> [index], columns -> [columns], data -> [values]}
这种就是有索引,有列字段,和数据矩阵构成的json格式。key名称只能是index,columns和data。
1 2 3
|
s = '{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}' df = pd.read_json(s, orient='split') df
|
2.’records’ : list like [{column -> value}, … , {column -> value}]
这种就是成员为字典的列表。如我今天要处理的json数据示例所见。构成是列字段为键,值为键值,每一个字典成员就构成了dataframe的一行数据。
1 2 3 4 5
|
strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000}, {"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000}]'
df = pd.read_json(strtext, orient='records') df
|
| code | code1 | code2 | issue | time | ttery |
0 |
8,4,5,2,9 |
297734529 |
NaN |
20130801-3391 |
1013395466000 |
min |
1 |
7,8,2,1,2 |
298058212 |
NaN |
20130801-3390 |
1013395406000 |
min |
3.’index’ : dict like {index -> {column -> value}}
以索引为key,以列字段构成的字典为键值。如:
1 2 3
|
s = '{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}' df = pd.read_json(s, orient='index') df
|
4.’columns’ : dict like {column -> {index -> value}}
这种处理的就是以列为键,对应一个值字典的对象。这个字典对象以索引为键,以值为键值构成的json字符串。如下图所示:
1 2 3
|
s = '{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}' df = pd.read_json(s, orient='columns') df
|
5.’values’ : just the values array。
values这种我们就很常见了。就是一个嵌套的列表。里面的成员也是列表,2层的。
1 2 3
|
s = '[["a",1],["b",2]]' df = pd.read_json(s, orient='values') df
|
十一、pandas读取sql语句
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
|
import numpy as np import pandas as pd import pymysql
def conn(sql): # 连接到mysql数据库 conn = pymysql.connect( host="localhost", port=3306, user="root", passwd="123", db="db1", ) try: data = pd.read_sql(sql, con=conn) return data except Exception as e: print("SQL is not correct!") finally: conn.close()
sql = "select * from test1 limit 0, 10" # sql语句 data = conn(sql) print(data.columns.tolist()) # 查看字段 print(data) # 查看数据
|