假设你在网上搜索某个项目所需的原始数据,但坏消息是数据存在于网页中,并且没有可用于获取原始数据的API。
所以现在你必须浪费30分钟写脚本来获取数据(最后花费 2小时)。
这不难但是很浪费时间。
Pandas库有一种内置的方法,可以从名为read_html()的html页面中提取表格数据:
https://pandas.pydata.org/
importpandasaspd
tables = pd.read_html("https://apps.sandiego.gov/sdfiredispatch/")
print(tables[0])
就这么简单! Pandas可以在页面上找到所有重要的html表,并将它们作为一个新的DataFrame对象返回。
https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe
输入表格0行有列标题,并要求它将基于文本的日期转换为时间对象:
importpandasaspd
calls_df, = pd.read_html("http://apps.sandiego.gov/sdfiredispatch/", header=0, parse_dates=["Call Date"])
print(calls_df)
得到:
CallDateCallTypeStreetCrossStreets Unit
2017-06-0217:27:58Medical HIGHLAND AV WIGHTMAN ST/UNIVERSITY AV E17
2017-06-0217:27:58Medical HIGHLAND AV WIGHTMAN ST/UNIVERSITY AV M34
2017-06-0217:23:51Medical EMERSON ST LOCUST ST/EVERGREEN ST E22
2017-06-0217:23:51Medical EMERSON ST LOCUST ST/EVERGREEN ST M47
2017-06-0217:23:15Medical MARAUDER WY BARONLN/FROBISHER ST E38
2017-06-0217:23:15Medical MARAUDER WY BARONLN/FROBISHER ST M41
是一行代码,数据不能作为json记录可用。
importpandasaspd
calls_df, = pd.read_html("http://apps.sandiego.gov/sdfiredispatch/", header=0, parse_dates=["Call Date"])
print(calls_df.to_json(orient="records", date_format="iso"))
运行下面的代码你将得到一个漂亮的json输出(即使有适当的ISO 8601日期格式):
[
{
"Call Date":"2017-06-02T17:34:00.000Z",
"Call Type":"Medical",
"Street":"ROSECRANS ST",
"Cross Streets":"HANCOCK ST/ALLEY",
"Unit":"M21"
},
{
"Call Date":"2017-06-02T17:34:00.000Z",
"Call Type":"Medical",
"Street":"ROSECRANS ST",
"Cross Streets":"HANCOCK ST/ALLEY",
"Unit":"T20"
},
{
"Call Date":"2017-06-02T17:30:34.000Z",
"Call Type":"Medical",
"Street":"SPORTS ARENA BL",
"Cross Streets":"CAM DEL RIO WEST/EAST DR",
"Unit":"E20"
}
// etc...
]
你甚至可以将数据保存到CSV或XLS文件中:
importpandasaspd
calls_df, = pd.read_html("http://apps.sandiego.gov/sdfiredispatch/", header=0, parse_dates=["Call Date"])
calls_df.to_csv("calls.csv", index=False)
运行并双击calls.csv在电子表格中打开:
当然,Pandas还可以更简单地对数据进行过滤,分类或处理:
>>> calls_df.describe()
CallDateCall Type Street Cross Streets Unit
count6969696469
unique292292760
top2017-06-0216:59:50Medical CHANNEL WY LA SALLE ST/WESTERN ST E1
freq566552
first2017-06-0216:36:46NaNNaNNaNNaN
last2017-06-0217:41:30NaNNaNNaNNaN
>>> calls_df.groupby("Call Type").count()
CallDateStreet Cross Streets Unit
Call Type
Medical66666166
Traffic Accident (L1)3333
>>> calls_df["Unit"].unique()
array(['E46','MR33','T40','E201','M6','E34','M34','E29','M30',
'M43','M21','T20','E20','M20','E26','M32','SQ55','E1',
'M26','BLS4','E17','E22','M47','E38','M41','E5','M19',
'E28','M1','E42','M42','E23','MR9','PD','LCCNOT','M52',
'E45','M12','E40','MR40','M45','T1','M23','E14','M2','E39',
'M25','E8','M17','E4','M22','M37','E7','M31','E9','M39',
'SQ56','E10','M44','M11'], dtype=object)