python抓取 - 走看看

zoukankan html css js c++ java

python抓取
我要抓取奥巴马每周的演讲内容http://www.putclub.com/html/radio/VOA/presidentspeech/index.html

如果手动提取，就需要一个个点进去，再复制保存，非常麻烦。

那有没有一步到位的方法呢，用python这种强大的语言就能快速实现。

首先我们看看这网页的源码

可以发现，我们要的信息就在这样一小条url中。

更具体点说，就是我们要遍历每个类似http://www.putclub.com/html/radio/VOA/presidentspeech/2014/0928/91326.html这样的网址，而这网址需要从上面的网页中提取。

好，开始写代码

首先打开这个目录页，保存在content
[python] view plain copy

import sys,urllib

url="http://www.putclub.com/html/radio/VOA/presidentspeech/index.html"

wp = urllib.urlopen(url)

print "start download..."

content = wp.read()
下面要提取出每一篇演讲的内容

具体思路是搜索“center_box”之后，每个“href=”和“target”之间的内容。为什么是这两个之间，请看网页源码。

得到的就是每一篇的url，再在前面加上www.putclub.com就是每一篇文章的网址啦
[html] view plain copy

print content.count("center_box")

index =  content.find("center_box")

content=content[content.find("center_box")+1:]

content=content[content.find("href=")+7:content.find("target")-2]

filename = content

url ="http://www.putclub.com/"+content

print content

print url

wp = urllib.urlopen(url)

print "start download..."

content = wp.read()
有了文章内容的url后，同样的方法筛选内容。
[python] view plain copy

#print content

print content.count("<div class="content"")

#content = content[content.find("<div class="content""):]

content = content[content.find(""):]

content = content[:content.find("<div class="dede_pages"")-1]

filename = filename[filename.find("presidentspeech")+len("presidentspeech/"):]
最后再保存并打印
[python] view plain copy

filename = filename.replace('/',"-",filename.count("/"))

fp = open(filename,"w+")

fp.write(content)

fp.close()

print content
OK，大功告成！保存成.pyw文件，以后只需双击就直接保存下了obama每周演讲内容~
查看全文

相关阅读:
IDEA入门学习笔记1：资料收集
 嵌入式入门学习笔记3：[转]编译linux
nrf51822微信开发2：[转]airkiss/airsync介绍
 nrf51822微信开发入门学习笔记1：开始前的准备
 Altium Designer入门学习笔记4：PCB设计中各层的含义
 简历包装1：[转]资料收集
 江苏省高等数学竞赛经验分享
 2017年高职高专技能比赛电子产品设计与制作赛项比赛经验分享
 2017年蓝桥杯单片机比赛经验分享
 蓝桥杯嵌入式比赛经验分享

原文地址：https://www.cnblogs.com/babyfei/p/6992235.html