Python for Infomatics 第12章网络编程四（译）

zoukankan html css js c++ java

Python for Infomatics 第12章网络编程四（译）
注：文章原文为Dr. Charles Severance 的《Python for Informatics》。文中代码用3.4版改写，并在本机测试通过。

12.7 用BeautifulSoup分析HTML

　　有很多Python库可以帮你分析HTML和抓取数据。每个库都有它们各自的强项和弱点，你可以基于你的需求选择一个。

　　下面的例子，我们将使用BeautifulSoup来分析一些HTML的输入，并抓取链接信息。你可以从www.crummy.com下载和安装BeautifulSoup代码。你可以在下载后安装它，或者简单的把BeautifulSoup.py文件放到和你应用程序同样的目录下。（译者选择的另一种安装方法：pip3 install BeautiflSoup4）

　　虽然HTML看起来像XML，一些页面还是仔细构建的XML。很多HTML的分析因为不正确格式引起XML分析器拒绝整个网页而中断。BeautifulSoup可以容忍有严重缺陷的HTML，还可以让你轻松提取你所需要的数据。我们将用urllib读取网页，然后使用BeautifulSoup抓取锚标签（a）的href属性。

　　具体的代码如下：
from bs4 import BeautifulSoup import urllib.request url = input('Enter - ') html = urllib.request.urlopen(url).read() soup = BeautifulSoup(html,"html.parser") tags = soup('a') for tag in tags: print(tag.get('href', None))
　　程序提示输入一个网页地址，然后打开这个网页，读取数据，并将数据传送给BeautifulSoup分析器，然后获取所有锚标签(a)的内容，打印出每个标签的属性。

　　程序运行后的输出如下：

Enter - http://www.py4inf.com/book.htm
http://amzn.to/1KkULF3
http://amzn.to/1KkULF3
http://amzn.to/1hLcoBy
http://amzn.to/1KkV42z
http://amzn.to/1fNOnbd
http://amzn.to/1N74xLt
http://do1.dr-chuck.net/py4inf/EN-us/book.pdf
http://do1.dr-chuck.net/py4inf/ES-es/book.pdf
https://twitter.com/fertardio
translations/KO/book_009_ko.pdf
http://www.xwmooc.net/python/
http://fanwscu.gitbooks.io/py4inf-zh-cn/
book_270.epub
translations/ES/book_272_es4.epub
https://www.gitbook.com/download/epub/book/fanwscu/py4inf-zh-cn
html-270/
html_270.zip
http://itunes.apple.com/us/book/python-for-informatics/id554638579?mt=13
http://www-personal.umich.edu/~csev/books/py4inf/ibooks//python_for_informatics.ibooks
http://www.py4inf.com/code
http://www.greenteapress.com/thinkpython/thinkCSpy/
http://allendowney.com/

　　你可以用BeautifulSoup 拉出每个标签的不同部分，具体代码如下：
from bs4 import BeautifulSoup import urllib.request url = input('Enter - ') html = urllib.request.urlopen(url).read() soup = BeautifulSoup(html,"html.parser") tags = soup('a') for tag in tags: print('TAG:', tag) print('URL:', tag.get('href', None) print('Content:', tag.contents[0]) print('Attrs:', tag.attrs)
　　这个程序的输出如下：

Enter - http://www.dr-chuck.com/page1.html
TAG: <a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>
URL: http://www.dr-chuck.com/page2.htm
Content:
Second Page
Attrs: {'href': 'http://www.dr-chuck.com/page2.htm'}

　　这个例子只是显示BeautifulSoup分析HTML力量的入门。想要了解更多的信息，请查看www.crummy.com的文档和示例。
查看全文

相关阅读:
ping和traceroute原理分析异同为什么不能ping通却能traceroute (转载）规格严格
 JDBC hang on Statement 规格严格
 邮件服务返回代码含义规格严格
 后台分析（转载）规格严格
 Java Socket(转载）规格严格
 编程好习惯规格严格
 统计图表生成规格严格
 几篇不错的博客规格严格
 我见过的一个让我瞠目结舌的电脑高手!
Ubuntu 下Ape转Mp3[88250原创]

原文地址：https://www.cnblogs.com/zhengsh/p/5432632.html

Python for Infomatics 第12章 网络编程四（译）

Python for Infomatics 第12章网络编程四（译）