zoukankan html css js c++ java

python爬虫---beautifulsoup（1）

　　beautifulsoup是用于对爬下来的内容进行解析的工具，其find和find_all方法都很有用。并且按照其解析完之后，会形成树状结构，对于网页形成了类似于json格式的key - value这种样子，更容易并且更方便对于网页的内容进行操作。

　　下载库就不用多说，使用python的pip，直接在cmd里面执行pip install beautifulsoup即可

　　首先仿照其文档说明，讲代码拷贝过来，如下

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'html.parser')

print soup.find_all('a')

　　html_doc即是我们爬下来的东西，这里方便直接使用了文档里面提供的内容。

　　我们直接对html_doc执行解析，使用的是html.parser这个解析器。

　　在sublime敲完之后ctrl+B即可运行（推荐下载python的SublimePythonIDE这个插件包，可以直接编译无需使用cmd）

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[Finished in 0.2s]

　　代码执行结果如上，将带有a的行数执行出来了。

　　我们按照文档要求改写一下，改写soup的内容，并且答应出结果。（直接黏贴官网内容，不在重复）

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

　　如上，可以很明显的看出来，解析完毕的soup，形成了key-value格式的数据，使用soup.title等方法可以分别打印出需要的内容。（#为打出内容）

　　还有其他的一些方法。

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

　　使用foreach即可很轻松的对于复杂父容器的子控件进行操作。（#为打出内容）

　　官网最后一个内容是将该网页的所有的内容去掉符号直接显示内容。方法如下

print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

　　也很方便的直接把文本的内容打出来了。

　　以上为beautifulsoup的比较简单的使用。

查看全文

相关阅读:
go语言从零学起(三) -- chat实现的思考
 go语言从零学起(二)--list循环删除元素(转载)
go语言从零学起(一) -- 文档教程篇
 Spring框架事务支持模型的优势
 Thymeleaf
社保到底是多交好，还是少交好？
使用静态工厂方法而不是构造器
 EJB、RMI、XMLRPC、Hessian、Thrift 、Protobuf
MySQL存储过程
 MySQL常用功能语句分类总结

原文地址：https://www.cnblogs.com/Sample1994/p/6664834.html