BeautifulSoup是一个专门用于解析html/xml的库。官网:http://www.crummy.com/software/BeautifulSoup/
说明,BS有了4.x的版本了。官方说:
Beautiful Soup 3 has been replaced by Beautiful Soup 4. You may be looking for the Beautiful Soup 4 documentation
Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers like lxml and html5lib. You should use Beautiful Soup 4 for all new projects.
我的电脑上面用
help(BeautifulSoup.__version__)看到版本号为:
3.2.1
Beautiful Soup 4 works on both Python 2 (2.6+) and Python 3.
安装其实很简单,BeautifulSoup只有一个文件,只要把这个文件拷到你的工作目录,就可以了。
from BeautifulSoup import BeautifulSoup # For processing HTML from BeautifulSoup import BeautifulStoneSoup # For processing XML import BeautifulSoup # To get everything
创建 BeautifulSoup 对象
BeautifulSoup对象需要一段html文本就可以创建了。
下面的代码就创建了一个BeautifulSoup对象:
from BeautifulSoup import BeautifulSoup doc = ['<html><head><title>PythonClub.org</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.', '<p id="secondpara" align="blah">This is paragraph <b>two</b> of pythonclub.org.', '</html>'] soup = BeautifulSoup(''.join(doc))
采用
print soup.prettify()
后:
# <html> # <head> # <title> # Page title # </title> # </head> # <body> # <p id="firstpara" align="center"> # This is paragraph # <b> # one # </b> # . # </p> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> # </html>
查找HTML内指定元素
BeautifulSoup可以直接用”.”访问指定HTML元素
根据html标签(tag)查找:查找html title
可以用 soup.html.head.title 得到title的name,和字符串值。
>>> soup.html.head.title 注意,包含title标签 <title>PythonClub.org</title> >>> soup.html.head.title.name u'title' >>> soup.html.head.title.string u'PythonClub.org' >>>
也可以直接通过soup.title直接定位到指定HTML元素:
>>> soup.title <title>PythonClub.org</title> >>>
根据html内容查找:查找包含特定字符串的整个标签内容
下面的例子给出了查找含有”para”的html tag内容:
>>> soup.findAll(text=re.compile("para")) [u'This is paragraph ', u'This is paragraph '] >>> soup.findAll(text=re.compile("para"))[0].parent <p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.</p> >>> soup.findAll(text=re.compile("para"))[0].parent.contents [u'This is paragraph ', <b>one</b>, u' of ptyhonclub.org.']
基本的方法:findAll
findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
These arguments show up over and over again throughout the Beautiful Soup API. The most important arguments are name
and the keyword arguments.
-
The simplest usage is to just pass in a tag name. This code finds all the <B>
Tag
s in the document:soup.findAll('b') #[<b>one</b>, <b>two</b>]
-
You can also pass in a regular expression. This code finds all the tags whose names start with B:
import re tagsStartingWithB = soup.findAll(re.compile('^b')) [tag.name for tag in tagsStartingWithB] #[u'body', u'b', u'b']
-
You can pass in a list or a dictionary. These two calls find all the <TITLE> and all the <P> tags. They work the same way, but the second call runs faster:
soup.findAll(['title', 'p']) #[<title>Page title</title>, # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] soup.findAll({'title' : True, 'p' : True}) #[<title>Page title</title>, # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
The keyword arguments impose restrictions on the attributes of a tag. This simple example finds all the tags which have a value of "center" for their "align" attribute:
soup.findAll(align="center") #[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>]
Searching by CSS class
The attrs
argument would be a pretty obscure feature were it not for one thing: CSS. It's very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, class
, is also a Python reserved word.
You could search by CSS class with soup.find("tagName", { "class" : "cssClass" })
, but that's a lot of code for such a common operation. Instead, you can pass a string for attrs
instead of a dictionary. The string will be used to restrict the CSS class.
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("""Bob's <b>Bold</b> Barbeque Sauce now available in <b class="hickory">Hickory</b> and <b class="lime">Lime</a>""") soup.find("b", { "class" : "lime" }) #<b class="lime">Lime</b> soup.find("b", "hickory") #<b class="hickory">Hickory</b>
根据CSS属性查找HTML内容
soup.findAll(id=re.compile("para$")) # [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] soup.findAll(attrs={'id' : re.compile("para$")}) # [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
深入理解BeautifulSoup
转自:http://www.pythonclub.org/modules/beautifulsoup/start
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
一篇文章
------------------------------------
汤料——Soup中的对象
标签(Tag)
标签对应于HTML元素,也就是应于一对HTML标签以及括起来的内容(包括内层标签和文本),如:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
soup.b就是一个标签,soup其实也可以视为是一个标签,其实整个HTML就是由一层套一层的标签组成的。
名字(Name)
名字对应于HTML标签中的名字(也就是尖括号里的第一项)。每个标签都具有名字,标签的名字使用.name
来访问,例如上例中,
tag.name == u'b'
soup.name == u'[document]'
属性(Atrriutes)
属性对应于HTML标签中的属性部分(也就是尖括号里带等号的那些)。标签可以有许多属性,也可以没有属性。属性使用类似于字典的形式访问,用方括号加属性名,例如上例中,
tag['class'] == u'boldest'
可以使用.attrs直接获得这个字典,例如,
tag.attrs == {u'class': u'boldest'}
文本(Text)
文本对应于HTML中的文本(也就是尖括号外的部分)。文件使用.text
来访问,例如上例中,
tag.text == u'Extremely bold'
string和text区别:
找汤料——Soup中的查找
解析一个HTML通常是为了找到感兴趣的部分,并提取出来。BeautifulSoup提供了find
和find_all
的方法进行查找。find
只返回找到的第一个标签,而find_all
则返回一个列表。因为查找用得很多,所以BeautifulSoup做了一些很方便的简化的使用方式:
tag.find_all("a") #等价于 tag("a") 这是4.0的函数find_all
tag.find("a") #等价于 tag.a
因为找不到的话,find_all返回空列表,find
返回None
,而不会抛出异常,所以,也不用担心 tag("a")
或tag.a
会因为找不到而报错。限于python的语法对变量名的规定,tag.a
的形式只能是按名字查找,因为点号.后面只能接变量名,而带括号的形式 tag()
或 tag.find()
则可用于以下的各种查找方式。
查找可以使用多种方式:字符串、列表、键-值(字典)、正则表达式、函数
-
字符串: 字符串会匹配标签的名字,例如
tag.a
或tag("a")
-
列表: 可以按一个字符串列表查找,返回名字匹配任意一个字符串的标签。例如
tag("h2", "p")
-
键-值: 可以用
tag(key=value)
的形式,来按标签的属性查找。键-值查找里有比较多的小花招,这里列几条:- class
class
是Python的保留字,不能当变量名用,偏偏在HTML中会有很多class=XXX
的情况,BeautifulSoup的解决方法是加一下划线,用class_
代替,如tag(class_=XXX)
。 - True
当值为True时,会匹配所有带这个键的标签,如tag(href=True)
- text
text做为键时表示查找按标签中的文本查找,如tag(text=something)
- class
-
正则表达式: 例如
tag(href=re.compile("elsie"))
-
函数: 当以上方法都行不通时,函数是终极方法。写一个以单个标签为参数的函数,传入
find
或find_all
进行查找。如def fun(tag): return tag.has_key("class") and not tag.has_key("id") tag(fun) # 会返回所有带class属性但不带id属性的标签
再来一碗——按文档的结构查找
HTML可以解析成一棵标签树,因此也可以按标签在树中的相互关系来查找。
-
查找上层节点:
find_parents()
和find_parent()
-
查找下一个兄弟节点:
find_next_siblings()
和find_next_sibling()
- 查找上一个兄弟节点:
find_previous_siblings()
和find_previous_sibling()
以上四个都只会查同一父节点下的兄弟
-
查找下层节点:其实上面说的find和find_all就是干这活的
-
查找下一个节点(无视父子兄弟关系)
find_all_next()
和find_next()
- 查找上一个节点(无视父子兄弟关系)
find_all_previous()
和find_previous()
以上的这些查找的参都和find
一样,可以搭配着用。
看颜色选汤——按CSS查找
用 .select()
方法,看 http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
一些小花招
- BeautifulSoup 可以支持多种解析器,如lxml, html5lib, html.parser. 如:
BeautifulSoup("<a></b>", "html.parser")
具体表现可参考 http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers
-
BeautifulSoup 在解析之前会先把文本转换成unicode,可以用
from_encoding
指定编码,如:BeautifulSoup(markup, from_encoding="iso-8859-8")
-
soup.prettify()可以输出排列得很好看的HTML文本,遇上中文的话可以指定编码使其显示正常,如
soup.prettify("gbk")
-
还是有编码问题,看:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicode-dammit
转自:http://cndenis.iteye.com/blog/1746706
soup2个重要的属性:
.contents and .children
A tag’s children are available in a list called .contents:
head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>
head_tag.contents
[<title>The Dormouse's story</title>]
type(head_tag.contents[0])
<class 'BeautifulSoup.Tag'> 说明content里面的类型不是string,而是固有的类型
title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']
The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the BeautifulSoup object.:
len(soup.contents)
# 1
soup.contents[0].name
# u'html'
A string does not have .contents, because it can’t contain anything:
text = title_tag.contents[0]
text.contents
# AttributeError: 'NavigableString' object has no attribute 'contents'
如果一个soup对象里面包含了html 标签,那么string是为None的。不管html tag前面是否有string。
soup=BeautifulSoup("<head><title>The Dormouse's story</title></head>")
head=soup.head
print head.string
输出None说明了这个问题
Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:
for child in title_tag.children:
print(child)
# The Dormouse's story
一个递归获取文本的函数:
def gettextonly(self,soup): v=soup.string if v==None: c=soup.contents resulttext='' for t in c: subtext=self.gettextonly(t) resulttext+=subtext+' ' return resulttext else: return v.strip()
一个分割字符串为单词的函数:
def separatewords(self,text): splitter=re.compile('\W') return [s.lower() for s in splitter.split(text) if s!='']