Win平台:“以管理员身份运行” cmd
执行 pip install beautifulsoup4
演示HTML页面地址:http://python123.io/ws//demo.html
文件名称:demo.html
网页源代码:HTML 5.0 格式代码
BeautifulSoup库的安装小测:
1 >>> import requests 2 >>> r = requests.get("http://python123.io/ws//demo.html") 3 >>> r.text 4 '<html><head><title>This is a python demo page</title></head> <body> <p class="title"><b>The demo python introduces several python courses.</b></p> <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p> </body></html>' 5 >>> demo = r.text 6 >>> from bs4 import BeautifulSoup 7 >>> soup = BeautifulSoup(demo,'html.parser') 8 >>> print(soup.prettify()) 9 <html> 10 <head> 11 <title> 12 This is a python demo page 13 </title> 14 </head> 15 <body> 16 <p class="title"> 17 <b> 18 The demo python introduces several python courses. 19 </b> 20 </p> 21 <p class="course"> 22 Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: 23 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> 24 Basic Python 25 </a> 26 and 27 <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2"> 28 Advanced Python 29 </a> 30 . 31 </p> 32 </body> 33 </html> 34 >>>
Beautiful Soup库的基本元素:
Beautiful Soup库的理解:
Beautiful Soup库是解析、遍历、维护“标签树”的功能库。
<p>..</p> : 标签Tag
Beautiful Soup库的引用:
from bs4 import BeautifulSoup
import bs4
Beautiful Soup库解析器:
soup = BeautifulSoup ('<html>data</html>','html.parser')
BeautifulSoup类的基本元素:
< p class = "title" > ... </p>
Tag标签:
1 >>> from bs4 import BeautifulSoup 2 >>> soup = BeautifulSoup(demo,'html.parser') 3 >>> soup.title 4 <title>This is a python demo page</title> 5 >>> tag = soup.a 6 >>> tag 7 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
任何存在于HTML语法中的标签都可以用soup.<tag>访问获得当HTML文档中存在多个相同<tag>对应内容时,soup.<tag>返回第一个
Tag的name:
1 >>> from bs4 import BeautifulSoup 2 >>> soup = BeautifulSoup(demo,'html.parser') 3 >>> soup.a.name 4 'a' 5 >>> soup.a.parent.name 6 'p' 7 >>> soup.a.parent.parent.name 8 'body' 9 >>>
每个<tag>都有自己的名字,通过<tag>.name获取,字符串类型
Tag的attrs(属性):
1 >>> tag = soup.a 2 >>> tag.attrs 3 {'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'} 4 >>> tag.attrs['class'] 5 ['py1'] 6 >>> tag.attrs['href'] 7 'http://www.icourse163.org/course/BIT-268001' 8 >>> type(tag.attrs) 9 <class 'dict'> 10 >>> type(tag) 11 <class 'bs4.element.Tag'>
一个<tag>可以有0或多个属性,字典类型
Tag的NavigableString:
>>> soup.a <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> >>> soup.a.string 'Basic Python' >>> soup.p <p class="title"><b>The demo python introduces several python courses.</b></p> >>> soup.p.string 'The demo python introduces several python courses.' >>> type(soup.p.string) <class 'bs4.element.NavigableString'>
NavigableString可以跨越多个层次
Tag的Comment:
1 >>> newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser") 2 >>> newsoup.b.string 3 'This is a comment' 4 >>> type(newsoup.b.string) 5 <class 'bs4.element.Comment'> 6 >>> newsoup.p.string 7 'This is not a comment' 8 >>> type(newsoup.p.string) 9 <class 'bs4.element.NavigableString'>
Comment是一种特殊类型
标签<tag>
基于bs4库的HTML内容遍历方法:
HTML基本格式:
<>...</>构成了所属关系,形成了标签的树形结构
标签树的下行遍历:
BeautifulSoup类型是标签树的根节点
标签树的下行遍历
1 >>> soup = BeautifulSoup(demo,'html.parser') 2 >>> soup.head 3 <head><title>This is a python demo page</title></head> 4 >>> soup.head.contents 5 [<title>This is a python demo page</title>] 6 >>> soup.body.contents 7 [' ', <p class="title"><b>The demo python introduces several python courses.</b></p>, ' ', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: 8 9 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, ' '] 10 >>> len(soup.body.contents) 11 >>> soup.body.contents[1] 12 <p class="title"><b>The demo python introduces several python courses.</b></p>
遍历儿子节点:
for child in soup.body.children: print(child)
遍历子孙节点:
for child in soup.body.descendants: print(child)
标签树的上行遍历:
1 soup = BeautifulSoup(demo,'html.parser') 2 for parent in soup.a.parents: #标签树的上行遍历 3 if parent is None: 4 print(parent) 5 else: 6 print(parent.name)
遍历所有先辈节点,包括soup本身,所以要区别判断
运行结果:
标签树的平行遍历:
平行遍历发生在同一个父节点下的各节点间
遍历的判断:
让HTML内容更加“友好”的显示:
bs4库的prettify()方法:
1 >>> import requests 2 >>> r = requests.get("http://python123.io/ws//demo.html") 3 >>> demo = r.text 4 >>> demo 5 '<html><head><title>This is a python demo page</title></head> <body> <p class="title"><b>The demo python introduces several python courses.</b></p> <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p> </body></html>' 6 >>> soup = BeautifulSoup(demo,'html.parser') 7 >>> soup.prettify() 8 '<html> <head> <title> This is a python demo page </title> </head> <body> <p class="title"> <b> The demo python introduces several python courses. </b> </p> <p class="course"> Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> Basic Python </a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2"> Advanced Python </a> . </p> </body> </html>' 9 >>> print(soup.prettify()) 10 <html> 11 <head> 12 <title> 13 This is a python demo page 14 </title> 15 </head> 16 <body> 17 <p class="title"> 18 <b> 19 The demo python introduces several python courses. 20 </b> 21 </p> 22 <p class="course"> 23 Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: 24 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> 25 Basic Python 26 </a> 27 and 28 <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2"> 29 Advanced Python 30 </a> 31 . 32 </p> 33 </body> 34 </html>
.prettify()为HTML文本<>及其内容增加' '
.prettify()可用于标签,方法:<tag>.prettify()
1 >>> print(soup.a.prettify()) 2 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> 3 Basic Python 4 </a>
bs4库的编码:
bs4库将任何HTML输入都变为utf-8编码,Python 3.x默认支持编码是utf-8,解析无障碍。
1 >>> soup = BeautifulSoup("<p>中文</p>",'html.parser') 2 >>> soup.p.string 3 '中文' 4 >>> print(soup.p.prettify()) 5 <p> 6 中文 7 </p>