一、Beautifu Soup库
from bs4 import BeautifulSoup soup = BeautifulSoup(demo,"html.parser")
Tag、Name、Attributes、NavigableString、Comment
.contents 子节点的列表,将<tag>所有儿子节点存入列表
.children 子节点的迭代类型
.descendants 子孙节点的迭代类型
.parent 节点的父亲标签
.parents 节点先辈标签的迭代类型
.next_sibling(s) 返回安照HTML文本顺序的下一个平行节点标签
.previous_sibling(s) 上一个
>>> import requests >>> r = requests.get("http://python123.io/ws/demo.html") >>> demo = r.text >>> demo '<html><head><title>This is a python demo page</title></head> <body> <p class="title"><b>The demo python introduces several python courses.</b></p> <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p> </body></html>' >>>from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.prettify()
'<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>'
>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
二、信息组织与提取
1.信息标记的三种形式:
XML:尖括号
JSON:有类型键值对
YAML:无类型
3.信息提取的一般方法
(1)完整解析信息地标记形式,再提取关键信息
(2)无视标记形式,直接搜索关键信息
(3)融合方法
实例:
>>> import requests r = requests.get("http://python123.io/ws/demo.html") demo = r.text demo SyntaxError: unexpected indent >>> import requests >>> r = requests.get("http://python123.io/ws/demo.html") >>> demo = r.text >>> demo '<html><head><title>This is a python demo page</title></head> <body> <p class="title"><b>The demo python introduces several python courses.</b></p> <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p> </body></html>' >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(demo,"html,parser") Traceback (most recent call last): File "<pyshell#6>", line 1, in <module> soup = BeautifulSoup(demo,"html,parser") File "C:UsersASUSAppDataLocalProgramsPythonPython37-32libsite-packagess4\__init__.py", line 196, in __init__ % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html,parser. Do you need to install a parser library? >>> yes Traceback (most recent call last): File "<pyshell#7>", line 1, in <module> yes NameError: name 'yes' is not defined >>> soup = BeautifulSoup(demo,"html.parser") >>> from link in soup.find_all('a') SyntaxError: invalid syntax >>> for link in soup.find_all('a') SyntaxError: invalid syntax >>> for link in soup.find_all('a'): print(link.get('href')) http://www.icourse163.org/course/BIT-268001 http://www.icourse163.org/course/BIT-1001870001
4.基于bs4库的HTML内容查找方法
<>.find_all(name,attrs,recursive,string,**kwargs)
返回一个列表类型,存储查找的结果
name:对标签名称的检索字符串
>>> soup.find_all('a') [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>] >>> soup.find_all(['a','b']) [<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>] >>> for tag in soup.find_all(True): print(tag.name) html head title body p b p a a >>> import re >>> for tag in soup.find_all(re.compile('b')): print(tag.name) body b
attrs:对标签属性值的检索字符串,可标注属性检索
>>> soup.find_all('p','course') [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>] >>> soup.find_all(id='link1') [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>] >>> soup.find_all(id=re.compile('link')) [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
recursive:是否对子孙全部检索,默认True
>>> soup.find_all('a',recursive=False) []
string:<>...</>中字符串区域的检索字符串
>>> soup.find_all(string = 'Basic Python') ['Basic Python'] >>> soup.find_all(string = re.compile("python")) ['This is a python demo page', 'The demo python introduces several python courses.']
>>> soup(string = 'Basic Python')
['Basic Python']
扩展方法:
<>.find() find_parents parent next_sibling(s) previous_sibling(s)
三、中国大学排名定向爬虫
技术路线:requests+bs4
可行性:robots协议
步骤1:获取内容 getHTMLText()
2:数据结构 fillUnivList()
3:利用DS printUnivList()
import requests from bs4 import BeautifulSoup import bs4 def getHTMLText(url): try: r = requests.get(url,timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def fillUnivList(ulist,html): soup = BeautifulSoup(html,'html.parser') for tr in soup.find('tbody').children: if isinstance(tr,bs4.element.Tag):#过滤 tds = tr('td') ulist.append([tds[0].string,tds[1].string,tds[3].string]) def printUnivList(ulist,num): #tplt = "{0:^10} {1:{3}^10} {2:^10}" print("{:^10} {:^6} {:^10}".format("排名","学校名称","总分")) for i in range(num): u = ulist[i] print("{:^10} {:^6} {:^10}".format(u[0],u[1],u[2])) def main(): uinfo = [] url='http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html' html = getHTMLText(url) fillUnivList(uinfo,html) printUnivList(uinfo,20) main()
优化后:
import requests from bs4 import BeautifulSoup import bs4 def getHTMLText(url): try: r = requests.get(url,timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def fillUnivList(ulist,html): soup = BeautifulSoup(html,'html.parser') for tr in soup.find('tbody').children: if isinstance(tr,bs4.element.Tag):#过滤 tds = tr('td') ulist.append([tds[0].string,tds[1].string,tds[3].string]) def printUnivList(ulist,num): tplt = "{0:^10} {1:{3}^10} {2:^10}" print(tplt.format("排名","学校名称","总分",chr(12288))) for i in range(num): u = ulist[i] print(tplt.format(u[0],u[1],u[2],chr(12288))) def main(): uinfo = [] url='http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html' html = getHTMLText(url) fillUnivList(uinfo,html) printUnivList(uinfo,20) main()