python之BeautifulSoup模块

zoukankan html css js c++ java

python之BeautifulSoup模块
BeautifulSoup模块
- 简单介绍
  它就是用来从HTML源码中提取我们需要的有效数据信息的工具，效率比正则表达式高
  BeautifulSoup又被称为bs4
- 安装
  pip install BeautifulSoup
- 简单案例
```
import requests
import bs4

url = 'https://www.lagou.com/'
res = requests.get(url)
res.raise_for_status()
no = bs4.BeautifulSoup(res.text)
print(type(no))
```
bs4.BeautifulSoup('Html文件中的内容的字符串')：获取一个BeautifulSoup对象
上面的代码直接运行会有警告：

D:/JavaSoft/pycharm-professional-2019.3/WorkSpace/python_learning/python_base/webcrawle/webcrawle_demo3.py:11: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 11 of the file D:/JavaSoft/pycharm-professional-2019.3/WorkSpace/python_learning/python_base/webcrawle/webcrawle_demo3.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.

谷歌翻译一下警告：没有显式指定解析器，所以我使用这个系统中可用的最佳HTML解析器(“lxml”)。这通常不是问题，但是如果您在另一个系统上或在不同的虚拟环境中运行这段代码，它可能会使用不同的解析器，并且行为也会有所不同。
总结：总的来说就是缺少一个html解析器，然后在程序中安装这个lxml模块，然后在初始化的时候把这个变量添加上去就可以解决了
```
import requests
import bs4

url = 'https://www.lagou.com/'
res = requests.get(url)
res.raise_for_status()
no = bs4.BeautifulSoup(res.text, 'lxml')
print(type(no))
```
- select方法作用：在bs4将html源码全部加载到对象中，然后可以调用这个方法进行规则匹配寻找我们需要的元素和数据
  实现机制：看源码select()方法每次匹配之后返回的是一个element模块中的ResultSet对象，这个对象继承list类，实际上返回的就是一个装有Tag对象的列表。Tag对象的值是可以传递给str()函数，这个对象有一个attrs属性，这个属性会把Tag对象中所有HTML属性作为一个字典进行存储
```
import requests
import bs4

# 从拉钩网把数据下载下来然后存储在本地的文件中（二进制存储）
# url = 'https://www.lagou.com/'
# res = requests.get(url)
# res.raise_for_status()
# with open('lagou.txt', 'wb') as op:
#     for line in res.iter_content(1000):
#         op.write(line)
file = open('lagou.txt', 'r', encoding='utf-8')
soup = bs4.BeautifulSoup(file, 'lxml')
print(type(soup))
elems = soup.select('#search_input')
print(elems)
print(type(elems))
print(len(elems))
print(elems[0])
print(type(elems[0]))
print(elems[0].getText())
print(elems[0].attrs)
print(elems[0].get('placeholder'))
```
代码执行之后的结果

<class 'bs4.BeautifulSoup'>
[]
<class 'bs4.element.ResultSet'>
1

<class 'bs4.element.Tag'>

{'maxlength': '64', 'placeholder': '搜索职位、公司或地点', 'type': 'text', 'id': 'search_input', 'class': ['search_input'], 'autocomplete': 'off', 'tabindex': '1', 'value': ''}
搜索职位、公司或地点
查看全文

相关阅读:
python软件开发目录规范
 模块与包
 匿名函数的使用
 三元表达式,列表生成式，字典生成式，生成器表达式
 Python函数进阶:生成器的原理及使用
 python迭代器的原理及应用
 PYTHON装饰器用法及演变
 文件操作补充
 pycharm的断点调试与TODO标记
 字符编码补充

原文地址：https://www.cnblogs.com/myfaith-feng/p/12727973.html