小白学 Python 爬虫（23）：解析库 pyquery 入门

zoukankan html css js c++ java

小白学 Python 爬虫（23）：解析库 pyquery 入门
人生苦短，我用 Python

前文传送门：

小白学 Python 爬虫（1）：开篇

小白学 Python 爬虫（2）：前置准备（一）基本类库的安装

小白学 Python 爬虫（3）：前置准备（二）Linux基础入门

小白学 Python 爬虫（4）：前置准备（三）Docker基础入门

小白学 Python 爬虫（5）：前置准备（四）数据库基础

小白学 Python 爬虫（6）：前置准备（五）爬虫框架的安装

小白学 Python 爬虫（7）：HTTP 基础

小白学 Python 爬虫（8）：网页基础

小白学 Python 爬虫（9）：爬虫基础

小白学 Python 爬虫（10）：Session 和 Cookies

小白学 Python 爬虫（11）：urllib 基础使用（一）

小白学 Python 爬虫（12）：urllib 基础使用（二）

小白学 Python 爬虫（13）：urllib 基础使用（三）

小白学 Python 爬虫（14）：urllib 基础使用（四）

小白学 Python 爬虫（15）：urllib 基础使用（五）

小白学 Python 爬虫（16）：urllib 实战之爬取妹子图

小白学 Python 爬虫（17）：Requests 基础使用

小白学 Python 爬虫（18）：Requests 进阶操作

小白学 Python 爬虫（19）：Xpath 基操

小白学 Python 爬虫（20）：Xpath 进阶

小白学 Python 爬虫（21）：解析库 Beautiful Soup（上）

小白学 Python 爬虫（22）：解析库 Beautiful Soup（下）

引言

前面一篇我们介绍了 Beautiful Soup 中可以使用 CSS 选择器，但是好像他的 CSS 选择器并没有想像中的强大。

本篇就介绍一个对 CSS 选择器更加友好的类库 —— pyquery 。它在语法上更加贴和 JQuery ，估计会成为各位后端开发人员的福音。

首先，还是先敬上各种官方地址：

官方文档：https://pyquery.readthedocs.io/en/latest/

PyPI：https://pypi.org/project/pyquery/

Github：https://github.com/gawel/pyquery

有问题，找官方，这句话是肯定不会错滴~~

初始化

首先，各位同学需要确保已经安装过 pyquery ，没有安装过的朋友可以翻一翻前面的前置准备，小编已经介绍过安装方式。

先来看一个简单的初始化的示例（还是使用上一篇的 HTML ，懒人真的没救了）：
```
from pyquery import PyQuery

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...
'''

d = PyQuery(html)
print(d('p'))
```
结果如下：
```
The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...
```
以上是直接使用字符串进行的初始化，同时它还支持直接传入 URL 地址进行初始化：
```
d_url = PyQuery(url='https://www.geekdigging.com/', encoding='UTF-8')
print(d_url('title'))
```
结果如下：
```
<title>极客挖掘机</title>
```
这样写的话，其实 PyQuery 会先请求这个 URL ，然后用响应得到的 HTML 内容完成初始化，与下面这样写其实也是一样的：
```
r = requests.get('https://www.geekdigging.com/')
r.encoding = 'UTF-8'
d_requests = PyQuery(r.text)
print(d_requests('title'))
```
CSS 选择器

我们先来简单感受下 CSS 选择器的用法，真的是非常的简单方便：
```
d_css = PyQuery(html)
print(d_css('.story .sister'))
print(type(d_css('.story .sister')))
```
结果如下：
```
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
<class 'pyquery.pyquery.PyQuery'>
```
这里的写法含义是我们先寻找 class 为 story 的节点，寻找到以后接着在它的子节点中继续寻找 class 为 sister 的节点。

最后的打印结果中可以看到，它的类型依然为 pyquery.pyquery.PyQuery ，说明我们可以继续使用这个结果解析。

查找节点

我们接着介绍一下常用的查找函数，这些查找函数最赞的地方就是它们和 JQuery 的用法完全一致。
- find() ：查找节点的所有子孙节点。
- children() ：只查找子节点。
- parent() ：查找父节点。
- parents() ：查找祖先节点。
- siblings() ：查找兄弟节点。
下面来一些简单的示例：
```
# 查找子节点
items = d('body')
print('子节点：', items.find('p'))
print(type(items.find('p')))

# 查找父节点
items = d('#link1')
print('父节点：', items.parent())
print(type(items.parent()))

# 查找兄弟节点
items = d('#link1')
print('兄弟节点：', items.siblings())
print(type(items.siblings()))
```
结果如下：
```
子节点： The Dormouse's story

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

...

<class 'pyquery.pyquery.PyQuery'>
父节点： Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.


<class 'pyquery.pyquery.PyQuery'>
兄弟节点： <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
<class 'pyquery.pyquery.PyQuery'>
```
遍历

通过上面的示例，可以看到，如果 pyquery 取出来的有多个节点，虽然类型也是 PyQuery ，但是和 Beautiful Soup 不一样的是返回的并不是列表，如果我们需要继续获取其中的节点，就需要遍历这个结果，可以使用 items() 这个获取结果进行遍历：
```
a = d('a')
for item in a.items():
 print(item)
```
结果如下：
```
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
```
这里我们调用 items() 后，会返回一个生成器，遍历一下，就可以逐个得到 a 节点对象了，它的类型也是 PyQuery 类型。每个 a 节点还可以调用前面所说的方法进行选择，比如继续查询子节点，寻找某个祖先节点等，非常灵活。

提取信息

前面我们获取到节点以后，接着就是要获取我们所需要的信息了。

获取信息主要分为两个部分，一个是获取节点的文本信息，一个获取节点的属性信息。

获取文本信息
```
a_1 = d('#link1')
print(a_1.text())
```
结果如下：
```
Elsie
```
如果想获取这个节点内的 HTML 信息，可以使用 html() 方法：
```
a_2 = d('.story')
print(a_2.html())
```
结果如下：
```
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
```
获取属性信息

当我们获取到节点以后，可以使用 attr() 来获取相关的属性信息：
```
attr_1 = d('#link1')
print(attr_1.attr('href'))
```
结果如下：
```
http://example.com/elsie
```
除了我们可以使用 attr() 这个方法以外， pyquery 还为我们提供了 attr 属性，比如上面的示例还可以写成这样：
```
print(attr_1.attr.href)
```
结果和上面的示例是一样的。

小结

我们在前置准备中安装的几种解析器到此就介绍完了，综合比较一下，Beautiful Soup 对新手比较友好，无需了解更多的其他知识就可以上手使用，但是对于复杂 DOM 的解析，依然需要一定的 CSS 选择器的基础，如果对 Xpath 比较熟练的话直接使用 lxml 倒是最为方便的，如果和小编一样，对 JQuery 和 CSS 选择器都比较熟悉，那么 pyquery 倒是一个很不错的选择。

接下来小编计划做几个简单的实战分享，敬请期待哦~~~

示例代码

本系列的所有代码小编都会放在代码管理仓库 Github 和 Gitee 上，方便大家取用。

示例代码-Github

示例代码-Gitee
查看全文

相关阅读:
iOS6和iOS7代码的适配(5)——popOver
es5创建对象与继承
 js学习日记-new Object和Object.create到底干了啥
 js滚动及可视区域的相关的操作
 css匹配规则及性能
 normalize.css源码分析
 css的水平居中和垂直居中总结
 js快速排序算法
 数据结构flash演示
 二叉树遍历

原文地址：https://www.cnblogs.com/babycomeon/p/12071342.html

小白学 Python 爬虫（23）：解析库 pyquery 入门

引言

初始化

CSS 选择器

查找节点

遍历

提取信息

获取文本信息

获取属性信息

小结

示例代码