zoukankan html css js c++ java

Python爬虫常用之PyQuery

PyQuery是解析页面常用的库.是python对jquery的封装.
下面是一份解析基本页面的代码.后期用到复杂或者实用的方式再增加.

 1 from pyquery import PyQuery as pq
 2 
 3 
 4 # 参数为字符串的情况
 5 html_str = "<html></html>"
 6 
 7 # 参数为网页链接（需带 http：//）
 8 your_url = "http://www.baidu.com"
 9 
10 # 参数为文件
11 path_to_html_file = "hello123.html"
12 
13 # 将参数传入pq库之后得到html页面
14 # d = pq(html_str)
15 # d = pq(etree.fromstring(html_str))
16 # d = pq(url=your_url)
17 # d = pq(url=your_url,
18 #        opener=lambda url, **kw: urlopen(url).read())
19 d = pq(filename=path_to_html_file)
20 
21 # 此时的'd'相当于Jquery的'$',选择器,可以通过标签,id,class等选择元素
22 
23 # 通过id选择
24 table = d("#my_table")
25 
26 # 通过标签选择
27 head = d("head")
28 
29 # 通过样式选择,多个样式写一起,使用逗号隔开即可
30 p = d(".p_font")
31 
32 # 获取标签内的文本
33 text = p.text()
34 print text
35 
36 # 获取标签的属性值
37 t_class = table.attr('class')
38 print t_class
39 
40 # 遍历标签内的选项
41 # 打印表格中的td中的文字
42 for item in table.items():
43     # 这个循环只循环一次,item仍然是pquery的对象
44     print item.text()
45 
46 for item in table('td'):
47     # 这个循环循环多次,item是html的对象
48     print item.text

用于测试的html代码:

 1 
 2     <head>
 3         <title>Test</title>
 4     </head>
 5     <body>
 6         <h1>Parse me!</h1>
 7         <img src = "" />
 8         <p>A paragraph.</p>
 9                 <p class = "p_font">A paragraph with class.</p>
10                 <!-- comment -->
11         <div>
12             <p>A paragraph in div.</p>
13         </div>
14         <table id = "my_table" class = "test-table">
15         <thead>
16         </thead>
17         <tbody>
18             <tr>
19                 <td>Month</td>
20                 <td>Savings</td>
21             </tr>
22             <tr>
23                 <td>January</td>
24                 <td>$100</td>
25             </tr>
26         </tbody>
27         </table>
28     </body>
29 </html>

分析html的结果输出如下:

A paragraph with class.
test-table
Month Savings January $100
Month
Savings
January
$100

由于使用python2,有的网页使用requests直接抓取下来放入pyquery()里面会出编码问题,这时使用unicode()转换一下即可.部分代码如下:

import requests
from pyquery import PyQuery as pq

r = requests.get('http://www.baidu.com')
# d = pq(r.content)
u = unicode(r.content, 'utf-8')
d = pq(u)

查看全文

相关阅读:
NOIP2011 D1T1 铺地毯
 NOIP2013 D1T3 货车运输倍增LCA OR 并查集按秩合并
 POJ 2513 trie树+并查集判断无向图的欧拉路
 599. Minimum Index Sum of Two Lists
594. Longest Harmonious Subsequence
575. Distribute Candies
554. Brick Wall
535. Encode and Decode TinyURL(rand and srand)
525. Contiguous Array
500. Keyboard Row

原文地址：https://www.cnblogs.com/masako/p/6627468.html