lmth1 一个用Python编写的便捷网页信息提取工具

zoukankan html css js c++ java

lmth1 一个用Python编写的便捷网页信息提取工具

lmth1 一个便捷的网页信息提取工具

0, Why lmth1?

玩Python的人十有八九用过urllib，扒数据的十有八九用过BeautifulSoup。我也不例外，平时抓数据几乎全用BeautifulSoup。
BeautifulSoup的功能挺不错，但就是API挫了点，用起来不顺。相对于中规中矩的API，我更中意jQuery的Fluent API。所以，花了两个晚上，以BeautifulSoup作为基础，搞了两个库lmth和lmth1：lmth提供基本功能，并负责Hpath解析；lmth1提供Fluent API，进行数据抓取。

lmth1的接口非常简单，它的实现更简单——不超过300行代码。但它的功能很强大，你很快就会看到，lmth1是如何用一行代码实现BeautifulSoup十行代码的功能的，而且，更易读。

1, 简介

如题。
使用前请将lmth.py, lmth1.py以及beautifulsoup.py放至Python的环境目录下。

2, Hpath

Hpath是一种我定义的一种类似于Xpath的HTML路径查询表达式，它的语法非常简单——几个例子就能说明白。如果需要严格的定义，请参考2.2的BNF定义。

2.1 实例阐述

注意，这里的例子所提到的获取元素，均为在目标节点下所获得的元素。

采用的实例HTML:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
2 <html xmlns="http://www.w3.org/1999/xhtml" >
3 <head>
4     <title>Untitled Page</title>
5 </head>
6 <body>
7 <h1 id="title">Page list</h1>
8 <div id="content" class="sites">
9     <a href="http://www.google.com/" class="good">Google</a>
10     <a href="http://www.yahoo.com/" class="good">Yahoo</a>
11     <a href="http://www.baidu.com/" class="asshole">Baidu</a>
12     <a href="http://www.bing.com/" class="excellent">Bing</a>
13 </div>
14 <div id="tbl">
15     <ul>
16     <li class="odd">1</li>
17     <li class="even">2</li>
18     <li class="odd">3</li>
19     <li class="even">4</li>
20     <li class="odd">5</li>
21     <li class="even">6</li>
22     </ul>
23 </div>
24 </body>
25 </html>

2.1.1 基本表达式

li
作用：获取所有li元素
结果：
[
     <li class="odd">1</li>,
     <li class="even">2</li>,
     <li class="odd">3</li>,
     <li class="even">4</li>,
     <li class="odd">5</li>,
     <li class="even">6</li>
]

div[id=tbl]
作用：获取所有id属性为tbl的div元素
提示：通过属性过滤来进行更精准的查找
结果：
<div id="tbl">
<ul>
<li class="odd">1</li>
<li class="even">2</li>
<li class="odd">3</li>
<li class="even">4</li>
<li class="odd">5</li>
<li class="even">6</li>
</ul>
</div>

div[id=content, class=sites]
作用：获取所有id属性为name且class属性为grey的div元素
提示：你可以同时设定多个属性值，属性对之间用逗号分隔
结果：
<div id="content" class="sites">
<a href="http://www.google.com/" class="good">Google</a>
<a href="http://www.yahoo.com/" class="good">Yahoo</a>
<a href="http://www.baidu.com/" class="asshole">Baidu</a>
<a href="http://www.bing.com/" class="excellent">Bing</a>
</div>

div[@id]
作用：获取所有div元素的id属性值
提示：你需要在需获取的属性值前加一个@符
结果：
[
     'content',
    'tbl'
]

div[id=content]/a[@href]
作用：获取所有id属性为name的元素下面的p元素的href属性值
结果：
[
     'http://www.google.com',
     'http://www.yahoo.com',
     'http://www.baidu.com',
     'http://www.bing.com'
]

2.1.2 高级表达式

a[class={excellent|good}, @class, @href]
作用：获取所有class属性为excellent或good的元素下面的a元素的class属性和href属性
提示：大括号里面的是正则表达式，利用它可以实现或操作
结果：
[
     {
         'href': 'http://www.google.com',
         'class': 'good'
     },
     {
         'href': 'http://www.yahoo.com',
         'class': 'good'
     },
     {
         'href': 'http://www.bing.com',
         'class': 'excellent'
     }
]

div[id={con.+}]/a[class={ass.+}, @class, @#]
作用：获取所有id属性以con做前缀的div元素下面的class属性以ass为前缀的元素的id属性以及内容
提示：@#代表要获取元素的内容（innertext）
结果：
{
     '#': 'Baidu',
     'class': 'asshole'
}

ul/li[class={e.+}, @#]
作用：获取所有id属性以post做前缀的元素下面的以数字为id的p元素下面的a元素的href属性及内容
提示：也可以利用正则表达式进行模糊查询
结果：
[
     '2',
     '4',
     '6'
]

2.2 Hpath的BNF定义

没玩过编译的可以忽略这一节。
玩过编译的看了就明白。

hpath ::= hpart {"/" hpart}
hpart ::= ele_name [ "[" attrs "]" ]
attrs ::= pred_attrs [ "," get_attrs ]
pred_attrs ::= pred_attr { "," pred_attr }
get_attrs ::= get_attr { "," get_attr }
get_attr ::= "@"string [ "(" attr_alias ")"]
attr_alias ::= string
pred_attr ::= string "=" value
value ::= string | regex_value
regex_value ::= "{" string "}"

3, 选择元素

lmth1提供了非常简便的API来进行HTML元素的获取。
为了方便，请输入以下代码：
from lmth1 import Url

这样可以省去lmth1这个看起来有些诡异的前缀 :)

这里以https://files.cnblogs.com/figure9/test.xml这个链接上的文件为例（该文件内容和之前的实例HTML是一样的）：

3.1 选择单个元素

Url('https://files.cnblogs.com/figure9/test.xml').elem('div')
作用：从https://files.cnblogs.com/figure9/test.xml链接上获取第一个div元素。
结果：
<div id="content" class="sites">
<a href="http://www.google.com/" class="good">Google</a>
<a href="http://www.yahoo.com/" class="good">Yahoo</a>
<a href="http://www.baidu.com/" class="asshole">Baidu</a>
<a href="http://www.bing.com/" class="excellent">Bing</a>
</div>
3.2 选择多个元素

Url('https://files.cnblogs.com/figure9/test.xml').elems('li')
作用：从https://files.cnblogs.com/figure9/test.xml链接上获取所有li元素。
结果：
[
     <li class="odd">1</li>,
     <li class="even">2</li>,
     <li class="odd">3</li>,
     <li class="even">4</li>,
     <li class="odd">5</li>,
     <li class="even">6</li>
]

3.3 链式选择

Url('https://files.cnblogs.com/figure9/test.xml').elem('div').elem('a')
作用：从https://files.cnblogs.com/figure9/test.xml链接上获取第一个div元素下面的a元素。
提示：这里只是为了演示链式选择，更好的选择是使用Url('https://files.cnblogs.com/figure9/test.xml').elem('div/a')，效果等同。
结果：
<a href="http://www.google.com/" class="good">Google</a>

Url('https://files.cnblogs.com/figure9/test.xml').elems('div')[-1].elems('li[class=odd]')[-1]
作用：从https://files.cnblogs.com/figure9/test.xml链接上获取最后一个div元素的最后一个class属性为odd的li元素。
提示：结合elems和Python的列表操作，可以获得强大的表达能力。
结果：
<li class="odd">5</li>

Url('https://files.cnblogs.com/figure9/test.xml').elems('div')[-1].elems('li')[::2]
作用：从https://files.cnblogs.com/figure9/test.xml链接上获取最后一个div元素的序数为奇数的li元素。
提示：Don't forget the slices!
结果：
[
     <li class="odd">1</li>,
     <li class="odd">3</li>,
     <li class="odd">5</li>
]

4, 获取属性

有时我们需要对本地的HTML文件进行操作，所以我在lmth引入了Path这个类，用来处理本地的文件。

请把https://files.cnblogs.com/figure9/test.xml的文件保存到本地，这里假定它被保存在d:\test.xml路径。

同样，为了方便，请输入以下代码：
from lmth1 import Path
这样可以省去lmth1这个看起来有些诡异的前缀 :)

4.1 获取单个属性

attr = Path(r'd:\test.xml').attr('li[class=even, @#(content)]')
作用：从d:\test.xml文件获取第一个class属性为even的li元素的内容，然后为这个属性取名为content。
提示：@开头表示要取的属性，()里代表属性的别名，#表示元素的内容，可以直接用名字获取属性值。
结果：
attr =>
{
     content:'2'
}

attr.content =>
'2'

4.2 获取多个属性

attrs = Path(r'd:\test.xml').attrs('a[@href(link), @class(category), @#(title)]')
作用：从d:\test.xml文件获取所有a元素的href（设置别名为link）、class属性（设置别名为category）和内容（设置别名为title）。
提示：@开头表示要取的属性，()里代表属性的别名，#表示元素的内容，可以直接用名字获取属性值。
结果：

attrs =>
[
     {
         category:u'good',
         link:u'http://www.google.com',
         title:u'Google'
     },
    {
         category:u'good',
         link:u'http://www.yahoo.com',
         title:u'Yahoo'
     },
    {
         category:u'asshole',
         link:u'http://www.baidu.com',
         title:u'Baidu'
     },
     {
         category:u'excellent',
         link:u'http://www.bing.com',
         title:u'Bing'
     }
]

attrs[-1] =>

{
     category:u'excellent',
     link:u'http://www.bing.com',
     title:u'Bing'
}

print attrs[2].title, 'is an', attrs[2].category =>

Baidu is an asshole

4.3 链式选择

可以在elem和elems选择器之后应用attr和attrs选择器。注意，你不能在attr和attrs选择器之后应用其它的选择器。

Path(r'd:\test.xml').elems('div[id=content]/a')[::2].attrs('[@href(link), @class(category)]')
作用：从d:\test.xml文件获取id属性为content的div元素下面的所有序数为奇数的a元素的href（别名为link）和class属性（别名为category）。
提示：当Hpath没有元素名，仅由要获取的属性名组成时，其获取的属性为当前元素的属性。

结果：
[
     {
         category:u'good',
         link:u'http://www.google.com'
     },
     {
         category:u'asshole',
         link:u'http://www.baidu.com'
     }
]

5, 其它功能

5.1 URL生成

在日常的html获取中，经常需要生成大量的URL，尽管这样的工作在Python中一两行就可以搞定，但为了避免不必要的重复，我在lmth1中提供了Urls类，并提供了两个基础方法，用来生成Urls对象。
Urls对象保存了若干个Url实例，其中每一个实例都可以直接进行选择操作。

请先执行下面的代码：
from lmth1 import Urls
通过数字批量生成Url：

lmth.Urls.from_indice('http://www.bing.com/page/', 1, 5)
作用：以为前缀，生成后缀从1到5的Url
结果：
http://www.bing.com/1
http://www.bing.com/2
http://www.bing.com/3
http://www.bing.com/4
http://www.bing.com/5

lmth.Urls.from_indice('http://www.bing.com/page/', 1, 4, 3)
作用：以http://www.bing.com/page/作为前缀，生成后缀从1到5的Url，其中默认宽度为3，用0填充
提示：第四个参数用来设置数字宽度，对于一些网站，这是很必要的
结果：
http://www.bing.com/001
http://www.bing.com/002
http://www.bing.com/003
http://www.bing.com/004
http://www.bing.com/005

通过后缀批量生成Url：
Urls.from_postfixes('http://www.baidu.com/', ['isfool', 'isasshole', 'ismoron'])
作用：以http://www.baidu.com/作为前缀，迭代后面的后缀列表，批量生成Url
结果：

http://www.baidu.com/isfool
http://www.baidu.com/isasshole
http://www.baidu.com/ismoron

5.2 字符编码

lmth1的默认编码是UTF-8，可以满足绝大多数网站的需求。然而，在读取某些中文的网站时，仍然会出现乱码，因此，lmth1允许手动设置编码。对于一些乱码的中文网站，将编码换为gb18030可以解决问题。

Url('http://www.bing.com/', 'gb18030')
作用：生成编码为gb18030的Url对象
提示：默认的编码为UTF-8

lmth.Urls.from_indice('http://www.bing.com/page/', 1, 5, code_str='gb18030')
作用：批量生成编码为gb18030的Url对象
提示：默认的编码为UTF-8

6, 参考

1, BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
2, Martin Fowler: Domain-Specific Languages.
3, Internal-DSL: http://en.wikipedia.org/wiki/Domain-specific_language
4, Fluent Interface: http://en.wikipedia.org/wiki/Fluent_interface
源代码下载：
https://files.cnblogs.com/figure9/lmth1withBS.7z

查看全文

相关阅读:
CoCreateInstace 返回未知注册类别错误
 WINCE USB驱动组入
 CreateEvent ResetEvent SetEvent
AppWidget的范例
 ubuntu下解决无声音的方法
 计算几何与图形学有关的几种常用算法
 Android实现GPS的打开与关闭
 深入剖析Android动画(Animation) (闪烁、左右摇摆、上下晃动等效果)
中兴手机Linux下开发的方法
 移动网络环境下ReadBuffer的使用

原文地址：https://www.cnblogs.com/figure9/p/2353299.html

lmth1 一个用Python编写的便捷网页信息提取工具

lmth1 一个便捷的网页信息提取工具

0, Why lmth1?

1, 简介

2, Hpath

2.1 实例阐述

2.1.2 高级表达式

2.2 Hpath的BNF定义

3, 选择元素

3.1 选择单个元素

3.2 选择多个元素

3.3 链式选择

4, 获取属性

4.1 获取单个属性

4.2 获取多个属性

4.3 链式选择

5, 其它功能

5.1 URL生成

5.2 字符编码

6, 参考

源代码下载：