HtmlParser设计解析(1) 解析器模式(Interpreter)

zoukankan html css js c++ java

HtmlParser设计解析(1) 解析器模式(Interpreter)
HtmlParser设计解析(1) - 解析器模式(Interpreter)
对于HtmlParser的使用，这方面的介绍很多，而且详细。前段时间我将HtmlParser的源码读了一篇，在此，总结下其HtmlParser的设计，跟大家交流，我们只关注是设计。
 一、Filter设计

 NodeFilter 是htmlParser主要的提取节点的一种方式，其结构灵活，通过组合解释器查找页面上的任一个节点。

 1、先看个测试用例:
Java代码
/**
* Test and filtering.
*/
public void testAnd () throws ParserException
{
 String guts;
 String html;
 NodeList list;

 guts = "<body>Now is the <a id=one>time</a> for all good <a id=two>men</a>..</body>";
 html = "<html>" + guts + "</html>";
 createParser (html);
 list = parser.extractAllNodesThatMatch (
 new AndFilter (
 new HasChildFilter (
 new TagNameFilter ("b")),
 new HasChildFilter (
 new StringFilter ("men")))
 );
 assertEquals ("only one element", 1, list.size ());
 assertType ("should be LinkTag", LinkTag.class, list.elementAt (0));
 LinkTag link = (LinkTag)list.elementAt (0);
 assertEquals ("attribute value", "two", link.getAttribute ("id"));
}
```
 /** * Test and filtering. */ public void testAnd () throws ParserException { String guts; String html; NodeList list; guts = "<body>Now is the <a id=one>time</a> for all good <a id=two>men</a>..</body>"; html = "<html>" + guts + "</html>"; createParser (html); list = parser.extractAllNodesThatMatch ( new AndFilter ( new HasChildFilter ( new TagNameFilter ("b")), new HasChildFilter ( new StringFilter ("men"))) ); assertEquals ("only one element", 1, list.size ()); assertType ("should be LinkTag", LinkTag.class, list.elementAt (0)); LinkTag link = (LinkTag)list.elementAt (0); assertEquals ("attribute value", "two", link.getAttribute ("id")); }
```
 2、NodeFilter 结构图



 3、所使用的设计模式

 NodeFilter接口的主要作用是判断该节点是否是客户端所查找的节点，返回一个boolean值。从上图中也可以看出，其接口中只有一个方法：

 boolean accept (Node node); //接受一个Node类型的参数

 在这，HtmlParser作者采用的是解析器模式来实现这个模式。

 我们先了解下解释器模式，然后再结合作者的源码来理解解释器模式，体会作者的设计灵活性。

 Interpreter模式可以定义出其方法的一种表示，并同时提供一个解释器。客户端可以使用解释器来解释这个语言中的句子。

 其中，Interpreter模式的几个要点：

 1、Interpreter模式应用场合是Interpreter模式应用中的难点，只有满足“业务规则频繁变化，且类似的模式不断重复出现，并且容易抽象为语法规则问题”才适合使用Interpreter模式

 2、使用Interpreter模式来表示方法规则，从而可以使用面向对象技艺来方便地“扩展”方法。

 4、HtmlParser NodeFilter 解释器模式的应用

 抽象表达式角色：
Java代码
public interface NodeFilter extends Serializable, Cloneable {
 /**
 * Predicate to determine whether or not to keep the given node.
 * The behaviour based on this outcome is determined by the context
 * in which it is called. It may lead to the node being added to a list
 * or printed out. See the calling routine for details.
 * @return <code>true</code> if the node is to be kept, <code>false</code>
 * if it is to be discarded.
 * @param node The node to test.
 */
 boolean accept (Node node);
}
```
public interface NodeFilter extends Serializable, Cloneable { /** * Predicate to determine whether or not to keep the given node. * The behaviour based on this outcome is determined by the context * in which it is called. It may lead to the node being added to a list * or printed out. See the calling routine for details. * @return <code>true</code> if the node is to be kept, <code>false</code> * if it is to be discarded. * @param node The node to test. */ boolean accept (Node node); }
```
下面看一个逻辑“与”的操作的实现，这里表示二个过滤器通过逻辑与操作给出一个boolean表达式的操作。代码如下：
Java代码
/**
* Accepts nodes matching all of its predicate filters (AND operation).
*/
public class AndFilter implements NodeFilter {
 protected NodeFilter[] mPredicates;

 /**
 * Creates an AndFilter that accepts nodes acceptable to both filters.
 *
 * @param left One filter.
 * @param right The other filter.
 */
 public AndFilter(NodeFilter left, NodeFilter right) {
 NodeFilter[] predicates;

 predicates = new NodeFilter[2];
 predicates[0] = left;
 predicates[1] = right;
 setPredicates(predicates);
 }

 public void setPredicates(NodeFilter[] predicates) {
 if (null == predicates)
 predicates = new NodeFilter[0];
 mPredicates = predicates;
 }

 public boolean accept(Node node) {
 boolean ret;

 ret = true;

 for (int i = 0; ret && (i < mPredicates.length); i++)
 if (!mPredicates[i].accept(node)) // 这里调用本身构造的解释器再进行判断
 ret = false;

 return (ret);
 }
}
```
/** * Accepts nodes matching all of its predicate filters (AND operation). */ public class AndFilter implements NodeFilter { 	protected NodeFilter[] mPredicates; 	/** 	 * Creates an AndFilter that accepts nodes acceptable to both filters. 	 * 	 * @param left One filter. 	 * @param right The other filter. 	 */ 	public AndFilter(NodeFilter left, NodeFilter right) { 		NodeFilter[] predicates; 		predicates = new NodeFilter[2]; 		predicates[0] = left; 		predicates[1] = right; 		setPredicates(predicates); 	} 	public void setPredicates(NodeFilter[] predicates) { 		if (null == predicates) 			predicates = new NodeFilter[0]; 		mPredicates = predicates; 	} 	public boolean accept(Node node) { 		boolean ret; 		ret = true; 		for (int i = 0; ret && (i < mPredicates.length); i++) 			if (!mPredicates[i].accept(node)) // 这里调用本身构造的解释器再进行判断 				ret = false; 		return (ret); 	} } 
```
再来看一个测试用例中的另外一些过滤操作，HasChildFilter 其代码如下：
Java代码
public class HasChildFilter implements NodeFilter {
 protected NodeFilter mChildFilter;

 protected boolean mRecursive;

 public HasChildFilter(NodeFilter filter) {
 this(filter, false);
 }

 public HasChildFilter(NodeFilter filter, boolean recursive) {
 mChildFilter = filter;
 mRecursive = recursive;
 }

 public boolean accept(Node node) {
 CompositeTag tag; // ?1
 NodeList children;
 boolean ret;

 ret = false;
 if (node instanceof CompositeTag) {
 tag = (CompositeTag) node;
 children = tag.getChildren();
 if (null != children) {
 for (int i = 0; !ret && i < children.size(); i++)
 if (mChildFilter.accept(children.elementAt(i))) // 判断是否包括该元素
 ret = true;
 // do recursion after all children are checked
 // to get breadth first traversal
 if (!ret && mRecursive) // 搜索下层节点
 for (int i = 0; !ret && i < children.size(); i++)
 if (accept(children.elementAt(i)))
 ret = true;
 }
 }

 return (ret);
 }
}
```
public class HasChildFilter implements NodeFilter { 	protected NodeFilter mChildFilter; 	protected boolean mRecursive; 	public HasChildFilter(NodeFilter filter) { 		this(filter, false); 	} 	public HasChildFilter(NodeFilter filter, boolean recursive) { 		mChildFilter = filter; 		mRecursive = recursive; 	} 	public boolean accept(Node node) { 		CompositeTag tag; // ?1 		NodeList children; 		boolean ret; 		ret = false; 		if (node instanceof CompositeTag) { 			tag = (CompositeTag) node; 			children = tag.getChildren(); 			if (null != children) { 				for (int i = 0; !ret && i < children.size(); i++) 					if (mChildFilter.accept(children.elementAt(i))) // 判断是否包括该元素 						ret = true; 				// do recursion after all children are checked 				// to get breadth first traversal 				if (!ret && mRecursive) // 搜索下层节点 					for (int i = 0; !ret && i < children.size(); i++) 						if (accept(children.elementAt(i))) 							ret = true; 			} 		} 		return (ret); 	} }
```
TagNameFilter 的代码如下：
Java代码
public class TagNameFilter implements NodeFilter {
 protected String mName;

 public TagNameFilter(String name) {
 mName = name.toUpperCase(Locale.ENGLISH);
 }

 public boolean accept(Node node) {
 return ((node instanceof Tag)
 && !((Tag) node).isEndTag()
 && ((Tag) node).getTagName().equals(mName));
 }
}
```
public class TagNameFilter implements NodeFilter { 	protected String mName; 	public TagNameFilter(String name) { 		mName = name.toUpperCase(Locale.ENGLISH); 	} 	public boolean accept(Node node) { 		return ((node instanceof Tag) 				&& !((Tag) node).isEndTag() 				&& ((Tag) node).getTagName().equals(mName)); 	} } 
```
 NodeFilter的另外13个子类，都按此实现包装不同的业务逻辑。并且非常容易增加其子类来实现新的“文法”规则。

 客户端则可灵活组装解释器，执行解释。非常灵活，这也满足用户自定义逻辑去查找HTML文件中的各个节点。

 至于HtmlParser是如何人存储HTML结构，在此不做深挖，只需要知道将提供一个迭代器可遍历所有的节点即可(其实HtmlParser中是通过遍历各个字符来映射Node对象及装载各字符的坐标(列数，行数))。

 5、HtmlParser中客户端的调用

 现在来看看测试用例中的Parser类中extractAllNodesThatMatch()。

Parser:
Java代码
public class Parser implements Serializable {
 ... ....

 /**
 * Extract all nodes matching the given filter.
 */
 public NodeList extractAllNodesThatMatch (NodeFilter filter) throws ParserException {
 NodeIterator e;
 NodeList ret;

 ret = new NodeList ();
 for (e = elements (); e.hasMoreNodes (); ) // elements()返回一个简单的迭代器，遍历所有节点
 e.nextNode ().collectInto (ret, filter);

 return (ret);
 }
 ... ...
}
```
public class Parser implements Serializable { ... .... /** * Extract all nodes matching the given filter. */ public NodeList extractAllNodesThatMatch (NodeFilter filter) throws ParserException { NodeIterator e; NodeList ret; ret = new NodeList (); for (e = elements (); e.hasMoreNodes (); ) // elements()返回一个简单的迭代器，遍历所有节点 e.nextNode ().collectInto (ret, filter); return (ret); } ... ... }
```
AbstractNode:
Java代码
public abstract class AbstractNode implements Node, Serializable {
 ... ...
 public void collectInto (NodeList list, NodeFilter filter) {
 if (filter.accept (this))
 list.add (this);
 }
 ... ...
}
```
public abstract class AbstractNode implements Node, Serializable { ... ... public void collectInto (NodeList list, NodeFilter filter) { if (filter.accept (this)) list.add (this); } ... ... }
```
Java代码
public class CompositeTag extends TagNode { //TagNode extends AbstractNode, AbstractNode implements Node
 ... ...
 public void collectInto (NodeList list, NodeFilter filter) {
 super.collectInto (list, filter); //AbstractNode collectInto
 for (SimpleNodeIterator e = children(); e.hasMoreNodes ();) {
 // e.nextNode() 返回一个Node类型 e.nextNode ().collectInto() = this.collectInto() 递归遍历所有节点，并对每个节点进行过滤，将符合条件的节点添加至结果集中(NodeList)
 e.nextNode ().collectInto (list, filter);
 }
 if ((null != getEndTag ()) && (this != getEndTag ()))
 getEndTag ().collectInto (list, filter);
 }
 ... ...
}
查看全文

相关阅读:
在Windows环境中使用版本管理工具Git [转]
poj3630 Phone List **
UVa10382 Watering Grass **
软件项目计划书格式 [转]
POI2001 Ants and the ladybug ***
RMQ+1/1算法 [转]
poj3264 Balanced Lineup *
Windows Server 2003 出现的Service Unavailable怎么办？
单车环佛山一环
 天凉了，大家多穿衣服

原文地址：https://www.cnblogs.com/wycg1984/p/1722395.html

HtmlParser设计解析(1) 解析器模式(Interpreter)

HtmlParser设计解析(1) - 解析器模式(Interpreter)