zoukankan      html  css  js  c++  java
  • 浅谈HtmlParser

      使用Heritrix抓取到自己所需的网页后,还需要对网页中的内容进行分类等操作,这个时候就需要用到htmlparser,但是使用htmlparser并不是那么容易!因为相关的文档比较少,很多更能需要开发者自己去摸索,去发掘!

      不过这里给大家提供一个比较好的网站(htmlparser的API):http://tool.oschina.net/apidocs/apidoc?api=HTMLParser,这个API是英文版的,英语不好的这时就要逼迫自己看下去了。

      HTMLParser的核心模块是org.htmlparser.Parser类,这个类实际完成了对于HTML页面的分析工作。这个类有下面几个构造函数:

    public Parser ();
    public Parser (Lexer lexer, ParserFeedback fb);
    public Parser (URLConnection connection, ParserFeedback fb) throws ParserException;
    public Parser (String resource, ParserFeedback feedback) throws ParserException;
    public Parser (String resource) throws ParserException;
    public Parser (Lexer lexer);
    public Parser (URLConnection connection) throws ParserException;

    和一个静态类

    public static Parser createParser (String html, String charset);

      对于大多数使用者来说,使用最多的是通过一个URLConnection或者一个保存有网页内容的字符串来初始化Parser,或者使用静态函数来生成一个Parser对象。ParserFeedback的代码很简单,是针对调试和跟踪分析过程的,一般不需要改变。而使用Lexer则是一个相对比较高级的话题,放到以后再讨论吧。
      这里比较有趣的一点是,如果需要设置页面的编码方式的话,不使用Lexer就只有静态函数一个方法了。对于大多数中文页面来说,好像这是应该用得比较多的一个方法。

    下面是初始化Parser的例子(通过打开一个网页的URL,中间的OpenFile方法是在打开一个本地的html文件时使用的)。

    【加载的网页文件:index.html】

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html>
        <head>
            <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>
            <title>百度</title>
            <link href = "a_1.css" rel = "stylesheet" type = "text/css"/>
        </head>
        <body>
            <div  align = "center" class = "photo" >
                <img src = "../image/baidu.PNG" >
            </div>
            <div align = "center" class = "body">
                <table cellpadding="8">
                    <td>
                        <a href = "#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>
                    </td>
                    <td>
                        <font color = "black">网页</font>
                    </td>
                    <td>
                        <a href = "#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>
                    </td>
                    <td>
                        <a href = "#" target = _blank title = "欢迎来到&#10百度网站">知道</a>
                    </td>
                    <td>
                        <a href = "#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>
                    </td>
                    <td>
                        <a href = "#" target = _blank title = "欢迎来到&#10百度网站">图片</a>
                    </td>
                    <td>
                        <a href = "#" target = _blank title = "欢迎来到&#10百度网站">视频</a>
                    </td>
                    <td>
                        <a href = "#" target = _blank title = "欢迎来到&#10百度网站">地图</a>
                    </td>
                </table>
                <input class = "input" >
            </div>
        </body>
    
    </html>
    View Code

    【源码:htmlparser_1.java】

     1 import java.io.BufferedReader;
     2 import java.io.File;
     3 import java.io.FileInputStream;
     4 import java.io.InputStreamReader;
     5 import java.net.HttpURLConnection;
     6 import java.net.URL;
     7 import org.htmlparser.Parser;
     8 import org.htmlparser.visitors.TextExtractingVisitor;
     9 
    10 public class Main {
    11     private static String ENCODE = "GBK";
    12     private static void message(String msg) {
    13         // TODO Auto-generated method stub
    14         try {
    15             System.out.println(new String(msg.getBytes(ENCODE), System
    16                     .getProperty("file.encoding")));
    17         } catch (Exception e) {
    18             // TODO: handle exception
    19             e.printStackTrace();
    20         }
    21     }
    22     
    23     /*
    24      * 打开一个文件
    25      */
    26     public static String OpenFile(String FileName) {
    27         try {
    28             File mFile = new File(FileName);
    29             FileInputStream mFileInputStream = new FileInputStream(mFile);
    30             InputStreamReader mInputStreamReader = new InputStreamReader(
    31                     mFileInputStream, ENCODE);
    32             BufferedReader mBufferedReader = new BufferedReader(
    33                     mInputStreamReader);
    34             String mContent = "";
    35             String mTemp = "";
    36             while ((mTemp = mBufferedReader.readLine()) != null) {
    37                 mContent += mTemp + "
    ";
    38             }
    39             mBufferedReader.close();
    40         } catch (Exception e) {
    41             // TODO: handle exception
    42             e.printStackTrace();
    43             return "";
    44         }
    45         return FileName;
    46     }
    47 
    48     /*
    49      * main方法
    50      */
    51     public static void main(String[] args) {
    52         // String mContent=OpenFile("");
    53         try {
    54             Parser mParser = new Parser((HttpURLConnection) (new URL(
    55                     "http://127.0.0.1/HtmlParser/index.html")).openConnection());
    56             TextExtractingVisitor mExtractingVisitor = new TextExtractingVisitor();
    57             mParser.visitAllNodesWith(mExtractingVisitor);
    58             String textInPage = mExtractingVisitor.getExtractedText();
    59             message(textInPage);
    60         } catch (Exception e) {
    61             // TODO: handle exception
    62             e.printStackTrace();
    63         }
    64     }
    65 
    66 }

    测试输出结果:

     1     
     2         
     3         百度
     4         
     5     
     6     
     7         
     8             
     9         
    10         
    11             
    12                 
    13                     新闻
    14                 
    15                 
    16                     网页
    17                 
    18                 
    19                     贴吧
    20                 
    21                 
    22                     知道
    23                 
    24                 
    25                     音乐
    26                 
    27                 
    28                     图片
    29                 
    30                 
    31                     视频
    32                 
    33                 
    34                     地图
    35                 
    36             
    37             
    38         
    39     
    View Code

     HTMLParser将解析过的信息保存为一个树的结构。Node是信息保存的数据类型基础。

    请看Node的定义:
    public interface Node extends Cloneable;

    Node中包含的方法有几类:

    对于树型结构进行遍历的函数,这些函数最容易理解:

    Node getParent ():取得父节点
    NodeList getChildren ():取得子节点的列表
    Node getFirstChild ():取得第一个子节点
    Node getLastChild ():取得最后一个子节点
    Node getPreviousSibling ():取得前一个兄弟(不好意思,英文是兄弟姐妹,直译太麻烦而且不符合习惯,对不起女同胞了)
    Node getNextSibling ():取得下一个兄弟节点

     取得Node内容的函数:

    String getText ():取得文本
    String toPlainTextString():取得纯文本信息。
    String toHtml () :取得HTML信息(原始HTML)
    String toHtml (boolean verbatim):取得HTML信息(原始HTML)
    String toString ():取得字符串信息(原始HTML)
    Page getPage ():取得这个Node对应的Page对象
    int getStartPosition ():取得这个Node在HTML页面中的起始位置
    int getEndPosition ():取得这个Node在HTML页面中的结束位置

    用于Filter过滤的函数:

    void collectInto (NodeList list, NodeFilter filter):基于filter的条件对于这个节点进行过滤,符合条件的节点放到list中。

     用于Visitor遍历的函数:

    void accept (NodeVisitor visitor):对这个Node应用visitor

    用于修改内容的函数,这类用得比较少:

    void setPage (Page page):设置这个Node对应的Page对象
    void setText (String text):设置文本
    void setChildren (NodeList children):设置子节点列表

    其他函数:

    void doSemanticAction (): 执行这个Node对应的操作(只有少数Tag有对应的操作)
    Object clone (): 接口Clone的抽象函数。

     实际我们用HTMLParser最多的是处理HTML页面,Filter或Visitor相关的函数是必须的,然后第一类和第二类函数是用得最多的。第一类函数比较容易理解,下面用例子说明一下第二类函数。

    【源码:htmlparser_2.java】

     1 import java.io.BufferedReader;
     2 import java.io.File;
     3 import java.io.FileInputStream;
     4 import java.io.InputStreamReader;
     5 import java.net.HttpURLConnection;
     6 import java.net.URL;
     7 import org.htmlparser.Node;
     8 import org.htmlparser.Parser;
     9 import org.htmlparser.util.NodeIterator;
    10 import org.htmlparser.visitors.TextExtractingVisitor;
    11 import org.omg.CosNaming.NamingContextPackage.NotEmpty;
    12 
    13 public class Main {
    14     private static String ENCODE = "utf-8";
    15     private static void message(String msg) {
    16         // TODO Auto-generated method stub
    17         try {
    18             System.out.println(new String(msg.getBytes(ENCODE), System
    19                     .getProperty("file.encoding")));
    20         } catch (Exception e) {
    21             // TODO: handle exception
    22             e.printStackTrace();
    23         }
    24     }
    25     
    26     /*
    27      * 打开一个文件
    28      */
    29     public static String OpenFile(String FileName) {
    30         try {
    31             File mFile = new File(FileName);
    32             FileInputStream mFileInputStream = new FileInputStream(mFile);
    33             InputStreamReader mInputStreamReader = new InputStreamReader(
    34                     mFileInputStream, ENCODE);
    35             BufferedReader mBufferedReader = new BufferedReader(
    36                     mInputStreamReader);
    37             String mContent = "";
    38             String mTemp = "";
    39             while ((mTemp = mBufferedReader.readLine()) != null) {
    40                 mContent += mTemp + "
    ";
    41             }
    42             mBufferedReader.close();
    43         } catch (Exception e) {
    44             // TODO: handle exception
    45             e.printStackTrace();
    46             return "";
    47         }
    48         return FileName;
    49     }
    50 
    51     /*
    52      * main方法
    53      */
    54     public static void main(String[] args) {
    55         // String mContent=OpenFile("");
    56         try {
    57             Parser mParser = new Parser((HttpURLConnection) (new URL(
    58                     "http://127.0.0.1/HtmlParser/index.html")).openConnection());
    59 //            TextExtractingVisitor mExtractingVisitor = new TextExtractingVisitor();
    60 //            mParser.visitAllNodesWith(mExtractingVisitor);
    61 //            String textInPage = mExtractingVisitor.getExtractedText();
    62 //            message(textInPage);
    63             
    64             for (NodeIterator i = mParser.elements(); i.hasMoreNodes();) {
    65                 Node node = i.nextNode();
    66                 message("getText:"+node.getText());
    67                 message("getPlainText:"+node.toPlainTextString());
    68                 message("toHtml:"+node.toHtml());
    69                 message("toHtml(true):"+node.toHtml(true));
    70                 message("tohtml(false):"+node.toHtml(false));
    71                 message("toString:"+node.toString());
    72                 message("==============================");
    73             }
    74         } catch (Exception e) {
    75             // TODO: handle exception
    76             e.printStackTrace();
    77         }
    78     }
    79 }

    测试输出结果:

      1 getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
      2 getPlainText:
      3 toHtml:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
      4 toHtml(true):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
      5 tohtml(false):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
      6 toString:Doctype Tag : !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; begins at : 0; ends at : 121
      7 ==============================
      8 getText:
      9 
     10 getPlainText:
     11 
     12 toHtml:
     13 
     14 toHtml(true):
     15 
     16 tohtml(false):
     17 
     18 toString:Txt (121[0,121],123[1,0]): 
    
     19 ==============================
     20 getText:html
     21 getPlainText:
     22     
     23         
     24         百度
     25         
     26     
     27     
     28         
     29             
     30         
     31         
     32             
     33                 
     34                     新闻
     35                 
     36                 
     37                     网页
     38                 
     39                 
     40                     贴吧
     41                 
     42                 
     43                     知道
     44                 
     45                 
     46                     音乐
     47                 
     48                 
     49                     图片
     50                 
     51                 
     52                     视频
     53                 
     54                 
     55                     地图
     56                 
     57             
     58             
     59         
     60     
     61 
     62 
     63 toHtml:<html>
     64     <head>
     65         <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>
     66         <title>百度</title>
     67         <link href = "a_1.css" rel = "stylesheet" type = "text/css"/>
     68     </head>
     69     <body>
     70         <div  align = "center" class = "photo" >
     71             <img src = "../image/baidu.PNG" >
     72         </div>
     73         <div align = "center" class = "body">
     74             <table cellpadding="8">
     75                 <td>
     76                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>
     77                 </td>
     78                 <td>
     79                     <font color = "black">网页</font>
     80                 </td>
     81                 <td>
     82                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>
     83                 </td>
     84                 <td>
     85                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">知道</a>
     86                 </td>
     87                 <td>
     88                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>
     89                 </td>
     90                 <td>
     91                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">图片</a>
     92                 </td>
     93                 <td>
     94                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">视频</a>
     95                 </td>
     96                 <td>
     97                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">地图</a>
     98                 </td>
     99             </table>
    100             <input class = "input" >
    101         </div>
    102     </body>
    103 
    104 </html>
    105 toHtml(true):<html>
    106     <head>
    107         <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>
    108         <title>百度</title>
    109         <link href = "a_1.css" rel = "stylesheet" type = "text/css"/>
    110     </head>
    111     <body>
    112         <div  align = "center" class = "photo" >
    113             <img src = "../image/baidu.PNG" >
    114         </div>
    115         <div align = "center" class = "body">
    116             <table cellpadding="8">
    117                 <td>
    118                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>
    119                 </td>
    120                 <td>
    121                     <font color = "black">网页</font>
    122                 </td>
    123                 <td>
    124                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>
    125                 </td>
    126                 <td>
    127                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">知道</a>
    128                 </td>
    129                 <td>
    130                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>
    131                 </td>
    132                 <td>
    133                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">图片</a>
    134                 </td>
    135                 <td>
    136                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">视频</a>
    137                 </td>
    138                 <td>
    139                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">地图</a>
    140                 </td>
    141             </table>
    142             <input class = "input" >
    143         </div>
    144     </body>
    145 
    146 </html>
    147 tohtml(false):<html>
    148     <head>
    149         <meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/>
    150         <title>百度</title>
    151         <link href = "a_1.css" rel = "stylesheet" type = "text/css"/>
    152     </head>
    153     <body>
    154         <div  align = "center" class = "photo" >
    155             <img src = "../image/baidu.PNG" >
    156         </div>
    157         <div align = "center" class = "body">
    158             <table cellpadding="8">
    159                 <td>
    160                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">新闻</a>
    161                 </td>
    162                 <td>
    163                     <font color = "black">网页</font>
    164                 </td>
    165                 <td>
    166                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>
    167                 </td>
    168                 <td>
    169                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">知道</a>
    170                 </td>
    171                 <td>
    172                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">音乐</a>
    173                 </td>
    174                 <td>
    175                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">图片</a>
    176                 </td>
    177                 <td>
    178                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">视频</a>
    179                 </td>
    180                 <td>
    181                     <a href = "#" target = _blank title = "欢迎来到&#10百度网站">地图</a>
    182                 </td>
    183             </table>
    184             <input class = "input" >
    185         </div>
    186     </body>
    187 
    188 </html>
    189 toString:Tag (123[1,0],129[1,6]): html
    190   Txt (129[1,6],132[2,1]): 
    	
    191   Tag (132[2,1],138[2,7]): head
    192     Txt (138[2,7],142[3,2]): 
    		
    193     Tag (142[3,2],216[3,76]): meta http-equiv = "Content-Type" content = "text/ht...
    194     Txt (216[3,76],220[4,2]): 
    		
    195     Tag (220[4,2],227[4,9]): title
    196       Txt (227[4,9],229[4,11]): 百度
    197       End (229[4,11],237[4,19]): /title
    198     Txt (237[4,19],241[5,2]): 
    		
    199     Tag (241[5,2],302[5,63]): link href = "a_1.css" rel = "stylesheet" type = "te...
    200     Txt (302[5,63],305[6,1]): 
    	
    201     End (305[6,1],312[6,8]): /head
    202   Txt (312[6,8],315[7,1]): 
    	
    203   Tag (315[7,1],321[7,7]): body
    204     Txt (321[7,7],325[8,2]): 
    		
    205     Tag (325[8,2],365[8,42]): div  align = "center" class = "photo" 
    206       Txt (365[8,42],370[9,3]): 
    			
    207       Tag (370[9,3],403[9,36]): img src = "../image/baidu.PNG" 
    208       Txt (403[9,36],407[10,2]): 
    		
    209       End (407[10,2],413[10,8]): /div
    210     Txt (413[10,8],417[11,2]): 
    		
    211     Tag (417[11,2],454[11,39]): div align = "center" class = "body"
    212       Txt (454[11,39],459[12,3]): 
    			
    213       Tag (459[12,3],482[12,26]): table cellpadding="8"
    214         Txt (482[12,26],488[13,4]): 
    				
    215         Tag (488[13,4],492[13,8]): td
    216           Txt (492[13,8],499[14,5]): 
    					
    217           Tag (499[14,5],552[14,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
    218             Txt (552[14,58],554[14,60]): 新闻
    219             End (554[14,60],558[14,64]): /a
    220           Txt (558[14,64],564[15,4]): 
    				
    221           End (564[15,4],569[15,9]): /td
    222         Txt (569[15,9],575[16,4]): 
    				
    223         Tag (575[16,4],579[16,8]): td
    224           Txt (579[16,8],586[17,5]): 
    					
    225           Tag (586[17,5],608[17,27]): font color = "black"
    226           Txt (608[17,27],610[17,29]): 网页
    227           End (610[17,29],617[17,36]): /font
    228           Txt (617[17,36],623[18,4]): 
    				
    229           End (623[18,4],628[18,9]): /td
    230         Txt (628[18,9],634[19,4]): 
    				
    231         Tag (634[19,4],638[19,8]): td
    232           Txt (638[19,8],645[20,5]): 
    					
    233           Tag (645[20,5],698[20,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
    234             Txt (698[20,58],700[20,60]): 贴吧
    235             End (700[20,60],704[20,64]): /a
    236           Txt (704[20,64],710[21,4]): 
    				
    237           End (710[21,4],715[21,9]): /td
    238         Txt (715[21,9],721[22,4]): 
    				
    239         Tag (721[22,4],725[22,8]): td
    240           Txt (725[22,8],732[23,5]): 
    					
    241           Tag (732[23,5],785[23,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
    242             Txt (785[23,58],787[23,60]): 知道
    243             End (787[23,60],791[23,64]): /a
    244           Txt (791[23,64],797[24,4]): 
    				
    245           End (797[24,4],802[24,9]): /td
    246         Txt (802[24,9],808[25,4]): 
    				
    247         Tag (808[25,4],812[25,8]): td
    248           Txt (812[25,8],819[26,5]): 
    					
    249           Tag (819[26,5],872[26,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
    250             Txt (872[26,58],874[26,60]): 音乐
    251             End (874[26,60],878[26,64]): /a
    252           Txt (878[26,64],884[27,4]): 
    				
    253           End (884[27,4],889[27,9]): /td
    254         Txt (889[27,9],895[28,4]): 
    				
    255         Tag (895[28,4],899[28,8]): td
    256           Txt (899[28,8],906[29,5]): 
    					
    257           Tag (906[29,5],959[29,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
    258             Txt (959[29,58],961[29,60]): 图片
    259             End (961[29,60],965[29,64]): /a
    260           Txt (965[29,64],971[30,4]): 
    				
    261           End (971[30,4],976[30,9]): /td
    262         Txt (976[30,9],982[31,4]): 
    				
    263         Tag (982[31,4],986[31,8]): td
    264           Txt (986[31,8],993[32,5]): 
    					
    265           Tag (993[32,5],1046[32,58]): a href = "#" target = _blank title = "欢迎来到&#10百度网站"
    266             Txt (1046[32,58],1048[32,60]): 视频
    267             End (1048[32,60],1052[32,64]): /a
    268           Txt (1052[32,64],1058[33,4]): 
    				
    269           End (1058[33,4],1063[33,9]): /td
    270         Txt (1063[33,9],1069[34,4]): 
    				
    271         Tag (1069[34,4],1073[34,8]): td
    272           Txt (1073[34,8],1080[35,5]): 
    					
    273           Tag (1080[35,5],1133[35,58]): a href = "#" target = _blank title = "欢迎来到&#10百...
    274             Txt (1133[35,58],1135[35,60]): 地图
    275             End (1135[35,60],1139[35,64]): /a
    276           Txt (1139[35,64],1145[36,4]): 
    				
    277           End (1145[36,4],1150[36,9]): /td
    278         Txt (1150[36,9],1155[37,3]): 
    			
    279         End (1155[37,3],1163[37,11]): /table
    280       Txt (1163[37,11],1168[38,3]): 
    			
    281       Tag (1168[38,3],1192[38,27]): input class = "input" 
    282       Txt (1192[38,27],1196[39,2]): 
    		
    283       End (1196[39,2],1202[39,8]): /div
    284     Txt (1202[39,8],1205[40,1]): 
    	
    285     End (1205[40,1],1212[40,8]): /body
    286   Txt (1212[40,8],1216[42,0]): 
    
    
    287   End (1216[42,0],1223[42,7]): /html
    288 
    289 ==============================
    View Code

       对于第一个Node的内容,对应的就是第一行<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">,从这个输出结果中,也可以看出内容的树状结构。或者说是树林结构。在Page内容的第一层Tag,如DOCTYPE,head和html,分别形成了一个最高层的Node节点(很多人可能对第二个和第四个Node的内容有点奇怪。实际上这两个Node就是两个换行符号。HTMLParser把HTML页面内容中的所有换行,空格,Tab等都转换成了相应的Tag,所以就出现了这样的Node。虽然内容少但是级别高,呵呵)

      getPlainTextString是把用户可以看到的内容都包含了。有趣的有两点,一是<head>标签中的Title内容是在plainText中的,可能在标题中可见的也算可见吧。另外就是象前面说的,HTML内容中的换行符什么的,也都成了plainText,这个逻辑上好像有点问题。

      另外可能大家发现toHtml,toHtml(true)和toHtml(false)的结果没什么区别。实际也是这样的,如果跟踪HTMLParser的代码就可以发现,Node的子类是AbstractNode,其中实现了toHtml()的代码,直接调用toHtml(false),而AbstractNode的三个子类RemarkNode,TagNode和TextNode中,toHtml(boolean verbatim)的实现中,都没有处理verbatim参数,所以三个函数的结果是一模一样的。如果你不需要实现你自己的什么特殊处理,简单使用toHtml就可以了。

    HTML的Node类继承关系如下图(这个是从别的文章Copy的)

    他被组织成三棵树的森林,其中以<html>标签为根节点的树高度最大,网页的树状结构图如下:

      html树中要特别注意的是每一个回车换行,HTMLParser会将他们看做一个节点处理。

      AbstractNodes是Node的直接子类,也是一个抽象类。它的三个直接子类实现是RemarkNode,用于保存注释。在输出结果的toString部分中可以看到有一个"Rem (345[6,2],356[6,13]): 这是注释",就是一个RemarkNode。TextNode也很简单,就是用户可见的文字信息。TagNode是最复杂的,包含了HTML语言中的所有标签,而且可以扩展(扩展 HTMLParser 对自定义标签的处理能力)。TagNode包含两类,一类是简单的Tag,实际就是不能包含其他Tag的标签,只能做叶子节点。另一类是CompositeTag,就是可以包含其他Tag,是分支节点

      HTMLParser遍历了网页的内容以后,以树(森林)结构保存了结果。HTMLParser访问结果内容的方法有两种。使用Filter和使用Visitor。

    (一)Filter类
      顾名思义,Filter就是对于结果进行过滤,取得需要的内容。HTMLParser在org.htmlparser.filters包之内一共定义了16个不同的Filter,也可以分为几类。
      判断类Filter:

    TagNameFilter
    HasAttributeFilter
    HasChildFilter
    HasParentFilter
    HasSiblingFilter
    IsEqualFilter

       逻辑运算Filter:

    AndFilter
    NotFilter
    OrFilter
    XorFilter

      其他Filter:

    NodeClassFilter
    StringFilter
    LinkStringFilter
    LinkRegexFilter
    RegexFilter
    CssSelectorNodeFilter

     所有的Filter类都实现了org.htmlparser.NodeFilter接口。这个接口只有一个主要函数:boolean accept (Node node);

    各个子类分别实现这个函数,用于判断输入的Node是否符合这个Filter的过滤条件,如果符合,返回true,否则返回false。

    (二)判断类Filter
      2.1 TagNameFilter

      TabNameFilter是最容易理解的一个Filter,根据Tag的名字进行过滤。

     【源码:htmlparser_3.java】(此处只给出main方法的代码,其余代码同上)

     1     /*
     2      * main方法
     3      */
     4     public static void main(String[] args) {
     5         // String mContent=OpenFile("");
     6         try {
     7             Parser mParser = new Parser((HttpURLConnection) (new URL(
     8                     "http://127.0.0.1/HtmlParser/index.html")).openConnection());
     9             
    10 //            TextExtractingVisitor mExtractingVisitor = new TextExtractingVisitor();
    11 //            mParser.visitAllNodesWith(mExtractingVisitor);
    12 //            String textInPage = mExtractingVisitor.getExtractedText();
    13 //            message(textInPage);
    14             
    15 //            for (NodeIterator i = mParser.elements(); i.hasMoreNodes();) {
    16 //                Node node = i.nextNode();
    17 //                message("getText:"+node.getText());
    18 //                message("getPlainText:"+node.toPlainTextString());
    19 //                message("toHtml:"+node.toHtml());
    20 //                message("toHtml(true):"+node.toHtml(true));
    21 //                message("tohtml(false):"+node.toHtml(false));
    22 //                message("toString:"+node.toString());
    23 //                message("==============================");
    24 //            }
    25             
    26             NodeFilter mNodeFilter = new TagNameFilter("DIV");
    27             NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);
    28             if (mNodeFilter!=null) {
    29                 for (int i = 0; i < mNodeList.size(); i++) {
    30                     Node textNode = (Node)mNodeList.elementAt(i);
    31                     message("getText:"+textNode.getText());
    32                     message("===================================");
    33                 }
    34             }
    35             
    36         } catch (Exception e) {
    37             // TODO: handle exception
    38             e.printStackTrace();
    39         }
    40     }

     测试输出结果:

    1 getText:div  align = "center" class = "photo" 
    2 ===================================
    3 getText:div align = "center" class = "body"
    4 ===================================
    View Code

    可以看出文件中两个Div节点都被取出了。下面可以针对这两个DIV节点进行操作。

      2.2 HasChildFilter

      下面让我们看看HasChildFilter。刚刚看到这个Filter的时候,我想当然地认为这个Filter返回的是有Child的Tag。直接初始化了一个
      NodeFilter filter = new HasChildFilter();
      结果调用NodeList nodes = parser.extractAllNodesThatMatch(filter);的时候HasChildFilter内部直接发生NullPointerException。读了一下HasChildFilter的代码,才发现,实际HasChildFilter是返回有符合条件的子节点的节点,需要另外一个Filter作为过滤子节点的参数。缺省的构造函数虽然可以初始化,但是由于子节点的Filter是null,所以使用的时候发生了Exception。从这点来看,HTMLParser的代码还有很多可以优化的的地方。呵呵。

     修改代码:

     1     /*
     2      * main方法
     3      */
     4     public static void main(String[] args) {
     5         // String mContent=OpenFile("");
     6         try {
     7             Parser mParser = new Parser((HttpURLConnection) (new URL(
     8                     "http://127.0.0.1/HtmlParser/index.html")).openConnection());            
     9             NodeFilter mInnerFilter = new TagNameFilter("DIV");
    10             NodeFilter mNodeFilter = new HasChildFilter(mInnerFilter);
    11             NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);
    12             if (mNodeFilter!=null) {
    13             for (int i = 0; i < mNodeList.size(); i++) {
    14                 Node textNode = (Node)mNodeList.elementAt(i);
    15                 message("getText:"+textNode.getText());
    16                 message("===================================");
    17             }
    18         }
    19             
    20         } catch (Exception e) {
    21             // TODO: handle exception
    22             e.printStackTrace();
    23         }
    24     }

     测试输出结果:

    1 getText:body
    2 ===================================
    View Code

     在此处可以看到,输出的是含有DIV子Tag的Tag节点。(body有子节点DIV“<div  align = "center" class = "photo" >”)

    注意HasChildFilter还有一个构造函数:public HasChildFilter (NodeFilter filter, boolean recursive)

    如果recursive是false,则只对第一级子节点进行过滤。比如前面的例子,body在第一级的子节点里就有DIV节点,所以匹配上了。如果我们用下面的方法调用:

    NodeFilter filter = new HasChildFilter( innerFilter, true );

     测试输出结果:

    1 getText:html
    2 ===================================
    3 getText:body
    4 ===================================
    View Code

     可以看到输出结果中多了一个html ,这个是整个HTML页面的节点(根节点),虽然这个节点下直接没有DIV节点,但是它的子节点body下面有DIV节点,所以它也被匹配上了。

      2.3 HasAttributeFilter

      HasAttributeFilter有3个构造函数:
      public HasAttributeFilter ();
      public HasAttributeFilter (String attribute);
      public HasAttributeFilter (String attribute, String value);
      这个Filter可以匹配出包含制定名字的属性,或者制定属性为指定值的节点。还是用例子说明比较容易。

     调用方法1:

    1             NodeFilter mNodeFilter = new HasAttributeFilter();
    2             NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

     输出结果:

    什么也没有输出

    调用方法2:

    1             NodeFilter mNodeFilter = new HasAttributeFilter("class");
    2             NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

     输出结果:

    1 getText:div  align = "center" class = "photo" 
    2 ===================================
    3 getText:div align = "center" class = "body"
    4 ===================================
    5 getText:input class = "input" 
    6 ===================================
    View Code

     调用方法3:

    1             NodeFilter mNodeFilter = new HasAttributeFilter("class","photo");
    2             NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

     输出结果:

    1 getText:div  align = "center" class = "photo" 
    2 ===================================
    View Code

      2.4 其他判断列Filter
      HasParentFilter和HasSiblingFilter的功能与HasChildFilter类似,大家自己试一下就应该了解了。

      IsEqualFilter的构造函数参数是一个Node:
      public IsEqualFilter (Node node) {
        mNode = node;
      }
      accept函数也很简单:
      public boolean accept (Node node) {
        return (mNode == node);
      }
      不需要过多说明了。

    (三)逻辑运算Filter

      前面介绍的都是简单的Filter,只能针对某种单一类型的条件进行过滤。HTMLParser支持对于简单类型的Filter进行组合,从而实现复杂的条件。原理和一般编程语言的逻辑运算是一样的。

      3.1 AndFilter

      AndFilter可以把两种Filter进行组合,只有同时满足条件的Node才会被过滤。
      测试代码:

    1 NodeFilter mNodeFilterLeft = new HasAttributeFilter("class");
    2 NodeFilter mNodeFilterRight = new HasAttributeFilter("align");
    3 NodeFilter mNodeFilter = new AndFilter(mNodeFilterLeft, mNodeFilterRight);
    4 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

    测试输出结果:

    1 getText:div  align = "center" class = "photo" 
    2 ===================================
    3 getText:div align = "center" class = "body"
    4 ===================================
    View Code

      3.2 OrFilter
      把前面的AndFilter换成OrFilter

       测试代码:

    1 NodeFilter mNodeFilterLeft = new HasAttributeFilter("class");
    2 NodeFilter mNodeFilterRight = new HasAttributeFilter("align");
    3 NodeFilter mNodeFilter = new OrFilter(mNodeFilterLeft, mNodeFilterRight);
    4 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

      测试输出结果:

    1 getText:div  align = "center" class = "photo" 
    2 ===================================
    3 getText:div align = "center" class = "body"
    4 ===================================
    5 getText:input class = "input" 
    6 ===================================
    View Code

      3.3 NotFilter
      把前面的AndFilter换成NotFilter

      测试代码:

    1 NodeFilter mNodeFilterLeft = new HasAttributeFilter("class");
    2 NodeFilter mNodeFilterRight = new HasAttributeFilter("align");
    3 NodeFilter mNodeFilter = new NotFilter(new OrFilter(mNodeFilterLeft,mNodeFilterRight));
    4 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

      测试输出结果:

      1 getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
      2 ===================================
      3 getText:
      4 
      5 ===================================
      6 getText:html
      7 ===================================
      8 getText:
      9     
     10 ===================================
     11 getText:head
     12 ===================================
     13 getText:
     14         
     15 ===================================
     16 getText:meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/
     17 ===================================
     18 getText:
     19         
     20 ===================================
     21 getText:title
     22 ===================================
     23 getText:百度
     24 ===================================
     25 getText:/title
     26 ===================================
     27 getText:
     28         
     29 ===================================
     30 getText:link href = "a_1.css" rel = "stylesheet" type = "text/css"/
     31 ===================================
     32 getText:
     33     
     34 ===================================
     35 getText:/head
     36 ===================================
     37 getText:
     38     
     39 ===================================
     40 getText:body
     41 ===================================
     42 getText:
     43         
     44 ===================================
     45 getText:
     46             
     47 ===================================
     48 getText:img src = "../image/baidu.PNG" 
     49 ===================================
     50 getText:
     51         
     52 ===================================
     53 getText:/div
     54 ===================================
     55 getText:
     56         
     57 ===================================
     58 getText:
     59             
     60 ===================================
     61 getText:table cellpadding="8"
     62 ===================================
     63 getText:
     64                 
     65 ===================================
     66 getText:td
     67 ===================================
     68 getText:
     69                     
     70 ===================================
     71 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
     72 ===================================
     73 getText:新闻
     74 ===================================
     75 getText:/a
     76 ===================================
     77 getText:
     78                 
     79 ===================================
     80 getText:/td
     81 ===================================
     82 getText:
     83                 
     84 ===================================
     85 getText:td
     86 ===================================
     87 getText:
     88                     
     89 ===================================
     90 getText:font color = "black"
     91 ===================================
     92 getText:网页
     93 ===================================
     94 getText:/font
     95 ===================================
     96 getText:
     97                 
     98 ===================================
     99 getText:/td
    100 ===================================
    101 getText:
    102                 
    103 ===================================
    104 getText:td
    105 ===================================
    106 getText:
    107                     
    108 ===================================
    109 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
    110 ===================================
    111 getText:贴吧
    112 ===================================
    113 getText:/a
    114 ===================================
    115 getText:
    116                 
    117 ===================================
    118 getText:/td
    119 ===================================
    120 getText:
    121                 
    122 ===================================
    123 getText:td
    124 ===================================
    125 getText:
    126                     
    127 ===================================
    128 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
    129 ===================================
    130 getText:知道
    131 ===================================
    132 getText:/a
    133 ===================================
    134 getText:
    135                 
    136 ===================================
    137 getText:/td
    138 ===================================
    139 getText:
    140                 
    141 ===================================
    142 getText:td
    143 ===================================
    144 getText:
    145                     
    146 ===================================
    147 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
    148 ===================================
    149 getText:音乐
    150 ===================================
    151 getText:/a
    152 ===================================
    153 getText:
    154                 
    155 ===================================
    156 getText:/td
    157 ===================================
    158 getText:
    159                 
    160 ===================================
    161 getText:td
    162 ===================================
    163 getText:
    164                     
    165 ===================================
    166 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
    167 ===================================
    168 getText:图片
    169 ===================================
    170 getText:/a
    171 ===================================
    172 getText:
    173                 
    174 ===================================
    175 getText:/td
    176 ===================================
    177 getText:
    178                 
    179 ===================================
    180 getText:td
    181 ===================================
    182 getText:
    183                     
    184 ===================================
    185 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
    186 ===================================
    187 getText:视频
    188 ===================================
    189 getText:/a
    190 ===================================
    191 getText:
    192                 
    193 ===================================
    194 getText:/td
    195 ===================================
    196 getText:
    197                 
    198 ===================================
    199 getText:td
    200 ===================================
    201 getText:
    202                     
    203 ===================================
    204 getText:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
    205 ===================================
    206 getText:地图
    207 ===================================
    208 getText:/a
    209 ===================================
    210 getText:
    211                 
    212 ===================================
    213 getText:/td
    214 ===================================
    215 getText:
    216             
    217 ===================================
    218 getText:/table
    219 ===================================
    220 getText:
    221             
    222 ===================================
    223 getText:
    224         
    225 ===================================
    226 getText:/div
    227 ===================================
    228 getText:
    229     
    230 ===================================
    231 getText:/body
    232 ===================================
    233 getText:
    234 
    235 
    236 ===================================
    237 getText:/html
    238 ===================================
    View Code

      3.4 XorFilter(暂未实现)
      把前面的AndFilter换成NotFilter

      测试代码:……

      测试输出结果:……

    (四)其他Filter
      4.1 NodeClassFilter

      这个Filter用于判断节点类型是否是某个特定的Node类型。在上面中我们已经了解了Node的不同类型,这个Filter就可以针对类型进行过滤。

      测试代码:

      测试输出结果:

      4.2 StringFilter

      这个Filter用于过滤显示字符串中包含制定内容的Tag。注意是可显示的字符串,不可显示的字符串中的内容(例如注释,链接等等)不会被显示。

      测试代码:

    1 NodeFilter mNodeFilter = new StringFilter("贴吧");
    2 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

      测试输出结果:

    1 getText:贴吧
    2 ===================================
    View Code

      4.3 LinkStringFilter

      这个Filter用于判断链接中是否包含某个特定的字符串,可以用来过滤出指向某个特定网站的链接。

      测试代码:

    1 NodeFilter mNodeFilter = new LinkStringFilter("http://tieba.baidu.com/");
    2 NodeList mNodeList = mParser.extractAllNodesThatMatch(mNodeFilter);

      测试输出结果:(此处需要修改html例子的代码,修改后为:【<a href = "http://tieba.baidu.com/" target = _blank title = "欢迎来到&#10百度网站">贴吧</a>】)

    1 getText:a href = "http://tieba.baidu.com/" target = _blank title = "欢迎来到&#10百度网站"
    2 ===================================
    View Code

      4.4 其他几个Filter

      其他几个Filter也是根据字符串对不同的域进行判断,与前面这些的区别主要就是支持正则表达式。这个不在本文的讨论范围以内,大家可以自己实验一下。

      HTMLParser遍历了网页的内容以后,以树(森林)结构保存了结果。HTMLParser访问结果内容的方法有两种。使用Filter和使用Visitor。
      下面介绍使用Visitor访问内容的方法。

      5.1 NodeVisitor

      从简单方面的理解,Filter是根据某种条件过滤取出需要的Node再进行处理。Visitor则是遍历内容树的每一个节点,对于符合条件的节点进行处理。实际的结果异曲同工,两种不同的方法可以达到相同的结果。
      下面是一个最常见的NodeVisitro的例子。

      测试代码:

     1     public static void main(String[] args) {
     2         // TODO Auto-generated method stub
     3         try {
     4             
     5             Parser mParser = new Parser(
     6                     (HttpURLConnection) (new URL(
     7                             "http://127.0.0.1/HtmlParser/index.html"))
     8                             .openConnection());
     9             NodeVisitor mNodeVisitor = new NodeVisitor(false, false) {
    10                 @Override
    11                 public void visitTag(Tag tag) {
    12                     // TODO Auto-generated method stub
    13                     message("This is Tag:" + tag.getText());
    14                 }
    15 
    16                 @Override
    17                 public void visitStringNode(Text string) {
    18                     // TODO Auto-generated method stub
    19                     message("This is Text:" + string);
    20                 }
    21 
    22                 @Override
    23                 public void visitRemarkNode(Remark remark) {
    24                     // TODO Auto-generated method stub
    25                     message("This is Remark:" + remark.getText());
    26                 }
    27 
    28                 @Override
    29                 public void beginParsing() {
    30                     // TODO Auto-generated method stub
    31                     message("begin Parsing");
    32                 }
    33 
    34                 @Override
    35                 public void visitEndTag(Tag tag) {
    36                     // TODO Auto-generated method stub
    37                     message("visitEndTag:" + tag.getText());
    38                 }
    39 
    40                 @Override
    41                 public void finishedParsing() {
    42                     // TODO Auto-generated method stub
    43                     message("finishedParsing!");
    44                 }
    45             };
    46             mParser.visitAllNodesWith(mNodeVisitor);
    47         } catch (Exception e) {
    48             // TODO: handle exception
    49         }
    50     }

      测试输出结果:

    1 begin Parsing
    2 This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
    3 This is Text:Txt (121[0,121],123[1,0]): 
    
    4 finishedParsing!
    View Code

      可以看到,开始遍历所以的节点以前,beginParsing先被调用,然后处理的是中间的Node,最后在结束遍历以前,finishParsing被调用。因为我设置的 recurseChildren和recurseSelf都是false,所以Visitor没有访问子节点也没有访问根节点的内容。中间输出的两个 就是我们在前面初始化Parser 中讨论过的最高层的那两个换行。

    我们先把recurseSelf设置成true,看看会发生什么。

    1 NodeVisitor visitor = new NodeVisitor( false, true) 

       输出结果 :

    1 begin Parsing
    2 This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
    3 This is Text:Txt (121[0,121],123[1,0]): 
    
    4 This is Tag:html
    5 finishedParsing!
    View Code

      可以看到,HTML页面的第一层节点都被调用了。

      我们再用下面的方法调用看看:

    1 NodeVisitor mNodeVisitor = new NodeVisitor(true, false)

      输出结果:

     1 begin Parsing
     2 This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
     3 This is Text:Txt (121[0,121],123[1,0]): 
    
     4 This is Text:Txt (129[1,6],132[2,1]): 
    	
     5 This is Text:Txt (138[2,7],142[3,2]): 
    		
     6 This is Tag:meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/
     7 This is Text:Txt (216[3,76],220[4,2]): 
    		
     8 This is Remark:<title>百度</title>
     9 This is Text:Txt (244[4,26],248[5,2]): 
    		
    10 This is Tag:link href = "a_1.css" rel = "stylesheet" type = "text/css"/
    11 This is Text:Txt (309[5,63],312[6,1]): 
    	
    12 visitEndTag:/head
    13 This is Text:Txt (319[6,8],322[7,1]): 
    	
    14 This is Text:Txt (328[7,7],332[8,2]): 
    		
    15 This is Text:Txt (372[8,42],377[9,3]): 
    			
    16 This is Tag:img src = "../image/baidu.PNG" 
    17 This is Text:Txt (410[9,36],414[10,2]): 
    		
    18 visitEndTag:/div
    19 This is Text:Txt (420[10,8],424[11,2]): 
    		
    20 This is Text:Txt (461[11,39],466[12,3]): 
    			
    21 This is Text:Txt (489[12,26],495[13,4]): 
    				
    22 This is Text:Txt (499[13,8],506[14,5]): 
    					
    23 This is Text:Txt (559[14,58],561[14,60]): 新闻
    24 visitEndTag:/a
    25 This is Text:Txt (565[14,64],571[15,4]): 
    				
    26 visitEndTag:/td
    27 This is Text:Txt (576[15,9],582[16,4]): 
    				
    28 This is Text:Txt (586[16,8],593[17,5]): 
    					
    29 This is Tag:font color = "black"
    30 This is Text:Txt (615[17,27],617[17,29]): 网页
    31 visitEndTag:/font
    32 This is Text:Txt (624[17,36],630[18,4]): 
    				
    33 visitEndTag:/td
    34 This is Text:Txt (635[18,9],641[19,4]): 
    				
    35 This is Text:Txt (645[19,8],652[20,5]): 
    					
    36 This is Text:Txt (727[20,80],729[20,82]): 贴吧
    37 visitEndTag:/a
    38 This is Text:Txt (733[20,86],739[21,4]): 
    				
    39 visitEndTag:/td
    40 This is Text:Txt (744[21,9],750[22,4]): 
    				
    41 This is Text:Txt (754[22,8],761[23,5]): 
    					
    42 This is Text:Txt (814[23,58],816[23,60]): 知道
    43 visitEndTag:/a
    44 This is Text:Txt (820[23,64],826[24,4]): 
    				
    45 visitEndTag:/td
    46 This is Text:Txt (831[24,9],837[25,4]): 
    				
    47 This is Text:Txt (841[25,8],848[26,5]): 
    					
    48 This is Text:Txt (901[26,58],903[26,60]): 音乐
    49 visitEndTag:/a
    50 This is Text:Txt (907[26,64],913[27,4]): 
    				
    51 visitEndTag:/td
    52 This is Text:Txt (918[27,9],924[28,4]): 
    				
    53 This is Text:Txt (928[28,8],935[29,5]): 
    					
    54 This is Text:Txt (988[29,58],990[29,60]): 图片
    55 visitEndTag:/a
    56 This is Text:Txt (994[29,64],1000[30,4]): 
    				
    57 visitEndTag:/td
    58 This is Text:Txt (1005[30,9],1011[31,4]): 
    				
    59 This is Text:Txt (1015[31,8],1022[32,5]): 
    					
    60 This is Text:Txt (1075[32,58],1077[32,60]): 视频
    61 visitEndTag:/a
    62 This is Text:Txt (1081[32,64],1087[33,4]): 
    				
    63 visitEndTag:/td
    64 This is Text:Txt (1092[33,9],1098[34,4]): 
    				
    65 This is Text:Txt (1102[34,8],1109[35,5]): 
    					
    66 This is Text:Txt (1162[35,58],1164[35,60]): 地图
    67 visitEndTag:/a
    68 This is Text:Txt (1168[35,64],1174[36,4]): 
    				
    69 visitEndTag:/td
    70 This is Text:Txt (1179[36,9],1184[37,3]): 
    			
    71 visitEndTag:/table
    72 This is Text:Txt (1192[37,11],1197[38,3]): 
    			
    73 This is Tag:input class = "input" 
    74 This is Text:Txt (1221[38,27],1225[39,2]): 
    		
    75 visitEndTag:/div
    76 This is Text:Txt (1231[39,8],1234[40,1]): 
    	
    77 visitEndTag:/body
    78 This is Text:Txt (1241[40,8],1245[42,0]): 
    
    
    79 visitEndTag:/html
    80 finishedParsing!
    View Code

      可以看到,所有的子节点都出现了,除了刚刚例子里面的两个最上层节点This is Tag:head和This is Tag:html xmlns="http://www.w3.org/1999/xhtml"。

      想让它们都出来,只需要

    1 NodeVisitor mNodeVisitor = new NodeVisitor(true, true)

       输出结果:

      1 begin Parsing
      2 This is Tag:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
      3 This is Text:Txt (121[0,121],123[1,0]): 
    
      4 This is Tag:html
      5 This is Text:Txt (129[1,6],132[2,1]): 
    	
      6 This is Tag:head
      7 This is Text:Txt (138[2,7],142[3,2]): 
    		
      8 This is Tag:meta http-equiv = "Content-Type" content = "text/html; charset = utf-8"/
      9 This is Text:Txt (216[3,76],220[4,2]): 
    		
     10 This is Remark:<title>百度</title>
     11 This is Text:Txt (244[4,26],248[5,2]): 
    		
     12 This is Tag:link href = "a_1.css" rel = "stylesheet" type = "text/css"/
     13 This is Text:Txt (309[5,63],312[6,1]): 
    	
     14 visitEndTag:/head
     15 This is Text:Txt (319[6,8],322[7,1]): 
    	
     16 This is Tag:body
     17 This is Text:Txt (328[7,7],332[8,2]): 
    		
     18 This is Tag:div  align = "center" class = "photo" 
     19 This is Text:Txt (372[8,42],377[9,3]): 
    			
     20 This is Tag:img src = "../image/baidu.PNG" 
     21 This is Text:Txt (410[9,36],414[10,2]): 
    		
     22 visitEndTag:/div
     23 This is Text:Txt (420[10,8],424[11,2]): 
    		
     24 This is Tag:div align = "center" class = "body"
     25 This is Text:Txt (461[11,39],466[12,3]): 
    			
     26 This is Tag:table cellpadding="8"
     27 This is Text:Txt (489[12,26],495[13,4]): 
    				
     28 This is Tag:td
     29 This is Text:Txt (499[13,8],506[14,5]): 
    					
     30 This is Tag:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
     31 This is Text:Txt (559[14,58],561[14,60]): 新闻
     32 visitEndTag:/a
     33 This is Text:Txt (565[14,64],571[15,4]): 
    				
     34 visitEndTag:/td
     35 This is Text:Txt (576[15,9],582[16,4]): 
    				
     36 This is Tag:td
     37 This is Text:Txt (586[16,8],593[17,5]): 
    					
     38 This is Tag:font color = "black"
     39 This is Text:Txt (615[17,27],617[17,29]): 网页
     40 visitEndTag:/font
     41 This is Text:Txt (624[17,36],630[18,4]): 
    				
     42 visitEndTag:/td
     43 This is Text:Txt (635[18,9],641[19,4]): 
    				
     44 This is Tag:td
     45 This is Text:Txt (645[19,8],652[20,5]): 
    					
     46 This is Tag:a href = "http://tieba.baidu.com/" target = _blank title = "欢迎来到&#10百度网站"
     47 This is Text:Txt (727[20,80],729[20,82]): 贴吧
     48 visitEndTag:/a
     49 This is Text:Txt (733[20,86],739[21,4]): 
    				
     50 visitEndTag:/td
     51 This is Text:Txt (744[21,9],750[22,4]): 
    				
     52 This is Tag:td
     53 This is Text:Txt (754[22,8],761[23,5]): 
    					
     54 This is Tag:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
     55 This is Text:Txt (814[23,58],816[23,60]): 知道
     56 visitEndTag:/a
     57 This is Text:Txt (820[23,64],826[24,4]): 
    				
     58 visitEndTag:/td
     59 This is Text:Txt (831[24,9],837[25,4]): 
    				
     60 This is Tag:td
     61 This is Text:Txt (841[25,8],848[26,5]): 
    					
     62 This is Tag:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
     63 This is Text:Txt (901[26,58],903[26,60]): 音乐
     64 visitEndTag:/a
     65 This is Text:Txt (907[26,64],913[27,4]): 
    				
     66 visitEndTag:/td
     67 This is Text:Txt (918[27,9],924[28,4]): 
    				
     68 This is Tag:td
     69 This is Text:Txt (928[28,8],935[29,5]): 
    					
     70 This is Tag:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
     71 This is Text:Txt (988[29,58],990[29,60]): 图片
     72 visitEndTag:/a
     73 This is Text:Txt (994[29,64],1000[30,4]): 
    				
     74 visitEndTag:/td
     75 This is Text:Txt (1005[30,9],1011[31,4]): 
    				
     76 This is Tag:td
     77 This is Text:Txt (1015[31,8],1022[32,5]): 
    					
     78 This is Tag:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
     79 This is Text:Txt (1075[32,58],1077[32,60]): 视频
     80 visitEndTag:/a
     81 This is Text:Txt (1081[32,64],1087[33,4]): 
    				
     82 visitEndTag:/td
     83 This is Text:Txt (1092[33,9],1098[34,4]): 
    				
     84 This is Tag:td
     85 This is Text:Txt (1102[34,8],1109[35,5]): 
    					
     86 This is Tag:a href = "#" target = _blank title = "欢迎来到&#10百度网站"
     87 This is Text:Txt (1162[35,58],1164[35,60]): 地图
     88 visitEndTag:/a
     89 This is Text:Txt (1168[35,64],1174[36,4]): 
    				
     90 visitEndTag:/td
     91 This is Text:Txt (1179[36,9],1184[37,3]): 
    			
     92 visitEndTag:/table
     93 This is Text:Txt (1192[37,11],1197[38,3]): 
    			
     94 This is Tag:input class = "input" 
     95 This is Text:Txt (1221[38,27],1225[39,2]): 
    		
     96 visitEndTag:/div
     97 This is Text:Txt (1231[39,8],1234[40,1]): 
    	
     98 visitEndTag:/body
     99 This is Text:Txt (1241[40,8],1245[42,0]): 
    
    
    100 visitEndTag:/html
    101 finishedParsing!
    View Code

      哈哈,这下调用清楚了,大家在需要处理的地方增加自己的代码好了。

      5.2 其他Visitor

    ……

    到此,个人感觉与htmlparser的缘分已尽!下一步,进军JSoup!!!

     ===========================参考网址===========================

    http://www.blogjava.net/amigoxie/archive/2008/01/18/176200.html

    http://www.cnblogs.com/loveyakamoz/archive/2011/07/27/2118937.html

    http://blog.csdn.net/witsmakemen/article/details/8778979

     ===========================参考网址===========================

  • 相关阅读:
    MFC 时钟 计算器 日期天数计算
    test10
    test9
    iOS 防止按钮快速点击造成多次响应的避免方法
    NSBundle读取图片 plist文件和txt文件
    按指定格式的子字符串,删除和分割字符串
    python批处理入门知识点
    命令行ffmpeg批量旋转视频
    NSData转化成十六进制字符串
    xcode里面使用Memory Leaks和Instruments检测内存泄漏
  • 原文地址:https://www.cnblogs.com/zhjsll/p/4251153.html
Copyright © 2011-2022 走看看