zoukankan      html  css  js  c++  java
  • html分析器——jericho-html-3.3分解table

    原部分来自Internet上的其他博客,只是因为很长一段时间。忘了谁是参考,这里说声抱歉。。

    先贴一些html页:

    <html>
    <head>
    <meta http-equiv="content-type" content="text/html;charset=GBK">
    <title>HTML Parser</title>
    <meta name="generator" content="Namo WebEditor">
    </head>
    <body>
    <table width=620 border=0 cellpadding=1 cellspacing=0 bgcolor=#0066cc>
    	<tr>
    		<td width=100%>
    		<table width=100% border=0 cellpadding=4 cellspacing=0 bgcolor=#D3E5FB>
    			<tr bgcolor=#D3E5FB>
    				<td width=20%><font size="2" face="Arial,Verdana"><b>想学习
    				Name</b></font><br>
    				</td>
    				<td width=13%><font size="2" face="Arial,Verdana"><b>Result</b></font><br>
    				</td>
    				<td width=8%><font size="2" face="Arial,Verdana"><b>Time</b></font><br>
    				</td>
    				<td width=59%><font size="2" face="Arial,Verdana"><b>Synopsis</b></font><br>
    				</td>
    			</tr>
    			<tr bgcolor=#eeeeee>
    				<td width=20%><font size="1" face="Arial,Verdana"><b>9</b>
    				想学习</font><br>
    				</td>
    				<td width=13%><font size="1" face="Arial,Verdana"><font
    					color=#ff0033>+FAIL</font> <a
    					href="v4_wireless_802.1x_full/cdrouter_dhcp_20.txt">想学习</a></font><br>
    				</td>
    				<td width=8%><font size="1" face="Arial,Verdana">12:31</font><br>
    				</td>
    				<td width=59%><font size="1" face="Arial,Verdana">想学习</font><br>
    				</td>
    			</tr>
    			<tr bgcolor=#ffffff>
    				<td width=20%><font size="1" face="Arial,Verdana"><b>1</b>
    				cdrouter_basic_1</font><br>
    				</td>
    				<td width=13%><font size="1" face="Arial,Verdana">Pass <a
    					href="v4_wireless_802.1x_full/cdrouter_basic_1.txt">想学习</a></font><br>
    				</td>
    				<td width=8%><font size="1" face="Arial,Verdana">00:00</font><br>
    				</td>
    				<td width=59%><font size="1" face="Arial,Verdana">想学习</font><br>
    				</td>
    			</tr>
    		</table>
    		</td>
    	</tr>
    </table>
    </body>
    </html>

    对于这个页面来说我想取出全部的td里面的文字内容,该怎么办呢。假设用正則表達式,我还真是难以写出正确的,来解析出我所要的结果。

    在网上搜索了一下jericho-html-3.3这个插件,用来解析table。的确非常方便。

    代码例如以下:

    package com.xxx.hbuassys.test;
    
    import java.net.URL;
    import java.util.Iterator;
    import java.util.List;
    
    import net.htmlparser.jericho.Element;
    import net.htmlparser.jericho.HTMLElementName;
    import net.htmlparser.jericho.Segment;
    import net.htmlparser.jericho.Source;
     
    public class HtmlParser
    {
        public static void main(String[] args) throws Exception
        {
            String sourceUrlString="test.html";
            
            if(sourceUrlString.indexOf(':') == -1)
                sourceUrlString ="file:"+sourceUrlString;
            Source source=new Source(new URL(sourceUrlString));
            List Elements_TABLE=source.getAllElements(HTMLElementName.TABLE);
            Elements_TABLE.remove(0);//因为table相互嵌套。我们须要的是第二个,所以删掉第一个
            Iterator it_TABLE = Elements_TABLE.iterator();
            while(it_TABLE.hasNext())
            {
                Element Element_TABLE = (Element)it_TABLE.next();
    //        	System.out.println("**"+Element_TABLE.toString()+"
    **");
                Segment getContent_TABLE = (Segment)Element_TABLE.getContent();
                List Elements_TR = getContent_TABLE.getAllElements(HTMLElementName.TR);
                Iterator it_TR = Elements_TR.iterator();
                while(it_TR.hasNext())
                {
                    Element Element_TR = (Element)it_TR.next();
                    Segment getContent_TR = (Segment)Element_TR.getContent();
                    List Elements_FONT = getContent_TR.getAllElements(HTMLElementName.FONT);
                    Iterator it_FONT = Elements_FONT.iterator();
                    int i = 1;
                    while(it_FONT.hasNext())
                    {
                        Element Element_FONT = (Element)it_FONT.next();
                        Segment getContent_FONT = (Segment)Element_FONT.getContent();
                        String a1 = getContent_FONT.toString();
                        System.out.println(i + " = " + Element_FONT.getContent().getTextExtractor().toString());
                        i++;
                    }
                    System.out.println();
                }
            }
        }
    }
    
    结果:

    1 = 想学习 Name
    2 = Result
    3 = Time
    4 = Synopsis


    1 = 9 想学习
    2 = +FAIL 想学习
    3 = +FAIL
    4 = 12:31
    5 = 想学习


    1 = 1 cdrouter_basic_1
    2 = Pass 想学习
    3 = 00:00
    4 = 想学习


    大致的思路就是,先取出全部的table标签,然后对须要的table进行解析,取出里面的tr,在从tr里面取出td这样就能够得到我们须要的内容了。

    假设仅仅讲到这,那么就跟网上其它人讲的没有什么差别了。

    由于项目的须要,使用此插件发现了一个问题:

    假设html页面的编码是UTF-8的格式,那么解析出来的内容就会是乱码。假设直接对这些乱码编码。採用new String(str.getBytes(),"GBK");等之类的操作都不能解决这个问题。本人亲自測试过。

    比如html页面变为:

    <html>
    <head>
    <meta http-equiv="content-type" content="text/html;charset=UTF-8">
    <title>HTML Parser</title>
    <meta name="generator" content="Namo WebEditor">
    </head>
    <body>
    <table width=620 border=0 cellpadding=1 cellspacing=0 bgcolor=#0066cc>
    	<tr>
    		<td width=100%>
    		<table width=100% border=0 cellpadding=4 cellspacing=0 bgcolor=#D3E5FB>
    			<tr bgcolor=#D3E5FB>
    				<td width=20%><font size="2" face="Arial,Verdana"><b>想学习
    				Name</b></font><br>
    				</td>
    				<td width=13%><font size="2" face="Arial,Verdana"><b>Result</b></font><br>
    				</td>
    				<td width=8%><font size="2" face="Arial,Verdana"><b>Time</b></font><br>
    				</td>
    				<td width=59%><font size="2" face="Arial,Verdana"><b>Synopsis</b></font><br>
    				</td>
    			</tr>
    			<tr bgcolor=#eeeeee>
    				<td width=20%><font size="1" face="Arial,Verdana"><b>9</b>
    				想学习</font><br>
    				</td>
    				<td width=13%><font size="1" face="Arial,Verdana"><font
    					color=#ff0033>+FAIL</font> <a
    					href="v4_wireless_802.1x_full/cdrouter_dhcp_20.txt">想学习</a></font><br>
    				</td>
    				<td width=8%><font size="1" face="Arial,Verdana">12:31</font><br>
    				</td>
    				<td width=59%><font size="1" face="Arial,Verdana">想学习</font><br>
    				</td>
    			</tr>
    			<tr bgcolor=#ffffff>
    				<td width=20%><font size="1" face="Arial,Verdana"><b>1</b>
    				cdrouter_basic_1</font><br>
    				</td>
    				<td width=13%><font size="1" face="Arial,Verdana">Pass <a
    					href="v4_wireless_802.1x_full/cdrouter_basic_1.txt">想学习</a></font><br>
    				</td>
    				<td width=8%><font size="1" face="Arial,Verdana">00:00</font><br>
    				</td>
    				<td width=59%><font size="1" face="Arial,Verdana">想学习</font><br>
    				</td>
    			</tr>
    		</table>
    		</td>
    	</tr>
    </table>
    </body>
    </html>

    得到的结果是:

    1 = ???

    ? Name
    2 = Result
    3 = Time
    4 = Synopsis


    1 = 9 ???

    ?
    2 = +FAIL ?

    ???
    3 = +FAIL
    4 = 12:31
    5 = ?

    ?

    ??


    1 = 1 cdrouter_basic_1
    2 = Pass ??

    ??
    3 = 00:00
    4 = ?

    ?

    ??




    採用的方法是:改变<meta http-equiv="content-type" content="text/html;charset=UTF-8">变为:<meta http-equiv="content-type" content="text/html;charset=GBK">

    具体情况,參考代码例如以下:

    package com.xxx.hbuassys.test;
    
    import java.io.BufferedReader;
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.FileReader;
    import java.io.InputStreamReader;
    import java.net.URL;
    import java.util.Iterator;
    import java.util.List;
    
    import net.htmlparser.jericho.Element;
    import net.htmlparser.jericho.HTMLElementName;
    import net.htmlparser.jericho.Segment;
    import net.htmlparser.jericho.Source;
     
    public class HtmlParser
    {
        public static void main(String[] args) throws Exception
        {
        	BufferedReader reader=new BufferedReader(new InputStreamReader(new FileInputStream(new File("test.html"))));
    //    	BufferedReader reader=new BufferedReader(new FileReader(new File("test.html")));
        	StringBuilder sbf=new StringBuilder();
        	String str=null;
        	while((str=reader.readLine())!=null){
        		sbf.append(str).append("
    ");
        	}
        	//解决中文乱码的方法
        	String html=sbf.toString().replace("<meta http-equiv="content-type" content="text/html;charset=UTF-8">", "<meta http-equiv="content-type" content="text/html;charset=GBK">");
    //    	System.out.println(html);
            Source source=new Source(html);
            List Elements_TABLE=source.getAllElements(HTMLElementName.TABLE);
            Elements_TABLE.remove(0);//因为table相互嵌套,我们须要的是第二个,所以删掉第一个
            Iterator it_TABLE = Elements_TABLE.iterator();
            while(it_TABLE.hasNext())
            {
                Element Element_TABLE = (Element)it_TABLE.next();
    //        	System.out.println("**"+Element_TABLE.toString()+"
    **");
                Segment getContent_TABLE = (Segment)Element_TABLE.getContent();
                List Elements_TR = getContent_TABLE.getAllElements(HTMLElementName.TR);
                Iterator it_TR = Elements_TR.iterator();
                while(it_TR.hasNext())
                {
                    Element Element_TR = (Element)it_TR.next();
                    Segment getContent_TR = (Segment)Element_TR.getContent();
                    List Elements_FONT = getContent_TR.getAllElements(HTMLElementName.FONT);
                    Iterator it_FONT = Elements_FONT.iterator();
                    int i = 1;
                    while(it_FONT.hasNext())
                    {
                        Element Element_FONT = (Element)it_FONT.next();
                        Segment getContent_FONT = (Segment)Element_FONT.getContent();
                        String a1 = getContent_FONT.toString();
                        System.out.println(i + " = " + Element_FONT.getContent().getTextExtractor().toString());
                        i++;
                    }
                    System.out.println();
                }
            }
        }
    }
    

    结果例如以下:

    1 = 想学习 Name
    2 = Result
    3 = Time
    4 = Synopsis


    1 = 9 想学习
    2 = +FAIL 想学习
    3 = +FAIL
    4 = 12:31
    5 = 想学习


    1 = 1 cdrouter_basic_1
    2 = Pass 想学习
    3 = 00:00
    4 = 想学习




    版权声明:本文博主原创文章,博客,未经同意不得转载。

  • 相关阅读:
    Java实现数组去除重复数据的方法详解
    java枚举和constant使用区别
    如何健壮你的后端服务
    entityframework学习笔记--001
    MongoDB配置服务--MongoDB安装成为windows服务
    MongoDB基础入门003--使用官方驱动操作mongo,C#
    MongoDB基础入门002--基本操作,增删改查
    MongoDB基础入门001--安装
    webapi的返回类型,webapi返回图片
    C#异步下载文件--基于http请求
  • 原文地址:https://www.cnblogs.com/bhlsheji/p/4878370.html
Copyright © 2011-2022 走看看