zoukankan      html  css  js  c++  java
  • 解析HTML文件

    运用.NET Framework类来解析HTML文件、读取数据并不是最容易的。虽然你可以用.NET Framework中的许多类(如StreamReader)来逐行解析文件,但XmlReader提供的API并不是“取出即可用(out of the box)”的,因为HTML的格式不规范。你可以用正则表达式(regular expression),但如果你对这些表达式运用不熟练,你可能开始时会觉得它们有些难。

    Microsoft的XML大师Chris Lovett最近在http://www.gotdotnet.com网站上发布了一个新的SGML解析器,叫做SgmlReader,它可以解析HTML文件,甚至将它们转换成一个格式规范的结构。SgmlReader派生于XmlReader,这就是说,你可以像运用诸如XmlTextReader这样的类来解析XML文件那样来解析HTML文件。在本文中,我将介绍如何用SgmlReader类来解析HTML文件并生成格式规范的HTML,从而使你可以用XPath语句来读取数据。

    创建一个SgmlReader实例来解析HTML
    在开始运用SgmlReader前,从gotdotnet.com下载它,并将assembly放到你的应用程序bin folder中。在可以运用assembly集后,编写代码来读取你想解析的HTML。在本文的例子中,我们用了HttpWebRequest和HttpWebResponse对象来访问一个远程的HTML文件: HttpWebRequest req = (HttpWebRequest)WebRequest.Create(uri);HttpWebResponse res = (HttpWebResponse)req.GetResponse();StreamReader sReader = new StreamReader(res.GetResponseStream());

    在得到远程的HTML文件后,你就可以创建一个SgmlReader类的实例了。通过将其DocType属性设置为“HTML”,让用户知道你正在处理HTML文件: SgmlReader reader = new SgmlReader();reader.DocType = "HTML";

    HTML文件的响应流可以被加载到SgmlReader实例,通过其InputStream属性进行解析。首先将HTML文件流加载到一个TextReader对象,然后将TextReader赋值给InputStream属性: reader.InputStream = new StringReader(sReader.ReadToEnd());

    现在,你就可以通过调用SgmlReader的Read()方法来解析HTML文件了: sw = new StringWriter();writer = new XmlTextWriter(sw);writer.Formatting = Formatting.Indented;while (reader.Read()) { if (reader.NodeType != XmlNodeType.Whitespace) writer.WriteNode(reader, true); }}

    因为SgmlReader创建了格式规范的HTML,所以你可以用XPath语句来读取不同的节点。下面的代码说明了如何将SgmlReader生成的输出结果加载到一个XPathNavigator,然后如何用一个XPath语句来查询HTML文件结构: StringBuilder sb = new StringBuilder();XPathDocument doc = new XPathDocument(new StringReader(sw.ToString()));XPathNavigator nav = doc.CreateNavigator();XPathNodeIterator nodes = nav.Select(xpath);while (nodes.MoveNext()) { sb.Append(nodes.Current.Value);}return sb.ToString();

    点击此处来查看SgmlReader类的一个实例演示

    如果你对XPath语言已经很熟悉,并了解.NET Framework中不同的XML解析API了,那么你就可以很容易地用SgmlReader类来解析HTML并读取数据了。

    部分代码C#

                private string GetWellFormedHTML(string uri,string xpath) ...{
                StreamReader sReader = null;
                StringWriter sw = null;
                SgmlReader reader = null;
                XmlTextWriter writer = null;
                try ...{
                    if (uri == String.Empty) uri = "http://www.XMLforASP.NET";
                    HttpWebRequest req = (HttpWebRequest)WebRequest.Create(uri);
                    HttpWebResponse res = (HttpWebResponse)req.GetResponse();
                    sReader = new StreamReader(res.GetResponseStream());
                    reader = new SgmlReader();
                    reader.DocType = "HTML";
                    reader.InputStream = new StringReader(sReader.ReadToEnd());
                    sw = new StringWriter();
                    writer = new XmlTextWriter(sw);
                    writer.Formatting = Formatting.Indented;
                    //writer.WriteStartElement("Test");
                    while (reader.Read()) ...{
                        if (reader.NodeType != XmlNodeType.Whitespace) ...{
                            writer.WriteNode(reader, true);
                        }
                    } 
                    //writer.WriteEndElement();
                    if (xpath == null) ...{
                        return sw.ToString();   
                    } else ...{ //Filter out nodes from HTML
                        StringBuilder sb = new StringBuilder();
                        XPathDocument doc = new XPathDocument(new StringReader(sw.ToString()));
                        XPathNavigator nav = doc.CreateNavigator();
                        XPathNodeIterator nodes = nav.Select(xpath);
                        while (nodes.MoveNext()) ...{
                            sb.Append(nodes.Current.Value + " ");
                        }
                        return sb.ToString();
                    }
                } catch (Exception exp) ...{
                    writer.Close(); 
                    reader.Close();
                    sw.Close();
                    sReader.Close();
                    return exp.Message;
                }
            }

  • 相关阅读:
    Codeforces 877 C. Slava and tanks
    Codeforces 877 D. Olya and Energy Drinks
    2017 10.25 NOIP模拟赛
    2017 国庆湖南 Day1
    UVA 12113 Overlapping Squares
    学大伟业 国庆Day2
    51nod 1629 B君的圆锥
    51nod 1381 硬币游戏
    [JSOI2010]满汉全席
    学大伟业 2017 国庆 Day1
  • 原文地址:https://www.cnblogs.com/soundcode/p/3785157.html
Copyright © 2011-2022 走看看