zoukankan      html  css  js  c++  java
  • Treating HTML like XML using HtmlAgilityPack, and doing it inside of an XSLT too [转载]

    I was not able to post this on Simon Mourier's blog due to the HTML and XSLT tags, so here it is on mine:

    Maybe someone has done this already, but I don't see it in the comments.

    I created an XSLT extension object based on HtmlAgilityPack. The class is tiny:

    using System;
    using System.Collections.Generic;
    using System.Text;
    using HtmlAgilityPack;
    using System.Xml;
    using System.Xml.XPath;
    using System.IO;

    namespace HtmlAgilityPack
    {
        public class XslExtension
        {
            public XmlDocument loadhtmlasxml(string url)
            {
                // Create an instance of the HtmlWeb object
                HtmlWeb web = new HtmlWeb();
                // Declare necessary stream and writer objects
                MemoryStream m = new MemoryStream();           
                XmlTextWriter xtw = new XmlTextWriter(m,null);           
                // Load the content into the writer
                web.LoadHtmlAsXml(url, xtw);
                // Rewind the memory stream
                m.Position = 0;
                // Create, fill, and return the xml document
                XmlDocument xdoc = new XmlDocument();
                xdoc.LoadXml((new StreamReader(m)).ReadToEnd());
                return xdoc;
            }
        }
    }


    Then, I used NXSLT from http://www.xmllab.net to load the custom extension function in from the command line so that the following XSL style sheet can be used directly:

    <xsl:stylesheet
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xmlns:hap="http://smourier.blogspot.com"
     xmlns:msxsl="urn:schemas-microsoft-com:xslt"
          version="1.0">

     <xsl:output method="html" omit-xml-declaration="yes" indent="no"/>

     <xsl:template match="/">

      <h1>BEGIN TEST OF HtmlAgilityPack.XslExtension</h1>

      <h2>First, connect to http://www.cnn.com and load its node set into a local variable</h2>   

      <xsl:variable name="cnn"><xsl:copy-of select="hap:loadhtmlasxml('http://www.cnn.com')" /></xsl:variable>

      <h3>CNN.com has this many nodes:</h3>

      <xsl:value-of select="count(msxsl:node-set($cnn)//*)" />
      <h2>Now, process all the A tags within the "Special Converage" stories inside the "div class="cnnLSSpecialCovBoxContent" that have an HREF that starts with /2005.</h2>
       <h3>Special Coverage</h3>
        <xsl:for-each select="msxsl:node-set($cnn)//div[@class='cnnLSSpecialCovBoxContent']//a[starts-with(@href, '/2005/')]">
       <div>
        <h3><xsl:copy-of select="." /></h3>
        <!-- Now get the images from each story if they exist -->
        <h5>Connecting to: <xsl:value-of select="concat('http://www.cnn.com', @href)" /> to retrieve image if it exists</h5>
        <xsl:copy-of select="hap:loadhtmlasxml(concat('http://www.cnn.com', @href))//img[@height = '168']" />
       <br /><br />
       </div>
       </xsl:for-each>
      <h1>END TEST OF HtmlAgilityPack.XslExtension</h1>
     </xsl:template>

    </xsl:stylesheet>


    The command for NXSLT to perform this is:


    nxslt2.exe source.xml source.xsl -ext hap:HtmlAgilityPack.XslExtension xmlns:hap="http://smourier.blogspot.com" -af .\HtmlAgilityPackXs
    lExtension.dll

    The style sheet connects to CNN.com using the syntax:

    select="hap:loadhtmlasxml('http://www.cnn.com')"

    Then, further down, after it processes each of the selected A HREF's, it connects to each of the linked stories and retrieves any images with height 168, outputting the HTML result tree.

    This could allow for any number of descendent link followings. I haven't worked out the automatic form processor yet, but I think that could be an XSLT extension too perhaps...

    Let me know what you think...
    http://blogs.wdevs.com/ultravioletconsulting/archive/2005/09/10/10506.aspx

    欢迎大家扫描下面二维码成为我的客户,为你服务和上云

  • 相关阅读:
    Mybatis 使用Mybatis时实体类属性名和表中的字段名不一致
    getResourceAsStream 地址
    Memory Allocation with COBOL
    静态call 动态call LINK
    反编译
    eclipse 设置英文
    WAR/EAR 概念
    application.xml
    对ContentProvider中getType方法的一点理解
    总结使人进步,可视化界面GUI应用开发总结:Android、iOS、Web、Swing、Windows开发等
  • 原文地址:https://www.cnblogs.com/shanyou/p/HtmlAgilityPack.html
Copyright © 2011-2022 走看看