zoukankan      html  css  js  c++  java
  • Treating HTML like XML using HtmlAgilityPack, and doing it inside of an XSLT too [转载]

    I was not able to post this on Simon Mourier's blog due to the HTML and XSLT tags, so here it is on mine:

    Maybe someone has done this already, but I don't see it in the comments.

    I created an XSLT extension object based on HtmlAgilityPack. The class is tiny:

    using System;
    using System.Collections.Generic;
    using System.Text;
    using HtmlAgilityPack;
    using System.Xml;
    using System.Xml.XPath;
    using System.IO;

    namespace HtmlAgilityPack
        public class XslExtension
            public XmlDocument loadhtmlasxml(string url)
                // Create an instance of the HtmlWeb object
                HtmlWeb web = new HtmlWeb();
                // Declare necessary stream and writer objects
                MemoryStream m = new MemoryStream();           
                XmlTextWriter xtw = new XmlTextWriter(m,null);           
                // Load the content into the writer
                web.LoadHtmlAsXml(url, xtw);
                // Rewind the memory stream
                m.Position = 0;
                // Create, fill, and return the xml document
                XmlDocument xdoc = new XmlDocument();
                xdoc.LoadXml((new StreamReader(m)).ReadToEnd());
                return xdoc;

    Then, I used NXSLT from http://www.xmllab.net to load the custom extension function in from the command line so that the following XSL style sheet can be used directly:


     <xsl:output method="html" omit-xml-declaration="yes" indent="no"/>

     <xsl:template match="/">

      <h1>BEGIN TEST OF HtmlAgilityPack.XslExtension</h1>

      <h2>First, connect to http://www.cnn.com and load its node set into a local variable</h2>   

      <xsl:variable name="cnn"><xsl:copy-of select="hap:loadhtmlasxml('http://www.cnn.com')" /></xsl:variable>

      <h3>CNN.com has this many nodes:</h3>

      <xsl:value-of select="count(msxsl:node-set($cnn)//*)" />
      <h2>Now, process all the A tags within the "Special Converage" stories inside the "div class="cnnLSSpecialCovBoxContent" that have an HREF that starts with /2005.</h2>
       <h3>Special Coverage</h3>
        <xsl:for-each select="msxsl:node-set($cnn)//div[@class='cnnLSSpecialCovBoxContent']//a[starts-with(@href, '/2005/')]">
        <h3><xsl:copy-of select="." /></h3>
        <!-- Now get the images from each story if they exist -->
        <h5>Connecting to: <xsl:value-of select="concat('http://www.cnn.com', @href)" /> to retrieve image if it exists</h5>
        <xsl:copy-of select="hap:loadhtmlasxml(concat('http://www.cnn.com', @href))//img[@height = '168']" />
       <br /><br />
      <h1>END TEST OF HtmlAgilityPack.XslExtension</h1>


    The command for NXSLT to perform this is:

    nxslt2.exe source.xml source.xsl -ext hap:HtmlAgilityPack.XslExtension xmlns:hap="http://smourier.blogspot.com" -af .\HtmlAgilityPackXs

    The style sheet connects to CNN.com using the syntax:


    Then, further down, after it processes each of the selected A HREF's, it connects to each of the linked stories and retrieves any images with height 168, outputting the HTML result tree.

    This could allow for any number of descendent link followings. I haven't worked out the automatic form processor yet, but I think that could be an XSLT extension too perhaps...

    Let me know what you think...


  • 相关阅读:
    PHP 日期的时区差异
    SQL server 中SUBSTRING()以及CONVERT()的用法
  • 原文地址:https://www.cnblogs.com/shanyou/p/HtmlAgilityPack.html
Copyright © 2011-2022 走看看