zoukankan      html  css  js  c++  java
  • Parsing html markup text using MSHTML

    Introduction:

    Often working with content in the form of html, I have needed to manipulate the content intelligently. I accomplished this by using regular expressions to "parse" the html to find certain tags. This enabled me to look for certain tags with certain attributes, etc.

    This works well enough, but some people aren't familiar with regular expression syntax and struggle to maintain and extend the code for manipulating the markup.

    A much simpler and developer-friendly option is to reference the mshtml object. I will illustrate the use of this object with an over simplified example. I am going to mention regular expressions, but I'm not going to go into the syntax or even show any statements - it's a totally different subject altogether.

    Problem scenario:

    My pages in my website contains elements with formatting elements hard coded onto them, instead of having all the formatting set through a class reference to a stylesheet.

    This means that I will have an element with it's bgcolor attribute set to "blue" and it's border attribute set to "1". For example:

    <p bgcolor="blue" color="red" border="1">bla di bla bla</p>

    I want to set a class name attribute on all the elements, with a combination of these two attributes with the same values. Meaning any element having a bgcolor of "blue" and a border of "1". The following will qualify too:

    <td bgcolor=blue id="mytd" onclick="alert('clicked');" border="1">Hello</td>

    So how can I find all the instances of tags that have these two attributes with the correct values in the markup? A normal string operation will not suffice. So a regular expression solution is sufficient. But when the border and bgcolor sequence is switched it adds a whole new level of complexity to the regular expression, for example:

    <td border="1" id="mytd" onclick="alert('clicked');" bgcolor=blue>Hello</td>

    Now we can't assume that the bgcolor attribute will be found first and then the border attribute. And what about when we want to search on three attributes?


     

    Solution

    What we want to do is loop through the html elements in the markup and look for elements that satisfy our requirements, and we check this by accessing the attributes in a non-sequential, natural manner. If all the attributes are satisfied, then the tag qualifies for the update.

    We need a way to let our method know what attributes to look for, their corresponding values and the new attribute key/value pairs to set ons this object.

    Code

    We have to add a reference to the mshtml object

    In the solution explorer, highlight the project to which you want to add the parsing functionality
    In the menu, click on Project -> Add reference
    In the dialog box that is shown, under the .Net tab - choose the Microsoft.mshtml assembly
    Click the select button and click on the OK button

    Now we can reference this assembly

    using mshtml;

    Our class will contain one method, this method will take 3 parameters.
    A string containing the markup to parse, an arraylist populated with key/value pairs that needs to be present on an object to qualify for the update and an arraylist populated with new key/value pairs to be set on the qualified objects.
    We also have a struct to aid us as a container for our attribute key/value pairs.
    
    namespace MarkupOps
        {
            public class ServerParse
            {
                /// 
                /// Searches the markup for tags that has all the key/value pairs in the searchList arrayList
                /// When it finds a tag it sets all the key/value pairs contained in the setList
                /// 
                /// The markup to search ans replace in
                /// An arraylist of key/value pair objects that a tag must have before qualifying for the 
                /// properties to be set
                /// A list of attributes to set on the qualifying objects
                /// 
                public static string UpdateAttributes(string inMarkup, ArrayList searchList, ArrayList setList)
                {
                    if (matchList.Count > 0)
                    {
                        //reads the html into an html document to enable parsing
                        IHTMLDocument2 doc = new HTMLDocumentClass ();
                        doc.write (new object [] {inMarkup});
                        doc.close ();
                        //loops through each element in the document to check if it qualifies for the attributes to be set
                        foreach(IHTMLElement el in (IHTMLElementCollection)doc.body.all)
                        {
                            // check to see if all the desired attributes were found with the correct values
                            bool qualify =true;
                            foreach(KeyValue att in searchList)
                            {
                                //check for the first required attribute that's missing and break the loop
                                if(el.getAttribute(att.key, 0).ToString().ToLower() != att.val.ToLower())
                                {
                                    qualify=false;
                                    break;
                                }
                            }
                            // if all the required attributes matched - we can proceed in setting the values 
                            if(qualify)
                            {
                                foreach(KeyValue setAtt in setList)
                                {
                                    //sets the value on the object, if the att already exists, it's overwritten 
                                    el.setAttribute(setAtt.key, setAtt.val, 0);
                                }
                            }
                        }
                        return doc.body.outerHTML;
                    }
                    return inMarkup;
                 }
    
            }
    
    
            //this just helps to contain the key values
            public struct KeyValue
            {
                public string key;
                public string val;
                public KeyValue(string key, string val)
                {
                    this.key=key;
                    this.val=val;
                }
            }
        }					
    					
    Using the code
    Our example html text
    <table bgcolor="red">
    <tr>
    <td bgcolor="yellow" border="2">Name</td>
    <td id="qualify1" border="1" bgcolor=blue></td>
    </tr>
    <tr>
    <td><p id="qualify2" bgcolor="blue" border="1">Surname</p></td>
    <td></td>
    </tr>
    <tr>
    <td>address</td>
    <td></td>
    </tr>
    </table>					
    					

    We want to parse this html, look for tags (of any kind) that has the following attributes:

    1. bgcolor=blue
    2. border=1


    When a tag is found which qualifies, the className (translates to class in html, but the DOM property is className) property of the element will be set to "blueBorder".

    
    //populates an arraylist with the keyvalue pairs which will qualify the tags
    ArrayList searchList = new ArrayList();
    KeyValue kv = new KeyValue("bgcolor", "blue");
    searchList.Add(kv);
    kv = new KeyValue("border", "1");
    searchList.Add(kv);
    
    //populates an arraylist with the keyvalue pairs which will be set on any qualified tags
    ArrayList setList = new ArrayList();
    kv = new KeyValue("className", "blueBorder");
    setList.Add(kv);
    
    
    // assume the markupContent variable contains the example html of above
    //we pass the variables into the method in order to get the parsed and updated markup back
    
    markupContent = ServerParse.UpdateAttributes(markupContent , searchList, setList);					
    					
    The resulting text
    
    <table bgcolor="red">
    <tr>
    <td bgcolor="yellow" border="2">Name</td>
    <td id="qualify1" border="1" class="blueBorder" bgcolor=blue></td>
    </tr>
    <tr>
    <td><p id="qualify2" class="blueBorder" bgcolor="blue" border="1">Surname</p></td>
    <td></td>
    </tr>
    <tr>
    <td>address</td>
    <td></td>
    </tr>
    </table>					
    					
    Conclusion

    You can use this anywhere where you want to manipulate the markup based on a search.
    And it's a much simpler process than using regular expressions.

    It can also be used to perform functions on markup in a windows application.

  • 相关阅读:
    (转)导弹跟踪算法
    中文linux安装oracle界面乱码解决方案
    linux下创建oracle表空间
    [INS-20802] Oracle Net Configuration Assistant failed
    Centos6.5下Oracle 11g R2安装过程
    设置MYSQL数据库编码为UTF-8
    如何在linux系统中设置静态ip地址
    Oracle Net Configuration Assistant failed异常的解决方案
    解决安装oracle11g r2时提示pdksh conflicts with ksh-20100621-2.el6.i686问题
    CentOS增加swap分区大小
  • 原文地址:https://www.cnblogs.com/jintan/p/1998878.html
Copyright © 2011-2022 走看看