zoukankan      html  css  js  c++  java
  • Xpath 获取html文档的标签

    1.html page content:
    <div class="mnr-c _yE">
        <div class="_kk _wI">In the news</div>
        <li class="card-section _df g _mZd">
            <div class="_K2 _SYd">
                <div style="overflow:hidden;134px;height:100px" class="thumb">
                <a href="http://www.bbc.co.uk/news/uk-30172110" onmousedown="return rwt(this,'','','','2','AFQjCNG3I0r8D75WjgjZODuobF8ne7wCNw','','0CCwQpwIwAQ','','',event)">
                    <img height="100" id="uid_0" src="data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==" width="134" border="0">
                </a>
                </div>
            </div>
            <div class="_I2">
                <a class="_Dk" href="http://www.bbc.co.uk/news/uk-30172110" onmousedown="return rwt(this,'','','','2','AFQjCNG3I0r8D75WjgjZODuobF8ne7wCNw','','0CC0QqQIwAQ','','',event)">
                Google case over online abuse settled</a>
                <div class="_Ck kv">
                    <cite>BBC News</cite><span class="f"> - </span>
                    <span class="f" style="white-space:nowrap">21 hours ago
                    </span>
                </div>
            </div>
            <span class="_dwd st s std" style="margin-left:144px">
            A UK businessman who took <em>Google</em> to court over malicious web postings about him&nbsp;...</span>
        </li>
        <div>
            <li class="g _Nn _wbb card-section">
                <a class="_Dk" href="http://www.pcworld.com/article/2851812/google-to-apps-users-take-more-responsibility-for-protecting-your-accounts.html" onmousedown="return rwt(this,'','','','3','AFQjCNH0fmBCNMjPanXErfX6GQmDNsZK7Q','','0CC8QqQIwAg','','',event)">
                New Google Apps dashboard helps users protect accounts</a>
                <div class="_Ck kv">
                    <cite>PCWorld</cite><span class="f"> - </span>
                    <span class="f" style="white-space:nowrap">5 hours ago</span>
                </div>
            </li>
            <li class="g _Nn _Abb card-section">
                <a class="_Dk" href="http://www.forbes.com/sites/georgeanders/2014/11/24/google-and-facebook-rewire-the-internet-as-fcc-dithers/" onmousedown="return rwt(this,'','','','4','AFQjCNGcPEbPFsUfSxeCneg_aFYBX65fNQ','','0CDEQqQIwAw','','',event)">
                Google And Facebook Rewire The Internet As FCC Dithers</a>
                <div class="_Ck kv">
                    <cite>Forbes</cite><span class="f"> - </span>
                    <span class="f" style="white-space:nowrap">8 hours ago</span>
                </div>
            </li>
        </div>

    2.获取标签:

    //获取文档中所有的class="g"或者包含"g"的<li>标签

    var allLiNodes = htmlDoc.DocumentNode.SelectNodes(@"//li[@class='g' or contains(@class,'g')]");

    //获取当前节点及其所有子节点中的具有先辈的<img>的单个<a>标签

    var imageNode = aImageTagNode.SelectSingleNode(@".//img[./ancestor::a/@href]");

    3.w3cshcool 实例:

    http://www.w3school.com.cn/xpath/xpath_axes.asp

    第一次接触,主要是对爬虫的结果进行解析,然后存储测试与发布,准确率还挺高的。

    另:若是浏览器,直接获取xpath的方法:

    F12  开发真工具,找到对应的元素,在标签上右键,就可以看到一个copy xpath 直接复制即可。

  • 相关阅读:
    maven创建父子工程
    webservice之jersey简单实用
    EL表达式处理字符串
    oracle不等于1怎么查?
    day_07 搭建Tomcat服务器使用Servlet服务,后端接受前端请求过来的表单数据并使用
    Day_06 流程控制-循环结构-嵌套循环结构的原理解析
    Day05_流程控制02 循环结构
    day_5 流程控制 选择结构的两种常用语句的使用语法
    day_04 运算符详解
    day_03 变量的数据类型详解
  • 原文地址:https://www.cnblogs.com/shy-huang/p/4140127.html
Copyright © 2011-2022 走看看