zoukankan      html  css  js  c++  java
  • [PHP] xpath提取网页数据内容

    想要使用xpath来解析html内容, PHP自带两个对象

    DOMDocumentDOMXpath,其中初始化 loadHtml一般都会报很多警告,但是并不影响使用,用@屏蔽错误。

        /**
         * 初始化DOMXpath对象
         *
         * @param [type]  $content  网页内容
         * @param [array] $pathinfo 匹配信息
         *
         * @return void
         */
        private function _createXpathObj($content, $patinfo)
        {
            // 如果没有xpath配置项,不初始化xpath
            if (!$this->_existsXpathParse($patinfo)) {
                return;
            }
            try {
                $dom = new DOMDocument();
                @$dom->loadHtml($content);
                $dom->normalize();
                $xpath = new DOMXpath($dom);
                $this->xpathObj = $xpath;
            } catch (Exception $e) {
                getService('logger')->warning('Parse html fail', ['content' => $content]);
            }
        }
    

    其中 $nodeDOMElement 对象。

        /**
         * 获取Xpath解析值
         *
         * @param [type] $pat 匹配模式
         *
         * @return string
         */
        private function _getXpathField($pat)
        {
            $objs = $this->xpathObj->query($pat);
            if ($objs->length > 0) {
                $node = $objs->item(0);
                $outerHTML = $node->ownerDocument->saveHTML($node);
                return trim($outerHTML);
                # 作为示例 输出innerhtml
                //$innerHTML = '';
                //foreach ($node->childNodes as $childNode){
                //     $innerHTML .= $childNode->ownerDocument->saveHTML($childNode);
                //}
                //return $innerHTML; 
                # 作为示例 输出文本不含标签
                //return $node->textContent; //$node->nodeValue;
            }
            return '';
        }
    

    示例

    <?php
            $dom = new DOMDocument('1.0','UTF-8');
            $dom->loadHTML('<html><body><div><p>p1</p><p>p2</p></div></body></html>');        
            $node = $dom->getElementsByTagName('div')->item(0);        
            $outerHTML = $node->ownerDocument->saveHTML($node);        
            $innerHTML = '';
            foreach ($node->childNodes as $childNode){
                    $innerHTML .= $childNode->ownerDocument->saveHTML($childNode);
            }
            echo '<h2>outerHTML: </h2>';
            echo htmlspecialchars($outerHTML);
            echo '<h2>innerHTML: </h2>';
            echo htmlspecialchars($innerHTML);        
    ?>




  • 相关阅读:
    PAT2019顶级7-2:美丽的序列(线段树+DP)
    ZOJ2112 Dynamic Rank(可持久化线段树套树状数组)
    CF1353E K-periodic Garland(动态规划)
    CF1353D Constructing the array(优先队列)
    HDU2069 Coin Change(基础DP)
    surf(树状数组+DP)
    双倍快乐(回文树)
    ZOJ3591 Nim(博弈论)
    HDU6601 Keep On EveryThing But Triangle(可持久化线段树)
    HDU6599 I Love Palindrome String(回文树)
  • 原文地址:https://www.cnblogs.com/wangluochong/p/13222665.html
Copyright © 2011-2022 走看看