zoukankan      html  css  js  c++  java
  • QT学习:c++解析html相关

    原来我做爬虫的时候,对页面进行解析的时候总是用很简单粗暴的方法,直接找规律。后来在网上看到了gumbo,尝试了一下,发现确实很好用,所以向大家推荐一下。

    以下转自:http://blog.csdn.net/whyistao/article/details/37919581

    1.c++好像没有太多的html解析库可以用,最后试着在qt里面集成了htmlcxx,一开始在pro里面写了 includepath += 路径,发现仍然没有用
    后来发现只要在 HEADERS 和 SOURCES 里面 把htmlcxx的c文件和.h文件 +=进去就行了,像这样:
    SOURCES += main.cpp
            html/utils.cc 
            html/Uri.cc 
            html/ParserSax.cc 
            html/ParserDom.cc 
            html/Node.cc 
            html/Extensions.cc
    HEADERS  += mainwindow.h 
            html/utils.h 
            html/Uri.h 
            html/tree.h 
            html/ParserSax.h 
            html/ParserDom.h 
            html/Node.h 
            html/Extensions.h 
            html/debug.h 
            html/ci_string.h 
            html/wincstring.h 
            html/tld.h
    
    参考了:   htmlcxx for qt(mingw)      http://blog.chinaunix.net/uid-21525518-id-1824657.html
    
    
    2.使用gumbo解析
    导入c和h文件方法同上,记一下gumbo常用类型
    GumboOutput   
    用GumboOutput来解析html源码,然后output->root即为根节点。
    GumboOutput* output = gumbo_parse(htmlString.c_str());
    GumboNode* node = output->root
    GumboNode    节点                      
    GumboNode node;      
    获得节点里面的东西    
    node->v->text                           //  节点的文本
    node->v.element.children    // 获得节点的子节点列表
    node->type     //节点的类型 
    GumboVector    节点容器  
    比如可以   GumboVector  * children  =    node->v.element.children;   来获得节点的子节点列表
    (GumboNode*) ( children->data[i] )     //获得这个节点列表的第i个节点   
    GumboAttribute  节点属性
    GumboAttribute* href;  
    if (node->v.element.tag == GUMBO_TAG_A &&   (href = gumbo_get_attribute(&node->v.element.attributes, "href"))) 
    {    std::cout << href->value << std::endl;  }
    
    
    节点的类型  
      ELEMENT_NODE,普通元素节点,如<html>,<p>,<div>,<span>,<img>  
      ATTRIBUTE_NODE,元素属性  
      TEXT_NODE,文本节点  
      CDATA_SECTION_NODE,即<![CDATA[ ]]>  
      ENTITY_REFERENCE_NODE,实体引用,如&   
      ENTITY_NODE,实体,如<!ENTITY copyright “Copyright 2010, impng. All rights reserved”]>  
      PROCESSING_INSTRUCTION_NODE,PI,处理指令,如<?xml  version=”1.0″?>  
      COMMENT_NODE,注释<!–   –>  
      DOCUMENT_NODE,根节点,即document.nodeType  
      DOCUMENT_TYPE_NODE,DTD,文档类型<!DOCTYPE   >  
      DOCUMENT_FRAGMENT_NODE,文档片段  
      NOTATION_NODE,DTD中定义的记号  
    
    在代码里的节点类型可以有如下几种           (使用方法       node->type ==  GUMBO_NODE_ELEMENT )
    typedef enum {
      /** Document node.  v will be a GumboDocument. */
      GUMBO_NODE_DOCUMENT,
      /** Element node.  v will be a GumboElement. */
      GUMBO_NODE_ELEMENT,
      /** Text node.  v will be a GumboText. */
      GUMBO_NODE_TEXT,
      /** CDATA node. v will be a GumboText. */
      GUMBO_NODE_CDATA,
      /** Comment node.  v. will be a GumboText, excluding comment delimiters. */
      GUMBO_NODE_COMMENT,
      /** Text node, where all contents is whitespace.  v will be a GumboText. */
      GUMBO_NODE_WHITESPACE
    } GumboNodeType;
    
    标签类型:                           (使用方法    node->v.element.tag != GUMBO_TAG_SCRIPT   )
    typedef enum {
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#the-root-element
      GUMBO_TAG_HTML,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#document-metadata
      GUMBO_TAG_HEAD,
      GUMBO_TAG_TITLE,
      GUMBO_TAG_BASE,
      GUMBO_TAG_LINK,
      GUMBO_TAG_META,
      GUMBO_TAG_STYLE,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/scripting-1.html#scripting-1
      GUMBO_TAG_SCRIPT,
      GUMBO_TAG_NOSCRIPT,
      GUMBO_TAG_TEMPLATE,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/sections.html#sections
      GUMBO_TAG_BODY,
      GUMBO_TAG_ARTICLE,
      GUMBO_TAG_SECTION,
      GUMBO_TAG_NAV,
      GUMBO_TAG_ASIDE,
      GUMBO_TAG_H1,
      GUMBO_TAG_H2,
      GUMBO_TAG_H3,
      GUMBO_TAG_H4,
      GUMBO_TAG_H5,
      GUMBO_TAG_H6,
      GUMBO_TAG_HGROUP,
      GUMBO_TAG_HEADER,
      GUMBO_TAG_FOOTER,
      GUMBO_TAG_ADDRESS,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/grouping-content.html#grouping-content
      GUMBO_TAG_P,
      GUMBO_TAG_HR,
      GUMBO_TAG_PRE,
      GUMBO_TAG_BLOCKQUOTE,
      GUMBO_TAG_OL,
      GUMBO_TAG_UL,
      GUMBO_TAG_LI,
      GUMBO_TAG_DL,
      GUMBO_TAG_DT,
      GUMBO_TAG_DD,
      GUMBO_TAG_FIGURE,
      GUMBO_TAG_FIGCAPTION,
      GUMBO_TAG_MAIN,
      GUMBO_TAG_DIV,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/text-level-semantics.html#text-level-semantics
      GUMBO_TAG_A,
      GUMBO_TAG_EM,
      GUMBO_TAG_STRONG,
      GUMBO_TAG_SMALL,
      GUMBO_TAG_S,
      GUMBO_TAG_CITE,
      GUMBO_TAG_Q,
      GUMBO_TAG_DFN,
      GUMBO_TAG_ABBR,
      GUMBO_TAG_DATA,
      GUMBO_TAG_TIME,
      GUMBO_TAG_CODE,
      GUMBO_TAG_VAR,
      GUMBO_TAG_SAMP,
      GUMBO_TAG_KBD,
      GUMBO_TAG_SUB,
      GUMBO_TAG_SUP,
      GUMBO_TAG_I,
      GUMBO_TAG_B,
      GUMBO_TAG_U,
      GUMBO_TAG_MARK,
      GUMBO_TAG_RUBY,
      GUMBO_TAG_RT,
      GUMBO_TAG_RP,
      GUMBO_TAG_BDI,
      GUMBO_TAG_BDO,
      GUMBO_TAG_SPAN,
      GUMBO_TAG_BR,
      GUMBO_TAG_WBR,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/edits.html#edits
      GUMBO_TAG_INS,
      GUMBO_TAG_DEL,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/embedded-content-1.html#embedded-content-1
      GUMBO_TAG_IMAGE,
      GUMBO_TAG_IMG,
      GUMBO_TAG_IFRAME,
      GUMBO_TAG_EMBED,
      GUMBO_TAG_OBJECT,
      GUMBO_TAG_PARAM,
      GUMBO_TAG_VIDEO,
      GUMBO_TAG_AUDIO,
      GUMBO_TAG_SOURCE,
      GUMBO_TAG_TRACK,
      GUMBO_TAG_CANVAS,
      GUMBO_TAG_MAP,
      GUMBO_TAG_AREA,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/the-map-element.html#mathml
      GUMBO_TAG_MATH,
      GUMBO_TAG_MI,
      GUMBO_TAG_MO,
      GUMBO_TAG_MN,
      GUMBO_TAG_MS,
      GUMBO_TAG_MTEXT,
      GUMBO_TAG_MGLYPH,
      GUMBO_TAG_MALIGNMARK,
      GUMBO_TAG_ANNOTATION_XML,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/the-map-element.html#svg-0
      GUMBO_TAG_SVG,
      GUMBO_TAG_FOREIGNOBJECT,
      GUMBO_TAG_DESC,
      // SVG title tags will have GUMBO_TAG_TITLE as with HTML.
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/tabular-data.html#tabular-data
      GUMBO_TAG_TABLE,
      GUMBO_TAG_CAPTION,
      GUMBO_TAG_COLGROUP,
      GUMBO_TAG_COL,
      GUMBO_TAG_TBODY,
      GUMBO_TAG_THEAD,
      GUMBO_TAG_TFOOT,
      GUMBO_TAG_TR,
      GUMBO_TAG_TD,
      GUMBO_TAG_TH,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/forms.html#forms
      GUMBO_TAG_FORM,
      GUMBO_TAG_FIELDSET,
      GUMBO_TAG_LEGEND,
      GUMBO_TAG_LABEL,
      GUMBO_TAG_INPUT,
      GUMBO_TAG_BUTTON,
      GUMBO_TAG_SELECT,
      GUMBO_TAG_DATALIST,
      GUMBO_TAG_OPTGROUP,
      GUMBO_TAG_OPTION,
      GUMBO_TAG_TEXTAREA,
      GUMBO_TAG_KEYGEN,
      GUMBO_TAG_OUTPUT,
      GUMBO_TAG_PROGRESS,
      GUMBO_TAG_METER,
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/interactive-elements.html#interactive-elements
      GUMBO_TAG_DETAILS,
      GUMBO_TAG_SUMMARY,
      GUMBO_TAG_MENU,
      GUMBO_TAG_MENUITEM,
      // Non-conforming elements that nonetheless appear in the HTML5 spec.
      // http://www.whatwg.org/specs/web-apps/current-work/multipage/obsolete.html#non-conforming-features
      GUMBO_TAG_APPLET,
      GUMBO_TAG_ACRONYM,
      GUMBO_TAG_BGSOUND,
      GUMBO_TAG_DIR,
      GUMBO_TAG_FRAME,
      GUMBO_TAG_FRAMESET,
      GUMBO_TAG_NOFRAMES,
      GUMBO_TAG_ISINDEX,
      GUMBO_TAG_LISTING,
      GUMBO_TAG_XMP,
      GUMBO_TAG_NEXTID,
      GUMBO_TAG_NOEMBED,
      GUMBO_TAG_PLAINTEXT,
      GUMBO_TAG_RB,
      GUMBO_TAG_STRIKE,
      GUMBO_TAG_BASEFONT,
      GUMBO_TAG_BIG,
      GUMBO_TAG_BLINK,
      GUMBO_TAG_CENTER,
      GUMBO_TAG_FONT,
      GUMBO_TAG_MARQUEE,
      GUMBO_TAG_MULTICOL,
      GUMBO_TAG_NOBR,
      GUMBO_TAG_SPACER,
      GUMBO_TAG_TT,
      // Used for all tags that don't have special handling in HTML.
      GUMBO_TAG_UNKNOWN,
      // A marker value to indicate the end of the enum, for iterating over it.
      // Also used as the terminator for varargs functions that take tags.
      GUMBO_TAG_LAST,
    } GumboTag;
    
    
    3.使用gumbo的时候,报了一个RtlWerpReportException failed with status code :-1073741823 错,
    一开始以为是堆栈溢出的问题,后来发现是自己代码逻辑没写对,最好对照着官方demo的用法去写
    if (node->v.element.tag == GUMBO_TAG_A &&      (href = gumbo_get_attribute(&node->v.element.attributes, "href"))) 
    {    std::cout << href->value << std::endl;  }
    
    
    4.编译gumbo的时候报了一个错
     错误:'for' loop initial declarations are only allowed in C99 mode
    所以在项目pro配置里要加上这两句
    QMAKE_CFLAGS_DEBUG +=  --std=c99
    QMAKE_CFLAGS_RELEASE +=  --std=c99

    转载请注明:http://www.cnblogs.com/fnlingnzb-learner/p/5835428.html

  • 相关阅读:
    Python入门11 —— 基本数据类型的操作
    Win10安装7 —— 系统的优化
    Win10安装6 —— 系统的激活
    Win10安装5 —— 系统安装步骤
    Win10安装4 —— 通过BIOS进入PE
    Win10安装2 —— 版本的选择与下载
    Win10安装1 —— 引言与目录
    Win10安装3 —— U盘启动工具安装
    虚拟机 —— VMware Workstation15安装教程
    Python入门10 —— for循环
  • 原文地址:https://www.cnblogs.com/fnlingnzb-learner/p/5835428.html
Copyright © 2011-2022 走看看