zoukankan      html  css  js  c++  java
  • html parser html解析器 C语言 其他语言也有接口 java

    下载:
    git clone https://github.com/google/gumbo-parser.git

    预先安装gcc等
    sudo apt-get install libtool

    $cd gumbo-parser/
    $ ./autogen.sh
    $ ./configure
    $ make
    $ sudo make install
    • 实例代码在examples下。make时会自动生成在gumbo-parser/目录下。

    注意所以操作都在gumbo-parser/目录下。

    自己可以修改示例重新生成。在gumbo-parser/目录下执行 make 程序名(不要后缀cc)。比如在examples/find_links.cc, 重新编译用 make find_links 即可。生成的可执行文件在根目录下。

    • 自己集成编译的话,配置信息可以用命令pkg-config打出:
    • $ pkg-config --cflags --libs gumbo 
    • $ gcc my_program.c `pkg-config --cflags --libs gumbo`

    集成gtest也可以。用官方的 make check没成功。

    git clone出gtest,进入目录。

    sudo cmake  CMakeLists.txt

    make #执行make,生成两个静态库:libgtest.a libgtest_main.a

    cp ./lib/libgtest*.a  /usr/lib

     测试代码:

    #include<gtest/gtest.h>
    
    int add(inta,intb){
    
        returna+b;
    
    }
    
    TEST(testCase,test0){
    
        EXPECT_EQ(add(2,3),5);
    
    }
    
    int main(intargc,char**argv){
    
      testing::InitGoogleTest(&argc,argv);
    
      returnRUN_ALL_TESTS();
    
    }
    
    作者:bowen_4ae0
    链接:https://www.jianshu.com/p/96158afbb91d
    来源:简书
    著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
    View Code

    在该文件的终端输入编译指令:

             $ g++ -o sample sample.cpp -lgtest -lpthread

             $ ./sample

    参考:https://www.jianshu.com/p/96158afbb91d

    注意加载库的顺序很重要。pthread一定得放到末尾!!!

    参考:https://github.com/google/gumbo-parser

    示例代码修改成遍历出所有文本节点:

    // Copyright 2013 Google Inc. All Rights Reserved.
    //
    // Licensed under the Apache License, Version 2.0 (the "License");
    // you may not use this file except in compliance with the License.
    // You may obtain a copy of the License at
    //
    //     http://www.apache.org/licenses/LICENSE-2.0
    //
    // Unless required by applicable law or agreed to in writing, software
    // distributed under the License is distributed on an "AS IS" BASIS,
    // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    // See the License for the specific language governing permissions and
    // limitations under the License.
    //
    // Author: jdtang@google.com (Jonathan Tang)
    //
    // Finds the URLs of all links in the page.
    
    #include <stdlib.h>
    
    #include <fstream>
    #include <iostream>
    #include <string>
    
    #include "gumbo.h"
    
    static void search_for_links(GumboNode* node) {
      if (node->type != GUMBO_NODE_ELEMENT) {
        return;
      }
      GumboAttribute* href;
      if (node->v.element.tag == GUMBO_TAG_A &&
          (href = gumbo_get_attribute(&node->v.element.attributes, "href"))) {
        std::cout << href->value << std::endl;
      }
    
      GumboVector* children = &node->v.element.children;
      for (unsigned int i = 0; i < children->length; ++i) {
        search_for_links(static_cast<GumboNode*>(children->data[i]));
      }
    }
    
    
    static void search_for_text(GumboNode* node) {
    
    
      if (node->type == GUMBO_NODE_TEXT) {
        std::cout << node->v.text.text << std::endl;
      }
      if (node->type == GUMBO_NODE_ELEMENT|| node->type == GUMBO_NODE_DOCUMENT|| node->type == GUMBO_NODE_TEMPLATE) {
          if(node->type == GUMBO_NODE_TEMPLATE){
              std::cout << "=== GUMBO_NODE_TEMPLATE ===" << std::endl;
          }
    
          if(node->type == GUMBO_NODE_DOCUMENT){
              std::cout << "=== GUMBO_NODE_DOCUMENT ===" << std::endl;
          }
          GumboVector* children = &node->v.element.children;
          for (unsigned int i = 0; i < children->length; ++i) {
            search_for_text(static_cast<GumboNode*>(children->data[i]));
          }
      }
    }
    
    int main(int argc, char** argv) {
      if (argc != 2) {
        std::cout << "Usage: find_links <html filename>.
    ";
        exit(EXIT_FAILURE);
      }
      const char* filename = argv[1];
    
      std::ifstream in(filename, std::ios::in | std::ios::binary);
      if (!in) {
        std::cout << "File " << filename << " not found!
    ";
        exit(EXIT_FAILURE);
      }
    
      std::string contents;
      in.seekg(0, std::ios::end);
      contents.resize(in.tellg());
      in.seekg(0, std::ios::beg);
      in.read(&contents[0], contents.size());
      in.close();
    
      GumboOutput* output = gumbo_parse(contents.c_str());
      //search_for_links(output->root);
      search_for_text(output->root);
      gumbo_destroy_output(&kGumboDefaultOptions, output);
    }
  • 相关阅读:
    AJAX异步传输——以php文件传输为例
    js控制json生成菜单——自制菜单(一)
    vs2010中关于HTML控件与服务器控件分别和js函数混合使用的问题
    SQL数据库连接到服务器出错——无法连接到XXX
    PHP错误:Namespace declaration statement has to be the very first statement in the script
    【LeetCode】19. Remove Nth Node From End of List
    【LeetCode】14. Longest Common Prefix
    【LeetCode】38. Count and Say
    【LeetCode】242. Valid Anagram
    【LeetCode】387. First Unique Character in a String
  • 原文地址:https://www.cnblogs.com/bigben0123/p/14031624.html
Copyright © 2011-2022 走看看