正文抽取利用curl获取网页内容

zoukankan html css js c++ java

正文抽取利用curl获取网页内容
近期在写一个正文抽取的程序，基于linux平台C++，大体流程从网页获取-->网页解析-->构建变种dom树-->正文抽取算法-->结构化输出。

目前已经完成了第一个功能，调试第二、三个功能，由于互联网上的页面很多由“无证”程序员完成，所以很不规范，需要进行一些容错处理，所以比较耗时间，而且，由于之前对编码格式不了解，在解析时，对我来说编码格式的转换也是一个难题，不过应该会在不断的学习过程中慢慢解决，也算是弥补一下技术缺陷。

网页获取可以用curl库完成，很简单，主要有四个函数：

　　1. CURL *curl_easy_init( )

This function must be the first function to call, and it returns a CURL easy handle that you must use as input to other easy-functions.

　　这个函数必须第一个被调用，返回的CURL指针用于其它几个easy-函数的easy-句柄
CURL *curl curl = curl_easy_init();
　　2. CURLcode curl_easy_setopt(CURL *handle, CURLoption option, parameter)

　　curl_easy_setopt() is used to tell libcurl how to behave. By using the appropriate options to curl_easy_setopt,you can change libcurl’s behavior.

　　这个函数设置libcurl如何进行处理。通过该函数设置适当的选项，可以进行不同的处理
CURLcode code;
code = curl_easy_setopt(curl, CURLOPT_ERRORBUFFER, error);　　/* 设置error为错误输出的buffer */
curl_easy_setopt(curl, CURLOPT_VERBOSE, 1L);　　　　　　　　　　/* 如果你想CURL报告每一件意外的事情，设置这个选项为一个非零值 */
code = curl_easy_setopt(curl, CURLOPT_URL, url);　　　　　　　 /* 设置将要进行访问的url */
code = curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1);　　 /* 设置这个选项为一个非零值(象 "Location: ")的头，服务器会把它当做HTTP头的一部分发送 */
code = curl_easy_setopt(curl, CURLOPT_HEADERFUNCTION, writer);　　/* 设置接收到响应头所调用的处理函数writer */
code = curl_easy_setopt(curl, CURLOPT_WRITEHEADER, &header);　　　/* 响应头函数的最后一个参数 */
code = curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, writer);　　/* 设置接收到响应体所调用的处理函数writer */
code = curl_easy_setopt(curl, CURLOPT_WRITEDATA, &content);　　/* 响应体函数的最后一个参数 */
　　3.CURLcode curl_easy_perform(CURL *handle);

　　This function is called after the init and all the curl_easy_setopt(3) calls are made, and will perform the transfer as described in the options.

　　这个函数将在curl_easy_setopt函数调用后被调用，将根据设置的处理选项进行处理（包括http头的字段添加，响应头和响应体的响应处理等）。
code = curl_easy_perform(curl);
　　4.void curl_easy_cleanup(CURL *handle);

　　This function must be the last function to call for an easy session. It is the opposite of the curl_easy_init(3) function and must be called with the same handle as input that the curl_easy_init call returned.

　　这个函数必须是每个easy回话的最后一个被调用。它与curl_easy_init函数相反，且必须使用curl_easy_init返回的同一个句柄。
curl_easy_cleanup(curl);
　　5. CURLcode curl_easy_getinfo(CURL *curl, CURLINFO info, ... );

　　Request internal information from the curl session with this function.

　　该函数可以从curl回话中获取中间信息。
code = curl_easy_getinfo(curl, CURLINFO_RESPONSE_CODE , &retcode); if ( (code == CURLE_OK) && retcode == 200 ) { .... }
　　curl的用法大致如此，目前正在进行正文抽取的工作，大体的工作已经有了些效果，但要做到抽取率100%，抽取错误90%还有些工作要做。

　　继续努力。。。。
查看全文

相关阅读:
Spring中使用Log4j记录日志
 Spring MVC异常处理实例
 Spring MVC静态资源实例
 Spring MVC页面重定向实例
 Spring MVC表单实例
 Eclipse4.6安装Tomcat插件时报错：Unable to read repository at http://tomcatplugin.sf.net/update/content.xml. Received fatal alert: handshake_failure
Graphviz--图形绘制工具
 使用Maven+Nexus+Jenkins+Svn+Tomcat+Sonar搭建持续集成环境
 MySQL在并发场景下的问题及解决思路
 MIT KIT OpenID Connect Demo Client

原文地址：https://www.cnblogs.com/geekma/p/2640270.html