zoukankan html css js c++ java

使用轻量级JAVA 爬虫Gecco工具抓取新闻DEMO

写在前面

最近看到Gecoo爬虫工具，感觉比较简单好用，所有写个DEMO测试一下，抓取网站
http://zj.zjol.com.cn/home.html，主要抓取新闻的标题和发布时间做为抓取测试对象。抓取HTML节点通过像Jquery选择器一样选择节点，非常方便，Gecco代码主要利用注解实现来实现URL匹配，看起来比较简洁美观。

Gecoo GitHub地址
https://github.com/xtuhcy/gecco
Gecoo 作者博客
http://my.oschina.net/u/2336761/blog?fromerr=ZuKKo3fH

添加Maven依赖

<dependency>
      <groupId>com.geccocrawler</groupId>
      <artifactId>gecco</artifactId>
      <version>1.0.8</version>
</dependency>

编写抓取列表页面

 1 @Gecco(matchUrl = "http://zj.zjol.com.cn/home.html?pageIndex={pageIndex}&pageSize={pageSize}",pipelines = "zJNewsListPipelines")
 2 public class ZJNewsGeccoList implements HtmlBean {
 3     @Request
 4     private HttpRequest request;
 5     @RequestParameter
 6     private int pageIndex;
 7     @RequestParameter
 8     private int pageSize;
 9     @HtmlField(cssPath = "#content > div > div > div.con_index > div.r.main_mod > div > ul > li  > dl > dt > a")
10     private List<HrefBean> newList;
11 }

 1 @PipelineName("zJNewsListPipelines")
 2 public class ZJNewsListPipelines implements Pipeline<ZJNewsGeccoList> {
 3     public void process(ZJNewsGeccoList zjNewsGeccoList) {
 4         HttpRequest request=zjNewsGeccoList.getRequest();
 5         for (HrefBean bean:zjNewsGeccoList.getNewList()){
 6             //进入祥情页面抓取
 7        SchedulerContext.into(request.subRequest("http://zj.zjol.com.cn"+bean.getUrl()));
 8         }
 9         int page=zjNewsGeccoList.getPageIndex()+1;
10         String nextUrl = "http://zj.zjol.com.cn/home.html?pageIndex="+page+"&pageSize=100";
11         //抓取下一页
12         SchedulerContext.into(request.subRequest(nextUrl));
13     }
14 }

编写抓取祥情页面

 1 @Gecco(matchUrl = "http://zj.zjol.com.cn/news/{code}.html" ,pipelines = "zjNewsDetailPipeline")
 2 public class ZJNewsDetail implements HtmlBean {
 3 
 4     @Text
 5     @HtmlField(cssPath = "#headline")
 6     private String title ;
 7 
 8     @Text
 9     @HtmlField(cssPath = "#content > div > div.news_con > div.news-content > div:nth-child(1) > div > p.go-left.post-time.c-gray")
10     private String createTime;
11 }

1 @PipelineName("zjNewsDetailPipeline")
2 public class ZJNewsDetailPipeline implements Pipeline<ZJNewsDetail> {
3     public void process(ZJNewsDetail zjNewsDetail) {
4         System.out.println(zjNewsDetail.getTitle()+"  "+zjNewsDetail.getCreateTime());
5     }
6 }

启动主函数

 1 public class Main {
 2     public static void main(String [] rags){
 3         GeccoEngine.create()
 4                 //工程的包路径
 5                 .classpath("com.zhaochao.gecco.zj")
 6                 //开始抓取的页面地址
 7                 .start("http://zj.zjol.com.cn/home.html?pageIndex=1&pageSize=100")
 8                 //开启几个爬虫线程
 9                 .thread(10)
10                 //单个爬虫每次抓取完一个请求后的间隔时间
11                 .interval(10)
12                 //使用pc端userAgent
13                 .mobile(false)
14                 //开始运行
15                 .run();
16     }
17 }

抓取结果

这里写图片描述

项目完成代码

http://git.oschina.net/whzhaochao/geccoDemo

查看全文

相关阅读:
CI/CD for Power Platform
SpringMVC异常处理
 SpringMVC框架中的拦截器
 spring实现文件上传
 idea常用的快捷键
 解决maven项目创建过慢的问题
 springmvc—入门程序
 Spring中的 JdbcTemplate
基于XML的AOP 配置
 基于注解的 AOP 配置

原文地址：https://www.cnblogs.com/lr393993507/p/5629376.html