zoukankan html css js c++ java

推荐几个优秀的java爬虫项目

java爬虫项目

大型的：

Nutch apache/nutch · GitHub

适合做搜索引擎，分布式爬虫是其中一个功能。

Heritrix internetarchive/heritrix3 · GitHub

比较成熟的爬虫。

小型的：

Crawler4j yasserg/crawler4j · GitHub

WebCollector CrawlScript/WebCollector · GitHub（国人作品）

目标是在让你在5分钟之内写好一个爬虫。参考了crawler4j，如果经常需要写爬虫，需要写很多爬虫，还是不错的，因为上手肯定不止5分钟。缺点是它的定制性不强。

WebMagic code4craft/webmagic · GitHub（国人作品，推荐）

垂直、全栈式、模块化爬虫。更加适合抓取特定领域的信息。它包含了下载、调度、持久化、处理页面等模块。每一个模块你都可以自己去实现，也可以选择它已经帮你实现好的方案。这就有了很强的定制性。
看看它的例子：

编写第一个爬虫

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;

public class GithubRepoPageProcessor implements PageProcessor {

   private Site site = Site.me().setRetryTimes(3).setSleepTime(100);

   @Override
   public void process(Page page) {
       page.addTargetRequests(page.getHtml().links().regex("(https://github\.com/\w+/\w+)").all());
       page.putField("author", page.getUrl().regex("https://github\.com/(\w+)/.*").toString());
       page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
       if (page.getResultItems().get("name")==null){
           //skip this page
           page.setSkip(true);
       }
       page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));
   }

   @Override
   public Site getSite() {
       return site;
   }

   public static void main(String[] args) {
       Spider.create(new GithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run();
   }
}

使用注解编写爬虫

 1 @TargetUrl("https://github.com/\w+/\w+")
 2 @HelpUrl("https://github.com/\w+")
 3 public class GithubRepo {
 4 
 5     @ExtractBy(value = "//h1[@class='entry-title public']/strong/a/text()", notNull = true)
 6     private String name;
 7 
 8     @ExtractByUrl("https://github\.com/(\w+)/.*")
 9     private String author;
10 
11     @ExtractBy("//div[@id='readme']/tidyText()")
12     private String readme;
13 
14     public static void main(String[] args) {
15         OOSpider.create(Site.me().setSleepTime(1000)
16                 , new ConsolePageModelPipeline(), GithubRepo.class)
17                 .addUrl("https://github.com/code4craft").thread(5).run();
18     }
19 }

两种方式，都可以实现对github项目的抓取。

原创：偉少

查看全文

相关阅读:
震撼！一组你从未见过的惊艳照片(45图)
看明白了这个故事不精神分裂算你厉害
 关于无法把程序（Adobe Fireworks CS5）添加到打开方式的解决办法
 打伞
 引用视频全屏播放代码
 居家生活实用生活小窍门集锦
 《西游记第一百零一回》第一百零一回观(转)
保鲜膜的28种妙用！
20155324 《信息安全系统设计基础》课程总结
 2017-2018-1 20155324 《信息安全系统设计基础》第十四周学习总结

原文地址：https://www.cnblogs.com/chinway/p/5466028.html