Nutch2.x 集成ElasticSearch 抓取+索引

zoukankan html css js c++ java

Nutch2.x 集成ElasticSearch 抓取+索引
http://blog.csdn.net/eryk86/article/details/14111811

使用https://github.com/apache/nutch.git导入nutch项目到intellij

配置ivy.xml和conf下的gora.properties、nutch-site.xml

修改ivy/ivy.xml

修改elasticsearch版本
[html] view plain copy

<dependency org="org.elasticsearch" name="elasticsearch" rev="0.90.5" conf="*->default"/>
去掉如下内容注解
[html] view plain copy

<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
修改软件版本，从1.2.15改成1.2.16，解决部分包导入失败问题
[html] view plain copy

<dependency org="log4j" name="log4j" rev="1.2.16" conf="*->master" />
修改gora.properties

注掉如下几行
[plain] view plain copy

#gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver

#gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest

#gora.sqlstore.jdbc.user=sa

#gora.sqlstore.jdbc.password=
添加一行
[html] view plain copy

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
修改nutch-site.xml，增加如下配置项
[html] view plain copy

<property>

    <name>storage.data.store.class</name>

    <value>org.apache.gora.hbase.store.HBaseStore</value>

</property>

<property>

    <name>http.agent.name</name>

    <value>NutchCrawler</value>

</property>

<property>

    <name>parser.character.encoding.default</name>

    <value>utf-8</value>

</property>

<property>

    <name>http.accept.language</name>

    <value>ja-jp, en-us, zh-cn,en-gb,en;q=0.7,*;q=0.3</value>

</property>

<property>

    <name>generate.batch.id</name>

    <value>1</value>

</property>
增加hbase配置文件hbase-site.xml到nutch/conf中
[html] view plain copy

<configuration>

    <property>

        <name>hbase.rootdir</name>

        <value>file:///data/hbase</value>

    </property>

    <property>

        <name>hbase.zookeeper.property.dataDir</name>

        <value>/data/zookeeper</value>

    </property>

</configuration>
修改nutch/src/bin/nutch，文件开头增加

  NUTCH_JAVA_HOME=/usr/local/jdk

修改src下org.apache.nutch.indexer.elastic.ElasticWriter 109行，使支持es0.90.5

  item.isFailed()

删除nutch/conf下所有template文件

编译nutch
  ant clean

  ant runtime

修改nutch-site.xml

[html] view plain copy

<property>

    <name>plugin.folders</name>

    <value>/home/eryk/workspace/nutch/runtime/local/plugins</value>

</property>

设置intelil，增加nutch/conf和nutch/runtime/lib到classpath

File->Project Structure->Dependencies 增加nutch/conf和nutch/runtime/local/lib目录

增加pom.xml的依赖库

[html] view plain copy

<dependency>

    <groupId>net.sourceforge.nekohtml</groupId>

    <artifactId>nekohtml</artifactId>

    <version>1.9.15</version>

</dependency>

<dependency>

    <groupId>org.ccil.cowan.tagsoup</groupId>

    <artifactId>tagsoup</artifactId>

    <version>1.2</version>

</dependency>

<dependency>

    <groupId>rome</groupId>

    <artifactId>rome</artifactId>

    <version>1.0</version>

</dependency>

修改pom.xml中es版本

[html] view plain copy

<dependency>

     <groupId>org.elasticsearch</groupId>

     <artifactId>elasticsearch</artifactId>

     <version>0.90.5</version>

     <optional>true</optional>

</dependency>

修正依赖库的版本冲突

[html] view plain copy

<dependency>

    <groupId>org.restlet.jse</groupId>

    <artifactId>org.restlet.ext.jackson</artifactId>

    <version>2.0.5</version>

               <exclusions>

                   <exclusion>

                       <artifactId>jackson-core-asl</artifactId>

                       <groupId>org.codehaus.jackson</groupId>

                   </exclusion>

                   <exclusion>

                       <artifactId>jackson-mapper-asl</artifactId>

                       <groupId>org.codehaus.jackson</groupId>

                   </exclusion>

               </exclusions>

               <optional>true</optional>

</dependency>

     <dependency>

                   <groupId>org.apache.gora</groupId>

                   <artifactId>gora-core</artifactId>

                   <version>0.3</version>

               <exclusions>

                   <exclusion>

                       <artifactId>jackson-mapper-asl</artifactId>

                       <groupId>org.codehaus.jackson</groupId>

                   </exclusion>

               </exclusions>

               <optional>true</optional>

           </dependency>

修改src下org.apache.nutch.crawl.Crawler代码，增加-elasticindex和-batchId参数

[html] view plain copy

Map<String,Object> argMap = ToolUtil.toArgMap(

    Nutch.ARG_THREADS, threads,

    Nutch.ARG_DEPTH, depth,

    Nutch.ARG_TOPN, topN,

    Nutch.ARG_SOLR, solrUrl,

    ElasticConstants.CLUSTER,elasticSearchAddr,      //使用es建立索引

    Nutch.ARG_SEEDDIR, seedDir,

    Nutch.ARG_NUMTASKS, numTasks,

    Nutch.ARG_BATCH,batchId,      //解决NullPointerException问题

    GeneratorJob.BATCH_ID,batchId);       //解决NullPointerException问题，貌似没用

run(argMap);
修改org.apache.nutch.indexer.elastic.ElasticWriter代码，支持-elasticindex ip:port传参
[html] view plain copy

public void open(TaskAttemptContext job) throws IOException {

    String clusterName = job.getConfiguration().get(ElasticConstants.CLUSTER);

    if (clusterName != null && !clusterName.contains(":")) {

      node = nodeBuilder().clusterName(clusterName).client(true).node();

    } else {

      node = nodeBuilder().client(true).node();

    }

    LOG.info(String.format("clusterName=[%s]",clusterName));



    if(clusterName.contains(":")){

        String[] addr = clusterName.split(":");

        client = new TransportClient()

                .addTransportAddress(new InetSocketTransportAddress(addr[0],Integer.parseInt(addr[1])));



    }else{

        client = node.client();

    }



    bulk = client.prepareBulk();

    defaultIndex = job.getConfiguration().get(ElasticConstants.INDEX, "index");

    maxBulkDocs = job.getConfiguration().getInt(

        ElasticConstants.MAX_BULK_DOCS, DEFAULT_MAX_BULK_DOCS);

    maxBulkLength = job.getConfiguration().getInt(

        ElasticConstants.MAX_BULK_LENGTH, DEFAULT_MAX_BULK_LENGTH);

  }
在nutch目录下增加urls目录，在url目录下新建seed.txt，写入要爬的种子地址

运行Crawler

传入参数

urls -elasticindex a2:9300 -threads 10 -depth 3 -topN 5 -batchId 1

观察nutch/hadoop.log日志
[html] view plain copy

           2013-11-03 22:57:36,682 INFO  elasticsearch.node - [Ikonn] started

2013-11-03 22:57:36,682 INFO  elastic.ElasticWriter - clusterName=[a2:9300]

2013-11-03 22:57:36,692 INFO  elasticsearch.plugins - [Electron] loaded [], sites []

2013-11-03 22:57:36,863 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100

2013-11-03 22:57:36,864 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter

2013-11-03 22:57:36,864 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off

2013-11-03 22:57:36,865 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter

2013-11-03 22:57:37,946 INFO  elastic.ElasticWriter - Processing remaining requests [docs = 86, length = 130314, total docs = 86]

2013-11-03 22:57:37,988 INFO  elastic.ElasticWriter - Processing to finalize last execute

2013-11-03 22:57:41,986 INFO  elastic.ElasticWriter - Previous took in ms 1590, including wait 3998

2013-11-03 22:57:42,020 INFO  elasticsearch.node - [Ikonn] stopping ...

2013-11-03 22:57:42,032 INFO  elasticsearch.node - [Ikonn] stopped

2013-11-03 22:57:42,032 INFO  elasticsearch.node - [Ikonn] closing ...

2013-11-03 22:57:42,039 INFO  elasticsearch.node - [Ikonn] closed

2013-11-03 22:57:42,041 WARN  mapred.FileOutputCommitter - Output path is null in cleanup

2013-11-03 22:57:42,057 INFO  elastic.ElasticIndexerJob - Done
查询es

http://a2:9200/_search?q=%E7%BE%8E%E5%A5%B3&pretty=true

返回结果，说明已经跑通了，观察hbase中，表已经自动建好，并存入了已经爬到的数据

参考

http://www.blogjava.net/paulwong/archive/2013/08/31/403513.html

http://my.oschina.net/mynote/blog/152845

http://www.searchtech.pro/nutch2.1-elasticsearch-mysql-local-Integrate

http://blog.csdn.net/laigood/article/details/7625862
查看全文

相关阅读:
端口被占用
 启动Windows防火墙提示“0x8007042c"
vue创建全局组件
 vue中过度动画之列表添加删除动画实现
 vue中过渡动画（类名结合动画实现方式）
vue中过渡动画（类名实现方式）
this.$nextTick()方法的使用
 利用axios获取数据并渲染到视图层
 axios的简单使用
 watch深度监听

原文地址：https://www.cnblogs.com/lvfeilong/p/35435wr234324.html