zoukankan      html  css  js  c++  java
  • Nutch2.x 集成ElasticSearch 抓取+索引

    http://blog.csdn.net/eryk86/article/details/14111811
     
    使用https://github.com/apache/nutch.git导入nutch项目到intellij
     
    配置ivy.xml和conf下的gora.properties、nutch-site.xml
    修改ivy/ivy.xml
         修改elasticsearch版本
    [html] view plaincopy
     
    1. <dependency org="org.elasticsearch" name="elasticsearch" rev="0.90.5" conf="*->default"/>  
         去掉如下内容注解
    [html] view plaincopy
     
    1. <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />  
         修改软件版本,从1.2.15改成1.2.16,解决部分包导入失败问题
    [html] view plaincopy
     
    1. <dependency org="log4j" name="log4j" rev="1.2.16" conf="*->master" />  
    修改gora.properties
         注掉如下几行
    [plain] view plaincopy
     
    1. #gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver  
    2. #gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest  
    3. #gora.sqlstore.jdbc.user=sa  
    4. #gora.sqlstore.jdbc.password=  
         添加一行
    [html] view plaincopy
     
    1. gora.datastore.default=org.apache.gora.hbase.store.HBaseStore  
     
    修改nutch-site.xml,增加如下配置项
    [html] view plaincopy
     
    1. <property>  
    2.     <name>storage.data.store.class</name>  
    3.     <value>org.apache.gora.hbase.store.HBaseStore</value>  
    4. </property>  
    5. <property>  
    6.     <name>http.agent.name</name>  
    7.     <value>NutchCrawler</value>  
    8. </property>  
    9. <property>  
    10.     <name>parser.character.encoding.default</name>  
    11.     <value>utf-8</value>  
    12. </property>  
    13. <property>  
    14.     <name>http.accept.language</name>  
    15.     <value>ja-jp, en-us, zh-cn,en-gb,en;q=0.7,*;q=0.3</value>  
    16. </property>  
    17. <property>    
    18.     <name>generate.batch.id</name>    
    19.     <value>1</value>    
    20. </property>  


     
    增加hbase配置文件hbase-site.xml到nutch/conf中  
    [html] view plaincopy
     
    1. <configuration>  
    2.     <property>  
    3.         <name>hbase.rootdir</name>  
    4.         <value>file:///data/hbase</value>  
    5.     </property>  
    6.     <property>  
    7.         <name>hbase.zookeeper.property.dataDir</name>  
    8.         <value>/data/zookeeper</value>  
    9.     </property>  
    10. </configuration>     
     
    修改nutch/src/bin/nutch,文件开头增加
          NUTCH_JAVA_HOME=/usr/local/jdk
     
    修改src下org.apache.nutch.indexer.elastic.ElasticWriter 109行,使支持es0.90.5
          item.isFailed()
     
    删除nutch/conf下所有template文件
     
    编译nutch
          ant clean 
          ant runtime
     
    修改nutch-site.xml
    [html] view plaincopy
     
    1. <property>  
    2.     <name>plugin.folders</name>  
    3.     <value>/home/eryk/workspace/nutch/runtime/local/plugins</value>  
    4. </property>  
    设置intelil,增加nutch/conf和nutch/runtime/lib到classpath
    File->Project Structure->Dependencies 增加nutch/conf和nutch/runtime/local/lib目录
     
    增加pom.xml的依赖库
    [html] view plaincopy
     
    1. <dependency>  
    2.     <groupId>net.sourceforge.nekohtml</groupId>  
    3.     <artifactId>nekohtml</artifactId>  
    4.     <version>1.9.15</version>  
    5. </dependency>  
    6. <dependency>  
    7.     <groupId>org.ccil.cowan.tagsoup</groupId>  
    8.     <artifactId>tagsoup</artifactId>  
    9.     <version>1.2</version>  
    10. </dependency>  
    11. <dependency>  
    12.     <groupId>rome</groupId>  
    13.     <artifactId>rome</artifactId>  
    14.     <version>1.0</version>  
    15. </dependency>  
     
    修改pom.xml中es版本
    [html] view plaincopy
     
    1. <dependency>  
    2.      <groupId>org.elasticsearch</groupId>  
    3.      <artifactId>elasticsearch</artifactId>  
    4.      <version>0.90.5</version>  
    5.      <optional>true</optional>  
    6. </dependency>  
     
    修正依赖库的版本冲突
    [html] view plaincopy
     
    1. <dependency>  
    2.     <groupId>org.restlet.jse</groupId>  
    3.     <artifactId>org.restlet.ext.jackson</artifactId>  
    4.     <version>2.0.5</version>  
    5.                <exclusions>  
    6.                    <exclusion>  
    7.                        <artifactId>jackson-core-asl</artifactId>  
    8.                        <groupId>org.codehaus.jackson</groupId>  
    9.                    </exclusion>  
    10.                    <exclusion>  
    11.                        <artifactId>jackson-mapper-asl</artifactId>  
    12.                        <groupId>org.codehaus.jackson</groupId>  
    13.                    </exclusion>  
    14.                </exclusions>  
    15.                <optional>true</optional>  
    16.  </dependency>  
    17.      <dependency>  
    18.                    <groupId>org.apache.gora</groupId>  
    19.                    <artifactId>gora-core</artifactId>  
    20.                    <version>0.3</version>  
    21.                <exclusions>  
    22.                    <exclusion>  
    23.                        <artifactId>jackson-mapper-asl</artifactId>  
    24.                        <groupId>org.codehaus.jackson</groupId>  
    25.                    </exclusion>  
    26.                </exclusions>  
    27.                <optional>true</optional>  
    28.            </dependency>  
    修改src下org.apache.nutch.crawl.Crawler代码,增加-elasticindex和-batchId参数
     
    [html] view plaincopy
     
    1.  Map<String,ObjectargMap = ToolUtil.toArgMap(  
    2.     Nutch.ARG_THREADS, threads,  
    3.     Nutch.ARG_DEPTH, depth,  
    4.     Nutch.ARG_TOPN, topN,  
    5.     Nutch.ARG_SOLR, solrUrl,  
    6.     ElasticConstants.CLUSTER,elasticSearchAddr,      //使用es建立索引  
    7.     Nutch.ARG_SEEDDIR, seedDir,  
    8.     Nutch.ARG_NUMTASKS, numTasks,  
    9.     Nutch.ARG_BATCH,batchId,      //解决NullPointerException问题  
    10.     GeneratorJob.BATCH_ID,batchId);       //解决NullPointerException问题,貌似没用  
    11. run(argMap);  
     
    修改org.apache.nutch.indexer.elastic.ElasticWriter代码,支持-elasticindex ip:port传参
    [html] view plaincopy
     
    1. public void open(TaskAttemptContext job) throws IOException {  
    2.     String clusterName = job.getConfiguration().get(ElasticConstants.CLUSTER);  
    3.     if (clusterName != null && !clusterName.contains(":")) {  
    4.       node = nodeBuilder().clusterName(clusterName).client(true).node();  
    5.     } else {  
    6.       node = nodeBuilder().client(true).node();  
    7.     }  
    8.     LOG.info(String.format("clusterName=[%s]",clusterName));  
    9.   
    10.     if(clusterName.contains(":")){  
    11.         String[] addr = clusterName.split(":");  
    12.         client = new TransportClient()  
    13.                 .addTransportAddress(new InetSocketTransportAddress(addr[0],Integer.parseInt(addr[1])));  
    14.   
    15.     }else{  
    16.         client = node.client();  
    17.     }  
    18.   
    19.     bulk = client.prepareBulk();  
    20.     defaultIndex = job.getConfiguration().get(ElasticConstants.INDEX, "index");  
    21.     maxBulkDocs = job.getConfiguration().getInt(  
    22.         ElasticConstants.MAX_BULK_DOCS, DEFAULT_MAX_BULK_DOCS);  
    23.     maxBulkLength = job.getConfiguration().getInt(  
    24.         ElasticConstants.MAX_BULK_LENGTH, DEFAULT_MAX_BULK_LENGTH);  
    25.   }  
     
    在nutch目录下增加urls目录,在url目录下新建seed.txt,写入要爬的种子地址
     
    运行Crawler
         传入参数
         urls -elasticindex a2:9300 -threads 10 -depth 3 -topN 5 -batchId 1
         观察nutch/hadoop.log日志
    [html] view plaincopy
     
    1.            2013-11-03 22:57:36,682 INFO  elasticsearch.node - [Ikonn] started  
    2. 2013-11-03 22:57:36,682 INFO  elastic.ElasticWriter - clusterName=[a2:9300]  
    3. 2013-11-03 22:57:36,692 INFO  elasticsearch.plugins - [Electron] loaded [], sites []  
    4. 2013-11-03 22:57:36,863 INFO  basic.BasicIndexingFilter - Maximum title length for indexing set to: 100  
    5. 2013-11-03 22:57:36,864 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter  
    6. 2013-11-03 22:57:36,864 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off  
    7. 2013-11-03 22:57:36,865 INFO  indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter  
    8. 2013-11-03 22:57:37,946 INFO  elastic.ElasticWriter - Processing remaining requests [docs = 86, length = 130314, total docs = 86]  
    9. 2013-11-03 22:57:37,988 INFO  elastic.ElasticWriter - Processing to finalize last execute  
    10. 2013-11-03 22:57:41,986 INFO  elastic.ElasticWriter - Previous took in ms 1590, including wait 3998  
    11. 2013-11-03 22:57:42,020 INFO  elasticsearch.node - [Ikonn] stopping ...  
    12. 2013-11-03 22:57:42,032 INFO  elasticsearch.node - [Ikonn] stopped  
    13. 2013-11-03 22:57:42,032 INFO  elasticsearch.node - [Ikonn] closing ...  
    14. 2013-11-03 22:57:42,039 INFO  elasticsearch.node - [Ikonn] closed  
    15. 2013-11-03 22:57:42,041 WARN  mapred.FileOutputCommitter - Output path is null in cleanup  
    16. 2013-11-03 22:57:42,057 INFO  elastic.ElasticIndexerJob - Done  
    查询es
         返回结果,说明已经跑通了,观察hbase中,表已经自动建好,并存入了已经爬到的数据
         
     
    参考
  • 相关阅读:
    端口被占用
    启动Windows防火墙提示“0x8007042c"
    vue创建全局组件
    vue中过度动画之列表添加删除动画实现
    vue中过渡动画(类名结合动画实现方式)
    vue中过渡动画(类名实现方式)
    this.$nextTick()方法的使用
    利用axios获取数据并渲染到视图层
    axios的简单使用
    watch深度监听
  • 原文地址:https://www.cnblogs.com/lvfeilong/p/35435wr234324.html
Copyright © 2011-2022 走看看