zoukankan      html  css  js  c++  java
  • Heritrix3.x自定义扩展Extractor

    一、引言:

      Heritrix3.x与Heritrix1.x版本差异比较大,全新配置模式的引入+扩展接口的变化,同时由于说明文档的匮乏,给Heritrix的开发者带来困惑,前面的文章已经就Heritrix的配置部署和运行做了说明,本文就Heritrix3.x版本就Extractor扩展做出实例说明。

    二、配置说明

      Heritrix3.x的WebUI发生了变化,不在是原来那种WebUI选择模式,而是变成了在线配置文件直接编辑模式。在这里自定义的Extractor要想加入Heritrix运行,首先需要修改配置文件,降自定义扩展的Extractor加入到Heritrix的Processor队列。完整配置文件如下所示:

      2.1 配置文件

    205  <!-- FETCH CHAIN --> 
    206  <!-- processors declared as named beans -->
    207  <bean id="preselector" class="org.archive.crawler.prefetch.Preselector">
    212  </bean>
    213  <bean id="preconditions" class="org.archive.crawler.prefetch.PreconditionEnforcer">
    217  </bean>
    218  <bean id="fetchDns" class="org.archive.modules.fetcher.FetchDNS">
    222  </bean>
    223  <bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">
    249  </bean>
    250  <bean id="extractorHttp" class="org.archive.modules.extractor.ExtractorHTTP">
    251  </bean>
    -------------------------------自定义Extractor------------------------------------
    252 <bean id="SohuNewsExtractor" class="my.SohuNewsExtractor"> 253 </bean>
    ---------------------------------------------------------------------------------
    254 <bean id="extractorHtml" class="org.archive.modules.extractor.ExtractorHTML"> 264 </bean> 265 <bean id="extractorCss" class="org.archive.modules.extractor.ExtractorCSS"> 266 </bean> 267 <bean id="extractorJs" class="org.archive.modules.extractor.ExtractorJS"> 268 </bean> 269 <bean id="extractorSwf" class="org.archive.modules.extractor.ExtractorSWF"> 270 </bean> 271 <!-- assembled into ordered FetchChain bean --> 272 <bean id="fetchProcessors" class="org.archive.modules.FetchChain"> 273 <property name="processors"> 274 <list> 275 <!-- recheck scope, if so enabled... --> 276 <ref bean="preselector"/> 277 <!-- ...then verify or trigger prerequisite URIs fetched, allow crawling... --> 278 <ref bean="preconditions"/> 279 <!-- ...fetch if DNS URI... --> 280 <ref bean="fetchDns"/> 281 <!-- ...fetch if HTTP URI... --> 282 <ref bean="fetchHttp"/> 283 <!-- ...extract oulinks from HTTP headers... --> 284 <ref bean="extractorHttp"/>

    ----------------------------自定义Extractor----------------------------------------------
    285 <!-- ...extract oulinks from HTTP content... --> 286 <ref bean="SohuNewsExtractor"/>
    ---------------------------------------------------------------------------------------
    287 <!-- ...extract oulinks from HTML content... --> 288 <ref bean="extractorHtml"/> 289 <!-- ...extract oulinks from CSS content... --> 290 <ref bean="extractorCss"/> 291 <!-- ...extract oulinks from Javascript content... --> 292 <ref bean="extractorJs"/> 293 <!-- ...extract oulinks from Flash content... --> 294 <ref bean="extractorSwf"/> 295 </list> 296 </property> 297 </bean> 298

      2.2 添加Bean和配置调度列表

    250  <bean id="extractorHttp" class="org.archive.modules.extractor.ExtractorHTTP">
    251  </bean>
    -------------------------------自定义Extractor------------------------------------
    252 <bean id="SohuNewsExtractor" class="my.SohuNewsExtractor"> 253 </bean>
    ---------------------------------------------------------------------------------
    ...
    ----------------------------自定义Extractor---------------------------------------
    285 <!-- ...extract oulinks from HTTP content... --> 286 <ref bean="SohuNewsExtractor"/>
    ---------------------------------------------------------------------------------
    配置完成以上部分,既可以实现自定义Extractor参与Processor任务处理的调度。

     三、程序说明

      3.1 Extractor基类

      Extractor基类发生了变化,新增了新的接口方法:

    1     @Override
    2     protected boolean shouldProcess(CrawlURI uri) {
    3         // TODO Auto-generated method stub
    4         return false;
    5     }

      如果不实现此方法,自定义扩展的Extractor的函数void extract(CrawlURI uri)将不会被调度。

      3.2 构造函数

      1.x版本的构造函数如下:  

        public Extractor(String name, String description) {
            super(name, description);
            // TODO Auto-generated constructor stub
        }

      3.x版本的构造函数取消了参数,采用的默认构造函数。

    四、遗留问题

       protected void extract(CrawlURI curi)

      {

      //1. 做哪些处理?

      //2. 如何控制后续的下载行为,要求只下载自己想要的内容

      }

  • 相关阅读:
    subprocess模块讲解
    正则
    logging日志模块
    2-30hashlib模块讲解
    json pickle复习 shelve模块讲解
    XML、PyYAML和configparser模块讲解
    os模块
    2-25sys模块和shutil模块讲解
    随机生成模块
    时间模块
  • 原文地址:https://www.cnblogs.com/hadoopdev/p/3493439.html
Copyright © 2011-2022 走看看