zoukankan      html  css  js  c++  java
  • nutch 写一个indexingfilter插件

    参考源:http://blog.csdn.net/amuseme_lu/article/details/6780244


    1 生成一个与urlfilter-regex类似的包结构

    代码路径的生成:http://www.cnblogs.com/i80386/archive/2012/09/04/2670670.html


    2

    public class MyIndexingFilter  implements IndexingFilter {
    
        public static final Log LOG = LogFactory.getLog(MyIndexingFilter.class);
        private Configuration conf;
        public void addIndexBackendOptions(Configuration conf) {
            LuceneWriter.addFieldOptions("mt", LuceneWriter.STORE.YES,
                    LuceneWriter.INDEX.TOKENIZED, conf);
        }
        private NutchDocument addMyField(NutchDocument doc)  
         {  
            System.out.println("银河系");
            String value="银河系";
            doc.add("mt",value);  //这里我设置了一个固定字段,实际应该从html抽取目标字段
            return doc;  
         }  
        public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
                CrawlDatum datum, Inlinks inlinks) throws IndexingException {
            addMyField(doc);
            return doc;
        }
        public Configuration getConf() {
            return this.conf;
        }
        public void setConf(Configuration arg0) {
            this.conf = arg0;
        }
    }

    3 生成jar包       build fat jar

    4 生成plugin.xml

    <plugin
       id="index-myfield"
       name="my Indexing Filter"
       version="1.0.0"
       provider-name="nutch.org">
    
    
       <runtime>
          <library name="myfield.jar">
             <export name="*"/>
          </library>
       </runtime>
    
       <requires>
          <import plugin="nutch-extensionpoints"/>
       </requires>
    
       <extension id="org.apache.nutch.indexer.myfield"
                  name="Nutch My Indexing Filter"
                  point="org.apache.nutch.indexer.IndexingFilter">
          <implementation id="MyIndexingFilter"
                          class="org.apache.nutch.indexer.myfield.MyIndexingFilter"/>
       </extension>
    
    </plugin>

    5 最后把打好的jar包与plugin.xml放到E:\nutch\src\plugin\index-myfield 文件夹中

    6 修改conf\nutch-site.xml

    <configuration>
    <property>
            <name>searcher.dir</name>
            <value>E:/crawl_2</value>
    </property>
        <property>  
          <name>plugin.includes</name>  
          <value>protocol-http|urlfilter-(regex|prefix|my)|parse-(html|tika)|index-(basic|anchor|myfield)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>  
          <description>Regular expression naming plugin directory names to  
          include.  Any plugin not matching this expression is excluded.  
          In any case you need at least include the nutch-extensionpoints plugin. By  
          default Nutch includes crawling just HTML and plain text via HTTP,  
          and basic indexing and search plugins. In order to use HTTPS please enable   
          protocol-httpclient, but be aware of possible intermittent problems with the   
          underlying commons-httpclient library.  
          </description>  
        </property>  
    </configuration>

    7 启动nutch

    8 在solr中检索

    9 可以检索到我们需要的字段


    注:如果我不是手动打jar放到 index-myfield文件夹中 ,而是直接修改nutch-site.xml 添加了 index-(basic|anchor|myfield)

  • 相关阅读:
    VC++菜单
    VC++的菜单控制和自绘菜单
    windowsUI的总结
    Linux mount BSD disk partition
    Linux qemu-nbd mount qemu disk image
    自定义chromium浏览器
    EF6配合MySQL或MSSQL(CodeFirst模式)配置指引
    使用 dmidecode 查看Linux服务器信息
    修改KVM的模拟网卡类型
    华为TaiShan 2280 ARM 服务器
  • 原文地址:https://www.cnblogs.com/i80386/p/2678466.html
Copyright © 2011-2022 走看看