zoukankan      html  css  js  c++  java
  • Heritrix 3.1.0 源码解析(十九)

    本文继续分析与heritrix3.1.0系统的处理器相关的源码

    我们照例先来浏览一下class uml图

    所有的处理器都继承自抽象父类Processor,其中重要的方法如下

    /**
         * Processes the given URI.  First checks {@link #ENABLED} and
         * {@link #DECIDE_RULES}.  If ENABLED is false, then nothing happens.
         * If the DECIDE_RULES indicate REJECT, then the 
         * {@link #innerRejectProcess(ProcessorURI)} method is invoked, and
         * the process method returns.
         * 
         * <p>Next, the {@link #shouldProcess(ProcessorURI)} method is 
         * consulted to see if this Processor knows how to handle the given
         * URI.  If it returns false, then nothing futher occurs.
         * 
         * <p>FIXME: Should innerRejectProcess be called when ENABLED is false,
         * or when shouldProcess returns false?  The previous Processor 
         * implementation didn't handle it that way.
         * 
         * <p>Otherwise, the URI is considered valid.  This processor's count
         * of handled URIs is incremented, and the 
         * {@link #innerProcess(ProcessorURI)} method is invoked to actually
         * perform the process.
         * 
         * @param uri  The URI to process
         * @throws  InterruptedException   if the thread is interrupted
         */
        public ProcessResult process(CrawlURI uri) 
        throws InterruptedException {
            if (!getEnabled()) {
                return ProcessResult.PROCEED;
            }
            
            if (getShouldProcessRule().decisionFor(uri) == DecideResult.REJECT) {
                innerRejectProcess(uri);
                return ProcessResult.PROCEED;
            }
            
            if (shouldProcess(uri)) {
                uriCount.incrementAndGet();
                return innerProcessResult(uri);
            } else {
                return ProcessResult.PROCEED;
            }
        }

    首先判断是否需要该处理器处理,shouldProcess(CrawlURI uri)为抽象方法,由子类实现(具体处理器类判断是否需要经过自身处理当前CrawlURI uri对象)

    里面进一步调用ProcessResult innerProcessResult(CrawlURI uri) 方法(有些子类覆盖了该方法)

    protected ProcessResult innerProcessResult(CrawlURI uri) 
        throws InterruptedException {
            innerProcess(uri);
            return ProcessResult.PROCEED;
        }

    继续调用void innerProcess(CrawlURI uri)方法,该方法是抽象方法,由子类实现

    /**
         * Actually performs the process.  By the time this method is invoked,
         * it is known that the given URI passes the {@link #ENABLED}, the 
         * {@link #DECIDE_RULES} and the {@link #shouldProcess(ProcessorURI)}
         * tests.  
         * 
         * @param uri    the URI to process
         * @throws InterruptedException   if the thread is interrupted
         */
        protected abstract void innerProcess(CrawlURI uri) 
        throws InterruptedException;

    处理器Processor类的子类 逻辑上又分为几大不同类别的处理器,它们在系统运行时已经属于不同的处理器链,在类的继承层次上 又有各自的层次归属

    本文以及接下来的文章我只能选择部分处理器Processor分析一下

    CandidatesProcessor处理器:CandidatesProcessor处理器里面拥有CandidateChain candidateChain成员,调用该处理器链的处理器方法

    通过该处理器的CrawlURI cURI对象最终调用BdbFrontier的schedule(CrawlURI cURI)方法添加到BDB数据库

     /**
         * Candidate chain
         */
        protected CandidateChain candidateChain;
        public CandidateChain getCandidateChain() {
            return this.candidateChain;
        }
        @Autowired
        public void setCandidateChain(CandidateChain candidateChain) {
            this.candidateChain = candidateChain;
        }
        
        /**
         * The frontier to use.
         */
        protected Frontier frontier;
        public Frontier getFrontier() {
            return this.frontier;
        }
        @Autowired
        public void setFrontier(Frontier frontier) {
            this.frontier = frontier;
        }

    实际调用的处理器方法如下

    /* (non-Javadoc)
         * @see org.archive.modules.Processor#innerProcess(org.archive.modules.CrawlURI)
         */
        @Override
        protected void innerProcess(final CrawlURI curi) throws InterruptedException {
            // Handle any prerequisites when S_DEFERRED for prereqs
            if (curi.hasPrerequisiteUri() && curi.getFetchStatus() == S_DEFERRED) {
                CrawlURI prereq = curi.getPrerequisiteUri();
                prereq.setFullVia(curi); 
                sheetOverlaysManager.applyOverlaysTo(prereq);
                try {
                    KeyedProperties.clearOverridesFrom(curi); 
                    KeyedProperties.loadOverridesFrom(prereq);
    
                    getCandidateChain().process(prereq, null);
                    
                    if(prereq.getFetchStatus()>=0) {
                        
                        frontier.schedule(prereq);
                    } else {
                        curi.setFetchStatus(S_PREREQUISITE_UNSCHEDULABLE_FAILURE);
                    }
                } finally {
                    KeyedProperties.clearOverridesFrom(prereq); 
                    KeyedProperties.loadOverridesFrom(curi);
                }
                return;
            }
    
            // Don't consider candidate links of error pages
            if (curi.getFetchStatus() < 200 || curi.getFetchStatus() >= 400) {
                curi.getOutLinks().clear();
                return;
            }
    
            for (Link wref: curi.getOutLinks()) {
                CrawlURI candidate;
                try {
                    candidate = curi.createCrawlURI(curi.getBaseURI(),wref);
                    // at least for duration of candidatechain, offer
                    // access to full CrawlURI of via
                    candidate.setFullVia(curi); 
                } catch (URIException e) {
                    loggerModule.logUriError(e, curi.getUURI(), 
                            wref.getDestination().toString());
                    continue;
                }
                sheetOverlaysManager.applyOverlaysTo(candidate);
                try {
                    KeyedProperties.clearOverridesFrom(curi); 
                    KeyedProperties.loadOverridesFrom(candidate);
                    
                    if(getSeedsRedirectNewSeeds() && curi.isSeed() 
                            && wref.getHopType() == Hop.REFER
                            && candidate.getHopCount() < SEEDS_REDIRECT_NEW_SEEDS_MAX_HOPS) {
                        candidate.setSeed(true);                     
                    }
                    getCandidateChain().process(candidate, null); 
                    if(candidate.getFetchStatus()>=0) {
                        if(checkForSeedPromotion(candidate)) {
                            /*
                             * We want to guarantee crawling of seed version of
                             * CrawlURI even if same url has already been enqueued,
                             * see https://webarchive.jira.com/browse/HER-1891
                             */
                            candidate.setForceFetch(true);                        
                            getSeeds().addSeed(candidate);
                        } else {                        
                            frontier.schedule(candidate);
                        }
                        curi.getOutCandidates().add(candidate);
                    }
                    
                } finally {
                    KeyedProperties.clearOverridesFrom(candidate); 
                    KeyedProperties.loadOverridesFrom(curi);
                }
            }
            curi.getOutLinks().clear();
        }

    我们查看一下爬行任务配置文件crawler-beans.cxml,CandidateChain candidateChain处理器链的相关处理器如下

    <!-- CANDIDATE CHAIN --> 
     <!-- first, processors are declared as top-level named beans -->
     <bean id="candidateScoper" class="org.archive.crawler.prefetch.CandidateScoper">
     </bean>
     <bean id="preparer" class="org.archive.crawler.prefetch.FrontierPreparer">
      <!-- <property name="preferenceDepthHops" value="-1" /> -->
      <!-- <property name="preferenceEmbedHops" value="1" /> -->
      <!-- <property name="canonicalizationPolicy"> 
            <ref bean="canonicalizationPolicy" />
           </property> -->
      <!-- <property name="queueAssignmentPolicy"> 
            <ref bean="queueAssignmentPolicy" />
           </property> -->
      <!-- <property name="uriPrecedencePolicy"> 
            <ref bean="uriPrecedencePolicy" />
           </property> -->
      <!-- <property name="costAssignmentPolicy"> 
            <ref bean="costAssignmentPolicy" />
           </property> -->
     </bean>
     <!-- now, processors are assembled into ordered CandidateChain bean -->
     <bean id="candidateProcessors" class="org.archive.modules.CandidateChain">
      <property name="processors">
       <list>
        <!-- apply scoping rules to each individual candidate URI... -->
        <ref bean="candidateScoper"/>
        <!-- ...then prepare those ACCEPTed to be enqueued to frontier. -->
        <ref bean="preparer"/>
       </list>
      </property>
     </bean>

    ---------------------------------------------------------------------------

    本系列Heritrix 3.1.0 源码解析系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3036954.html

  • 相关阅读:
    SQL Server, Timeout expired.all pooled connections were in use and max pool size was reached
    javascript 事件调用顺序
    Best Practices for Speeding Up Your Web Site
    C语言程序设计 使用VC6绿色版
    破解SQL Prompt 3.9的几步操作
    Master page Path (MasterPage 路径)
    几个小型数据库的比较
    CSS+DIV 完美实现垂直居中的方法
    由Response.Redirect引发的"Thread was being aborted. "异常的处理方法
    Adsutil.vbs 在脚本攻击中的妙用
  • 原文地址:https://www.cnblogs.com/chenying99/p/3036954.html
Copyright © 2011-2022 走看看