zoukankan      html  css  js  c++  java
  • Heritrix 3.1.0 源码解析(三十)

    作为CrawlURI uri对象在处理器链中的生命周期,本人认为逻辑上应该从FrontierPreparer处理器开始,再经过后续的处理器(其实具体CrawlURI uri对象的生命周期,是在它的父级CrawlURI uri对象的抽取处理器处理时已经初具雏形,父级CrawlURI uri对象与它的子级CrawlURI uri对象的生命周期是交错的,关于处理器的流程我在前面已经描述过)

    经过FrontierPreparer处理器处理的CrawlURI uri对象下一步才是进入BdbFrontier对象的Schedule方法添加到BdbWorkQueue工作队列

    该处理器主要是为CrawlURI uri对象初始化配置,包括调度等级、格式化URL链接、生成classkey、设置holderCost、设置优先级策略,为BdbFrontier对象对其调度做准备

    本人在Heritrix 3.1.0 源码解析(二十)解析CandidateChain candidateChain处理器链相关联的处理器时已经提到FrontierPreparer处理器,此文并没有分析该处理器的作用,现在回顾一下

    首先是设置CrawlURI curi对象的调度等级,是根据当前CrawlURI curi对象的pathFromSeed属性(从seed到当前CrawlURI curi的Hop值,不同链接类型有不同的代码)

    /**
         * Calculate the coarse, original 'schedulingDirective' prioritization
         * for the given CrawlURI
         * 
         * @param curi
         * @return
         */
        protected int getSchedulingDirective(CrawlURI curi) {
            if(StringUtils.isNotEmpty(curi.getPathFromSeed())) {
                char lastHop = curi.getPathFromSeed().charAt(curi.getPathFromSeed().length()-1);
                if(lastHop == 'R') {
                    // refer
                    return getPreferenceDepthHops() >= 0 ? HIGH : MEDIUM;
                } 
            }
            if (getPreferenceDepthHops() == 0) {
                return HIGH;
                // this implies seed redirects are treated as path
                // length 1, which I belive is standard.
                // curi.getPathFromSeed() can never be null here, because
                // we're processing a link extracted from curi
            } else if (getPreferenceDepthHops() > 0 && 
                curi.getPathFromSeed().length() + 1 <= getPreferenceDepthHops()) {
                return HIGH;
            } else {
                // optionally preferencing embeds up to MEDIUM
                int prefHops = getPreferenceEmbedHops(); 
                if (prefHops > 0) {
                    int embedHops = curi.getTransHops();
                    if (embedHops > 0 && embedHops <= prefHops
                            && curi.getSchedulingDirective() == SchedulingConstants.NORMAL) {
                        // number of embed hops falls within the preferenced range, and
                        // uri is not already MEDIUM -- so promote it
                        return MEDIUM;
                    }
                }
                // Everything else stays as previously assigned
                // (probably NORMAL, at least for now)
                return curi.getSchedulingDirective();
            }
        }

    UriCanonicalizationPolicy,姑且称为URL格式化策略类,该类为抽象类,提供格式化URL的抽象方法,由具体子类实现

    /**
     * URI Canonicalizatioon Policy
     * 
     * @contributor stack
     * @contributor gojomo
     */
    public abstract class UriCanonicalizationPolicy {
        public abstract String canonicalize(String uri);
    }

    RulesCanonicalizationPolicy类继承自抽象类UriCanonicalizationPolicy,实现格式化URL方法

    /**
     * URI Canonicalizatioon Policy
     * 
     * @contributor stack
     * @contributor gojomo
     */
    public class RulesCanonicalizationPolicy 
        extends UriCanonicalizationPolicy
        implements HasKeyedProperties {
        private static Logger logger =
            Logger.getLogger(RulesCanonicalizationPolicy.class.getName());
        
        protected KeyedProperties kp = new KeyedProperties();
        public KeyedProperties getKeyedProperties() {
            return kp;
        }
        
        {
            setRules(getDefaultRules());
        }
        @SuppressWarnings("unchecked")
        public List<CanonicalizationRule> getRules() {
            return (List<CanonicalizationRule>) kp.get("rules");
        }
        public void setRules(List<CanonicalizationRule> rules) {
            kp.put("rules", rules);
        }
        
        /**
         * Run the passed uuri through the list of rules.
         * @param context Url to canonicalize.
         * @param rules Iterator of canonicalization rules to apply (Get one
         * of these on the url-canonicalizer-rules element in order files or
         * create a list externally).  Rules must implement the Rule interface.
         * @return Canonicalized URL.
         */
        public String canonicalize(String before) {
            String canonical = before;
            if (logger.isLoggable(Level.FINER)) {
                logger.finer("Canonicalizing: "+before);
            }
            for (CanonicalizationRule rule : getRules()) {
                if(rule.getEnabled()) {
                    canonical = rule.canonicalize(canonical);
                }
                if (logger.isLoggable(Level.FINER)) {
                    logger.finer(
                        "Rule " + rule.getClass().getName() + " "
                        + (rule.getEnabled()
                                ? canonical :" (disabled)"));
                }
            }
            return canonical;
        }
        
        /**
         * A reasonable set of default rules to use, if no others are
         * provided by operator configuration.
         */
        public static List<CanonicalizationRule> getDefaultRules() {
            List<CanonicalizationRule> rules = new ArrayList<CanonicalizationRule>(6);
            rules.add(new LowercaseRule());
            rules.add(new StripUserinfoRule());
            rules.add(new StripWWWNRule());
            rules.add(new StripSessionIDs());
            rules.add(new StripSessionCFIDs());
            rules.add(new FixupQueryString());
            return rules;
        }
    }

    格式化URL方法里面迭代调用CanonicalizationRule类型集合里面的成员对象的String canonicalize(String url)方法

    CanonicalizationRule是接口,接口声明了String canonicalize(String url)方法,实现该接口的有上面静态方法List<CanonicalizationRule> getDefaultRules()中添加的类,这种处理方式有点类似composite模式与Iterator模式的结合,不过枝节点与叶节点并没有实现共同的接口类型

    QueueAssignmentPolicy类为生成URL对象的Classkey策略,该类同样为抽象类,提供生成Classkey的方法(工作队列的标识也就是根据这个生成的Classkey)

    默认的生成URL对象的Classkey策略为SurtAuthorityQueueAssignmentPolicy实现类,是根据URL对象的域名生成字符串,因此相同域名的站点里面的URL对象也就只有这一个Classkey标识,也就是只有一个工作队列

    我们可以扩展Classkey生成策略,比较经典的是利用ELFHash算法为CrawlURI curi对象分配Key值 ,我这里做一个示例,新建MyQueueAssignmentPolicy类,继承自抽象类QueueAssignmentPolicy,相关源码如下:

    /**
         * 
         */
        private static final long serialVersionUID = 1L;
    
        @Override
        public String getClassKey(CrawlURI cauri) 
        {
            // TODO Auto-generated method stub
            String uri = cauri.getURI().toString();         
            long hash = ELFHash(uri);//利用ELFHash算法为uri分配Key值         
            String a = Long.toString(hash % 50);//取模50,对应50个线程         
            return a;
        }
        public long ELFHash(String str)      
        {         
            long hash = 0;         
            long x   = 0;         
            for(int i = 0; i < str.length(); i++)         
            {            
                hash = (hash << 4) + str.charAt(i);//将字符中的每个元素依次按前四位与上            
                if((x = hash & 0xF0000000L) != 0)//个元素的低四位想与           
                {               
                    hash ^= (x >> 24);//长整的高四位大于零,折回再与长整后四位异或              
                    hash &= ~x;            
                }         
            }         
            return (hash & 0x7FFFFFFF);      
        }

    然后我们在配置文件crawler-beans.cxml里面将FrontierPreparer处理器Bean的queueAssignmentPolicy属性设置成我们扩展的MyQueueAssignmentPolicy类的Bean就可以了

    UriPrecedencePolicy类为CrawlURI curi对象优先级策略,该类同样为抽象类,提供设置CrawlURI curi对象的优先级的抽象方法

    abstract public class UriPrecedencePolicy implements Serializable {
    
        /**
         * Add a precedence value to the supplied CrawlURI, which is being 
         * scheduled onto a frontier queue for the first time. 
         * @param curi CrawlURI to assign a precedence value
         */
        abstract public void uriScheduled(CrawlURI curi);
    
    }

    默认为CostUriPrecedencePolicy类,根据CrawlURI curi对象的持有成本设置其优先级

    /**
     * UriPrecedencePolicy which sets a URI's precedence to its 'cost' -- which
     * simulates the in-queue sorting order in Heritrix 1.x, where cost 
     * contributed the same bits to the queue-insert-key that precedence now does.
     */
    public class CostUriPrecedencePolicy extends UriPrecedencePolicy {
        private static final long serialVersionUID = -8164425278358540710L;
    
        /* (non-Javadoc)
         * @see org.archive.crawler.frontier.precedence.UriPrecedencePolicy#uriScheduled(org.archive.crawler.datamodel.CrawlURI)
         */
        @Override
        public void uriScheduled(CrawlURI curi) {
            curi.setPrecedence(curi.getHolderCost()); 
        }
    }

    FrontierPreparer处理器Bean的相关策略在crawler-beans.cxml配置文件中的配置如下

     <!-- 
       OPTIONAL BEANS
        Uncomment and expand as needed, or if non-default alternate 
        implementations are preferred.
      -->
      
     <!-- CANONICALIZATION POLICY -->
     <bean id="canonicalizationPolicy" 
       class="org.archive.modules.canonicalize.RulesCanonicalizationPolicy">
       <property name="rules">
        <list>
         <bean class="org.archive.modules.canonicalize.LowercaseRule" />
         <bean class="org.archive.modules.canonicalize.StripUserinfoRule" />
         <bean class="org.archive.modules.canonicalize.StripWWWNRule" />
         <bean class="org.archive.modules.canonicalize.StripSessionIDs" />
         <bean class="org.archive.modules.canonicalize.StripSessionCFIDs" />
         <bean class="org.archive.modules.canonicalize.FixupQueryString" />
        </list>
      </property>
     </bean> 
    
     <!-- QUEUE ASSIGNMENT POLICY -->
     <bean id="queueAssignmentPolicy" 
       class="org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy">
      <property name="forceQueueAssignment" value="" />
      <property name="deferToPrevious" value="true" />
      <property name="parallelQueues" value="1" />
     </bean>
     
     <!-- URI PRECEDENCE POLICY -->
     <bean id="uriPrecedencePolicy" 
       class="org.archive.crawler.frontier.precedence.CostUriPrecedencePolicy">
     </bean>
     
     <!-- COST ASSIGNMENT POLICY -->
     <bean id="costAssignmentPolicy" 
       class="org.archive.crawler.frontier.UnitCostAssignmentPolicy">
     </bean>

    ---------------------------------------------------------------------------

    本系列Heritrix 3.1.0 源码解析系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/29/3050992.html

  • 相关阅读:
    《代码大全2》阅读笔记08Chapter 15 Using Conditionals
    《代码大全2》阅读笔记09Chapter 16 Controlling Loops
    《代码大全2》阅读笔记12 Chapter 19 General Control Issues
    《代码大全2》阅读笔记13 Chapter 22 Developer Testing
    [转帖]Dictionary, SortedDictionary, SortedList 横向评测
    《代码大全2》阅读笔记07Chapter 12 Fundamental Data Types
    《代码大全2》阅读笔记11 Chapter 24 Refactoring
    《代码大全2》阅读笔记14 Chapter 23 Debugging
    New Concept English 3 01 A Puma at large
    (ZT)委托和事件的区别
  • 原文地址:https://www.cnblogs.com/chenying99/p/3050992.html
Copyright © 2011-2022 走看看