zoukankan      html  css  js  c++  java
  • Heritrix 3.1.0 源码解析(三十)

    作为CrawlURI uri对象在处理器链中的生命周期,本人认为逻辑上应该从FrontierPreparer处理器开始,再经过后续的处理器(其实具体CrawlURI uri对象的生命周期,是在它的父级CrawlURI uri对象的抽取处理器处理时已经初具雏形,父级CrawlURI uri对象与它的子级CrawlURI uri对象的生命周期是交错的,关于处理器的流程我在前面已经描述过)

    经过FrontierPreparer处理器处理的CrawlURI uri对象下一步才是进入BdbFrontier对象的Schedule方法添加到BdbWorkQueue工作队列

    该处理器主要是为CrawlURI uri对象初始化配置,包括调度等级、格式化URL链接、生成classkey、设置holderCost、设置优先级策略,为BdbFrontier对象对其调度做准备

    本人在Heritrix 3.1.0 源码解析(二十)解析CandidateChain candidateChain处理器链相关联的处理器时已经提到FrontierPreparer处理器,此文并没有分析该处理器的作用,现在回顾一下

    首先是设置CrawlURI curi对象的调度等级,是根据当前CrawlURI curi对象的pathFromSeed属性(从seed到当前CrawlURI curi的Hop值,不同链接类型有不同的代码)

    /**
         * Calculate the coarse, original 'schedulingDirective' prioritization
         * for the given CrawlURI
         * 
         * @param curi
         * @return
         */
        protected int getSchedulingDirective(CrawlURI curi) {
            if(StringUtils.isNotEmpty(curi.getPathFromSeed())) {
                char lastHop = curi.getPathFromSeed().charAt(curi.getPathFromSeed().length()-1);
                if(lastHop == 'R') {
                    // refer
                    return getPreferenceDepthHops() >= 0 ? HIGH : MEDIUM;
                } 
            }
            if (getPreferenceDepthHops() == 0) {
                return HIGH;
                // this implies seed redirects are treated as path
                // length 1, which I belive is standard.
                // curi.getPathFromSeed() can never be null here, because
                // we're processing a link extracted from curi
            } else if (getPreferenceDepthHops() > 0 && 
                curi.getPathFromSeed().length() + 1 <= getPreferenceDepthHops()) {
                return HIGH;
            } else {
                // optionally preferencing embeds up to MEDIUM
                int prefHops = getPreferenceEmbedHops(); 
                if (prefHops > 0) {
                    int embedHops = curi.getTransHops();
                    if (embedHops > 0 && embedHops <= prefHops
                            && curi.getSchedulingDirective() == SchedulingConstants.NORMAL) {
                        // number of embed hops falls within the preferenced range, and
                        // uri is not already MEDIUM -- so promote it
                        return MEDIUM;
                    }
                }
                // Everything else stays as previously assigned
                // (probably NORMAL, at least for now)
                return curi.getSchedulingDirective();
            }
        }

    UriCanonicalizationPolicy,姑且称为URL格式化策略类,该类为抽象类,提供格式化URL的抽象方法,由具体子类实现

    /**
     * URI Canonicalizatioon Policy
     * 
     * @contributor stack
     * @contributor gojomo
     */
    public abstract class UriCanonicalizationPolicy {
        public abstract String canonicalize(String uri);
    }

    RulesCanonicalizationPolicy类继承自抽象类UriCanonicalizationPolicy,实现格式化URL方法

    /**
     * URI Canonicalizatioon Policy
     * 
     * @contributor stack
     * @contributor gojomo
     */
    public class RulesCanonicalizationPolicy 
        extends UriCanonicalizationPolicy
        implements HasKeyedProperties {
        private static Logger logger =
            Logger.getLogger(RulesCanonicalizationPolicy.class.getName());
        
        protected KeyedProperties kp = new KeyedProperties();
        public KeyedProperties getKeyedProperties() {
            return kp;
        }
        
        {
            setRules(getDefaultRules());
        }
        @SuppressWarnings("unchecked")
        public List<CanonicalizationRule> getRules() {
            return (List<CanonicalizationRule>) kp.get("rules");
        }
        public void setRules(List<CanonicalizationRule> rules) {
            kp.put("rules", rules);
        }
        
        /**
         * Run the passed uuri through the list of rules.
         * @param context Url to canonicalize.
         * @param rules Iterator of canonicalization rules to apply (Get one
         * of these on the url-canonicalizer-rules element in order files or
         * create a list externally).  Rules must implement the Rule interface.
         * @return Canonicalized URL.
         */
        public String canonicalize(String before) {
            String canonical = before;
            if (logger.isLoggable(Level.FINER)) {
                logger.finer("Canonicalizing: "+before);
            }
            for (CanonicalizationRule rule : getRules()) {
                if(rule.getEnabled()) {
                    canonical = rule.canonicalize(canonical);
                }
                if (logger.isLoggable(Level.FINER)) {
                    logger.finer(
                        "Rule " + rule.getClass().getName() + " "
                        + (rule.getEnabled()
                                ? canonical :" (disabled)"));
                }
            }
            return canonical;
        }
        
        /**
         * A reasonable set of default rules to use, if no others are
         * provided by operator configuration.
         */
        public static List<CanonicalizationRule> getDefaultRules() {
            List<CanonicalizationRule> rules = new ArrayList<CanonicalizationRule>(6);
            rules.add(new LowercaseRule());
            rules.add(new StripUserinfoRule());
            rules.add(new StripWWWNRule());
            rules.add(new StripSessionIDs());
            rules.add(new StripSessionCFIDs());
            rules.add(new FixupQueryString());
            return rules;
        }
    }

    格式化URL方法里面迭代调用CanonicalizationRule类型集合里面的成员对象的String canonicalize(String url)方法

    CanonicalizationRule是接口,接口声明了String canonicalize(String url)方法,实现该接口的有上面静态方法List<CanonicalizationRule> getDefaultRules()中添加的类,这种处理方式有点类似composite模式与Iterator模式的结合,不过枝节点与叶节点并没有实现共同的接口类型

    QueueAssignmentPolicy类为生成URL对象的Classkey策略,该类同样为抽象类,提供生成Classkey的方法(工作队列的标识也就是根据这个生成的Classkey)

    默认的生成URL对象的Classkey策略为SurtAuthorityQueueAssignmentPolicy实现类,是根据URL对象的域名生成字符串,因此相同域名的站点里面的URL对象也就只有这一个Classkey标识,也就是只有一个工作队列

    我们可以扩展Classkey生成策略,比较经典的是利用ELFHash算法为CrawlURI curi对象分配Key值 ,我这里做一个示例,新建MyQueueAssignmentPolicy类,继承自抽象类QueueAssignmentPolicy,相关源码如下:

    /**
         * 
         */
        private static final long serialVersionUID = 1L;
    
        @Override
        public String getClassKey(CrawlURI cauri) 
        {
            // TODO Auto-generated method stub
            String uri = cauri.getURI().toString();         
            long hash = ELFHash(uri);//利用ELFHash算法为uri分配Key值         
            String a = Long.toString(hash % 50);//取模50,对应50个线程         
            return a;
        }
        public long ELFHash(String str)      
        {         
            long hash = 0;         
            long x   = 0;         
            for(int i = 0; i < str.length(); i++)         
            {            
                hash = (hash << 4) + str.charAt(i);//将字符中的每个元素依次按前四位与上            
                if((x = hash & 0xF0000000L) != 0)//个元素的低四位想与           
                {               
                    hash ^= (x >> 24);//长整的高四位大于零,折回再与长整后四位异或              
                    hash &= ~x;            
                }         
            }         
            return (hash & 0x7FFFFFFF);      
        }

    然后我们在配置文件crawler-beans.cxml里面将FrontierPreparer处理器Bean的queueAssignmentPolicy属性设置成我们扩展的MyQueueAssignmentPolicy类的Bean就可以了

    UriPrecedencePolicy类为CrawlURI curi对象优先级策略,该类同样为抽象类,提供设置CrawlURI curi对象的优先级的抽象方法

    abstract public class UriPrecedencePolicy implements Serializable {
    
        /**
         * Add a precedence value to the supplied CrawlURI, which is being 
         * scheduled onto a frontier queue for the first time. 
         * @param curi CrawlURI to assign a precedence value
         */
        abstract public void uriScheduled(CrawlURI curi);
    
    }

    默认为CostUriPrecedencePolicy类,根据CrawlURI curi对象的持有成本设置其优先级

    /**
     * UriPrecedencePolicy which sets a URI's precedence to its 'cost' -- which
     * simulates the in-queue sorting order in Heritrix 1.x, where cost 
     * contributed the same bits to the queue-insert-key that precedence now does.
     */
    public class CostUriPrecedencePolicy extends UriPrecedencePolicy {
        private static final long serialVersionUID = -8164425278358540710L;
    
        /* (non-Javadoc)
         * @see org.archive.crawler.frontier.precedence.UriPrecedencePolicy#uriScheduled(org.archive.crawler.datamodel.CrawlURI)
         */
        @Override
        public void uriScheduled(CrawlURI curi) {
            curi.setPrecedence(curi.getHolderCost()); 
        }
    }

    FrontierPreparer处理器Bean的相关策略在crawler-beans.cxml配置文件中的配置如下

     <!-- 
       OPTIONAL BEANS
        Uncomment and expand as needed, or if non-default alternate 
        implementations are preferred.
      -->
      
     <!-- CANONICALIZATION POLICY -->
     <bean id="canonicalizationPolicy" 
       class="org.archive.modules.canonicalize.RulesCanonicalizationPolicy">
       <property name="rules">
        <list>
         <bean class="org.archive.modules.canonicalize.LowercaseRule" />
         <bean class="org.archive.modules.canonicalize.StripUserinfoRule" />
         <bean class="org.archive.modules.canonicalize.StripWWWNRule" />
         <bean class="org.archive.modules.canonicalize.StripSessionIDs" />
         <bean class="org.archive.modules.canonicalize.StripSessionCFIDs" />
         <bean class="org.archive.modules.canonicalize.FixupQueryString" />
        </list>
      </property>
     </bean> 
    
     <!-- QUEUE ASSIGNMENT POLICY -->
     <bean id="queueAssignmentPolicy" 
       class="org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy">
      <property name="forceQueueAssignment" value="" />
      <property name="deferToPrevious" value="true" />
      <property name="parallelQueues" value="1" />
     </bean>
     
     <!-- URI PRECEDENCE POLICY -->
     <bean id="uriPrecedencePolicy" 
       class="org.archive.crawler.frontier.precedence.CostUriPrecedencePolicy">
     </bean>
     
     <!-- COST ASSIGNMENT POLICY -->
     <bean id="costAssignmentPolicy" 
       class="org.archive.crawler.frontier.UnitCostAssignmentPolicy">
     </bean>

    ---------------------------------------------------------------------------

    本系列Heritrix 3.1.0 源码解析系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/29/3050992.html

  • 相关阅读:
    洛谷 P1226 【模板】快速幂||取余运算 题解
    洛谷 P2678 跳石头 题解
    洛谷 P2615 神奇的幻方 题解
    洛谷 P1083 借教室 题解
    洛谷 P1076 寻宝 题解
    洛谷 UVA10298 Power Strings 题解
    洛谷 P3375 【模板】KMP字符串匹配 题解
    Kafka Shell基本命令
    Mybatis与Hibernate的详细对比
    MyBatis简介
  • 原文地址:https://www.cnblogs.com/chenying99/p/3050992.html
Copyright © 2011-2022 走看看