zoukankan      html  css  js  c++  java
  • Heritrix 3.1.0 源码解析(二十六)


    Heritrix3.1.0系统的package org.archive.modules.credential里面的相关类都是与请求证书有关的



    KeyedProperties kp = new KeyedProperties();
        public KeyedProperties getKeyedProperties() {
            return kp;
         * Credentials used by heritrix authenticating. See
         * http://crawler.archive.org/proposals/auth/ for background.
         * @see http://crawler.archive.org/proposals/auth/
            setCredentials(new HashMap<String, Credential>());
        public Map<String,Credential> getCredentials() {
            return (Map<String,Credential>) kp.get("credentials");
        public void setCredentials(Map<String,Credential> map) {
         * List of possible credential types as a List.
         * This types are inner classes of this credential type so they cannot
         * be created without their being associated with a credential list.
        private static final List<Class<?>> credentialTypes;
        // Initialize the credentialType data member.
        static {
            // Array of all known credential types.
            Class<?> [] tmp = {HtmlFormCredential.class, HttpAuthenticationCredential.class};
            credentialTypes = Collections.unmodifiableList(Arrays.asList(tmp));
         * Constructor.
        public CredentialStore() {
         * @return Unmodifable list of credential types.
        public static List<Class<?>> getCredentialTypes() {
            return CredentialStore.credentialTypes;
         * @param context Pass a ProcessorURI.  Used to set
         * context.
         * @return An iterator or null.
        public Collection<Credential> getAll() {
            Map<String,Credential> map = getCredentials();
            return map.values();
         * @param context  Used to set context.
         * @param name Name to give the manufactured credential.  Should be unique
         * else the add of the credential to the list of credentials will fail.
         * @return Returns <code>name</code>'d credential.
         * @throws AttributeNotFoundException
         * @throws MBeanException
         * @throws ReflectionException
        public Credential get(/*StateProvider*/Object context, String name) {
            return getCredentials().get(name);
         * Return set made up of all credentials of the passed
         * <code>type</code>.
         * @param context  Used to set context.  
         * @param type Type of the list to return.  Type is some superclass of
         * credentials.
         * @param rootUri RootUri to match.  May be null.  In this case we return
         * all.  Currently we expect the CrawlServer name to equate to root Uri.
         * Its not.  Currently it doesn't distingush between servers of same name
         * but different ports (e.g. http and https).
         * @return Unmodifable sublist of all elements of passed type.
        public Set<Credential> subset(CrawlURI context, Class<?> type, String rootUri) {
            Set<Credential> result = null;
            for (Credential c: getAll()) {
                if (!type.isInstance(c)) {
                if (rootUri != null) {
                    String cd = c.getDomain();
                    if (cd == null) {
                    if (!rootUri.equalsIgnoreCase(cd)) {
                if (result == null) {
                    result = new HashSet<Credential>();
            return result;


    (注意到最后的subset方法,好像没有用到CrawlURI context参数,方法返回的只能是指定域并且指定证书类型的证书集合)

    从它的静态代码块可以看到,系统提供了两种类型的证书类型,分别是HtmlFormCredential.class, HttpAuthenticationCredential.class,前者用于form认证,后者用于Basic/Digest HTTP认证


         * The root domain this credential goes against: E.g. www.archive.org
        String domain = "";
         * @param context Context to use when searching for credential domain.
         * @return The domain/root URI this credential is to go against.
         * @throws AttributeNotFoundException If attribute not found.
        public String getDomain() {
            return this.domain;
        public void setDomain(String domain) {
            this.domain = domain;
         *为CrawlURI curi对象添加当前证书
         * Attach this credentials avatar to the passed <code>curi</code> .
         * Override if credential knows internally what it wants to attach as
         * payload.  Otherwise, if payload is external, use the below
         * {@link #attach(CrawlURI, String)}.
         * @param curi CrawlURI to load with credentials.
        public void attach(CrawlURI curi) {
         *为CrawlURI curi对象移除当前证书
         * Detach this credential from passed curi.
         * @param curi
         * @return True if we detached a Credential reference.
        public boolean detach(CrawlURI curi) {
            return curi.getCredentials().remove(this);
         *为CrawlURI curi对象移除所有证书
         * Detach all credentials of this type from passed curi.
         * @param curi
         * @return True if we detached references.
        public boolean detachAll(CrawlURI curi) {
            boolean result = false;
            Iterator<Credential> iter = curi.getCredentials().iterator();
            while (iter.hasNext()) {
                Credential cred = iter.next();
                if (cred.getClass() ==  this.getClass()) {
                    result = true;
            return result;
         *判断CrawlURI curi对象是否需要当前证书认证
         * @param curi CrawlURI to look at.
         * @return True if this credential IS a prerequisite for passed
         * CrawlURI.
        public abstract boolean isPrerequisite(CrawlURI curi);
         *判断CrawlURI curi对象是否存在认证URI
         * @param curi CrawlURI to look at.
         * @return True if this credential HAS a prerequisite for passed CrawlURI.
        public abstract boolean hasPrerequisite(CrawlURI curi);
         *获取CrawlURI curi对象的认证URI
         * Return the authentication URI, either absolute or relative, that serves
         * as prerequisite the passed <code>curi</code>.
         * @param curi CrawlURI to look at.
         * @return Prerequisite URI for the passed curi.
        public abstract String getPrerequisite(CrawlURI curi);
         *获取CrawlURI curi对象的认证URI
         * @param context Context to use when searching for credential domain.
         * @return Key that is unique to this credential type.
         * @throws AttributeNotFoundException
        public abstract String getKey();
         *判断CrawlURI curi对象是否每次都要认证
         * @return True if this credential is of the type that needs to be offered
         * on each visit to the server (e.g. Rfc2617 is such a type).
        public abstract boolean isEveryTime();
         *为HttpMethod method添加认证参数
         * @param curi CrawlURI to as for context.
         * @param http Instance of httpclient.
         * @param method Method to populate.
         * @return True if added a credentials.
        public abstract boolean populate(CrawlURI curi, HttpClient http,
            HttpMethod method);
         * @param curi CrawlURI to look at.
         * @return True if this credential is to be posted.  Return false if the
         * credential is to be GET'd or if POST'd or GET'd are not pretinent to this
         * credential type.
        public abstract boolean isPost();
         * 判断CrawlURI curi对象的CrawlServer类中的名称与当前认证对象的域名是否一致(用于排除不需要当前认证的CrawlURI curi对象)
         * Test passed curi matches this credentials rootUri.
         * @param controller
         * @param curi CrawlURI to test.
         * @return True if domain for credential matches that of the passed curi.
        public boolean rootUriMatch(ServerCache cache, 
                CrawlURI curi) {
            String cd = getDomain();
            CrawlServer serv = cache.getServerFor(curi.getUURI());
            String serverName = serv.getName();
    //        String serverName = controller.getServerCache().getServerFor(curi).
    //            getName();
            logger.fine("RootURI: Comparing " + serverName + " " + cd);
            return cd != null && serverName != null &&

    上述方法的功能是为CrawlURI curi对象添加当前证书、移除当前证书、为HttpMethod method对象添加证书参数、判断CrawlURI curi对象的域名与当前证书的域名是否一致等

    HtmlFormCredential对象继承自上述证书类Credential,为CrawlURI curi对象提供form认证,相关方法实现如下

         * Full URI of page that contains the HTML login form we're to apply these
         * credentials too: E.g. http://www.archive.org
        String loginUri = "";
        public String getLoginUri() {
            return this.loginUri;
        public void setLoginUri(String loginUri) {
            this.loginUri = loginUri;
         * Form items.
        Map<String,String> formItems = new HashMap<String,String>();
        public Map<String,String> getFormItems() {
            return this.formItems;
        public void setFormItems(Map<String,String> formItems) {
            this.formItems = formItems;
        enum Method {
         * GET or POST.
        Method httpMethod = Method.POST;
        public Method getHttpMethod() {
            return this.httpMethod;
        public void setHttpMethod(Method method) {
            this.httpMethod = method; 
         * Constructor.
        public HtmlFormCredential() {
        public boolean isPrerequisite(final CrawlURI curi) {
            boolean result = false;
            String curiStr = curi.getUURI().toString();
            String loginUri = getPrerequisite(curi);
            if (loginUri != null) {
                try {
    //登录url UURI uuri
    = UURIFactory.getInstance(curi.getUURI(), loginUri); if (uuri != null && curiStr != null && uuri.toString().equals(curiStr)) { result = true; if (!curi.isPrerequisite()) { curi.setPrerequisite(true); logger.fine(curi + " is prereq."); } } } catch (URIException e) { logger.severe("Failed to uuri: " + curi + ", " + e.getMessage()); } } return result; } public boolean hasPrerequisite(CrawlURI curi) { return getPrerequisite(curi) != null; } public String getPrerequisite(CrawlURI curi) { return getLoginUri(); } public String getKey() { return getLoginUri(); } public boolean isEveryTime() { // This authentication is one time only. return false; } public boolean populate(CrawlURI curi, HttpClient http, HttpMethod method) { // http is not used boolean result = false; Map<String,String> formItems = getFormItems(); if (formItems == null || formItems.size() <= 0) { try { logger.severe("No form items for " + method.getURI()); } catch (URIException e) { logger.severe("No form items and exception getting uri: " + e.getMessage()); } return result; } NameValuePair[] data = new NameValuePair[formItems.size()]; int index = 0; String key = null; for (Iterator<String> i = formItems.keySet().iterator(); i.hasNext();) { key = i.next(); data[index++] = new NameValuePair(key, (String)formItems.get(key)); } if (method instanceof PostMethod) { ((PostMethod)method).setRequestBody(data); result = true; } else if (method instanceof GetMethod) { // Append these values to the query string. // Get current query string, then add data, then get it again // only this time its our data only... then append. HttpMethodBase hmb = (HttpMethodBase)method; String currentQuery = hmb.getQueryString(); hmb.setQueryString(data); String newQuery = hmb.getQueryString(); hmb.setQueryString( ((StringUtils.isNotEmpty(currentQuery)) ? currentQuery + "&" : "") + newQuery); result = true; } else { logger.severe("Unknown method type: " + method); } return result; } public boolean isPost() { return Method.POST.equals(getHttpMethod()); }

    上述方法的功能 我在它的接口方法里面已经注释了,这里不再重复

    另外HttpAuthenticationCredential证书类提供了Basic/Digest HTTP认证功能,源码我就不具体分析了,可以参照HtmlFormCredential类的认证功能对比不难理解了


    <bean id="credentialStore"
         <property name="credentials">
             <entry key="formCredential" value-ref="formCredential" />
    <bean id="credential"
        <property name="domain" value="example.com" /> 
        <property name="login-uri" value="http://example.com/login"/> 
        <property name="form-items">
                <entry key="login" value="mylogin"/>
                <entry key="password" value="mypassword"/>
                <entry key="submit" value="submit"/>
    <bean id="credential"
        <property name="domain"><value>domain</value></property> 
        <property name="realm"><value>myrealm</value></property> 
        <property name="login"><value>mylogin</value></property> 
        <property name="password"><value>mypassword</value></property> 


    本系列Heritrix 3.1.0 源码解析系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/28/3049042.html

  • 相关阅读:
    composer lavarel 安装
    phpstudy 500 Internal Server Error 解决办法
    阿里云php-7.2.12 安装
    layui 笔记
    Thinkphp5.1 模板路径报错
    window/linux composer安装/卸载
  • 原文地址:https://www.cnblogs.com/chenying99/p/3049042.html
Copyright © 2011-2022 走看看