zoukankan      html  css  js  c++  java
  • 配置Nutch模拟浏览器以绕过反爬虫限制

    原文链接:http://yangshangchuan.iteye.com/blog/2030741

    当我们配置Nutch抓取 http://yangshangchuan.iteye.com 的时候,抓取的所有页面内容均为:您的访问请求被拒绝 ...... 这是最简单的反爬虫策略(该策略简单地读取HTTP请求头User-Agent的值来判断是人(浏览器)还是机器爬虫,我们只需要简单地配置Nutch来模拟浏览器(simulate web browser)就可以绕过这种限制。

     

    nutch-default.xml中有5项配置是和User-Agent相关的:

     

    Xml代码  收藏代码
    1. <property>  
    2.   <name>http.agent.description</name>  
    3.   <value></value>  
    4.   <description>Further description of our bot- this text is used in  
    5.   the User-Agent header.  It appears in parenthesis after the agent name.  
    6.   </description>  
    7. </property>  
    8. <property>  
    9.   <name>http.agent.url</name>  
    10.   <value></value>  
    11.   <description>A URL to advertise in the User-Agent header.  This will   
    12.    appear in parenthesis after the agent name. Custom dictates that this  
    13.    should be a URL of a page explaining the purpose and behavior of this  
    14.    crawler.  
    15.   </description>  
    16. </property>  
    17. <property>  
    18.   <name>http.agent.email</name>  
    19.   <value></value>  
    20.   <description>An email address to advertise in the HTTP 'From' request  
    21.    header and User-Agent header. A good practice is to mangle this  
    22.    address (e.g. 'info at example dot com') to avoid spamming.  
    23.   </description>  
    24. </property>  
    25. <property>  
    26.   <name>http.agent.name</name>  
    27.   <value></value>  
    28.   <description>HTTP 'User-Agent' request header. MUST NOT be empty -   
    29.   please set this to a single word uniquely related to your organization.  
    30.   NOTE: You should also check other related properties:  
    31.     http.robots.agents  
    32.     http.agent.description  
    33.     http.agent.url  
    34.     http.agent.email  
    35.     http.agent.version  
    36.   and set their values appropriately.  
    37.   </description>  
    38. </property>  
    39. <property>  
    40.   <name>http.agent.version</name>  
    41.   <value>Nutch-1.7</value>  
    42.   <description>A version string to advertise in the User-Agent   
    43.    header.</description>  
    44. </property>  
    <property>
      <name>http.agent.description</name>
      <value></value>
      <description>Further description of our bot- this text is used in
      the User-Agent header.  It appears in parenthesis after the agent name.
      </description>
    </property>
    <property>
      <name>http.agent.url</name>
      <value></value>
      <description>A URL to advertise in the User-Agent header.  This will 
       appear in parenthesis after the agent name. Custom dictates that this
       should be a URL of a page explaining the purpose and behavior of this
       crawler.
      </description>
    </property>
    <property>
      <name>http.agent.email</name>
      <value></value>
      <description>An email address to advertise in the HTTP 'From' request
       header and User-Agent header. A good practice is to mangle this
       address (e.g. 'info at example dot com') to avoid spamming.
      </description>
    </property>
    <property>
      <name>http.agent.name</name>
      <value></value>
      <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
      please set this to a single word uniquely related to your organization.
      NOTE: You should also check other related properties:
    	http.robots.agents
    	http.agent.description
    	http.agent.url
    	http.agent.email
    	http.agent.version
      and set their values appropriately.
      </description>
    </property>
    <property>
      <name>http.agent.version</name>
      <value>Nutch-1.7</value>
      <description>A version string to advertise in the User-Agent 
       header.</description>
    </property>

     

    在类nutch1.7/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java中可以看到这5项配置是如何构成User-Agent的:

     

    Java代码  收藏代码
    1. this.userAgent = getAgentString( conf.get("http.agent.name"),   
    2.         conf.get("http.agent.version"),   
    3.         conf.get("http.agent.description"),   
    4.         conf.get("http.agent.url"),   
    5.         conf.get("http.agent.email") );  
    this.userAgent = getAgentString( conf.get("http.agent.name"), 
            conf.get("http.agent.version"), 
            conf.get("http.agent.description"), 
            conf.get("http.agent.url"), 
            conf.get("http.agent.email") );

     

    Java代码  收藏代码
    1. private static String getAgentString(String agentName,  
    2.                                      String agentVersion,  
    3.                                      String agentDesc,  
    4.                                      String agentURL,  
    5.                                      String agentEmail) {  
    6.     
    7.   if ( (agentName == null) || (agentName.trim().length() == 0) ) {  
    8.     // TODO : NUTCH-258  
    9.     if (LOGGER.isErrorEnabled()) {  
    10.       LOGGER.error("No User-Agent string set (http.agent.name)!");  
    11.     }  
    12.   }  
    13.     
    14.   StringBuffer buf= new StringBuffer();  
    15.     
    16.   buf.append(agentName);  
    17.   if (agentVersion != null) {  
    18.     buf.append("/");  
    19.     buf.append(agentVersion);  
    20.   }  
    21.   if ( ((agentDesc != null) && (agentDesc.length() != 0))  
    22.   || ((agentEmail != null) && (agentEmail.length() != 0))  
    23.   || ((agentURL != null) && (agentURL.length() != 0)) ) {  
    24.     buf.append(" (");  
    25.       
    26.     if ((agentDesc != null) && (agentDesc.length() != 0)) {  
    27.       buf.append(agentDesc);  
    28.       if ( (agentURL != null) || (agentEmail != null) )  
    29.         buf.append("; ");  
    30.     }  
    31.       
    32.     if ((agentURL != null) && (agentURL.length() != 0)) {  
    33.       buf.append(agentURL);  
    34.       if (agentEmail != null)  
    35.         buf.append("; ");  
    36.     }  
    37.       
    38.     if ((agentEmail != null) && (agentEmail.length() != 0))  
    39.       buf.append(agentEmail);  
    40.       
    41.     buf.append(")");  
    42.   }  
    43.   return buf.toString();  
    44. }  
      private static String getAgentString(String agentName,
                                           String agentVersion,
                                           String agentDesc,
                                           String agentURL,
                                           String agentEmail) {
        
        if ( (agentName == null) || (agentName.trim().length() == 0) ) {
          // TODO : NUTCH-258
          if (LOGGER.isErrorEnabled()) {
            LOGGER.error("No User-Agent string set (http.agent.name)!");
          }
        }
        
        StringBuffer buf= new StringBuffer();
        
        buf.append(agentName);
        if (agentVersion != null) {
          buf.append("/");
          buf.append(agentVersion);
        }
        if ( ((agentDesc != null) && (agentDesc.length() != 0))
        || ((agentEmail != null) && (agentEmail.length() != 0))
        || ((agentURL != null) && (agentURL.length() != 0)) ) {
          buf.append(" (");
          
          if ((agentDesc != null) && (agentDesc.length() != 0)) {
            buf.append(agentDesc);
            if ( (agentURL != null) || (agentEmail != null) )
              buf.append("; ");
          }
          
          if ((agentURL != null) && (agentURL.length() != 0)) {
            buf.append(agentURL);
            if (agentEmail != null)
              buf.append("; ");
          }
          
          if ((agentEmail != null) && (agentEmail.length() != 0))
            buf.append(agentEmail);
          
          buf.append(")");
        }
        return buf.toString();
      }

     

    在类nutch1.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java中使用User-Agent请求头,这里的http.getUserAgent()返回的userAgent就是HttpBase.java中的userAgent:

     

    Java代码  收藏代码
    1. String userAgent = http.getUserAgent();  
    2. if ((userAgent == null) || (userAgent.length() == 0)) {  
    3.     if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); }  
    4. else {  
    5.     reqStr.append("User-Agent: ");  
    6.     reqStr.append(userAgent);  
    7.     reqStr.append(" ");  
    8. }  
    String userAgent = http.getUserAgent();
    if ((userAgent == null) || (userAgent.length() == 0)) {
    	if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); }
    } else {
    	reqStr.append("User-Agent: ");
    	reqStr.append(userAgent);
    	reqStr.append("
    ");
    }

     

    通过上面的分析可知:在nutch-site.xml中只需要增加如下几种配置之一便可以模拟一个特定的浏览器(Imitating a specific browser)

     

    1、模拟Firefox浏览器:

     

    Xml代码  收藏代码
    1. <property>  
    2.     <name>http.agent.name</name>  
    3.     <value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value>  
    4. </property>  
    5. <property>  
    6.     <name>http.agent.version</name>  
    7.     <value>20100101 Firefox/27.0</value>  
    8. </property>  
    <property>
    	<name>http.agent.name</name>
    	<value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value>
    </property>
    <property>
    	<name>http.agent.version</name>
    	<value>20100101 Firefox/27.0</value>
    </property>

     

    2、模拟IE浏览器:

     

    Xml代码  收藏代码
    1. <property>  
    2.     <name>http.agent.name</name>  
    3.     <value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value>  
    4. </property>  
    5. <property>  
    6.     <name>http.agent.version</name>  
    7.     <value>6.0)</value>  
    8. </property>  
    <property>
    	<name>http.agent.name</name>
    	<value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value>
    </property>
    <property>
    	<name>http.agent.version</name>
    	<value>6.0)</value>
    </property>

     

    3、模拟Chrome浏览器:

     

    Xml代码  收藏代码
    1. <property>  
    2.     <name>http.agent.name</name>  
    3.     <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</value>  
    4. </property>  
    5. <property>  
    6.     <name>http.agent.version</name>  
    7.     <value>537.36</value>  
    8. </property>  
    <property>
    	<name>http.agent.name</name>
    	<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</value>
    </property>
    <property>
    	<name>http.agent.version</name>
    	<value>537.36</value>
    </property>

     

    4、模拟Safari浏览器:

     

    Xml代码  收藏代码
    1. <property>  
    2.     <name>http.agent.name</name>  
    3.     <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</value>  
    4. </property>  
    5. <property>  
    6.     <name>http.agent.version</name>  
    7.     <value>534.57.2</value>  
    8. </property>  
    <property>
    	<name>http.agent.name</name>
    	<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</value>
    </property>
    <property>
    	<name>http.agent.version</name>
    	<value>534.57.2</value>
    </property>

     

     

    5、模拟Opera浏览器:

     

    Xml代码  收藏代码
    1. <property>  
    2.     <name>http.agent.name</name>  
    3.     <value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</value>  
    4. </property>  
    5. <property>  
    6.     <name>http.agent.version</name>  
    7.     <value>19.0.1326.59</value>  
    8. </property>  
    <property>
    	<name>http.agent.name</name>
    	<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</value>
    </property>
    <property>
    	<name>http.agent.version</name>
    	<value>19.0.1326.59</value>
    </property>

     

     

    后记:查看User-Agent的方法:

    1、http://www.useragentstring.com

    2、http://whatsmyuseragent.com

    3、http://www.enhanceie.com/ua.aspx

     

    NUTCH/HADOOP视频教程

  • 相关阅读:
    操作系统——理论知识
    BEGIN-4 Fibonacci数列
    BEGIN-3 圆的面积
    面向对象三大特征之一:多态
    面向对象三大特征之二:继承
    package---包
    面向对象三大特征之一:封装
    关键字:This(上)
    无参构造与有参构造
    面向对象
  • 原文地址:https://www.cnblogs.com/zhjsll/p/4443851.html
Copyright © 2011-2022 走看看