nginx限制蜘蛛的频繁抓取

zoukankan html css js c++ java

nginx限制蜘蛛的频繁抓取
蜘蛛抓取量骤增，导致服务器负载很高。最终用nginx的ngx_http_limit_req_module模块限制了百度蜘蛛的抓取频率。每分钟允许百度蜘蛛抓取200次，多余的抓取请求返回503。

nginx的配置：
#全局配置
limit_req_zone $anti_spider zone=anti_spider:60m rate=200r/m; #某个server中 limit_req zone=anti_spider burst=5 nodelay; if ($http_user_agent ~* "baiduspider") { set $anti_spider $http_user_agent; } #其它爬虫限制参考 if ($http_user_agent ~* "qihoobot|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo!
Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot") { set $anti_spider $http_user_agent; }
参数说明：
指令limit_req_zone 中的rate=200r/m 表示每分钟只能处理200个请求。
指令limit_req 中的burst=5 表示最大并发为5。即同一时间只能同时处理5个请求。
指令limit_req 中的 nodelay 表示当已经达到burst值时，再来新请求时，直接返回503
IF部分用于判断是否是百度蜘蛛的user agent。如果是，就对变量$anti_spider赋值。这样就做到了只对百度蜘蛛进行限制了。

详细的参数说明，可以查看官方文档。
http://nginx.org/en/docs/http/ngx_http_limit_req_module.html#limit_req_zone

这个模块对请求的限制采用了漏桶算法。
漏桶算法详见 http://baike.baidu.com/view/2054741.htm
相关代码请查看nginx源码文件 src/http/modules/ngx_http_limit_req_module.c
代码的核心部分是ngx_http_limit_req_lookup 方法。
查看全文

相关阅读:
课堂作业02
模仿JavaAppArguments.java示例，编写一个程序，此程序从命令行接收多个数字，求和之后输出结果。
Feign使用Hystrix无效原因及解决方法
 解决Spring Boot 使用RedisTemplate 存储键值出现乱码 xacxedx00x05tx00
consul怎么在windows下安装
 java运行jar命令提示没有主清单属性
 Maven parent.relativePath
Maven的pom.xml文件结构之基本配置packaging和多模块聚合结构（微服务）
redis开启远程访问
 kibana使用

原文地址：https://www.cnblogs.com/xiewenming/p/8108703.html