zoukankan      html  css  js  c++  java
  • 用nginx屏蔽爬虫的方法

    用nginx屏蔽爬虫的方法

    1. 使用"robots.txt"规范

    在网站根目录新建空白文件,命名为"robots.txt",将下面内容保存即可。

    User-agent: BaiduSpider
    Disallow:
    User-agent: YisouSpider
    Disallow:
    User-agent: 360Spider
    Disallow:
    User-agent: Sosospider
    Disallow:
    User-agent: SogouSpider
    Disallow:
    User-agent: YodaoBot
    Disallow:
    User-agent: Googlebot
    Disallow:
    User-agent: bingbot
    Disallow:
    User-agent: *
    Disallow: /


    2. nginx

    在http字段下加入一个map做匹配引导

    map $http_user_agent $limit_bots {
    default 0;
    ~*(baiduspider|qqspider|google|soso|bing|yandex|sogou|bingbot|yahoo|sohu-search|yodao|YoudaoBot|robozilla|msnbot|MJ12bot|NHN|Twiceler) 1;
    ~*(AltaVista|Googlebot|Slurp|BlackWidow|Bot|ChinaClaw|Custo|DISCo|Download|Demon|eCatch|EirGrabber|EmailSiphon|EmailWolf|SuperHTTP|Surfbot|WebWhacker) 1;
    ~*(Express|WebPictures|ExtractorPro|EyeNetIE|FlashGet|GetRight|GetWeb!|Go!Zilla|Go-Ahead-Got-It|GrabNet|Grafula|HMView|Go!Zilla|Go-Ahead-Got-It) 1;
    ~*(rafula|HMView|HTTrack|Stripper|Sucker|Indy|InterGET|Ninja|JetCar|Spider|larbin|LeechFTP|Downloader|tool|Navroad|NearSite|NetAnts|tAkeOut|WWWOFFLE) 1;
    ~*(GrabNet|NetSpider|Vampire|NetZIP|Octopus|Offline|PageGrabber|Foto|pavuk|pcBrowser|RealDownload|ReGet|SiteSnagger|SmartDownload|SuperBot|WebSpider) 1;
    ~*(Teleport|VoidEYE|Collector|WebAuto|WebCopier|WebFetch|WebGo|WebLeacher|WebReaper|WebSauger|eXtractor|Quester|WebStripper|WebZIP|Wget|Widow|Zeus) 1;
    ~*(Twengabot|htmlparser|libwww|Python|perl|urllib|scan|Curl|email|PycURL|Pyth|PyQ|WebCollector|WebCopy|webcraw) 1;
    }
    再到server字段或者是location字段下加入if判断
    if ($limit_bots = 1) {return 403;}


    将下面代码添加到"location / { }" 段里面,比如伪静态规则里面。

    #禁止Scrapy等工具的抓取
    if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
    return 403;
    }

    #禁止指定UA及UA为空的访问
    if ($http_user_agent ~ "BaiduSpider|JiKeSpider|YandexBot|Bytespider|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadW
    ebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bo
    t|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Ezooms|^$" ) {
    return 404;
    }

    #禁止非GET|HEAD|POST方式的抓取, ~ 为模糊匹配 ~* 为模糊匹配不区分大小写
    if ($request_method !~ ^(GET|HEAD|POST)$) {
    return 403;
    }

    if ($http_user_agent ~ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)" ) {
    return 404;
    }
    if ($http_user_agent ~ "Mozilla/5.0+(compatible;+Baiduspider/2.0;++http://www.baidu.com/search/spider.html)") {
    return 404;
    }

    if ($http_user_agent ~ "Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X)(compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)") {
    return 404;
    }
    if ($http_user_agent ~ "Mozilla/5.0 (Linux; Android 10; VCE-AL00 Build/HUAWEIVCE-AL00; wv)") {
    return 404;
    }


    测试一下:

    curl -I -A "Mozilla/5.0Macintosh;IntelMacOSX10_12_0AppleWebKit/537.36KHTML,likeGeckoChrome/60.0.6967.1704Safari/537.36;YandexBot" http://www.xxxxx.com

    返回 403 表示设置成功!

  • 相关阅读:
    团队作业——系统设计
    团队作业—预则立&&他山之石
    Alpha 冲刺报告2
    Android:Date、String、Long三种日期类型之间的相互转换
    冲刺阶段第一天
    需求分析答辩总结
    用ExifInterface读取经纬度的时候遇到的一个问题
    项目uml设计
    项目选题报告答辩总结
    深夜睡不着,去某乎爬点照片
  • 原文地址:https://www.cnblogs.com/walkersss/p/14766129.html
Copyright © 2011-2022 走看看