Rebots协议是什么？ - 走看看

zoukankan html css js c++ java

Rebots协议是什么？
数据的时代，网络爬虫有一定的法律风险，但是只要遵守协议知道抓爬哪些数据是不合法的，我们就能避免。

每个网站一般都有Rebots协议,没有的就都可以爬了。

　　Robots Exclusion Standard,网络爬虫排除标准协议

作用：

　　告知网络爬虫哪些页面可以抓爬，哪些不可以

形式：

　　　在网站跟目录下的robots.txt文件

拿油管举个例子：

　　https://www.youtube.com/robots.txt

打开内容如下
# robots.txt file for YouTube # Created in the distant future (the year 2000) after # the robotic uprising of the mid 90's which wiped out all humans. User-agent: Mediapartners-Google* Disallow: User-agent: * Disallow: /channel/*/community Disallow: /comment Disallow: /get_video Disallow: /get_video_info Disallow: /live_chat Disallow: /login Disallow: /results Disallow: /signup Disallow: /t/terms Disallow: /timedtext_video Disallow: /user/*/community Disallow: /verify_age Disallow: /watch_ajax Disallow: /watch_fragments_ajax Disallow: /watch_popup Disallow: /watch_queue_ajax Sitemap: https://www.youtube.com/sitemaps/sitemap.xml
　　其中# 注释， *代表所有， /代表跟目录

　　User-agent 来源审查，限制此类协议头抓爬

最后 Robots只是建议不是强制约束，可以不遵守，但是会存在法律风险。

　　在此提倡大家遵守Robots协议，共建良好环境
查看全文

相关阅读:
IDEA创建一个javaweb工程（在module中）以及配置Tomcat
晨会复盘
 cnblog 笔记思路
 Mysql执行计划-extra
Mysql执行计划分析-type(access_type)
Mysql执行计划-selectType
刻意训练
 MYSQL执行计划
 个人展望-程序员职业规划
 服务拆分原则

原文地址：https://www.cnblogs.com/hao11/p/12609348.html

Copyright © 2011-2022 走看看