zoukankan html css js c++ java

如何确定网站可否可爬取

Robots协议

约束性: Robots协议是建议但非约束性，网络爬虫可以不遵守，但存在法律风险。

网站排除爬虫有两个办法

审查来源
Robots协议告知
作用:网站告知网络爬虫哪些页面可以抓取，哪些不行。
形式:在网站根目录下的robots.txt文件。

查看京东Robots协议

https://www.jd.com/robots.txt,

可以看到：（并不是所有网站都有协议，无Robots协议说明可任意爬取）

User-agent: *               　　　　　　　 无论什么样的爬虫都应当遵守如下协议
Disallow: /?* 　　　　　　　　　　　　　　　　任何爬虫都不当访问以问号开头的网站
Disallow: /pop/*.html 
Disallow: /pinpai/*.html?* 
User-agent: EtaoSpider  　　　　　　　　　　这个爬虫不允许爬取京东的任何资源
Disallow: / 
User-agent: HuihuiSpider 
Disallow: / 
User-agent: GwdangSpider 
Disallow: / 
User-agent: WochachaSpider 
Disallow: /

转载仅为学习，不会商用。
欢迎转载原创，附文链接。

查看全文

相关阅读:
Java RunTime Environment (JRE) or Java Development Kit (JDK) must be available in order to run Eclipse. ......
UVA 1597 Searching the Web
UVA 1596 Bug Hunt
UVA 230 Borrowers
UVA 221 Urban Elevations
UVA 814 The Letter Carrier's Rounds
UVA 207 PGA Tour Prize Money
UVA 1592 Database
UVA 540 Team Queue
UVA 12096 The SetStack Computer

原文地址：https://www.cnblogs.com/xdd1997/p/13535581.html