一、识别网站所用技术
构建网站所使用的技术类型也会对我们如何爬取产生影响。有一个十分有用的工具可以检查网站构建的技术类型---builtwith模块。该模块的安装如下
pip install builtwith
该模块将url作为参数,下载该 url 并其进行分析,返回该网站的技术。
>>> import builtwith >>> builtwith.parse('http://123.127.249.126:8081/manage/login') {'programming-languages': ['Java'],
'web-servers': ['Nginx'],
'web-frameworks': ['Twitter Bootstrap'],
'javascript-frameworks': ['jQuery']}
从上面返回的结果可以看出,使用的框架是java 的 Twitter Bootstrap(前段框架利器)
二、寻找网站所有者
对于一些网站,我们可能会关心其所有者是谁。比如、我们已知网站的所有者会封禁网络爬虫,那么我们最好把下载速度控制的更加保守一些,为了找到网站所有者,我们可以使用WHOIS 协议查询域名的注册者是谁。Python 中有一个针对该协议的封装库,器文档地址为 https://pypi.python.org/pypi/python-whois,我们可以通过pip进行安装:
pip install python-whois
下面使用该模块对 http://www.baidu.com 进行WHOIS 查询是返回的结果。
>>>import whois >>> print(whois.whois('http://www.baidu.com')) { "domain_name": [ "BAIDU.COM", "baidu.com" ], "registrar": "MarkMonitor, Inc.", "whois_server": "whois.markmonitor.com", "referral_url": null, "updated_date": [ "2019-05-09 04:30:46", "2019-05-08 20:59:33" ], "creation_date": [ "1999-10-11 11:05:17", "1999-10-11 04:05:17" ], "expiration_date": [ "2026-10-11 11:05:17", "2026-10-11 00:00:00" ], "name_servers": [ "NS1.BAIDU.COM", "NS2.BAIDU.COM", "NS3.BAIDU.COM", "NS4.BAIDU.COM", "NS7.BAIDU.COM", "ns7.baidu.com", "ns2.baidu.com", "ns1.baidu.com", "ns3.baidu.com", "ns4.baidu.com" ], "status": [ "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited", "clientTransferProhibited https://icann.org/epp#clientTransferProhibited", "clientUpdateProhibited https://icann.org/epp#clientUpdateProhibited", "serverDeleteProhibited https://icann.org/epp#serverDeleteProhibited", "serverTransferProhibited https://icann.org/epp#serverTransferProhibited", "serverUpdateProhibited https://icann.org/epp#serverUpdateProhibited", "clientUpdateProhibited (https://www.icann.org/epp#clientUpdateProhibited)", "clientTransferProhibited (https://www.icann.org/epp#clientTransferProhibited)", "clientDeleteProhibited (https://www.icann.org/epp#clientDeleteProhibited)", "serverUpdateProhibited (https://www.icann.org/epp#serverUpdateProhibited)", "serverTransferProhibited (https://www.icann.org/epp#serverTransferProhibited)", "serverDeleteProhibited (https://www.icann.org/epp#serverDeleteProhibited)" ], "emails": [ "abusecomplaints@markmonitor.com", "whoisrequest@markmonitor.com" ], "dnssec": "unsigned", "name": null, "org": "Beijing Baidu Netcom Science Technology Co., Ltd.", "address": null, "city": null, "state": "Beijing", "zipcode": null, "country": "CN" }
可以看出该域名是 属于百度的,实际上却是如此。