发现要抓取的内容在网页源码上面没有,找到传内容的json文件,解析,结果如下:
代码:
#coding=utf-8
import json
import urllib
import urllib.request
url='http://www.tianyancha.com/expanse/holder.json?id=9519792&;ps=20&pn=1' #需要解析的json网址
def getPage(url): #获取json内容
response=urllib.request.urlopen(url).read()
z_response=response.decode('UTF-8') #转码成中文
return z_response
names=json.loads(getPage(url))
#{"state":"ok","message":"","special":"","data":{"total":4,"result":[{"amount":3528.5705,"id":2277807374,"capitalActl":[],"type":2,"capital":[{"amomon":"3,528.5705万元","percent":"54.29%"}],"name":"马化腾"},{"amount":1485.7115,"id":1925786094,"capitalActl":[],"type":2,"capital":[{"amomon":"1,485.7115万元","percent":"22.86%"}],"name":"张志东"},{"amount":742.859,"id":2246944474,"capitalActl":[],"type":2,"capital":[{"amomon":"742.859万元","percent":"11.43%"}],"name":"陈一丹"},{"amount":742.859,"id":2171369795,"capitalActl":[],"type":2,"capital":[{"amomon":"742.859万元","percent":"11.43%"}],"name":"许晨晔"}]}}
for i in range(0,names['data']['total']):
print(names['data']['result'][i]['name'])
解决短时间内限制问题:
法一:有小部分网站的防范措施比较弱,可以伪装下IP,修改X-Forwarded-for(貌似这么拼。。。)即可绕过。
大部分网站么,如果要频繁抓取,一般还是要多IP。我比较喜欢的解决方案是国外VPS再配多IP,通过默认网关切换来实现IP切换,比HTTP代理高效得多,估计也比多数情况下的ADSL切换更高效。
法二:尽可能的模拟用户行为:1、UserAgent经常换一换;2、访问时间间隔设长一点,访问时间设置为随机数;3、访问页面的顺序也可以随机着来