zoukankan      html  css  js  c++  java
  • 爬虫入门

    发现要抓取的内容在网页源码上面没有,找到传内容的json文件,解析,结果如下:
    代码:
    #coding=utf-8
    import json
    import urllib
    import urllib.request
     
     
    def getPage(url):     #获取json内容
        response=urllib.request.urlopen(url).read()
        z_response=response.decode('UTF-8')    #转码成中文
        return z_response
     
    names=json.loads(getPage(url))
    #{"state":"ok","message":"","special":"","data":{"total":4,"result":[{"amount":3528.5705,"id":2277807374,"capitalActl":[],"type":2,"capital":[{"amomon":"3,528.5705万元","percent":"54.29%"}],"name":"马化腾"},{"amount":1485.7115,"id":1925786094,"capitalActl":[],"type":2,"capital":[{"amomon":"1,485.7115万元","percent":"22.86%"}],"name":"张志东"},{"amount":742.859,"id":2246944474,"capitalActl":[],"type":2,"capital":[{"amomon":"742.859万元","percent":"11.43%"}],"name":"陈一丹"},{"amount":742.859,"id":2171369795,"capitalActl":[],"type":2,"capital":[{"amomon":"742.859万元","percent":"11.43%"}],"name":"许晨晔"}]}}
    for i in range(0,names['data']['total']):
       print(names['data']['result'][i]['name'])
     
    解决短时间内限制问题:
    法一:有小部分网站的防范措施比较弱,可以伪装下IP,修改X-Forwarded-for(貌似这么拼。。。)即可绕过。

    大部分网站么,如果要频繁抓取,一般还是要多IP。我比较喜欢的解决方案是国外VPS再配多IP,通过默认网关切换来实现IP切换,比HTTP代理高效得多,估计也比多数情况下的ADSL切换更高效。

    法二:尽可能的模拟用户行为:1、UserAgent经常换一换;2、访问时间间隔设长一点,访问时间设置为随机数;3、访问页面的顺序也可以随机着来
  • 相关阅读:
    docker 安装 nexus3 初始密码不再是admin123
    eclipse中Tomcat修改项目名称
    WAMP3.1.3自定义根目录
    git学习笔记
    小米和MAC触摸板手势汇总
    IDEA快捷键汇总
    servelet 实现Post接口访问
    LeetCode:Jump Game II
    LeetCode:Trapping Rain Water
    LeetCode: Container With Most Water
  • 原文地址:https://www.cnblogs.com/to-creat/p/6743985.html
Copyright © 2011-2022 走看看