zoukankan      html  css  js  c++  java
  • 爬虫入门

    发现要抓取的内容在网页源码上面没有,找到传内容的json文件,解析,结果如下:
    代码:
    #coding=utf-8
    import json
    import urllib
    import urllib.request
     
     
    def getPage(url):     #获取json内容
        response=urllib.request.urlopen(url).read()
        z_response=response.decode('UTF-8')    #转码成中文
        return z_response
     
    names=json.loads(getPage(url))
    #{"state":"ok","message":"","special":"","data":{"total":4,"result":[{"amount":3528.5705,"id":2277807374,"capitalActl":[],"type":2,"capital":[{"amomon":"3,528.5705万元","percent":"54.29%"}],"name":"马化腾"},{"amount":1485.7115,"id":1925786094,"capitalActl":[],"type":2,"capital":[{"amomon":"1,485.7115万元","percent":"22.86%"}],"name":"张志东"},{"amount":742.859,"id":2246944474,"capitalActl":[],"type":2,"capital":[{"amomon":"742.859万元","percent":"11.43%"}],"name":"陈一丹"},{"amount":742.859,"id":2171369795,"capitalActl":[],"type":2,"capital":[{"amomon":"742.859万元","percent":"11.43%"}],"name":"许晨晔"}]}}
    for i in range(0,names['data']['total']):
       print(names['data']['result'][i]['name'])
     
    解决短时间内限制问题:
    法一:有小部分网站的防范措施比较弱,可以伪装下IP,修改X-Forwarded-for(貌似这么拼。。。)即可绕过。

    大部分网站么,如果要频繁抓取,一般还是要多IP。我比较喜欢的解决方案是国外VPS再配多IP,通过默认网关切换来实现IP切换,比HTTP代理高效得多,估计也比多数情况下的ADSL切换更高效。

    法二:尽可能的模拟用户行为:1、UserAgent经常换一换;2、访问时间间隔设长一点,访问时间设置为随机数;3、访问页面的顺序也可以随机着来
  • 相关阅读:
    2171 棋盘覆盖
    [网络流24题] 骑士共存
    COGS28 [NOI2006] 最大获利[最大权闭合子图]
    1066: [SCOI2007]蜥蜴
    1877: [SDOI2009]晨跑
    POJ 2125 Destroying the Graph 二分图最小点权覆盖
    LA 3231
    3028: 食物
    PYOJ 44. 【HNSDFZ2016 #6】可持久化线段树
    1597: [Usaco2008 Mar]土地购买
  • 原文地址:https://www.cnblogs.com/to-creat/p/6743985.html
Copyright © 2011-2022 走看看