zoukankan      html  css  js  c++  java
  • 爬虫的简单运用

    我运用的库为requests

    python没有自带需自行安装

    安装代码 pip install requests   或 pip3 install requests

    以访问百度页面为例

    代码如下:

    
    

    import requests
    def gethtml(url):
    try:
    r = requests.get(url,timeout=30)
    r.raise_for_status()#如果状态不是200,引发异常
    r.encoding='utf-8'#无论原来是什么编码,都改成utf-8
    return r.text,r.content
    except:
    return ""
    url="http://www.baidu.com/"
    str1,str2=gethtml(url)
    print(str1)
    print(len(str1))
    print(len(str2))

    
    

    注意爬取网站与读入文件一样要做异常处理及使用try-except语句

    结果如下:

    <!DOCTYPE html>
    <!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class="fm"> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class="s_ipt" value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class="mnav">新闻</a> <a href=http://www.hao123.com name=tj_trhao123 class="mnav">hao123</a> <a href=http://map.baidu.com name=tj_trmap class="mnav">地图</a> <a href=http://v.baidu.com name=tj_trvideo class="mnav">视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class="mnav">贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class="lb">登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class="bri" style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class="cp-feedback">意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

    2287
    2381

    多次访问只需要运用循环语句多次调用函数即可

  • 相关阅读:
    剑指Offer-30.连续子数组的最大和(C++/Java)
    剑指Offer-29.最小的K个数(C++/Java)
    UVA 1616 Caravan Robbers 商队抢劫者(二分)
    UVA 10570 Meeting with Aliens 外星人聚会
    UVA 11093 Just Finish it up 环形跑道 (贪心)
    UVA 12673 Erratic Expansion 奇怪的气球膨胀 (递推)
    UVA 10954 Add All 全部相加 (Huffman编码)
    UVA 714 Copying Books 抄书 (二分)
    UVALive 3523 Knights of the Round Table 圆桌骑士 (无向图点双连通分量)
    codeforecs Gym 100286B Blind Walk
  • 原文地址:https://www.cnblogs.com/DXL123/p/10909048.html
Copyright © 2011-2022 走看看