zoukankan      html  css  js  c++  java
  • Web Scraping with Python第一章

    1. 认识urllib

    urllib是python的标准库,它提供丰富的函数例如从web服务器请求数据、处理cookie等,在python2中对应urllib2库,不同于urllib2,python3的urllib被分为若干子模块:urllib.request、urllib.parse、urllib.error等,urllib库的使用可以参考https://docs.python.org/3/library/urllib.html

    from urllib.request import urlopen
    html = urlopen("http://pythonscraping.com/pages/page1.html")
    print(html.read())
    
    b'<html>
    <head>
    <title>A Useful Page</title>
    </head>
    <body>
    <h1>An Interesting Title</h1>
    <div>
    Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
    </div>
    </body>
    </html>
    '
    

    2. 认识BeautifulSoup

    BeautifulSoup库用于解析html文本,并转化为BeautifulSoup对象。

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
    bsObj = BeautifulSoup(html.read(),"lxml")
    print(bsObj.h1)
    
    <h1>An Interesting Title</h1>
    

    BeautifulSoup函数需要制定解析库,下表列出常见的几种解析库,并给出优缺点:

    解析库 使用方法 优势 劣势
    Python标准库 BeautifulSoup(html,’html.parser’) Python内置标准库;执行速度快 容错能力较差
    lxml HTML解析库 BeautifulSoup(html,’lxml’) 速度快;容错能力强 需要安装,需要C语言库
    lxml XML解析库 BeautifulSoup(html,[‘lxml’,’xml’]) 速度快;容错能力强;支持XML格式 需要C语言库
    htm5lib解析库 BeautifulSoup(html,’htm5llib’) 以浏览器方式解析,最好的容错性 速度慢

    3. 可靠性爬虫

    我们知道在网站访问中通常会出现404 Page not found的情况,或者服务器暂时关闭了,在调用urlopen函数时就会抛出异常,使得程序无法继续运行,我们可以urllib.error模块来处理异常。

    from urllib.request import urlopen
    from urllib.error import URLError
    
    try:
        html = urlopen("https://www.baid.com/")   #url is wrong
    except URLError as e:
        print(e)
    
    <urlopen error [Errno 111] Connection refused>
    

    在取得可靠性连接后,我们用BeautifulSoup处理html,通常会出现网站改版后无法找到某个标签从而抛出异常的情形。

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
    try:
        bsObj = BeautifulSoup(html.read(),"lxml")
        li = bsObj.ul.li
        print(li)
    except AttributeError as e:
        print(e)
    
    'NoneType' object has no attribute 'li'
    

    4. 第一个爬虫程序

    from urllib.request import urlopen
    from urllib.error import HTTPError
    from bs4 import BeautifulSoup
    
    def getTitle(url):
        try:
            html = urlopen(url)
        except HTTPError as e:
            return None
        
        try:
            bsObj = BeautifulSoup(html.read(),"lxml")
            title = bsObj.body.h1
        except AttributeError as e:
            return None
        
        return title
    
    title = getTitle("http://www.pythonscraping.com/pages/page1.html")
    if title == None:
        print("Title could not be found.")
    else:
        print(title)
    
    
    <h1>An Interesting Title</h1>
  • 相关阅读:
    Hbase-06-Snapshot原理
    Hbase-05-备份表数据
    Hbase-04-hbck
    Python Exception Handling
    10.TiPD 调度
    8.存储引擎TiFlash
    6.TiDB数据库的存储
    7.存储引擎TiKV
    4.Tidb SQL优化(一)
    5.TiDB SQL优化(二)
  • 原文地址:https://www.cnblogs.com/dxs959229640/p/8660834.html
Copyright © 2011-2022 走看看