zoukankan      html  css  js  c++  java
  • Scraper——BeautifulSoup and LXML

    爬虫解析方式除了正则表达式,还有BeautifulSoup包和LXML模块。现在分别来介绍这两种方式。
    1.BeautifulSoup包
    功能比正则表达式很多,且要简洁明白一些。但是,由于它是用python编写出来的包,速度会慢一些。

    # 数据抓取——BeautifulSoup包
    '''
    官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
    '''
    # beautifulsoup包处理错误的HTML格式
    from bs4 import BeautifulSoup
    broken_html = '<ul class=country><li>Area<li>Population</ul>'
    soup = BeautifulSoup(broken_html, "html.parser")
    fixed_html = soup.prettify()
    # 修复HTML格式
    # print fixed_html
    ul = soup.find('ul', attrs={'class': 'country'})
    # 调取元素
    # print ul.find('li')
    # print ul.find_all('li')
    
    # 现在用此方法抽取国家面积数据
    import urllib2
    def download(url, user_agent="wswp", num_retries=2):
        print "Download :", url
        headers = {"User_agent": user_agent}
        request = urllib2.Request(url, headers=headers)
        try:
            html = urllib2.urlopen(request).read()
        except urllib2.URLError as e:
            print "Download Error :", e.reason
            html = None
            if num_retries > 0:
                if hasattr(e, "code") and 500 <= e.code < 600:
                    return download(url, user_agent, num_retries-1)
        return html
    if __name__ == "__main__":
        url = "http://example.webscraping.com/view/United-Kingdom-239"
        html = download(url)
        soup = BeautifulSoup(html, "html.parser", from_encoding="utf-8")
        # 先找到其父元素
        tr = soup.find(attrs={'id':'places_area__row'})
        # 然后再找到面积所在的子元素
        td = tr.find(attrs={'class':'w2p_fw'})
        # 最后输出子元素的内容
        area = td.text
        print area
    
    # 总结:BeautifulSoup包虽然比正则表达式要复杂,但是,并不难懂,而且更易构造和理解。最后,像多余的空格和标签属性这种布局上的小变化,我们使用BeautifulSoup包更为方便。

    2.LXML模块

    这此模块中有一个CSS选择器。在使用前,必须先要安装cssselect包。不然,会出现错误!

    # 数据抓取——Lxml模块
    '''
    Lxml是基于libxml2这一XML解析库的Python封存,该模块的解析速度更加块,比BeautifulSoup包快,因为,它使用的C语言编写。
    '''
    
    # 使用第一步先将不合法的HTML解析为统一的格式。
    import lxml.html
    import urllib2
    '''
    broken_html = '<ul class=country><li>Area<li>Population</ul>'
    # 解析html
    tree = lxml.html.fromstring(broken_html)
    fixed_html = lxml.html.tostring(tree, pretty_print=True)
    '''
    
    # print fixed_html
    def download(url, user_agent="wswp", num_retries=2):
        print "Download :", url
        headers = {"User_agent": user_agent}
        request = urllib2.Request(url, headers=headers)
        try:
            html = urllib2.urlopen(request).read()
        except urllib2.URLError as e:
            print "Download Error :", e.reason
            html = None
            if num_retries > 0:
                if hasattr(e, "code") and 500 <= e.code < 600:
                    return download(url, user_agent, num_retries - 1)
        return html
    
    if __name__ == "__main__":
        url = "http://example.webscraping.com/view/United-Kingdom-239"
        html = download(url)
        tree = lxml.html.fromstring(html)
        td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0]  # 注意在最新的lxml模块中已经没有cssselect包,需要单独下载 pip install cssselect
        area = td.text_content()
        print area
  • 相关阅读:
    JavaScript And Ajax(JavaScript 基本示例)
    JavaScript And Ajax(JavaScript 本质)
    LINQ(LINQ to Entities)
    XML (转换)
    XML 搜索和验证(XmlDocument、XPath to XmlDocument、LINQ to XDocument)
    图形、GDI + 和图表(Chart 控件)
    XML(简介)
    图形、GDI + 和图表(在网页上嵌入动态图形)
    网站导航(URL 映射和路由)
    JavaScript And Ajax(在客户端回调中使用 Ajax)
  • 原文地址:https://www.cnblogs.com/llhy1178/p/6834792.html
Copyright © 2011-2022 走看看