zoukankan      html  css  js  c++  java
  • Scraper_compare('NoneType' object has no attribute 'group')

    三种解析网页的方法各有所用,各有特点。通过,对比三种方式更能明白在什么情况之下采用什么方法。其中,运行代码时,可能会遇到一个bug(

    results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).group()
    AttributeError: 'NoneType' object has no attribute 'group'

    ),这其实是有一个参数没有对,如果删除之后,运行顺利!

    # 比较这三种解析的相对性能
    import urllib2
    import re
    from bs4 import BeautifulSoup
    import lxml.html
    import time
    def download(url, user_agent="wswp", num_retries=2):
        print "Download :", url
        headers = {"User_agent": user_agent}
        request = urllib2.Request(url, headers=headers)
        try:
            html = urllib2.urlopen(request).read()
        except urllib2.URLError as e:
            print "Download Error :", e.reason
            html = None
            if num_retries > 0:
                if hasattr(e, "code") and 500 <= e.code < 600:
                    return download(url, user_agent, num_retries-1)
        return html
    # 如果运行出现“results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).group()
    AttributeError: 'NoneType' object has no attribute 'group'”则只需删除FIELDS中的“iso”元素就行。
    FIELDS = ("area", "population", "iso", "country", "capital", "continent", "tld", "currency_code", "currency_name", "phone", "postal_code_format", "postal_code_regex", "languages", "neighbours")
    
    # 正则
    def re_scraper(html):
        results = {}
        for field in FIELDS:
            results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).groups()[0]
        return results
    
    # BeautifulSoup
    def bs_scraper(html):
        soup = BeautifulSoup(html, "html.parser", from_encoding="utf8")
        results = {}
        for field in FIELDS:
            # tr = soup.find(attrs={'id': 'places_%s__row' % field})
            # td = tr.find(attrs={'class':'w2p_fw'})
            # results[field] = td.text
            results[field] = soup.find('table').find('tr', id='places_%s__row' % field).find('td', class_='w2p_fw').text
        return results
    
    # lxml模块
    def lxml_scraper(html):
        tree = lxml.html.fromstring(html)
        results = {}
        for field in FIELDS:
            results[field] = tree.cssselect('tr#places_%s__row > td.w2p_fw' % field)[0].text_content()
        return results
    
    # 测试相对性能
    NUM_ITERATIONS = 1000
    html = download("http://example.webscraping.com/view/United-Kingdom-239")
    for name, scraper in [("Regular expressions", re_scraper), ("BeautifulSoup", bs_scraper), ("Lxml", lxml_scraper)]:
        start = time.time()
        for i in range(NUM_ITERATIONS):
            if scraper == re_scraper:
                # 清除正则表达式产生的搜索缓存,使结果更加公平
                re.purge()
            result = scraper(html)
            assert (result['area'] == '244,820 square kilometres')
    
        end = time.time()
        print "%s: %.2f seconds" % (name, end - start)

    输出结果是:

    Regular expressions: 3.82 seconds
    BeautifulSoup: 25.92 seconds
    Lxml: 4.33 seconds

    可以看出BeautifulSoup的性能最慢,但是,同时它也是最简单的。不过,在通常情况下,LXML是抓取数据的最好选择,这是因为该方法既快速又健壮,而正则表达式和BeautifulSoup只在某些特定的场景下有用。

  • 相关阅读:
    1058 A+B in Hogwarts (20分)
    我的Vue之小功能统计
    H5如何用Canvas画布生成并保存带图片文字的新年快乐的海报
    微信小程序之特殊效果及功能
    移动端H5适配方法(盒子+图片+文字)
    5分钟教你3种实现验证码功能
    微信小程序动态生成保存二维码
    微信授权获取code(微信支付)
    H5微信自定义分享链接(设置标题+简介+图片)
    带你走近WebSocket协议
  • 原文地址:https://www.cnblogs.com/llhy1178/p/6844214.html
Copyright © 2011-2022 走看看