zoukankan      html  css  js  c++  java
  • Xpath re bs4 等爬虫解析器的性能比较

    xpath re bs4 等爬虫解析器的性能比较

    本文原始地址:https://sitoi.cn/posts/23470.html

    思路

    测试网站地址:http://baijiahao.baidu.com/s?id=1644707202199076031

    根据同一个网站,获取同样的数据,重复 500 次取和后进行对比。

    测试例子

    # -*- coding: utf-8 -*-
    import re
    import time
    
    import scrapy
    from bs4 import BeautifulSoup
    
    
    class NewsSpider(scrapy.Spider):
        name = 'news'
        allowed_domains = ['baidu.com']
        start_urls = ['http://baijiahao.baidu.com/s?id=1644707202199076031']
    
        def parse(self, response):
            re_time_list = []
            xpath_time_list = []
            lxml_time_list = []
            bs4_lxml_time_list = []
            html5lib_time_list = []
            bs4_html5lib_time_list = []
            for _ in range(500):
                # re
                re_start_time = time.time()
                news_title = re.findall(pattern="<title>(.*?)</title>", string=response.text)[0]
                news_content = "".join(re.findall(pattern='<span class="bjh-p">(.*?)</span>', string=response.text))
                re_time_list.append(time.time() - re_start_time)
                # xpath
                xpath_start_time = time.time()
                news_title = response.xpath("//div[@class='article-title']/h2/text()").extract_first()
                news_content = response.xpath('string(//*[@id="article"])').extract_first()
                xpath_time_list.append(time.time() - xpath_start_time)
                # bs4 html5lib without BeautifulSoup
                soup = BeautifulSoup(response.text, "html5lib")
                html5lib_start_time = time.time()
                news_title = soup.select_one("div.article-title > h2").text
                news_content = soup.select_one("#article").text
                html5lib_time_list.append(time.time() - html5lib_start_time)
                # bs4 html5lib with BeautifulSoup
                bs4_html5lib_start_time = time.time()
                soup = BeautifulSoup(response.text, "html5lib")
                news_title = soup.select_one("div.article-title > h2").text
                news_content = soup.select_one("#article").text
                bs4_html5lib_time_list.append(time.time() - bs4_html5lib_start_time)
    
                # bs4 lxml without BeautifulSoup
                soup = BeautifulSoup(response.text, "lxml")
                lxml_start_time = time.time()
                news_title = soup.select_one("div.article-title > h2").text
                news_content = soup.select_one("#article").text
                lxml_time_list.append(time.time() - lxml_start_time)
    
                # bs4 lxml without BeautifulSoup
                bs4_lxml_start_time = time.time()
                soup = BeautifulSoup(response.text, "lxml")
                news_title = soup.select_one("div.article-title > h2").text
                news_content = soup.select_one("#article").text
                bs4_lxml_time_list.append(time.time() - bs4_lxml_start_time)
            re_result = sum(re_time_list)
            xpath_result = sum(xpath_time_list)
            lxml_result = sum(lxml_time_list)
            html5lib_result = sum(html5lib_time_list)
            bs4_lxml_result = sum(bs4_lxml_time_list)
            bs4_html5lib_result = sum(bs4_html5lib_time_list)
    
            print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    ")
            print(f"re 使用时间:{re_result}")
            print(f"xpath 使用时间:{xpath_result}")
            print(f"lxml 纯解析使用时间:{lxml_result}")
            print(f"html5lib 纯解析使用时间:{html5lib_result}")
            print(f"bs4_lxml 转换解析使用时间:{bs4_lxml_result}")
            print(f"bs4_html5lib 转换解析使用时间:{bs4_html5lib_result}")
            print("
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    ")
            print(f"xpath/re :{xpath_result / re_result}")
            print(f"lxml/re :{lxml_result / re_result}")
            print(f"html5lib/re :{html5lib_result / re_result}")
            print(f"bs4_lxml/re :{bs4_lxml_result / re_result}")
            print(f"bs4_html5lib/re :{bs4_html5lib_result / re_result}")
            print("
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
    
    

    测试结果:

    第一次

    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    
    re 使用时间:0.018010616302490234
    xpath 使用时间:0.19927382469177246
    lxml 纯解析使用时间:0.3410227298736572
    html5lib 纯解析使用时间:0.3842911720275879
    bs4_lxml 转换解析使用时间:1.6482152938842773
    bs4_html5lib 转换解析使用时间:6.744122505187988
    
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    
    xpath/re :11.064242408196765
    lxml/re :18.934539726245003
    html5lib/re :21.336925154218847
    bs4_lxml/re :91.51354213550078
    bs4_html5lib/re :374.4526223822509
    lxml/xpath :1.7113272673976896
    html5lib/xpath :1.9284578525152096
    
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    

    第二次

    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    
    re 使用时间:0.023047208786010742
    xpath 使用时间:0.18992280960083008
    lxml 纯解析使用时间:0.3522317409515381
    html5lib 纯解析使用时间:0.418229341506958
    bs4_lxml 转换解析使用时间:1.710503101348877
    bs4_html5lib 转换解析使用时间:7.1153998374938965
    
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    
    xpath/re :8.24059917034769
    lxml/re :15.28305419636484
    html5lib/re :18.14663742538819
    bs4_lxml/re :74.21736476770769
    bs4_html5lib/re :308.7315216154427
    lxml/xpath :1.8546047296364272
    html5lib/xpath :2.2021016979791463
    
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    

    第三次

    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    
    re 使用时间:0.014002561569213867
    xpath 使用时间:0.18992352485656738
    lxml 纯解析使用时间:0.3783881664276123
    html5lib 纯解析使用时间:0.39995455741882324
    bs4_lxml 转换解析使用时间:1.751767873764038
    bs4_html5lib 转换解析使用时间:7.1871068477630615
    
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    
    xpath/re :13.563484360899695
    lxml/re :27.022781835827757
    html5lib/re :28.56295653062267
    bs4_lxml/re :125.10338662716453
    bs4_html5lib/re :513.2708620660298
    lxml/xpath :1.9923185751389976
    html5lib/xpath :2.1058716013241323
    
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    

    结果分析:

    三次取平均值结果分析

    re xpath lxml html5lib lxml(bs4) html5lib(bs4)
    re 1 10.52 19.46 21.84 92.82 382.25
    xpath 1 1.85 2.08 8.82 36.34
    lxml 1 1.12 4.77 19.64
    html5lib 1 4.25 17.50
    lxml(bs4) 1 4.12
    html5lib(bs4) 1
    • xpath/re :10.52
    • lxml/re :19.46
    • html5lib/re :21.84
    • bs4_lxml/re :92.82
    • bs4_html5lib/re :382.25
    • lxml/xpath :1.85
    • html5lib/xpath :2.08
    • bs4_lxml/xpath :8.82
    • bs4_html5lib/xpath :36.34
    • html5lib/lxml :1.12
    • bs4_lxml/lxml :4.77
    • bs4_html5lib/lxml :19.64
    • bs4_lxml/html5lib :4.25
    • bs4_html5lib/html5lib :17.50
    • bs4_html5lib/bs4_lxml :4.12

    三种爬取方式的对比

    re xpath bs4
    安装 内置 第三方 第三方
    语法 正则 路径匹配 面向对象
    使用 困难 较困难 简单
    性能 最高 适中 最低

    结论

    re > xpath > bs4

    • re 是 xpath 的 10 倍左右

      虽然 re 在性能上远比 xpath bs4 高很多,但是在使用上,比 xpath 和 bs4 难度上要大很多,且后期维护的困难度上也高很多。

    • xpath 是 bs4 的 1.8 倍左右

      仅仅比较提取的效率来说,xpath 是 bs4 的 1.8 倍左右,但是实际情况还包含 bs4 的 转换过程,在层数多且量大的情况下,实际效率 xpath 要比 bs4 高很多。

    总的来说,xpath 加上 scrapy-redis 的分布式已经非常满足性能要求了,建议入 xpath 的坑。

  • 相关阅读:
    Java文件之NIO核心组件之三选择器
    plsql备份表---只是表---不包含表数据
    根据id来大量删除数据between
    符号的问题
    excel表格中添加单引号的方法
    oracle中insert 多条数据方法
    sql developer以字段来删除大量数据
    Day 29
    Day 28
    Day 27
  • 原文地址:https://www.cnblogs.com/sitoi/p/11819580.html
Copyright © 2011-2022 走看看