zoukankan      html  css  js  c++  java
  • 普通爬虫vs多线程爬虫vs框架爬虫,Python爬对比

    前言

    本文的文字及图片过滤网络,可以学习,交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。

     

    Python爬虫、数据分析、网站开发等案例教程视频免费在线观看

    https://space.bilibili.com/523606542

    基本开发环境

    • Python 3.6
    • 皮查姆

    目标网页分析

    网站就选择发表情这个网站吧

    普通爬虫vs多线程爬虫vs框架爬虫,Python爬对比

     

    网站是静态网页,所有的数据都保存在div标签中,爬取的难度不大。

    普通爬虫vs多线程爬虫vs框架爬虫,Python爬对比

     

    根据标签提取其中的表情包url地址以及标题就可以了。

     

    普通爬虫实现

    import requests
    import parsel
    import re
    
    
    def change_title(title):
        pattern = re.compile(r"[/\:*?"<>|]")  # '/  : * ? " < > |'
        new_title = re.sub(pattern, "_", title)  # 替换为下划线
        return new_title
    
    
    for page in range(0, 201):
        url = f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html'
        headers = {
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
        }
        response = requests.get(url=url, headers=headers)
        selector = parsel.Selector(response.text)
        divs = selector.css('.tagbqppdiv')
        for div in divs:
            img_url = div.css('a img::attr(data-original)').get()
            title_ = img_url.split('.')[-1]
            title = div.css('a img::attr(title)').get()
            new_title = change_title(title) + title_
            img_content = requests.get(url=img_url, headers=headers).content
            path = 'img\' + new_title
            with open(path, mode='wb') as f:
                f.write(img_content)
                print(title)

     

    代码简单的说明:

    1,标题的替换,因为有一些图片的标题,其中会包含特殊字符,在创建文件的时候特殊字符是不能命名的,所以需要使用正则把有可能出现的特殊字符替换掉。

        divs = selector.css('.tagbqppdiv')
        for div in divs:
            img_url = div.css('a img::attr(data-original)').get()
            title_ = img_url.split('.')[-1]
            title = div.css('a img::attr(title)').get()
            new_title = change_title(title) + title_

     

    2,翻页爬取以及模拟浏览器请求网页

            img_content = requests.get(url=img_url, headers=headers).content
            path = 'img\' + new_title
            with open(path, mode='wb') as f:
                f.write(img_content)
                print(title)

    翻页多点击下一页看一下url地址的变化就可以找到相对应规律了,网站是get请求方式,使用请求请求网页即可,加上标题请求头,伪装浏览器请求,如果不加,网站会识别出你是python爬虫程序请求访问的,不过对于这个网站,其实加不加都差不多的。

     

    3,解析数据提取想要的数据

            img_content = requests.get(url=img_url, headers=headers).content
            path = 'img\' + new_title
            with open(path, mode='wb') as f:
                f.write(img_content)
                print(title)

    这里我们使用的是parsel解析库,用的是css选择器解析的数据。

    就是根据标签属性提取相对应的数据内容。

     

    4,保存数据

            img_content = requests.get(url=img_url, headers=headers).content
            path = 'img\' + new_title
            with open(path, mode='wb') as f:
                f.write(img_content)
                print(title)

    请求表情包url地址,返回获取内容二进制数据,图片,视频,文件等等都是二进制数据保存的。如果是文字则是text。

    path就是文件保存的路径,因为是二进制数据,所以保存方式是wb。

     

    多线程爬虫实现

     

    import requests
    import parsel
    import re
    import concurrent.futures
    
    
    def get_response(html_url):
        """模拟浏览器请求网址,获得网页源代码"""
        headers = {
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
        }
        response = requests.get(url=html_url, headers=headers)
        return response
    
    
    def change_title(title):
        """正则匹配特殊字符标题"""
        pattern = re.compile(r"[/\:*?"<>|]")  # '/  : * ? " < > |'
        new_title = re.sub(pattern, "_", title)  # 替换为下划线
        return new_title
    
    
    def save(img_url, title):
        """保存表情到本地文件"""
        img_content = get_response(img_url).content
        path = 'img\' + title
        with open(path, mode='wb') as f:
            f.write(img_content)
            print(title)
    
    
    def main(html_url):
        """主函数"""
        response = get_response(html_url)
        selector = parsel.Selector(response.text)
        divs = selector.css('.tagbqppdiv')
        for div in divs:
            img_url = div.css('a img::attr(data-original)').get()
            title_ = img_url.split('.')[-1]
            title = div.css('a img::attr(title)').get()
            new_title = change_title(title) + title_
            save(img_url, new_title)
    
    
    if __name__ == '__main__':
        executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)
        for page in range(0, 201):
            url = f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html'
            executor.submit(main, url)
        executor.shutdown()

    简单的代码说明:

    其实在前文已经有铺垫了,多线程爬虫就是把每一块都封装成函数,让它每一块代码都有它的作用,然后通过线程模块启动就好。

    executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)

     

    最大的线程数

     

    scrapy框架爬虫实现

    关于scrapy框架项目的创建这里只是不过多讲了,之前文章有详细讲解过,scrapy框架项目的创建,可以点击下方链接查看

    简单使用scrapy爬虫框架批量采集网站数据

     

    items.py

    import scrapy
    
    from ..items import BiaoqingbaoItem
    
    
    class BiaoqingSpider(scrapy.Spider):
        name = 'biaoqing'
        allowed_domains = ['fabiaoqing.com']
        start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)]
    
        def parse(self, response):
            divs = response.css('#bqb div.ui.segment.imghover div')
            for div in divs:
                img_url = div.css('a img::attr(data-original)').get()
                title = div.css('a img::attr(title)').get()
                yield BiaoqingbaoItem(img_url=img_url, title=title)

     

    middlewares.py

    
    
    BOT_NAME = 'biaoqingbao'
    
    SPIDER_MODULES = ['biaoqingbao.spiders']
    NEWSPIDER_MODULE = 'biaoqingbao.spiders'
    
    DOWNLOADER_MIDDLEWARES = {
       'biaoqingbao.middlewares.BiaoqingbaoDownloaderMiddleware': 543,
    }
    ITEM_PIPELINES = {
       'biaoqingbao.pipelines.DownloadPicturePipeline': 300,
    }
    IMAGES_STORE = './images'

    pipelines.py

    import scrapy
    
    from ..items import BiaoqingbaoItem
    
    
    class BiaoqingSpider(scrapy.Spider):
        name = 'biaoqing'
        allowed_domains = ['fabiaoqing.com']
        start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)]
    
        def parse(self, response):
            divs = response.css('#bqb div.ui.segment.imghover div')
            for div in divs:
                img_url = div.css('a img::attr(data-original)').get()
                title = div.css('a img::attr(title)').get()
                yield BiaoqingbaoItem(img_url=img_url, title=title)

     

    setting.py

    BOT_NAME = 'biaoqingbao'
    
    SPIDER_MODULES = ['biaoqingbao.spiders']
    NEWSPIDER_MODULE = 'biaoqingbao.spiders'
    
    DOWNLOADER_MIDDLEWARES = {
       'biaoqingbao.middlewares.BiaoqingbaoDownloaderMiddleware': 543,
    }
    ITEM_PIPELINES = {
       'biaoqingbao.pipelines.DownloadPicturePipeline': 300,
    }
    IMAGES_STORE = './images'

     

    标清

    import scrapy
    
    from ..items import BiaoqingbaoItem
    
    
    class BiaoqingSpider(scrapy.Spider):
        name = 'biaoqing'
        allowed_domains = ['fabiaoqing.com']
        start_urls = [f'https://www.fabiaoqing.com/biaoqing/lists/page/{page}.html' for page in range(1, 201)]
    
        def parse(self, response):
            divs = response.css('#bqb div.ui.segment.imghover div')
            for div in divs:
                img_url = div.css('a img::attr(data-original)').get()
                title = div.css('a img::attr(title)').get()
                yield BiaoqingbaoItem(img_url=img_url, title=title)

     

    简单总结:

    三个程序的最大的区别就在于在于爬取速度的相对,但是如果从写代码的时间上面来计算的话,普通是最简单的,因为对于这样的静态网站根本不需要调试,可以从头写到位,加上空格一共也就是29行的代码。

  • 相关阅读:
    Verilog HDL Test Bench
    配置maven仓库
    mac上卸载oracle jdk 1.8.0_31
    Mac系统安装jdk和maven
    ActiveX的AssemblyInof.cs文件 IObjectSafety  接口
    C#破解dll
    Web Api 转
    dynamic
    无焦点窗体(转载)
    Linux操作系统基础知识part4
  • 原文地址:https://www.cnblogs.com/hhh188764/p/14279595.html
Copyright © 2011-2022 走看看