zoukankan      html  css  js  c++  java
  • 多进程爬虫

    多进程简介

    一个进程就是个一个程序, 运行一个脚本文件, 跑多个程序


    为什么学习多线程

    提升爬虫效率


    多进程和多线程的区别

    工厂 ==> 车间 ==> 工人


    多进程的使用方法

    1 from multiprocessing import Pool
    2 pool = Pool(processes=4)
    3 pool.map(func,iterable)
     

    性能对比

    爬取url:https://www.qiushibaike.com/8hr/page/1/

     1 import re
     2 import time
     3 from multiprocessing import Pool
     4  5 import requests
     6  7 headers = {
     8     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'
     9 }
    10 11 def re_scraper(url):
    12     res = requests.get(url,headers=headers)
    13     names = re.findall('<h2>(.*?)</h2>', res.text, re.S)
    14     contents = re.findall('<div class="content">.*?<span>(.*?)</span>', res.text, re.S)
    15     laughs = re.findall('<i class="number">(d+)</i>',res.text,re.S)
    16     comments = re.findall('<i class="number">(d+)</i>', res.text, re.S)
    17     infos = list()
    18     for name,content,laugh,comment in zip(names,contents,laughs,comments):
    19         info = {
    20             'name':name,
    21             'content':content,
    22             'laugh':laugh,
    23             'comment':comment
    24         }
    25         infos.append(info)
    26     return infos
    27 28 if __name__ == "__main__":
    29     urls = ['https://www.qiushibaike.com/8hr/page/{}/'.format(str(i)) for i in range(1, 36)]
    30     start_1 = time.time()
    31     for url in urls:
    32         re_scraper(url)
    33     end_1 = time.time()
    34     print('串行爬虫耗时:',end_1 - start_1)
    35 36     start_2 = time.time()
    37     pool = Pool(processes=2)
    38     pool.map(re_scraper,urls)
    39     end_2 = time.time()
    40     print('2进程爬虫耗时:',end_2 - start_2)
    41 42     start_3 = time.time()
    43     pool = Pool(processes=4)
    44     pool.map(re_scraper,urls)
    45     end_3 = time.time()
    46     print('4进程爬虫耗时:',end_3 - start_3)
    47  

    1 运行结果:
    2 
    3 [Running] python "f:WWW	est_pycompare_test.py"
    4 串行爬虫耗时: 14.95523715019226
    5 2进程爬虫耗时: 11.39123272895813
    6 4进程爬虫耗时: 4.0303635597229
    7 
    8 [Done] exited with code=0 in 32.827 seconds
  • 相关阅读:
    刷脸背后:人脸检测人脸识别人脸检索_张重生资料整理
    webpack工具
    js精度缺失和最大安全整数
    在线文档预览(干货篇)
    讨论js比较两个数组(对象)是否相等的范围
    js不同数据类型中==与===的对比
    js中this的指向
    前后端数据类型
    js网页节点操作
    圆角渐变边框实现
  • 原文地址:https://www.cnblogs.com/xuxaut-558/p/10166642.html
Copyright © 2011-2022 走看看