Python：怎样用线程将任务并行化？

zoukankan html css js c++ java

Python：怎样用线程将任务并行化？
如果待处理任务满足：
1. 可拆分，即任务可以被拆分为多个子任务，或任务是多个相同的任务的集合；
2. 任务不是CPU密集型的，如任务涉及到较多IO操作（如文件读取和网络数据处理）
则使用多线程将任务并行运行，能够提高运行效率。

假设待处理的任务为：有很多文件目录，对于每个文件目录，搜索匹配一个给定字符串的文件的所有行（相当于是实现grep的功能）。则此处子任务为：给定一个目录，搜索匹配一个给定字符串的文件的所有行。总的任务为处理所有目录。

将子任务表示为一个函数T，如下所示：
def T(dir, pattern): print('searching pattern %s in dir %s' % (pattern, dir)) ...
为每个子任务创建一个线程

要实现并行化，最简单的方法是为每一个子任务创建一个thread，thread处理完后退出。

from threading import Thread from time import sleep def T(dir, pattern): "This is just a stub that simulate a dir operation" sleep(1) print('searching pattern %s in dir %s' % (pattern, dir)) threads = [] dirs = ['a/b/c', 'a/b/d', 'b/c', 'd/f'] pattern = 'hello' for dir in dirs: thread = Thread(target=T, args=(dir, pattern)) 1 thread.start() 2 threads.append(thread) for thread in threads: thread.join() 3 print('Main thread end here')

1 ：创建一个Thread对象，target参数指定这个thread待执行的函数，args参数指定target函数的输入参数

2 ：启动这个thread。 T(dir, pattern)将被调用

3 ：等待，直到这个thread结束。整个for循环表示主进程会等待所有子线程结束后再退出

程序的运行结果为：

searching pattern hello in dir a/b/csearching pattern hello in dir d/f searching pattern hello in dir b/c searching pattern hello in dir a/b/d Main thread end here

可以看出由于线程是并行运行的，部分输出会交叠。但主进程的打印总在最后。

以上例子中对于每个dir都需要创建一个thread。如果dir的数目较多，则会创建太多的thread，影响运行效率。较好的方式是限制总线程的数目。
限制线程数目

可以使用信号量（semaphore）来限制同时运行的最大线程数目。如下所示：

from threading import Thread, BoundedSemaphore from time import sleep def T(dir, pattern): "This is just a stub that simulate a dir operation" sleep(1) print('searching pattern %s in dir %s' % (pattern, dir)) threads = [] dirs = ['a/b/c', 'a/b/d', 'b/c', 'd/f'] pattern = 'hello' maxjobs = BoundedSemaphore(2) 1 def wrapper(dir, pattern): T(dir, pattern) maxjobs.release() 2 for dir in dirs: maxjobs.acquire() 3 thread = Thread(target=wrapper, args=(dir, pattern)) thread.start() threads.append(thread) for thread in threads: thread.join() print('Main thread end here')

1 ：创建一个有2个资源的信号量。一个信号量代表总的可用的资源数目，这里表示同时运行的最大线程数目为2。

2 ：在线程结束时释放资源。运行在子线程中。

3 ：在启动一个线程前，先获取一个资源。如果当前已经有2个线程在运行，则会阻塞，直到其中一个线程结束。运行在主线程中。

当限制了最大运行线程数为2后，由于只有2个线程同时运行，程序的输出更加有序，几乎总是为：

searching pattern hello in dir a/b/c searching pattern hello in dir a/b/d searching pattern hello in dir b/c searching pattern hello in dir d/f Main thread end here

以上实现中为每个子任务创建一个线程进行处理，然后通过信号量限制同时运行的线程的数目。如果子任务很多，这种方法会创建太多的线程。更好的方法是使用线程池。
使用线程池（Thread Pool）

即预先创建一定数目的线程，形成一个线程池。每个线程持续处理多个子任务（而不是处理一个就退出）。这样做的好处是：创建的线程数目会比较固定。

那么，每个线程处理哪些子任务呢？一种方法为：预先将所有子任务均分给每个线程。如下所示：

from threading import Thread from time import sleep def T(dir, pattern): "This is just a stub that simulate a dir operation" sleep(1) print('searching pattern %s in dir %s' % (pattern, dir)) dirs = ['a/b/c', 'a/b/d', 'b/c', 'd/f'] pattern = 'hello' def wrapper(dirs, pattern): 1 for dir in dirs: T(dir, pattern) threadsPool = [ 2 Thread(target=wrapper, args=(dirs[0:2], pattern)), Thread(target=wrapper, args=(dirs[2:], pattern)), ] for thread in threadsPool: 3 thread.start() for thread in threadsPool: thread.join() print('Main thread end here')

1 ：这个函数能够处理多个dir，将作为线程的target函数

2 ：创建一个有2个线程的线程池。并事先分配子任务给每个线程。线程1处理前两个dir，线程2处理后两个dir

3 ：启动线程池中所有线程

程序的输出结果为：

searching pattern hello in dir a/b/csearching pattern hello in dir b/c searching pattern hello in dir d/f searching pattern hello in dir a/b/d Main thread end here

这种方法存在以下问题：

子任务分配可能不均。导致每个线程运行时间差别可能较大，则整体运行时长可能被拖长

只能处理所有子任务都预先知道的情况，无法处理子任务实时出现的情况

如果有一种方法，能够让线程知道当前所有的待处理子任务，线程一旦空闲，便可以从中获取一个任务进行处理，则以上问题都可以解决。任务队列便是解决方案。
使用消息队列

可以使用Queue实现一个任务队列，用于在线程间传递子任务。主线程将所有待处理子任务放置在队列中，子线程从队列中获取子任务去处理。如下所有（注：以下代码只运行于Python 2，因为Queue只存在于Python 2）：

from threading import Thread from time import sleep import Queue def T(dir, pattern): "This is just a stub that simulate a dir operation" sleep(1) print('searching pattern %s in dir %s' % (pattern, dir)) dirs = ['a/b/c', 'a/b/d', 'b/c', 'd/f'] pattern = 'hello' taskQueue = Queue.Queue() 1 def wrapper(): while True: try: dir = taskQueue.get(True, 0.1) 2 T(dir, pattern) except Queue.Empty: continue threadsPool = [Thread(target=wrapper) for i in range(2)] 3 for thread in threadsPool: thread.start() 4 for dir in dirs: taskQueue.put(dir) 5 for thread in threadsPool: thread.join() print('Main thread end here')

1 ：创建一个任务队列

2 ：子线程从任务队列中获取一个任务。第一个参数为True，表示如果没有任务，会等待。第二个参数表示最长等待0.1秒如果在0.1秒后仍然没有任务，则会抛出一个Queue.Empty的异常

3 ：创建有2个线程的线程池。注意target函数wrapper没有任何参数

4 ：启动所有线程

5 ：主线程将所有子任务放置在任务队列中，以供子线程获取处理。由于子线程已经被启动，则子线程会立即获取到任务并处理

程序的输出为：

searching pattern hello in dir a/b/c searching pattern hello in dir a/b/d searching pattern hello in dir b/c searching pattern hello in dir d/f

从中可以看出主进程的打印结果并没有出来，程序会一直运行，而不退出。这个问题的原因是：目前的实现中，子线程为一个无限循环，因此其永远不会终止。因此，必须有一种机制来结束子进程。
终止子进程

一种简单方法为，可以在任务队列中放置一个特殊元素，作为终止符。当子线程从任务队列中获取这个终止符后，便自行退出。如下所示，使用None作为终止符。

from threading import Thread from time import sleep import Queue def T(dir, pattern): "This is just a stub that simulate a dir operation" sleep(1) print('searching pattern %s in dir %s' % (pattern, dir)) dirs = ['a/b/c', 'a/b/d', 'b/c', 'd/f'] pattern = 'hello' taskQueue = Queue.Queue() def wrapper(): while True: try: dir = taskQueue.get(True, 0.1) if dir is None: 1 taskQueue.put(dir) 2 break T(dir, pattern) except Queue.Empty: continue threadsPool = [Thread(target=wrapper) for i in range(2)] for thread in threadsPool: thread.start() for dir in dirs: taskQueue.put(dir) taskQueue.put(None) 3 for thread in threadsPool: thread.join() print('Main thread end here')

1 ：如果任务为终止符（此处为None），则退出

2 ：将这个终止符重新放回任务队列。因为只有一个终止符，如果不放回，则其它子线程获取不到，也就无法终止

3 ：将终止符放在任务队列。注意必须放置在末尾，否则终止符后的任务无法得到处理

修改过后，程序能够正常运行，主进程能够正常退出了。

searching pattern hello in dir a/b/csearching pattern hello in dir a/b/d searching pattern hello in dir b/c searching pattern hello in dir d/f Main thread end here
总结

要并行化处理子任务，最简单的方法是为每个子任务创建一个线程去处理。这种方法的缺点是：如果子任务非常多，则需要创建的线程数目会非常多。并且同时运行的线程数目也会较多。通过使用信号量来限制同时运行的线程数目，通过线程池来避免创建过多的线程。

与每个线程处理一个任务不同，线程池中每个线程会处理多个子任务。这带来一个问题：每个子线程如何知道要处理哪些子任务。一种方法是预先将所有子任务均分给每个线程，而更灵活的方法则是通过任务队列，由子线程自行决定要处理哪些任务。

使用线程池时，线程主函数通常实现为一个无限循环，因此需要考虑如何终止线程。可以在任务队列中放置一个终止符来告诉线程没有更多任务，因此其可以终止。

备注：我的博客即将搬运同步至腾讯云+社区，邀请大家一同入驻： https://cloud.tencent.com/developer/support-plan?invite_code=2qako5hpuv0g8
查看全文

相关阅读:
maskrcnn_benchmark代码分析(2)
NoSQL现状
 CAP理论
 svn revert
在SpringMVC中使用Jackson并格式化时间
 找了一个api管理工具
 SpringBoot读取application.properties文件
 MySQL性能优化的21个最佳实践和 mysql使用索引
 Cannot subclass final class class com.sun.proxy.$Proxy
AOP拦截器表达式写法

原文地址：https://www.cnblogs.com/astropeak/p/9035300.html

Python：怎样用线程将任务并行化？

为每个子任务创建一个线程

限制线程数目

使用线程池（Thread Pool）

使用消息队列

终止子进程

总结