(1)对于记忆模式, 使用上是透明的,并且具有懒计算特性。
Joblib is a set of tools to provide lightweight pipelining in Python. In particular:
- transparent disk-caching of functions and lazy re-evaluation (memoize pattern)
- easy simple parallel computing
Joblib is optimized to be fast and robust on large data in particular and has specific optimizations for numpy arrays. It is BSD-licensed.
(2)透明地持久化到磁盘中。-- 对现有程序影响最小,并且在程序崩溃后可以重启续跑。
The vision is to provide tools to easily achieve better performance and reproducibility when working with long running jobs.
- Avoid computing the same thing twice: code is often rerun again and again, for instance when prototyping computational-heavy jobs (as in scientific development), but hand-crafted solutions to alleviate this issue are error-prone and often lead to unreproducible results.
- Persist to disk transparently: efficiently persisting arbitrary objects containing large data is hard. Using joblib’s caching mechanism avoids hand-written persistence and implicitly links the file on disk to the execution context of the original Python object. As a result, joblib’s persistence is good for resuming an application status or computational job, eg after a crash.
Joblib addresses these problems while leaving your code and your flow control as unmodified as possible (no framework, no new paradigms).
Main features
Transparent and fast disk-caching of output value: a memoize or make-like functionality for Python functions that works well for arbitrary Python objects, including very large numpy arrays. Separate persistence and flow-execution logic from domain logic or algorithmic code by writing the operations as a set of steps with well-defined inputs and outputs: Python functions. Joblib can save their computation to disk and rerun it only if necessary:
>>> from joblib import Memory >>> cachedir = 'your_cache_dir_goes_here' >>> mem = Memory(cachedir) >>> import numpy as np >>> a = np.vander(np.arange(3)).astype(np.float) >>> square = mem.cache(np.square) >>> b = square(a) ________________________________________________________________________________ [Memory] Calling square... square(array([[0., 0., 1.], [1., 1., 1.], [4., 2., 1.]])) ___________________________________________________________square - 0...s, 0.0min >>> c = square(a) >>> # The above call did not trigger an evaluation
Embarrassingly parallel helper: to make it easy to write readable parallel code and debug it quickly:
>>> from joblib import Parallel, delayed >>> from math import sqrt >>> Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
Fast compressed Persistence: a replacement for pickle to work efficiently on Python objects containing large data ( joblib.dump & joblib.load ).
A simple example
First create a temporary directory:
>>> from tempfile import mkdtemp >>> savedir = mkdtemp() >>> import os >>> filename = os.path.join(savedir, 'test.joblib')
Then create an object to be persisted:
>>> import numpy as np >>> to_persist = [('a', [1, 2, 3]), ('b', np.arange(10))]
which is saved into filename:
>>> import joblib >>> joblib.dump(to_persist, filename) ['...test.joblib']
The object can then be reloaded from the file:
>>> joblib.load(filename) [('a', [1, 2, 3]), ('b', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))]
Pipelining -- based on parallel API
执行完毕后, Parallel接口才返回,继续执行后续代码。
random_state = np.random.randint(np.iinfo(np.int32).max, size=n_vectors) random_vector = Parallel(n_jobs=2, backend=backend)(delayed( stochastic_function_seeded)(10, rng) for rng in random_state) print_vector(random_vector, backend) random_vector = Parallel(n_jobs=2, backend=backend)(delayed( stochastic_function_seeded)(10, rng) for rng in random_state) print_vector(random_vector, backend)