12.8 简单的并行编程

问题

You have a program that performs a lot of CPU-intensive work, and you want to makeit run faster by having it take advantage of multiple CPUs.

解决方案

The concurrent.futures library provides a ProcessPoolExecutor class that can beused to execute computationally intensive functions in a separately running instance ofthe Python interpreter. However, in order to use it, you first need to have some com‐putationally intensive work. Let’s illustrate with a simple yet practical example.Suppose you have an entire directory of gzip-compressed Apache web server logs:

logs/20120701.log.gz20120702.log.gz20120703.log.gz20120704.log.gz20120705.log.gz20120706.log.gz...
Further suppose each log file contains lines like this:

124.115.6.12 - - [10/Jul/2012:00:18:50 -0500] “GET /robots.txt ...” 200 71210.212.209.67 - - [10/Jul/2012:00:18:51 -0500] “GET /ply/ ...” 200 11875210.212.209.67 - - [10/Jul/2012:00:18:51 -0500] “GET /favicon.ico ...” 404 36961.135.216.105 - - [10/Jul/2012:00:20:04 -0500] “GET /blog/atom.xml ...” 304 -...

Here is a simple script that takes this data and identifies all hosts that have accessed therobots.txt file:

If you manually submit a job, the result is an instance of Future. To obtain the actualresult, you call its result() method. This blocks until the result is computed and re‐turned by the pool.Instead of blocking, you can also arrange to have a callback function triggered uponcompletion instead. For example:

def when_done(r):print(‘Got:", r.result())with ProcessPoolExecutor() as pool:future_result = pool.submit(work, arg)future_result.add_done_callback(when_done)
The user-supplied callback function receives an instance of Future that must be usedto obtain the actual result (i.e., by calling its result() method).Although process pools can be easy to use, there are a number of important consider‐ations to be made in designing larger programs. In no particular order:

  • This technique for parallelization only works well for problems that can be trivially

decomposed into independent parts.

  • Work must be submitted in the form of simple functions. Parallel execution of

instance methods, closures, or other kinds of constructs are not supported.

  • Function arguments and return values must be compatible with pickle. Work is

carried out in a separate interpreter using interprocess communication. Thus, dataexchanged between interpreters has to be serialized.

  • Functions submitted for work should not maintain persistent state or have side

effects. With the exception of simple things such as logging, you don’t really haveany control over the behavior of child processes once started. Thus, to preserve yoursanity, it is probably best to keep things simple and carry out work in pure-functionsthat don’t alter their environment.

  • Process pools are created by calling the fork() system call on Unix. This makes a

clone of the Python interpreter, including all of the program state at the time of thefork. On Windows, an independent copy of the interpreter that does not clone stateis launched. The actual forking process does not occur until the first pool.map()or pool.submit() method is called.

  • Great care should be made when combining process pools and programs that use

threads. In particular, you should probably create and launch process pools priorto the creation of any threads (e.g., create the pool in the main thread at programstartup).

文章导航