"Do child processes spawned via multiprocessing share objects created earlier in the program?"

No.

Processes have independent memory space.

Solution 1

To make best use of a large structure with lots of workers, do this.

Write each worker as a "filter" -- reads intermediate results from stdin, does work, writes intermediate results on stdout.

Connect all the workers as a pipeline:

process1 <source | process2 | process3 | ... | processn >result

Each process reads, does work and writes.

This is remarkably efficient since all processes are running concurrently. The writes and reads pass directly through shared buffers between the processes.

Solution 2

In some cases, you have a more complex structure -- often a "fan-out" structure. In this case you have a parent with multiple children.

Parent opens source data. Parent forks a number of children.
Parent reads source, farms parts of the source out to each concurrently running child.
When parent reaches the end, close the pipe. Child gets end of file and finishes normally.

The child parts are pleasant to write because each child simply reads sys.sydin.

The parent has a little bit of fancy footwork in spawning all the children and retaining the pipes properly, but it's not too bad.

Fan-in is the opposite structure. A number of independently running processes need to interleave their inputs into a common process. The collector is not as easy to write, since it has to read from many sources.

Reading from many named pipes is often done using the select module to see which pipes have pending input.

Solution 3

Shared lookup is the definition of a database.

Solution 3A -- load a database. Let the workers process the data in the database.

Solution 3B -- create a very simple server using werkzeug (or similar) to provide WSGI applications that respond to HTTP GET so the workers can query the server.

Solution 4

Shared filesystem object. Unix OS offers shared memory objects. These are just files that are mapped to memory so that swapping I/O is done instead of more convention buffered reads.

You can do this from a Python context in several ways

Write a startup program that (1) breaks your original gigantic object into smaller objects, and (2) starts workers, each with a smaller object. The smaller objects could be pickled Python objects to save a tiny bit of file reading time.

Write a startup program that (1) reads your original gigantic object and writes a page-structured, byte-coded file using seek operations to assure that individual sections are easy to find with simple seeks. This is what a database engine does -- break the data into pages, make each page easy to locate via a seek.

Spawn workers with access this this large page-structured file. Each worker can seek to the relevant parts and do their work there.

http://www.doughellmann.com/PyMOTW/multiprocessing/mapreduce.html

Implementing MapReduce with multiprocessing¶

The Pool class can be used to create a simple single-server MapReduce implementation. Although it does not give the full benefits of distributed processing, it does illustrate how easy it is to break some problems down into distributable units of work.

SimpleMapReduce¶

In MapReduce, input data is broken down into chunks for processing by different worker instances. Each chunk of input data is mapped to an intermediate state using a simple transformation. The intermediate data is then collected together and partitioned based on a key value so that all of the related values are together. Finally, the partitioned data is reduced to a result set.

import collections
import itertools
import multiprocessing

class SimpleMapReduce(object):
    
    def __init__(self, map_func, reduce_func, num_workers=None):
        """
        map_func

          Function to map inputs to intermediate data. Takes as
          argument one input value and returns a tuple with the key
          and a value to be reduced.
        
        reduce_func

          Function to reduce partitioned version of intermediate data
          to final output. Takes as argument a key as produced by
          map_func and a sequence of the values associated with that
          key.
         
        num_workers

          The number of workers to create in the pool. Defaults to the
          number of CPUs available on the current host.
        """
        self.map_func = map_func
        self.reduce_func = reduce_func
        self.pool = multiprocessing.Pool(num_workers)
    
    def partition(self, mapped_values):
        """Organize the mapped values by their key.
        Returns an unsorted sequence of tuples with a key and a sequence of values.
        """
        partitioned_data = collections.defaultdict(list)
        for key, value in mapped_values:
            partitioned_data[key].append(value)
        return partitioned_data.items()
    
    def __call__(self, inputs, chunksize=1):
        """Process the inputs through the map and reduce functions given.
        
        inputs
          An iterable containing the input data to be processed.
        
        chunksize=1
          The portion of the input data to hand to each worker.  This
          can be used to tune performance during the mapping phase.
        """
        map_responses = self.pool.map(self.map_func, inputs, chunksize=chunksize)
        partitioned_data = self.partition(itertools.chain(*map_responses))
        reduced_values = self.pool.map(self.reduce_func, partitioned_data)
        return reduced_values

Counting Words in Files¶

The following example script uses SimpleMapReduce to counts the “words” in the reStructuredText source for this article, ignoring some of the markup.

import multiprocessing
import string

from multiprocessing_mapreduce import SimpleMapReduce

def file_to_words(filename):
    """Read a file and return a sequence of (word, occurances) values.
    """
    STOP_WORDS = set([
            'a', 'an', 'and', 'are', 'as', 'be', 'by', 'for', 'if', 'in', 
            'is', 'it', 'of', 'or', 'py', 'rst', 'that', 'the', 'to', 'with',
            ])
    TR = string.maketrans(string.punctuation, ' ' * len(string.punctuation))

    print multiprocessing.current_process().name, 'reading', filename
    output = []

    with open(filename, 'rt') as f:
        for line in f:
            if line.lstrip().startswith('..'): # Skip rst comment lines
                continue
            line = line.translate(TR) # Strip punctuation
            for word in line.split():
                word = word.lower()
                if word.isalpha() and word not in STOP_WORDS:
                    output.append( (word, 1) )
    return output


def count_words(item):
    """Convert the partitioned data for a word to a
    tuple containing the word and the number of occurances.
    """
    word, occurances = item
    return (word, sum(occurances))


if __name__ == '__main__':
    import operator
    import glob

    input_files = glob.glob('*.rst')
    
    mapper = SimpleMapReduce(file_to_words, count_words)
    word_counts = mapper(input_files)
    word_counts.sort(key=operator.itemgetter(1))
    word_counts.reverse()
    
    print '\nTOP 20 WORDS BY FREQUENCY\n'
    top20 = word_counts[:20]
    longest = max(len(word) for word, count in top20)
    for word, count in top20:
        print '%-*s: %5s' % (longest+1, word, count)

Each input filename is converted to a sequence of (word, 1) pairs by file_to_words. The data is partitioned by SimpleMapReduce.partition() using the word as the key, so the partitioned data consists of a key and a sequence of 1 values representing the number of occurrences of the word. The reduction phase converts that to a pair of (word, count) values by calling count_words for each element of the partitioned data set.

$ python multiprocessing_wordcount.py

PoolWorker-2 reading communication.rst
PoolWorker-2 reading index.rst
PoolWorker-1 reading basics.rst
PoolWorker-1 reading mapreduce.rst

TOP 20 WORDS BY FREQUENCY

process         :    75
multiprocessing :    40
worker          :    35
after           :    30
running         :    29
start           :    28
processes       :    26
python          :    26
literal         :    25
header          :    25
pymotw          :    25
end             :    25
daemon          :    23
now             :    21
consumer        :    19
starting        :    18
exiting         :    16
event           :    15
value           :    14
run             :    13

See also

MapReduce - Wikipedia: Overview of MapReduce on Wikipedia.
MapReduce: Simplified Data Processing on Large Clusters: Google Labs presentation and paper on MapReduce.
operator: Operator tools such as itemgetter().

Answer by: S.Lott

Related topics: python multiprocessing

Warning
Some of this package’s functionality requires a functioning shared semaphore implementation on the host operating system. Without one, the multiprocessing.synchronize module will be disabled, and attempts to import it will result in an ImportError. See issue 3770 for additional information.
Note
Functionality within this package requires that the __main__ method be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the multiprocessing.Pool examples will not work in the interactive interpreter. For example:
Extended Slices
Ever since Python 1.4, the slicing syntax has supported an optional third ``step'' or ``stride'' argument. For example, these are all legal Python syntax: L[1:10:2], L[:-1:1], L[::-1]. This was added to Python at the request of the developers of Numerical Python, which uses the third argument extensively. However, Python's built-in list, tuple, and string sequence types have never supported this feature, raising a TypeError if you tried it. Michael Hudson contributed a patch to fix this shortcoming.
For example, you can now easily extract the elements of a list that have even indexes:
>>> L = range(10) >>> L[::2] [0, 2, 4, 6, 8]
Negative values also work to make a copy of the same list in reverse order:
>>> L[::-1] [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
This also works for tuples, arrays, and strings:
>>> s='abcd' >>> s[::2] 'ac' >>> s[::-1] 'dcba'
If you have a mutable sequence such as a list or an array you can assign to or delete an extended slice, but there are some differences between assignment to extended and regular slices. Assignment to a regular slice can be used to change the length of the sequence:
>>> a = range(3) >>> a [0, 1, 2] >>> a[1:3] = [4, 5, 6] >>> a [0, 4, 5, 6]
Extended slices aren't this flexible. When assigning to an extended slice, the list on the right hand side of the statement must contain the same number of items as the slice it is replacing:
>>> a = range(4) >>> a [0, 1, 2, 3] >>> a[::2] [0, 2] >>> a[::2] = [0, -1] >>> a [0, 1, -1, 3] >>> a[::2] = [0,1,2] Traceback (most recent call last): File "<stdin>", line 1, in ? ValueError: attempt to assign sequence of size 3 to extended slice of size 2
Deletion is more straightforward:
>>> a = range(4) >>> a [0, 1, 2, 3] >>> a[::2] [0, 2] >>> del a[::2] >>> a [1, 3]
One can also now pass slice objects to the __getitem__ methods of the built-in sequences:
>>> range(10).__getitem__(slice(0, 5, 2)) [0, 2, 4]
Or use slice objects directly in subscripts:
>>> range(10)[slice(0, 5, 2)] [0, 2, 4]
To simplify implementing sequences that support extended slicing, slice objects now have a method indices(length) which, given the length of a sequence, returns a (start, stop, step) tuple that can be passed directly to range(). indices() handles omitted and out-of-bounds indices in a manner consistent with regular slices (and this innocuous phrase hides a welter of confusing details!). The method is intended to be used like this:
class FakeSeq: ... def calc_item(self, i): ... def __getitem__(self, item): if isinstance(item, slice): indices = item.indices(len(self)) return FakeSeq([self.calc_item(i) for i in range(*indices)]) else: return self.calc_item(i)
From this example you can also see that the built-in slice object is now the type object for the slice type, and is no longer a function. This is consistent with Python 2.2, where int, str, etc., underwent the same change.

concurrency processing mutiple file python solution

Some Notes on Tim Bray's Wide Finder Benchmark

The Problem #

A Single-Threaded Python Solution #

Compiling the RE #

Skipping lines that cannot match #

Reading files in binary mode (Windows) #

The Code #

A Multi-Threaded Python Solution #

A Multi-Processor Python Solution #

Memory Mapping #

Summary #

Addenda #

Technology Answers

Python multiprocessing: sharing a large read-only object between processes?

Answer

Implementing MapReduce with multiprocessing¶

SimpleMapReduce¶

Counting Words in Files¶

Extended Slices