13. Optimization: Solutions – Expert Python Programming

Chapter 13. Optimization: Solutions

Optimizing a program is not a magical process. It is done by following a simple process, synthesized by Stefan Schwarzer at Europython 2006 in an original pseudo-code example:

def optimize():
    """Recommended optimization"""
    assert got_architecture_right(), "fix architecture"
    assert made_code_work(bugs=None), "fix bugs"
    while code_is_too_slow():
        wbn = find_worst_bottleneck(just_guess=False,
        is_faster = try_to_optimize(wbn,
        if not is_faster:

# By Stefan Schwartzer, Europython 2006 

This chapter presents some solutions to optimize your program through:

  • Reducing the complexity

  • Multithreading

  • Multiprocessing

  • Caching

Reducing the Complexity

There are many definitions of what makes a program complex, and many ways to express it. But at the code level, where we want to make an isolated sequence of statements faster, there is a limited number of techniques to quickly detect the lines that are guilty in a bottleneck.

The two main techniques are:

  • Measuring the cyclomatic complexity of the code

  • Measuring the Landau notation also called Big-O notation

From there, the optimization process will consist of reducing the complexity so that the code is fast enough. This section provides simple tips for this work by simplifying loops. But first of all, let's learn how to measure complexity.

Measuring Cyclomatic Complexity

Cyclomatic complexity is a metric introduced by McCabe that measures the number of linear paths through the code. All if, for, and while loops are counted to come up with a measure.

The code can then be categorized as follows:

Cyclomatic Complexity

What it means

1 to 10

Not complex

to 20

Moderately complex

to 50

Really complex

More than 50

Too complex

In Python, this can be done automatically by parsing the AST (Abstract Syntax Tree) see http://en.wikipedia.org/wiki/Abstract_Syntax_Tree.

The PyMetrics project from Reg Charney ( http://sourceforge.net/projects/pymetrics) provides a nice script to calculate the cyclomatic complexity.

Measuring the Big-O Notation

The complexity of a function can be expressed by the Big-O notation (see http://en.wikipedia.org/wiki/Big_O_notation). This metric defines how an algorithm is affected by the size of the input data. For instance, does the algorithm scale linearly with the size of the input data or quadratically?

Calculating the Big-O notation manually for an algorithm is the best approach to optimize your code, as it gives you the ability to detect and focus on the parts that will really slow down the code.

To measure the Big-O notation, all constants and low-order terms are removed in order to focus on the portion that really weights when the input data grows. The idea is to try to categorize the algorithm in one of these categories, even if it is approximative :




Constant. Does not depend on the input data.


Linear. Will grow as "n" grows.

O(n log n)

Quasi linear


Quadratic complexity


Cubic complexity




Factorial complexity

For instance, a dict look up is O(1) (pronounced "order 1") and is considered constant regardless of how many elements are in the dict, whereas looking through a list of items for a particular item is O(n).

Let's take another example:

>>> def function(n):
...     for i in range(n):
...         print i

In that case, the loop speed will depend on "n", and the Big-O notation will be O(n).

If the function has conditions, the notation to keep is the highest one:

>>> def function(n):
...     if some_test:
...         print 'something'
...     else:
...         for i in range(n):
...             print i

In this example, the function can be O(1) or O(n), depending on the test. So the worst case is O(n), which is the notation to keep.

Let's take another example:

>>> def function(n):
...     for i in range(n):
...         for j in range(n):
...             print i, j

A nested loop introduces a quadratic complexity O(n2), which is very expensive when n is big.

Of course the notation needs to introspect the called functions:

>>> def function(n):
...     for i in range(n):
...         print i
>>> def other_function(n):
...     if some_test:
...         for i in range(n):
...             function(n)
...     else:
...         function(n)

The other_function is either calling function, which has O(n) complexity, or else calling it in a loop that has O(n) complexity, so the worst case is O(n*n): O(n2).

As said earlier, constants and low-order terms should be removed when calculating the notation because they don't really matter when data size is getting big:

>>> def function(n):
...     for i in range(n*2):
...         print i

This function is O(n*2), but since constants are removed we just say O(n). That said, this simplification should be kept in mind when you are comparing several algorithms.

Be careful, for although we usually assume that an O(n2) (quadratic) function will be faster than an O(n3) (cubic) function, this may not always be the case. Sometimes, for smaller values of n, the cubic function is faster, while for larger values of n the quadratic function catches up and is faster. For instance O(100*n2) that is simplified to O(n2) is not necessarily faster than O(5*n3) that corresponds to O(n3). That is why you should optimize once profiling has shown where to do it.

If you want to practice on Big-O, you can exercise on http://pages.cs.wisc.edu/~hasti/cs367-common/notes/COMPLEXITY.html#bigO.


The Big-O notation is a great way to improve your algorithms, but beware that:

  • Its calculation implies approximation.

  • It's accurate only for pure Python code, which does not depend on external resources.

When you are unable to calculate the complexity of an algorithm, for instance if it has C code that is not easy to dig, switch on tools such as timeit or the profile decorator that was presented in the previous chapter, with enough input data to test the algorithm's efficiency.


To reduce the complexity of an algorithm, the way data is stored is fundamental. You should pick your data structure carefully. This section provides a few examples.

Searching in a List

If you need to provide a search algorithm for a list instance, a binary search over a sorted version will reduce the complexity from O(n) to O(log n). The bisect module can be used for this since, given a value, it uses a binary search to return the next insertion index for a sorted sequence:

>>> def find(seq, el):
...     pos =  bisect(seq, el)
...     if pos==0 or (pos==len(seq) and seq[-1]!=el):
...         return -1
...     return pos - 1
>>> seq = [2, 3, 7, 8, 9]
>>> find(seq, 9)
>>> find(seq, 10)
>>> find(seq, 0)
>>> find(seq, 7)

Of course, this means that either the list is already sorted or you need to sort it. On the other hand, if you already have a sorted list, you can also insert new items into that list using bisect without needing to re-sort the list (i.e. insertion sort; see http://en.wikipedia.org/wiki/Insertion_sort).

Using a Set Instead of a List

When you need to build a sequence of distinct values out of a given sequence, the first algorithm that comes in mind is:

>>> seq = ['a', 'a', 'b', 'c', 'c', 'd']
>>> res = []
>>> for el in seq:
...     if el not in res:
...         res.append(el)
>>> res
['a', 'b', 'c', 'd']

The complexity is introduced by the lookup in the res list with the in operator that costs at the most O(n). It is then called in the global loop, which costs O(n). So the complexity is mostly quadratic.

Using a set type for the same work will be faster because the stored values are looked up using hashes such as the dict type. In other words, for each value in seq, the time taken to see if it is already in the set will be constant:

>>> seq = ['a', 'a', 'b', 'c', 'c', 'd']
>>> res = set(seq)
>>> res
set(['a', 'c', 'b', 'd'])

This lowers the complexity to O(n).

Of course, this assumes that the rest of the algorithm can use a set object, which ignores duplicates.


When you try to reduce the complexity of an algorithm, carefully consider your data structures. There are a range of built-in types, so pick the right one.

It is often better to transform your data before the algorithm is called than to try changing the algorithm to make it faster on the original data.

Cut the External Calls, Reduce the Workload

Part of the complexity is introduced by calls to other functions, methods, classes. In general, get as much of the code out of loops as possible. This is doubly important for nested loops. Don't recalculate those things over and over inside a loop that can be calculated before the loop even begins. Inner loops should be tight.

Using Collections

The collection module provides alternatives to built-in container types. They are available in three types:

  • deque: A list-like type with extra features

  • defaultdict: A dict-like type with a built-in default factory feature

  • namedtuple: A tuple-like type that assigns keys for members (2.6 only)


A deque is an alternative implementation for lists. Whereas a list is based on arrays, a deque is based on a doubly linked list. Hence, a deque is much faster when you need to insert something into its middle or head, but much slower when you need to access an arbitrary index. Of course, modern hardware does memory copies so quickly that the downsides of a list aren't as severe as one might imagine. So be sure to profile your code before switching from a list to a deque.

For example, if you want to remove two elements of a sequence located at a given position without creating a second list instance, by using a slice, a deque object will be faster:

>>> from pbp.scripts.profiler import profile, stats
>>> from collections import deque
>>> my_list = range(100000)
>>> my_deque = deque(range(100000))
>>> @profile('by_list')
... def by_list():
...     my_list[500:502] = []
>>> @profile('by_deque')
... def by_deque():
...     my_deque.rotate(500)
...     my_deque.pop()
...     my_deque.pop()
...     my_deque.rotate(-500)
>>> by_list();by_deque()
>>> print stats['by_list']
{'stones': 47.836141152815379, 'memory': 396, 
 'time': 1.0523951053619385}
>>> print stats['by_deque']
{'stones': 19.198688593777742, 'memory': 552, 
 'time': 0.42237114906311035}

deque also provides more efficient append and pop methods that work at the same speed from both ends of the sequence. This makes it a perfect type for queues.

For example, a FIFO (First In First Out) queue will be much more efficient with a deque:

>>> from collections import deque
>>> from pbp.scripts.profiler import profile, stats
>>> import sys
>>> queue = deque()
>>> def d_add_data(data):
...     queue.appendleft(data)
>>> def d_process_data():
...     queue.pop()
>>> BIG_N = 1000000
>>> @profile('deque')
... def sequence():
...     for i in range(BIG_N):
...         d_add_data(i)
...     for i in range(BIG_N/2):
...         d_process_data()
...     for i in range(BIG_N):
...         d_add_data(i)
>>> lqueue = []
>>> def l_add_data(data):
...     lqueue.append(data)
>>> def l_process_data():
...     lqueue.pop(-1)
>>> @profile('list')
... def lsequence():
...     for i in range(BIG_N):
...         l_add_data(i)
...     for i in range(BIG_N/2):
...         l_process_data()
...     for i in range(BIG_N):
...         l_add_data(i)
>>> sequence(); lsequence()
>>> print stats['deque']
{'stones': 86.521963988031672, 'memory': 17998920, 'time': 1.9380919933319092}
>>> print stats['list']
{'stones': 222.34191851956504, 'memory': 17994312, 'time': 4.9804589748382568}


Python 2.6 provides new useful queue classes such as a LIFO queue (LifoQueue) and a priority queue (PriorityQueue) in the Queue module (which has been renamed to queue in Python 3k).


The defaultdict type is similar to the dict type, but adds a default factory for new keys. This avoids writing an extra test to initialize the mapping entry, and is more efficient than the dict.setdefault method.

The Python documentation provides a usage example for this feature, which runs almost three times faster than dict.default:

>>> from collections import defaultdict
>>> from pbp.scripts.profiler import profile, stats
>>> s = [('yellow', 1), ('blue', 2), ('yellow', 3), 
...      ('blue', 4), ('red', 1)]
>>> @profile('defaultdict')
... def faster():
...     d = defaultdict(list)
...     for k, v in s:
...         d[k].append(v)
>>> @profile('dict')
... def slower():
...     d = {}
...     for k, v in s:
...         d.setdefault(k, []).append(v)
>>> slower(); faster()
>>> stats['dict']
{'stones': 16.587882671716077, 'memory': 396, 
 'time': 0.35166311264038086}
>>> stats['defaultdict']
{'stones': 6.5733464259021686, 'memory': 552, 
 'time': 0.13935494422912598}

The defaultdict type takes a factory as a parameter, and can therefore be used with built-in types or classes whose constructor does not take arguments:

>>> lg = defaultdict(long)
>>> lg['one']


namedtuple is a class factory that takes a type name and a list of attributes, and creates a class out of it. The class can then be used to instantiate a tuple-like object and provide accessors for its elements:

>>> from collections import namedtuple 
>>> Customer = namedtuple('Customer', 
...                       'firstname lastname')
>>> c = Customer(u'Tarek', u'Ziadé')
>>> c.firstname

It can be used to create records that are easier to write, compared to a custom class that would require some boiler-plate code to initialize values. The generated class can be subclassed to add more operations.


Reducing the complexity is done by storing the data in an efficient data structure that works well with the way the algorithm will use it.

That said, when the solution is not obvious, you should consider dropping and re-writing the incriminated part instead of killing the code readability for the sake of performance.

Often, the Python code can be readable and fast. So try to find a good way to perform the work instead of trying to work around a flawed design.


Threading is often considered to be a complex topic by developers. While this statement is totally true, Python provides high-level classes and functions that ease the usage of threading for the specific use cases this section will present.

To summarize this section, threading should be considered when some tasks can be performed in the background while the main program is doing something else.

What is Multithreading?

A thread is short for a thread of execution. A programmer can split his or her work into threads that run simultaneously and share the same memory context. Unless your code depends on third-party resources, multi-threading will not speed it up on a mono-processor machine, and will even add some overhead for thread management. Multi-threading will benefit from a multiprocessor or multi-core machine and will parallelize each thread execution on each CPU, thus making the program faster.

The fact that the same context is shared among threads means you must protect data from concurrent accesses. If two threads update the same data without any protection, a race condition occurs. This is called a race hazard, where unexpected results happen because of the code run by each thread making false assumptions about the state of the data.

Lock mechanisms help in protecting data, and thread programming has always been a matter of making sure that the resources are accessed by threads in a safe way. This can be quite hard and thread programming often leads to bugs that are hard to debug, since they are hard to reproduce. The worst problem occurs when, due to a wrong code design, two threads lock a resource and try to get the resource that the other thread has locked. They will wait for each other forever. This is called a deadlock and is quite hard to debug. Reentrant locks help a bit in this by making sure a thread doen't get locked by attempting to lock a resource twice.

Nevertheless, when threads are used for isolated needs with tools that were built for them, they may increase the speed of programs.

Multithreading is often implemented at the kernel level. When the machine has one single processor with a single core, the system uses a timeslicing mechanism. Here, the CPU switches from one thread to another so fast that there is an illusion of parallelization. This is done at the processing level as well. On multiprocessor or multi-core machines, even if timeslicing is used, processes and threads are distributed among CPUs making the program really fast.

How Python Deals with Threads

Unlike some other languages, Python uses multiple kernel-level threads that can each run any of the interpreter-level threads. However, all threads accessing Python objects are serialized by one global lock. This is done because much of the interpreter code as well as third-party C code is not thread-safe and need to be protected.

This mechanism is called the Global Interpreter Lock (GIL) and some developers have started to ask for its removal from Python 3k. But, as Guido stated, this would involve too much work and make Python implementation more complex. So the GIL stays.


Stackless Python or Stackless is an experimental implementation of the Python programming language, so named because it avoids depending on the C call stack for its stack. The language supports generators, micro-threads, and coroutines, and provides the benefits of thread-based programming without the performance and complexity problems associated with conventional threads.

See: http://www.stackless.com.

Although some developers are frustrated by this limitation, many developers understand the fundamental difficulty of doing multithreaded programming correctly. Hence, many Python programmers will often opt to use multiple processes instead of multiple threads. Because processes have separate memory contexts, they aren't quite as susceptible to data corruption as threads are.

So what is the point of multithreading in Python?

When threads contain only pure Python code, there is no point in using threads to speed up the program since the GIL will serialize it. However, multiple threads can do IO operations or execute C code in certain third-party extensions parallelly.

For non-pure code blocks where external resources are used or C code involved, multithreading is useful to wait for a third-party resource to return results. This is because a sleeping thread that has explicitly unlocked the GIL can stand by and wake up when results are back. Last, whenever a program needs to provide a responsive interface, multithreading is the answer even if it uses timeslicing. The program can interact with the user while doing some heavy computing in the so-called background.


See Shannon Behrens article on Dr Dobb's on concurrency for more details: http://ddj.com/linux-open-source/206103078.

The next section tries to cover common use cases.

When Should Threading Be Used?

Despite the GIL limitation, threads can be really useful in some cases. They can help in:

  • Building responsive interfaces

  • Delegating work

  • Building multi-user applications

Building Responsive Interfaces

Let's say you ask your system to copy files from a folder to another through a graphical user interface. The task will possibly be pushed into the background and the windows will be constantly refreshed by the main thread, so you get live feedback on the operation. You will also be able to cancel the operation. This is less irritating than a raw cp or copy command that does not provide any feedback until the whole work is finished, and that has to be stopped through a Ctrl+C.

A responsive interface also allows a user to work on several tasks. For instance, Gimp will let you play around with a picture while another one is being filtered, since the two tasks are independent.


When you are building a user interface, try to push long running tasks into the background, or at least try to provide constant feedback to the user.

Delegating Work

If your process depends on third-party resources, threads might really speed up everything.

Let's take the case of a function that indexes files in a folder and pushes the built indexes into a database. Depending on the type of file, the function calls a different external program. One is specialized in PDF and another one in OpenOffice files, for example.

Instead of treating each file in a sequence, by calling the right program and then storing the result into the database, your function can set up a thread for each converter and push jobs to be done to each one of them through a queue. The overall time taken by the function will be closer to the slowest converter than to the sum of all the work.

This consumers-producer pattern is often used to provide a shared space for threads, and will be presented in this chapter.

Converter threads can be initialized from the start and the code in charge of pushing the result into the database can also be a thread that consumes available results in the queue.

Multi-User Applications

Threading is also used as a design pattern for multi-user applications. For instance, a web server will push a user request into a new thread and then will idle, waiting for new requests. Having a thread dedicated to each request simplifies a lot of work, but requires the developer to take care of locking the resources. But this is not a problem when all shared data is pushed into a relational database that takes care of the concurrency matters. So threads in a multi-user application act almost like a process and are under the same process only to ease their management at the application level.

For instance, a web server will be able to put all requests in a queue and wait for a thread to be available to send the work to it. Furthermore, it allows memory sharing that can boost up some work and reduce the memory load.

Using processes costs more resources since it loads a new interpreter for each one. Sharing data between processes also requires more work.


Consider using threads for any multi-user application.

The Twisted framework, which comes with a callback-based programming philosophy, has ready-to-use patterns for server programming.

Last, eventlet (see http://wiki.secondlife.com/wiki/Eventlet) is another interesting approach, probably simpler than Twisted.

Simple Example

Let's take a small example of an application that recursively scans a directory to process text files. Each text file is opened and processed by an external converter. Using threads will possibly make it faster because the indexation work can be done simultaneously on several files.

The external converter is a small Python program that does some complex work:


for i in range(100000):
     i = str(i) + "y"*10000 

This script saved into converter.py takes around 25 kpystones, which is around half a second on a MacBook Intel Core Duo 2.

In a multi-threaded solution, the main program deals with a pool of threads. Each thread takes its work from a queue. In this use case, threads are called workers. The queue is the shared resource where the main program adds files it has found walking in the directory. The workers take the files out of the queue and process them.

The Queue module (which will be renamed queue in Python 3k) from the standard library is the perfect class for our program. It provides a multi-consumer, multi-producer FIFO queue that internally uses a deque instance and is thread-safe.

So if we want to process files that are in that queue, we just use the get method together with task_done, which lets the Queue instance know that the task has be finished for join to work:

>>> from Queue import Queue
>>> import logging
>>> import time 
>>> import subprocess
>>> q = Queue()
>>> def index_file(filename):
...     logging.info('indexing %s' % filename)
...     f = open(filename)
...     try:
...         content = f.read()
...         # external process is here
...         subprocess.call(['converter.py'])
...         time.sleep(0.5)
...     finally:
...         f.close()
>>> def worker():
...     while True:
...         index_file(q.get())
...         q.task_done()

The worker function, which will be called through a thread, takes file names from the queue and processes them by calling the index_file function. The sleep call simulates the process done by an external program, and makes the thread wait for the results, and therefore unlock the GIL.

The main program can then launch workers, scan for files to be processed, and feed the queue with them.

This is done by creating Thread instances with the worker method. The setDaemon is necessary so that the threads automatically get shut down when the program exits. Otherwise, the program would hang forever waiting for them to exit. This can be manually managed but it is not useful here.

At the end of the index_files function, the join method will wait for the queue to be fully processed.

Let's create a full script called indexer.py that runs a multithreaded version, and a single thread to index a directory structure containing text files:

from threading import Thread
import os
import subprocess
from Queue import Queue
import logging
import time 
import sys
from pbp.scripts.profiler import profile, print_stats

dirname = os.path.realpath(os.path.dirname(__file__))
CONVERTER = os.path.join(dirname, 'converter.py')

q = Queue()

def index_file(filename):
    f = open(filename)
        content = f.read()
        # process is here

def worker():
    while True:

def index_files(files, num_workers):
    for i in range(num_workers):
        t = Thread(target=worker)
    for file in files:

def get_text_files(dirname):
    for root, dirs, files in os.walk(dirname):
        for file in files:
            if os.path.splitext(file)[-1] != '.txt':
            yield os.path.join(root, file)

def process(dirname, numthreads):
    dirname = os.path.realpath(dirname)
    if numthreads > 1:
        index_files(get_text_files(dirname), numthreads)
        for f in get_text_files(dirname):

if __name__ == '__main__':    
    process(sys.argv[1], int(sys.argv[2])) 

This script can be used with any directory as long as it contains text files. It takes two parameters:

  1. The name of the directory

  2. The number of threads

When name of the directory is used alone, no threads are launched and the directory is processed in the main thread.

Let's run it on the same MacBook, on a directory containing 36 files with 19 text files. The directory is composed of a structure of 6 directories:

$ python2 indexer.py zc.buildout-1.0.6-py2.5.egg 1
process : 301.83 kstones, 6.821 secondes, 396 bytes
$ python indexer.py zc.buildout-1.0.6-py2.5.egg 2
process : 155.28 kstones, 3.509 secondes, 2496 bytes
$python indexer.py zc.buildout-1.0.6-py2.5.egg 4
process : 150.42 kstones, 3.369 secondes, 4584 bytes
$python indexer.py zc.buildout-1.0.6-py2.5.egg 8
process : 153.96 kstones, 3.418 secondes, 8760 bytes
$python indexer.py zc.buildout-1.0.6-py2.5.egg 12
process : 154.18 kstones, 3.454 secondes, 12948 bytes
$python indexer.py zc.buildout-1.0.6-py2.5.egg 24
process : 161.84 kstones, 3.593 secondes, 25524 bytes

It appears that two threads are twice as fast as one thread and that adding more threads is not changing anything. Twenty-four threads are even a bit slower than 12 threads, due to the overhead.

These results may vary depending on the number of files since the disk access is also adding some overhead. But we can safely say that multithreading made the code two times faster, when used on a dual-core.


Multithreading should be used to build responsive interfaces and to delegate some work to third-party applications.

Since memory is shared, the danger of data corruption and race conditions is always present. This danger is greatly mitigated if you use the queue module as the only way for the threads to communicate and pass data to one another.

It's a reasonable policy to never let two threads touch the same mutable data.


The GIL limitation makes it impossible to speed up programs that make heavy use of pure Python that is CPU bound. The only way to achieve it is to use separate processes. This is usually done by forking the program at some point. A fork is a system call available through os.fork, which will create a new child process. The two processes then continue the program in their own right after the forking:

>>> import os
>>> a = []
>>> def some_work():
...     a.append(2)
...     child_pid = os.fork()
...     if child_pid == 0:
...         a.append(3)
...         print "hey, i am the child process"
...         print "my pid is %d" % os.getpid()
...         print str(a)
...     else:
...         a.append(4)
...         print "hey, i am the parent"
...         print "the child is pid %d" % child_pid 
...         print "I am the pid %d " % os.getpid()
...         print str(a)
>>> some_work()
hey, i am the parent
the child is pid 25513
I am the pid 25411 
[2, 4]
hey, i am the child process
my pid is 25513
[2, 3]


Be careful: Running this example at the prompt will lead to a messed-up session.

The memory context is also copied at the fork and then each process deals with its own address space. To communicate, processes need to work with system-wide resources or use low-level tools like signals.

Unfortunately, os.fork is not available under Windows, where a new interpreter needs to be spawned in order to mimic the fork feature. So the code may vary depending on the platform.

When the processes are created, they might need to communicate. If the processes are used to do some isolated job using a relational database (for instance), a shared space is usually the best pick.

Working with signals is painful. Shared memory, pipes, or sockets are simpler to work with. This is usually the case when processes are not one-shot workers, but rather interactive.

There is one library that makes processing really easy to deal with: pyprocessing.


pyprocessing provides a portable way to work with processes as if they were threads.

To install it, look for processing with easy_install:

$ easy_install processing

The tool provides a Process class that is very similar to the Thread class, and can be used on any platform:

>>> from processing import Process
>>> import os
>>> def work():
...     print 'hey i am a process, id: %d' % os.getpid()
>>> ps = []
>>> for i in range(4):
...     p = Process(target=work)
...     ps.append(p)
...     p.start()
hey i am a process, id: 27457
hey i am a process, id: 27458
hey i am a process, id: 27460
hey i am a process, id: 27459
>>> ps
[<Process(Process-1, stopped)>, <Process(Process-2, stopped)>, <Process(Process-3, stopped)>, <Process(Process-4, stopped)>]
>>> for p in ps:
...     p.join()

When the processes are created, the memory is forked. The most efficient usage of processes is to let them work on their own after they have been created to avoid overhead, and check on their states from the main thread. Besides the memory state that is copied, the Process class also provides an extra args argument in its constructor so that data can be passed along.

pyprocessing also provides a queue-like class that can be used to share data among processes in a shared memory space fully managed by the package.


processing.sharedctypes also provides functions to shared objects from ctypes amongst processes.

See http://pyprocessing.berlios.de/doc/sharedctypes.html.

The previous worker example can, therefore, use processes instead of threads as long as the Queue instance is replaced by a processing.Queue one.

Another nice feature of pyprocessing is the Pool class that automatically generates and manages a collection of workers. If not provided, the number of workers will be the same as that of the number of CPUs available on the computer, given by the cpuCount API:

>>> import processing
>>> import Queue
>>> print 'this machine has %d CPUs' \
...         % processing.cpuCount()
this machine has 2 CPUs
>>> def worker():
...     file = q.get_nowait()
...     return 'worked on ' + file
>>> q = processing.Queue()
>>> pool = processing.Pool()
>>> for i in ('f1', 'f2', 'f3', 'f4', 'f5'):
...     q.put(i)
>>> while True:
...     try:
...         result = pool.apply_async(worker)
...         print result.get(timeout=1)
...     except Queue.Empty:
...         break
worked on f1
worked on f2
worked on f3
worked on f4
worked on f5

The apply_async method will call the worker function through the pool, and immediately return a result object that can be used by the main process to get back the result. The get method can be used to wait for a result with a timeout.

Last, an Array and a Value class provide shared memory spaces. However, their usage should be avoided by design since they introduce bottlenecks in the parallelization, and increase code complexity.


The pyprocessing website has a lot of code examples that are worth a read to reuse this package in your programs. By the time this book was written, this package was claimed on python-dev to become part of the standard library, and this should be effective in the 2.7 series. So pyprocessing is definitely the recommended multiprocessing tool.


The result of a function or a method that is expensive to run can be cached as long as:

  • The function is deterministic and results have the same value every time, given the same input.

  • The return value of the function continues to be useful and valid for some period of time (non-deterministic).


A deterministic function always returns the same result for the same set of arguments, whereas a non-deterministic one returns results that may vary.

Good candidates for caching are usually:

  • Results from callables that query databases

  • Results from callables that render static values, like file content, web requests, or PDF rendering

  • Results from deterministic callables that perform complex calculations

  • Global mappings that keep track of values with expiration times, like web session objects

  • Some data that needs to be accessed often and quickly

Deterministic Caching

A simple example of deterministic caching is a function that calculates a square. Keeping track of the results allows you to speed it up:

>>> import random, timeit
>>> from pbp.scripts.profiler import profile, print_stats
>>> cache = {}
>>> def square(n):
...     return n * n
>>> def cached_factory(n):
...     if n in cache:
...         cache[n] = square(n)
...     return cache[n]
>>> @profile('not cached')
... def factory_calls():
...     for i in xrange(100):
...         square(random.randint(1, 10))
>>> @profile('cached')
... def cached_factory_calls():
...     n = [random.randint(1, 10) for i in range(100)]
...     ns = [cached_factory(i) for i in n]
>>> factory_calls(); cached_factory_calls();
>>> print_stats()
cached : 6.07 kstones, 0.142 secondes, 480 bytes
not cached : 20.51 kstones, 0.340 secondes, 396 bytes

Of course, such a caching is efficient as long as the time taken to interact with the cache is less than the time taken by the function. If it's faster to simply re-calculate the value, by all means do so! Also caches can be dangerous if used incorrectly. For instance, you might end up using stale data, or you might end up gobbling memory with an ever larger cache.

That's why setting up a cache has to be done only if it's worth it; setting it up properly has a cost.

In the preceding example, we used an argument to the function as key for the cache. This only works if the arguments are hashable. For instance, this works with int and str, but not with dict. When arguments are getting complex and are not necessarily hashable, they have to be manually processed and translated into a unique key used for the cache:

>>> def cache_me(a, b, c, d):
...     # we don't care about d for the key 
...     key = 'cache_me:%s:%s:%s' % (a, b, c)
...     if key not in cache:
...         print 'caching'
...         cache[key] = complex_calculation(a, b, c, d)
...     print d    # d is just use for display
...     return cache[key]

It is possible, of course, to automatically create the key by looping over each argument. But there are many special cases where we will need to calculate the key manually, as in the example above.

This behavior is called memoizing and can be turned into a simple decorator:

>>> cache = {}
>>> def get_key(function, *args, **kw):
...     key = '%s.%s:' % (function.__module__,
...                       function.__name__)
...     hash_args = [str(arg) for arg in args]
...     # of course, will work only if v is hashable
...     hash_kw = ['%s:%s' % (k, hash(v)) 
...                for k, v in kw.items()]
...     return '%s::%s::%s' % (key, hash_args, hash_kw)    
>>> def memoize(get_key=get_key, cache=cache):
...     def _memoize(function):
...         def __memoize(*args, **kw):
...             key = get_key(function, *args, **kw)
...             try:
...                 return cache[key] 
...             except KeyError:
...                 cache[key] = function(*args, **kw) 
...                 return value
...         return __memoize
...     return _memoize
>>> @memoize()
... def factory(n):
...     return n * n
>>> factory(4)
>>> factory(4)
>>> factory(3)
>>> cache
{"__main__.factory:::['3']::[]": 9,
 "__main__.factory:::['4']::[]": 16}

The decorator uses a callable to calculate a key, and a default get_key does argument introspection. It will raise an exception if the keyword argument is not hashable. Nevertheless, this function can be adapted to special cases. The mapping that stores values is also made configurable.

A common practice is to calculate the MD5 hash (or SHA) of arguments. But beware that such a hash has a real cost, and the function itself needs to be slower than the key calculation for the cache to be useful. For our factory function, it is barely the case:

>>> import md5 
>>> def get_key(function_called, n):
...     return md5.md5(str(n)).hexdigest()
>>> @memoize(get_key)
... def cached_factory(n):
...     return n * n
>>> factory_calls(); cached_factory_calls();
>>> print_stats()
cached : 6.96 kstones, 0.143 secondes, 1068 bytes
not cached : 7.61 kstones, 0.157 secondes, 552 bytes

Non-Deterministic Caching

Non-deterministic functions are functions that may produce a different output even when given the same input. For example, database queries are sometimes cached for a given amount of time. For instance, if an SQL table holds information on users, caching queries for all functions that display user data is a good practice, as long as this table is not updated often. Another example is a server configuration file that does not change after the server has started. Putting those values in a cache is a good practice. Many servers can be sent a signal as a sign that they should clear their cache and re-read their configuration files. Last, a web server will probably cache complete pages and the logo used on all page headers for at least a few hours, using a cache server such as SQUID. In the logo case, the client-side browser also maintains a local cache as well but deals with SQUID to know if the logo has been modified.


The cache duration is set according to the average update time of the data.

The memoize cache can have an extra age argument in order to invalidate cached values that are too old:

def memoize(get_key=get_key, storage=cache, age=0):
    def _memoize(function):
        def __memoize(*args, **kw):
            key = get_key(function, *args, **kw)
                value_age, value = storage[key]
                deprecated = (age != 0 and 
                             (value_age+age) < time.time())            
            except KeyError:
                deprecated = True 
            if not deprecated:
                return value

            storage[key] = time.time(), function(*args, **kw)     
            return storage[key][1]
        return __memoize
    return _memoize 

Let's say we have a function that displays the current time. Dropping the seconds, we can cache it with a 30 seconds age to get a reasonably accurate cache:

>>> from datetime import datetime
>>> @memoize(age=30)
... def what_time():
...     return datetime.now().strftime('%H:%M')
>>> what_time()
>>> cache
{'__main__.what_time:::[]::[]': (1212168961.613435, '19:36')}

Of course, the cache invalidation could be done asynchronously by another function that removes expired keys to speed up the memoize function. This is common for web applications that need to occasionally expire old sessions.

Pro-Active Caching

There are a lot of caching strategies to speed up an application. For instance, if an intranet gets a high load every morning from its users who read the news posted the previous afternoon, it makes sense to cache the results of rendering the articles so that they don't have to be rendered for every new web request. A good caching strategy would be to mimic these users once at night time through a cron job, to fill the cache with data with a maximum age of 12 hours.


If you want to be serious about caching, Memcached (see http://www.danga.com/memcached) is the tool you would want to use. This cache server is used by big applications such as Facebook or Wikipedia to scale their websites. Among simple caching features, it has clustering capabilities that makes it possible to set up a very efficiently distributed cache system in no time.

The tool is Unix-based, but can be driven from any platform and from many languages. The Python client is really simple and our memoize function can be adapted to work with it in very easily.


Beaker is a WSGI implementation of a caching middleware using Memcached. See http://pypi.python.org/pypi/Beaker.

Caching can save your day, but it should not be used to hide the slowness of a badly designed or poorly implemented function.

It is often safer to simply improve the code so that you don't have to worry about stale caches, infinite memory growth, or a bad design. It has to be used only on code that cannot be optimized anymore.

Cache size should always be controlled by a maximum age and/or by a maximum size.

Also Memcached should be used for efficient caching.


In this chapter we have learned:

  • How to measure the complexity of the code, and some approaches to reduce it

  • How threads work in Python and what they are good for

  • A simple way to use processes

  • A bit of caching theory and how to use it

The next chapter is dedicated to design patterns.