12. Optimization: General Principles and Profiling Techniques – Expert Python Programming

Chapter 12. Optimization: General Principles and Profiling Techniques


"Premature optimization is the root of all evil in programming."

 --Donald Knuth

This chapter is about optimization and provides a set of general principles and profiling techniques. It gives the three rules of optimization every developer should be aware of, and provides guidelines on optimization. Last, it focuses on how to find bottlenecks.

The Three Rules of Optimization

Optimization has a price, no matter what the results are. When a piece of code works, it might be better (sometimes) to leave it alone than to try making it faster at all costs. They are a few rules to keep in mind when doing optimization:

  • Make it work first.

  • Work from the user's point of view.

  • Keep the code readable.

Make It Work First

A very common mistake is to try to optimize the code while you are writing it. This is impossible because the real bottlenecks are often located where you would have never thought they would be.

An application is composed of very complex interactions, and it is impossible to get a full picture of what is going on before it is really used.

Of course, this is not a reason to write a function or a method without trying to make it as fast as possible. You should be careful to lower its complexity as much as possible and avoid useless repetition. But the first goal is to make it work. This goal should not be hindered by optimization matters.

For line-level code, the Python philosophy is that there's one and preferably only one way to do it. So as long as you stick with a Pythonic syntax, described in Chapters 2 and 3, your code should be fine. Often, writing less code is better and faster than writing more code.

Don't do any of these things until you have gotten your code working and you are ready to profile:

  • Start to write a global dictionary to cache data for a function

  • Think about externalizing a part of the code in C or hybrid languages such as Pyrex

  • Look for external libraries to do some basic calculation

For very specialized programs, such as scientific calculation programs or games, the usage of specialized libraries and externalization might be unavoidable from the beginning. On the other hand using libraries like Numeric might ease the development of specific features and produce a simpler and faster code at the end. Furthermore, you should not rewrite a function if there is a good library that does it for you.

For instance Soya 3D, which is a game engine on the top of OpenGL (see http://home.gna.org/oomadness/en/soya3d/index.html), uses C and Pyrex for fast matrix operations when rendering real-time 3D.


Optimization is carried out on programs that already work.

"Make it work, then make it right, then make it fast"Kent Beck

Work from the User's Point of View

I have seen teams working on optimizing the startup time of an application server that worked really fine when it was started. Once they finished speeding it, they promoted that work to their customers. They were a bit frustrated to notice that the customers didn't really care about it. This was because the speed-up work was not motivated by the user feedback, but by the developer's point of view. The people who built the system were launching the server everyday. So the startup time mattered to them but not the customer.

While making a program start faster is a good thing from an absolute point of view, teams should be careful to prioritize the optimization work and ask themselves the following questions:

  • Have I been asked to make it faster?

  • Who finds the program slow?

  • Is it really slow, or acceptable?

  • How much will it cost to make it go faster? Is it worth it? What parts need to be fast?

Remember that optimization has a cost, and that the developer's point of view is meaningless to customers, unless you are writing a framework or a library and the customer is a developer too.


Optimization is not a game. It should be done only when necessary.

Keep the Code Readable(and thus maintainable)

Even if Python tries to make the common code patterns the fastest, optimization work might obfuscate your code and make it really hard to read. There's a balance to keep between producing readable, and therefore maintainable code, and defacing it in order to make it faster.

When you have reached 90% of your optimization objectives, and the 10% left to be done makes your code completely unreadable, it might be a good idea to stop the work there or to look for other solutions.


Optimization should not make your code unreadable. If it happens, you should look for alternative solutions such as externalization or redesign. However, there is always a good compromise between readability and speed.

Optimization Strategy

Let's say your program has a real speed problem you need to resolve. Do not try to guess how to make it faster. Bottlenecks are often hard to find by looking at the code, and a set of tools is needed to find the real problem.

A good optimization strategy can start with three steps:

  • Find another culprit: Make sure a third-party server or resource is not faulty.

  • Scale the hardware: Make sure the resources are sufficient.

  • Write a speed test: Create a scenario with speed objectives.

Find Another Culprit

Often, a performance problem occurs at production level and the customer alerts you that it is not working as it used to when the software was being tested. Performance problems might occur because the application was not planned to work in the real world with a high number of users and an increase of data size.

But if the application interacts with other applications, the first thing to do is to check if the bottlenecks are located on those interactions. For instance, a database server or an LDAP server might be responsible for extra overhead and make everything slower.

The physical links between applications should also be considered: Maybe the network link between your application server and another server in the intranet is getting really slow due to a misconfiguration, or a paranoid anti-virus that scans all TCP packets and slows everything down.

The design documentation should provide a diagram of all interactions and the nature of each link to get an overall picture of the system, and get help when trying to resolve a speed problem.


If your application uses third-party servers of resources, every interaction should be audited to make sure the bottleneck is not located there.

Scale the Hardware

When there is no more volatile memory available, the system starts to use the hard disk to store data. This is swapping.

This involves a lot of overhead and the performances drop drastically. From a user's point of view, the system is considered dead at this stage. So, it is important to scale the hardware to prevent this.

While having enough memory on a system is important, it is also important to make sure that the applications are not acting crazy and eating too much memory. For instance, if a program works on big video files that can weigh in at several hundreds of megabytes, it should not load them entirely in memory, but rather work on chunks or use disk streams.

Disk usage is also important. A full partition might really slow down your application, if the I/O errors are hidden in the code that tries to write repeatedly on the disk. Furthermore, even if the code only tries the write once, the hardware and OS might try to write multiple times.


Munin is a great system-monitoring tool that you can use to get a snapshot of the system health: http://munin.projects.linpro.no

Make sure the system is healthy and fits the application need. But make sure the application is not consuming memory and disk space like an ogre.

Write a Speed Test

When starting with optimization work, it is important to work on tests' side rather than running some manual tests continually. A good practice is to dedicate a test module in the application, where the sequence of calls that are to be optimized is written. Having this scenario will help you to track your progress while you are optimizing the application.

You can even write a few assertions where you set some speed objectives. To prevent speed regression, these tests can be left after the code has been optimized:

>>> def test_speed():
...    import time
...    start = time.time()
...    the_code()
...    end = time.time() - start
...    assert end < 10, \
...    "sorry this code should not take 10 seconds !"


Measuring execution speed depends on the power of the CPU used. But we will see in the next section how to write universal duration measures.

Finding Bottlenecks

Finding bottlenecks is done by:

  • Profiling CPU usage

  • Profiling memory usage

  • Profiling network usage

Profiling CPU Usage

The first source of bottlenecks is your code. The standard library provides all the tools needed to perform code profiling. They are based on a deterministic approach.


A deterministic profiler measures the time spent in each function by adding a timer at the lowest level. This introduces a bit of overhead, but provides a good idea on where the time is consumed. A statistical profiler, on the other hand, samples the instruction pointer usage and does not instrument the code. The latter is less accurate, but allows running the target program at full speed.

There are two ways to profile the code:

  • Macro-profiling: Profiles the whole program while it is being used, and generates statistics.

  • Micro-profiling: Measures a precise part of the program by instrumenting it manually.


Macro-profiling is done by running the application in a special mode, where the interpreter is instrumented to collect statistics on the code usage. Python provides several tools for this:

  • profile: A pure Python implementation

  • cProfile: A C implementation that provides the same interface as that of the profile tool, but has less overhead

  • hotshot: Another C implementation, which is probably going to be removed from the standard library

Unless your program runs with a Python version below 2.5, the recommended profiler is cProfile.

The following is a myapp.py module with a main function:

import time
def medium():
def light():
def heavy():
   for i in range(100):
def main():
   for i in range(2):
if __name__ == '__main__':

The module can be called directly from the prompt and the results are summarized here:

$ python -m cProfile myapp.py
         1212 function calls in 10.120 CPU seconds
   Ordered by: standard name
   ncalls  tottime  cumtime  percall  file
        1    0.000   10.117   10.117  myapp.py:16(main)
      400    0.004    4.077    0.010  myapp.py:3(lighter)
      200    0.002    2.035    0.010  myapp.py:6(light)
        2    0.005   10.117    5.058  myapp.py:9(heavy)
        3    0.000    0.000    0.000  {range}
      602   10.106   10.106    0.017  {time.sleep}

The statistics provided are a print view of a statistic object filled by the profiler. A manual call of the tool can be:

>>> from myapp import main
>>> import cProfile
>>> profiler = cProfile.Profile()
>>> profiler.runcall(main)
>>> profiler.print_stats()
         1209 function calls in 10.140 CPU seconds
   Ordered by: standard name
   ncalls  tottime  cumtime  percall  file
        1    0.000   10.140   10.140  myapp.py:16(main)
      400    0.005    4.093    0.010  myapp.py:3(medium)
      200    0.002    2.042    0.010  myapp.py:6(light)
        2    0.005   10.140    5.070  myapp.py:9(heavy)
        3    0.000    0.000    0.000  {range}
      602   10.128   10.128    0.017  {time.sleep}

The statistics can also be saved in a file and then read by the pstats module. This module provides a class that knows how to handle profile files, and gives a few helpers to play with them like sort methods:

>>> cProfile.run('main()', 'myapp.stats')
>>> import pstats
>>> p = pstats.Stats('myapp.stats')
>>> p.total_calls
>>> p.sort_stats('time').print_stats(3)
Thu Jun 19 23:56:08 2008    myapp.stats
         1210 function calls in 10.240 CPU seconds
   Ordered by: internal time
   List reduced from 8 to 3 due to restriction <3>
   ncalls  tottime  cumtime  percall  filename:lineno(function)
      602   10.231   10.231    0.017  {time.sleep}
        2    0.004   10.240    5.120  myapp.py:9(heavy)
      400    0.004    4.159    0.010  myapp.py:3(medium)

From there, you can browse the code by printing out the callers and callees for each function:

>>> p.print_callees('medium')
   Ordered by: internal time
   List reduced from 8 to 1 due to restriction <'medium'>
Function                  called...
                        ncalls tottime cumtime
myapp.py:3(lighter) ->  400   4.155   4.155 {time.sleep}
>>> p.print_callers('light')
   Ordered by: internal time
   List reduced from 8 to 2 due to restriction <'light'>
Function                 was called by...
                        ncalls tottime cumtime
myapp.py:3(medium) <-   400   0.004   4.159 myapp.py:9(heavy)
myapp.py:6(light) <-    200   0.002   2.073 myapp.py:9(heavy)

Being able to sort the output allows working on different views to find the bottlenecks. For instance:

  • When the number of calls is really high and makes most of the global time, the function or method is probably in a loop and an optimization might be tried to get it out.

  • When one function is taking very long, a cache might be a good option if possible.

Another great way to visualize bottlenecks from profiling data is to transform them into diagrams. Gprof2Dot (http://code.google.com/p/jrfonseca/wiki/Gprof2Dot) can be used to turn profiler data into a dot graph. You can download this simple script from http://jrfonseca.googlecode.com/svn/trunk/gprof2dot/gprof2dot.py and use it on the stats as long as Graphviz (see http://www.graphviz.org/) is installed in your box:

$ wget http://jrfonseca.googlecode.com/svn/trunk/gprof2dot/ gprof2dot.py
$ python2.5 gprof2dot.py -f pstats myapp.stats | dot -Tpng -o output.png


KcacheGrind is also a great vizualization tool to display profile data.

See http://kcachegrind.sourceforge.net/cgi-bin/show.cgi.

Macro-profiling is a good way to detect the function that has a problem, or at least its neighborhood. When you have found it, you can jump to micro-profiling.


When the slow function is found, it is sometimes necessary to do more profiling work that tests just a part of the program. This is done by manually instrumenting a part of the code in a speed test.

For instance, the cProfile module can be used from a decorator:

>>> import tempfile, os, cProfile, pstats
>>> def profile(column='time', list=5):
...    def _profile(function):
...       def __profile(*args, **kw):
...          s = tempfile.mktemp()
...          profiler = cProfile.Profile()
...          profiler.runcall(function, *args, **kw)
...          profiler.dump_stats(s)
...          p = pstats.Stats(s)
...          p.sort_stats(column).print_stats(list)
...       return __profile
...    return _profile
>>> from myapp import main
>>> @profile('time', 6)
... def main_profiled():
...    return main()
>>> main_profiled()
Fri Jun 20 00:30:36 2008 ...
            1210 function calls in 10.129 CPU seconds
   Ordered by: internal time
   List reduced from 8 to 6 due to restriction <6>
   ncalls tottime cumtime percall filename:lineno(function)
      602  10.118  10.118   0.017 {time.sleep}
        2   0.005  10.129   5.065 myapp.py:9(heavy)
      400   0.004   4.080   0.010 myapp.py:3(lighter)
      200   0.002   2.044   0.010 myapp.py:6(light)
        1   0.000  10.129  10.129 myapp.py:16(main)
        3   0.000   0.000   0.000 {range}
>>> from myapp import lighter
>>> p = profile()(lighter)
>>> p()
Fri Jun 20 00:32:40 2008 /var/folders/31/31iTrMYWHny8cxfjH5VuTk+++TI/-Tmp-/tmpQjstAG
            3 function calls in 0.010 CPU seconds
   Ordered by: internal time
   ncalls tottime cumtime percall filename:lineno(function)
        1   0.010   0.010   0.010 {time.sleep}
        1   0.000   0.010   0.010 myapp.py:3(lighter)

This approach allows testing parts of the application and sharpens the statistics output.

But at this stage, having a list of callees is probably not interesting, as the function has already been pointed out as the one to optimize. The only interesting information is to know how fast it is, and then enhance it.

timeit fits this need better by providing a simple way to measure the execution time of a small code snippet, with the best underlying timer the host system provides (time.time or time.clock ):

>>> from myapp import light
>>> import timeit
>>> t = timeit.Timer('main()')
>>> t.timeit(number=5)
10000000 loops, best of 3: 0.0269 usec per loop
10000000 loops, best of 3: 0.0268 usec per loop
10000000 loops, best of 3: 0.0269 usec per loop
10000000 loops, best of 3: 0.0268 usec per loop
10000000 loops, best of 3: 0.0269 usec per loop

The module allows to repeat the call, and is oriented to try out isolated code snippets. This is very useful outside the application context, in a prompt for instance, but is not really handy to use within an existing application.


A deterministic profiler will provide results depending on what the computer is doing, and so results may vary each time. Repeating the same test multiple times and making averages provides more accurate results. Furthermore, some computers have special CPU features, such as SpeedStep, that might change the results if the computer is idling when the test is launched (see http://en.wikipedia.org/wiki/SpeedStep). So continually repeating the test is a good practice for small code snippets. There are also various caches to keep in mind such as DNS caches or CPU caches.

A decorator similar to the one above is an easier way to time a part of an application. This decorator collects durations, though :

>>> import time
>>> import sys
>>> if sys.platform == 'win32':   # same condition in timeit
...    timer = time.clock
... else:
...    timer = time.time
>>> stats = {}
>>> def duration(name='stats', stats=stats):
...    def _duration(function):
...       def __duration(*args, **kw):
...          start_time = timer()
...          try:
...             return function(*args, **kw)
...          finally:
...             stats[name] = timer() - start_time
...       return __duration
...    return _duration
>>> from myapp import heavy
>>> heavy = duration('this_func_is')(heavy)
>>> heavy()
>>> print stats['this_func_is']

The global stats dictionary is filled by the decorator when the code is executed, and can be read after the function has finished its work.

Using such a decorator allows to add inline instrumentation to the application code without disrupting the application itself.

>>> stats = {}
>>> from myapp import light
>>> import myapp
>>> myapp.light = duration('myapp.light')(myapp.light)
>>> myapp.main()
>>> stats
{'myapp.light': 0.05014801025390625}

This can be done in the context of speed tests.

Measuring Pystones

When measuring execution time, the result depends on the computer hardware. To be able to produce a universal measure, the simplest way is to benchmark the speed of a fixed sequence of code and calculate a ratio out of it. From there, the time taken by a function can be translated to a universal value that can be compared on any computer.


A lot of benchmarking tools are available. For instance, Whetstone was created in 1972 and back then, it provided a computer performance analyzer in Algol 60 (see http://en.wikipedia.org/wiki/Whetstone_%28benchmark%29). It is used to measure the Millions Of Whetstone Instructions Per Second (MWIPS). A table of results is maintained at http://freespace.virgin.net/roy.longbottom/whetstone%20results.htm.

Python provides a benchmark utility in its test package that measures the duration of a sequence of well-chosen operations. The result is a number of pystones per second the computer is able to perform, and the time used to perform the benchmark, which is generally around one second on modern hardware:

>>> from test import pystone
>>> pystone.pystones()
(1.0500000000000007, 47619.047619047589)

The rate can be used to translate a profile duration into a number of pystones:

>>> from test import pystone
>>> benchtime, pystones = pystone.pystones()
>>> def seconds_to_kpystones(seconds):
...    return (pystones*seconds) / 1000
>>> seconds_to_kpystones(0.03)
>>> seconds_to_kpystones(1)
>>> seconds_to_kpystones(2)

The seconds_to_kpystones returns the number of kilo pystones. This conversion can be included in the duration decorator to yield a value in stones.

>>> def duration(name='stats', stats=stats):
...    def _duration(function):
...       def __duration(*args, **kw):
...          start_time = timer()
...          try:
...             return function(*args, **kw)
...          finally:
...             total = timer() - start_time
...             kstones = seconds_to_kpystones(total)
...             stats[name] = total, kstones
...       return __duration
...    return _duration
>>> @duration()
... def some_code():
...    time.sleep(0.5)
>>> some_code()
>>> stats
{'stats': (0.50012803077697754, 24.278059746455238)}

Having pystones will allow using this decorator in tests so that you can set assertions on execution times. These tests will be runnable on any computer and will allow developers to prevent speed regressions. When a part of the application has been optimized, they will be able to set its maximum execution time in tests and make sure it won't be breached by further changes.

Profiling Memory Usage

Another problems is memory consumption. If a program that runs starts to eat so much memory that the system swaps, there is probably a place in your application where too many objects are created. This is often easy to detect through classical profiling because consuming enough memory to make a system swap involves a lot of CPU work that can be detected. But sometimes it is not obvious, and the memory usage has to be profiled.

How Python Deals with Memory

Memory usage is probably the hardest thing to profile in Python when you use the CPython implementation. While languages such as C allow you to get the memory size of any element, Python will never let you know how much a given object consumes. This is due to the dynamic nature of the language, and the fact that there is an automatic management of object instantiation: the garbage collector. For instance, two variables that point to the same string value might or might not point to the same string-object instance in memory.

The approach of such a memory manager is roughly based on a simple statement: If a given object is not referenced anymore, it is removed. In other words, all local references in a function are removed after the interpreter:

  • Leaves the function

  • Makes sure the object is not being used anymore.


Under normal conditions, the collector will do a nice job. But a del call can be used to help the garbage collector by manually removing the references to an object manually.

So objects that remain in memory are:

  • Global objects

  • Objects that are still referenced in some way

Be careful of the argument inbound outbound edge-case. If an object is created within the arguments, the argument reference will still be alive if the function returns the object. This can lead to unexpected results if it is used as a default value:

>>> def my_function(argument={}): # bad practice
...    if '1' in argument:
...       argument['1'] = 2
...    argument['3'] = 4
...    return argument
>>> my_function()
{'3': 4}
>>> res = my_function()
>>> res['4'] = 'I am still alive!'
>>> print my_function()
{'3': 4, '4': 'I am still alive!'}

That is why non-mutable objects should always be used, like this:

>>> def my_function(argument=None): # better practice
...    if argument is None:
...       argument = {} # a fresh dict is created everytime
...    if '1' in argument:
...       argument['1'] = 2
...       argument['3'] = 4
...       return argument
>>> my_function()
{'3': 4}
>>> res = my_function()
>>> res['4'] = 'I am still alive!'
>>> print my_function()
{'3': 4}

Garbage collection is handy and avoids keeping track of objects and therefore we have to manually destroy them. Although this introduces another problem, since developers never clean up instances in memory, it might grow in an uncontrolled way if developers don't pay attention to the way they use data structure.

Usual memory eaters are:

  • Caches that grow uncontrolled

  • Object factories that register instances globally and do not keep track of their usage, such as a database connector creator, used on the fly every time a query is called

  • Threads that are not properly finished

  • Objects with a __del__ method and involved in a cycle are also memory eaters. The Python garbage collector will not break the cycle since it cannot be sure which object should be deleted first. Hence, you will leak memory. Using this method is a bad idea in any case.

Profiling Memory

Knowing how many objects are controlled by the garbage collector and their real size is a bit tricky. For instance, knowing how much a given object weighs in bytes would involve crawling down all its attributes, dealing with cross-references, and then summing up everything. It's a pretty difficult problem if you consider the way objects tend to refer to each other. The gc module does not provide high-level functions for this, and would require Python to be compiled in debug mode to have a full set of information.

Often, programmers just ask the system about the memory usage of their application after and before a given operation has been performed. But this measure is approximative and depends a lot on how the memory is managed at system level. Using the top command under Linux or the Task Manager under Windows, for instance, makes it possible to detect memory problems when they are obvious. But it requires painful work on the code side to track down the faulty code block.

Fortunately, there are a few tools available to make memory snapshots, and calculate the number and size of loaded objects. But let's keep in mind that Python does not release memory easily, and prefers to hold on to it in case it is needed again.

A Guppy-PE Primer

Guppy-PE ( http://guppy-pe.sourceforge.net) is a framework that provides a memory profiler called Heap, among other features.

Guppy is easy_install-able:

$ sudo easy_install guppy

From there, an hpy function is available under the guppy namespace. It returns an object that knows how to display a snapshot of the memory:

>>> from guppy import hpy
>>> profiler = hpy()
>>> profiler.heap()
Partition of a set of 22723 objects. Total size = 1660748 bytes.
  Index Count  %   Size  % Cumulative % Kind (class / dict of class)
      0  9948 44 775680 47     775680 47 str
      1  5162 23 214260 13     989940 60 tuple
      2  1404  6  95472  6    1085412 65 types.CodeType
      3    61  0  91484  6    1176896 71 dict of module
      4   152  1  84064  5    1260960 76 dict of type
      5  1333  6  79980  5    1340940 81 function
      6   168  1  72620  4    1413560 85 type
      7   119  1  68884  4    1482444 89 dict of class
      8    76  0  51728  3    1534172 92 dict (no owner)
      9   959  4  38360  2    1572532 95 __builtin__.wrapper_descriptor
<43 more rows. Type e.g. '_.more' to view.>

The output provides memory usage ordered by size, and grouped by the type of objects. The object has many attributes to define how the list is displayed and is used for memory the way pstats is used with time.

We can measure how much a specific object weighs using the iso method:

>>> import random
>>> def eat_memory():
...    memory = []
...    def _get_char():
...       return chr(random.randint(97, 122))
...    for i in range(100):
...       size = random.randint(20, 150)
...       data = [_get_char() for i in xrange(size)]
...       memory.append(''.join(data))
...    return '\n'.join(memory)
>>> profiler.iso(eat_memory())
Partition of a set of 1 object. Total size = 8840 bytes.
  Index Count  % Size   %  Cumulative % Kind
      0    1 100 8840 100       8840 100 str
>>> profiler.iso(eat_memory()+eat_memory())
Partition of a set of 1 object. Total size = 17564 bytes.
  Index Count   %  Size   %  Cumulative  %  Kind
      0     1 100 17564 100       17564 100 str

A setrelheap method is also available to reset the profiler, so that it can be used to track down the memory usage of an isolated block of code using heap. But this initialization is not perfect because the context will always have a few elements loaded. That is why the size of a freshly-initialized hpy instance varies a bit and is not 0:

>>> g = hpy()
>>> g.setrelheap()
>>> g.heap().size
>>> g.heap().size
>>> g.heap().size

The heap method returns a Usage object that provides information such as:

  • size: The total consumed size in bytes

  • get_rp(): A nice traversal method that will tell where the object code is located in the modules.

  • Many ways to sort the results, such as bytype

Tracking Memory Usage with Heapy

Heapy is not easy to use and needs a bit of practice. This section will just discuss some simple functions that can be used to provide memory usage information. From there, tracking down the code has to be done by browsing objects through the Heapy API.


The pkgcore project's website has a nice example on how Heapy was used to track down memory usage. See http://www.pkgcore.org/trac/pkgcore/doc/dev-notes/heapy.rst

It is a good starting point when playing with the tool.

The duration decorator (proposed earlier) can be changed to provide the size of the memory used by a function in bytes. This information can be retrieved with the duration and the pystone values, in a dictionary. The full decorator put in a profiler.py module looks like this:

import time
import sys
from test import pystone
from guppy import hpy
benchtime, stones = pystone.pystones()
def secs_to_kstones(seconds):
   return (stones*seconds) / 1000
stats = {}
if sys.platform == 'win32':
   timer = time.clock
   timer = time.time
def profile(name='stats', stats=stats):
   """Calculates a duration and a memory size."""
   def _profile(function):
      def __profile(*args, **kw):
         start_time = timer()
         profiler = hpy()
         # 12 corresponds to the initial memory size
         # after a setref call
         start = profiler.heap().size + 12
            return function(*args, **kw)
            total = timer() - start_time
            kstones = secs_to_kstones(total)
            memory = profiler.heap().size - start
            stats[name] = {'time': total,
                           'stones': kstones,
                           'memory': profiler.heap().size}
      return __profile
   return _profile

The start variable is used to make sure that the calculated memory does not include the memory consumed by Heapy when the setrelheap call is made.

Using the decorator with eat_memory will provide, besides the duration in seconds and pystones, the memory consumed by the function:

>>> import profiler
>>> import random
>>> eat_it = profiler.profile('you bad boy!')(eat_memory)
>>> please = eat_it()
>>> profiler.stats
{'you bad boy!': {'stones': 14.306935999128555, 'memory': 8680, 'time': 0.30902981758117676}}

Of course, running it several times will lead to different memory sizes when non-mutable objects are involved. But it is still a good indicator.

Another interesting usage of Heapy is to check if a function does not free the used memory, for instance if it produces cached or registered elements. This can be done by repeating function calls and observing whether the memory used grows:

A simple function for this can be:

>>> def memory_grow(function, *args, **kw):
...    """checks if a function makes the memory grows"""
...    profiler = hpy()
...    profiler.setref()
...    # 12 corresponds to the initial memory size
...    # after a setref call
...    start = profiler.heap().size + 12
...    for i in range(REPEAT):
...       function(*args, **kw)
...    return profiler.heap().size - start
>>> def stable():
...    return "some"*10000
>>> d = []
>>> def greedy():
...    for i in range(100):
...       d.append('garbage data'*i)
>>> memory_grow(stable)
>>> memory_grow(greedy)

C Code Memory Leaks

If the Python code seems perfectly fine and the memory still increases when you loop through the isolated function, the leak might be located on the C side. This happens for instance when a Py_DECREF call is missing.

The Python core code is pretty robust and tested for leaks. If you use packages that have C extensions, they might be a good place to look at first.

Profiling Network Usage

As I said earlier, an application that communicates with third-party programs such as a database or an LDAP server can be slowed down when those applications are slow. This can be tracked with a regular code profiling method on the application side. But if the third-party software works fine on its own, the culprit is probably the network.

The problem might be a misconfigured hub, a low-bandwidth network link, or even a high number of traffic collisions that makes computers send the same packets several times.

Here are a few elements to get you in. To find out what is going on, there are three fields to investigate:

If you want to go further on network performance issues, you might also want to read Network Performance Open Source Toolkit by Richard Blum (Wiley). This book exposes strategies to tune the applications that are heavily using the network, and provides a tutorial to scan complex network problems.

High Performance MySQL by Jeremy Zawodny (O'Reilly) is also a good book to read when writing an application that uses MySQL.


We have seen in this chapter:

  • The three rules of optimization:

    • Make it work first.

    • Take the user's point of view.

    • Keep the code readable.

  • An optimization strategy based on writing a scenario with speed objectives.

  • How to profile code memory, and a few tips for network profiling.

Now that you know how to find problems, the next chapter provides solutions to get rid of them.