4. Choosing Good Names – Expert Python Programming

Chapter 4. Choosing Good Names

Most of the standard library was built keeping usability in mind. For instance, working with built-in types is done naturally and was designed to be easy to use. Python, in that matter, can be compared to the pseudo-code you might think about when working on a program. Most of the code can be read out loud. For instance, this snippet is understandable by anyone:

>>> if 'd' not in my_list:
...     my_list.append('d')
... 

This is one of the reasons why writing Python is so easy when compared to other languages. When you are writing a program, the flow of your thoughts is quickly translated into lines of code.

This chapter focuses on the best practices to write code that is easy to understand and use, through:

  • The usage of naming conventions, described in PEP 8, and a set of naming best practices

  • The namespace refactoring

  • Working on API, from its initial shape to its refactoring

PEP 8 and Naming Best Practices

PEP 8 ( http://www.python.org/dev/peps/pep-0008) provides a style guide for writing Python code. Besides some basic rules such as space indentation, maximum line length, and other details concerning the code layout, PEP 8 also provides a section on naming conventions that most of the code bases follow.

This section provides a quick summary of this PEP, and adds to it a naming best-practice guide for each kind of element.

Naming Styles

The different naming styles used in Python are:

  • CamelCase, where words are capitalized and grouped

  • mixedCase], which is like CamelCase, but starts with a lower case character

  • UPPERCASE, and UPPER_CASE_WITH_UNDERSCORES

  • lowercase and lower_case_with_underscores

  • _leading and trailing_ underscores, and sometimes __doubled__

Lower case and upper case elements are often a single word, and sometimes a few words concatenated. With underscores, they are usually abbreviated phrases. Using a single word is better. The leading and trailing underscores are used to mark the privacy and special elements.

These styles are applied to:

  • Variables

  • Functions and methods

  • Properties

  • Classes

  • Modules

  • Packages

Variables

There are two kinds of variables in Python:

  • Constants

  • Public and private variables

Constants

For constant global variables, an upper case with an underscore is used. It informs the developer that the given variable represents a constant value.

Note

There are no real constants in Python like those in C++ where const can be used. You can change the value of any variable. That's why Python uses a naming convention to mark a variable as a constant.

For example, the doctest module provides a list of option flags and directives (see http://docs.python.org/lib/doctest-options.html) that are small sentences, clearly defining what each option is intended for:

>>> from doctest import IGNORE_EXCEPTION_DETAIL
>>> from doctest import REPORT_ONLY_FIRST_FAILURE

These variable names seem rather long, but it is important to clearly describe them. Their usage is mostly located in initialization code rather than in the body of the code itself, so this verbosity is not annoying.

Note

Abbreviated names obfuscate the code most of the time. Don't be afraid of using complete words when an abbreviation seems unclear.

Some constants' names are also driven by the underlying technology. For instance, the os module uses some constants that are defined on C side, such as the EX_XXX series, that defines exception numbers.

>>> import os
>>> try:
...     os._exit(0)
... except os.EX_SOFTWARE:
...     print 'internal softwar error'
...     raise
... 

A good practice when using constants is to gather them at the top of a module that uses them, and combine them under new variables when they are intended for such operations:

>>> import doctest
>>> TEST_OPTIONS = (doctest.ELLIPSIS |
...                 doctest.NORMALIZE_WHITESPACE | 
...                 doctest.REPORT_ONLY_FIRST_FAILURE)

Naming and Usage

Constants are used to define a set of values the program relies on, such as the default configuration file name.

A good practice is to gather all the constants in a single file in the package. That is how Django, for instance, works. A module named config.py provides all the constants:

# config.py
SQL_USER = 'me'
SQL_USER = 'tarek'
SQL_PASSWORD = 'secret'
SQL_URI = 'postgres://%s:%s@localhost/db' % \
              (SQL_USER, SQL_PASSWORD)
MAX_THREADS = 4

Another approach is to use a configuration file that can be parsed with the ConfigParser module, or an advanced tool such as ZConfig, which is the parser used in Zope to describe its configuration files. But some people argue that it is rather an overkill to use another file format in a language such as Python, where a file can be edited and changed as easily as a text file.

For options that act like flags, a good practice is to combine them with Boolean operations, as the doctest and re modules do. The pattern taken from doctest is quite simple:

>>> OPTIONS = {}
>>> def register_option(name):
...     return OPTIONS.setdefault(name, 1 << len(OPTIONS))
... 
>>> def has_option(options, name):
...     return bool(options & name)
... 
>>> # now defining options
... 
>>> BLUE = register_option('BLUE')
>>> RED = register_option('RED')
>>> WHITE = register_option('WHITE')
>>> 
>>> # let's try them
... 
>>> SET = BLUE | RED
>>> has_option(SET, BLUE)
True
>>> has_option(SET, WHITE)
False

When such a new set of constants is created, avoid using a common prefix for them, unless the module has several sets. The module name itself is a common prefix.

Note

Using binary bit-wise operations to combine options is common in Python. The inclusive OR (|) operator will let you combine several options in a single integer, and the AND (&) operator will let you check that the option is present in the integer. (See the has_option function)

This works if the integer can be shifted with the<< operator, to stay distinct from one another in the combined integer. In other words, it is a power of two (see register_options).

Public and Private Variables

For global variables that are mutable and public, a lower case with an underscore should be used when they need to be protected. But these kinds of variables are not used frequently, since the module usually provides getters and setters to work with them when they need to be protected. A leading underscore, in that case, can mark the variable as a private element of the package:

>>> _observers = []
>>> def add_observer(observer):
...     _observers.append(observer)
... 
>>> def get_observers():
...     """Makes sure _observers cannot be modified."""
...     return tuple(_observers)
...

Variables that are located in functions and methods follow the same rules, and are never marked as private since they are local to the context.

For class or instance variables, using the private marker (the leading underscore) has to be done only if making the variable a part of the public signature does not bring any useful information, or is redundant.

In other words, if the variable is used internally in the method to provide a public feature, and is dedicated to this role, it is better to make it private.

For instance, the attributes that are powering a property are good private citizens:

>>> class Citizen(object):
...     def __init__(self):
...         self._message = 'Go boys'
...     def _get_message(self):
...         return self._message
...     kane = property(_get_message)
... 
>>> Citizen().kane
'Go boys'

Another example would be a variable that keeps an internal state. This value is not useful for the rest of the code, but participates in the behavior of the class:

>>> class MeanElephant(object):
...     def __init__(self):
...         self._people_to_kill = []
...     def is_slapped_on_the_butt_by(self, name):
...         self._people_to_kill.append(name)
...         print 'Ouch!'
...     def revenge(self):
...         print '10 years later...'
...         for person in self._people_to_kill:
...             print 'Me kill %s' % person  
... 
>>> joe = MeanElephant()
>>> joe.is_slapped_on_the_butt_by('Tarek')
Ouch!
>>> joe.is_slapped_on_the_butt_by('Bill')
Ouch!
>>> joe.revenge()
10 years later...
Me kill Tarek
Me kill Bill

Note

Never bet on how your class might be subclassed.

Functions and Methods

Function and methods should be in lower case with underscores. This rule is not always true in the standard library though, and you can find some modules with mixedCase such as currentThread in the threading module (which will probably change in Python 3000).

This way of writing methods was common before the lower case norm became the standard, and some frameworks such as Zope are also using mixedCase for methods. The community of developers working with it is quite large. So the choice between mixedCase and lower case with an underscore is definitely driven by the library you are using.

As a Zope developer, it is not easy to stay consistent because building an application that mixes pure Python modules and modules that import Zope code is difficult. In Zope, some classes mix both conventions because the code base is evolving to an egg-based framework, where each module is closer to pure Python than before.

A decent practice in this kind of library environment is to use mixedCase only for elements that are exposed in the framework, and to keep the rest of the code in PEP 8 style.

The Private Controversy

For private methods and functions, a leading underscore is conventionally added. This rule was quite controversial because of the name mangling feature in Python. When a method has two leading underscores, it is renamed on the fly by the interpreter to prevent a name collision with a method from any subclass.

So some people tend to use a double leading underscore for their private attributes to avoid name collision in the subclasses:

>>> class Base(object):
...     def __secret(self):
...         print "don't tell"
...     def public(self):
...         self.__secret()
... 
>>> Base.__secret
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'Base' has no attribute '__secret'
>>> dir(Base)
['_Base__secret', ..., 'public']
>>> class Derived(Base):
...     def __secret(self):
...         print "never ever"
... 
>>> Derived().public()
don’t tell

The original motivation for name mangling in Python was not to provide a private gimmick like in C++, but to make sure that some base classes implicitly avoid collisions in subclasses, especially in multiple inheritance contexts. But using it for every attribute obfuscates the code in private, which is not Pythonic at all.

Therefore, some people opined that explicit name mangling should always be used:

>>> class Base(object):
...     def _Base_secret(self):       # don’t do this !!!
...         print "you told it ?"
... 

This duplicates the class name all over the code and so __ should be preferred.

But the best practice, as the BDFL (Guido, the Benevolent Dictator For Life, see http://en.wikipedia.org/wiki/BDFL) said, is to avoid using name mangling by looking at the __mro__ (method resolution order) value of a class before writing a method in a subclass. Changing the base class private methods has to be done carefully.

For more information on this topic, an interesting thread occurred in the "python-dev" list a few years ago, where people argued the utility of name mangling and its fate in the language. It can be found at: http://mail.python.org/pipermail/python-dev/2005-December/058555.html.

Special Methods

Special methods ( http://docs.python.org/ref/specialnames.html) start and end with a double underscore, and no normal method should use this convention. They are used for operator overloading, container definitions, and so on. For the sake of readability, they should be gathered at the beginning of class definitions:

>>> class weirdint(int):
...     def __add__(self, other):
...         return int.__add__(self, other) + 1
...     def __repr__(self):
...         return '<weirdo %d>' % self
...     
...     #
...     # public API
...     #
...     def do_this(self):
...         print 'this'
...     def do_that(self):
...         print 'that'

For a normal method, you should never use these kinds of names. So don't invent a name for a method such as this:

>>> class BadHabits(object):
...     def __my_method__(self):
...         print 'ok'
... 

Arguments

Arguments are in lower case, with underscores if needed. They follow the same naming rules as variables.

Properties

The names of properties are in lower case, or in lower case with underscores. Most of the time, they represent an object's state, which can be a noun or an adjective, or a small phrase when needed:

>>> class Connection(object):
...     _connected = []
...     def connect(self, user):
...         self._connected.append(user)
...     def _connected_people(self):
...         return '\n'.join(self._connected)
...     connected_people = property(_connected_people)
... 
>>> my = Connection()
>>> my.connect('Tarek')
>>> my.connect('Shannon')
>>> print my.connected_people
Tarek
Shannon

Classes

The names of classes are always in CamelCase, and may have a leading underscore when they are private to a module.

The class and instance variables are often noun phrases, and form a usage logic with the method names that are verb phrases:

>>> class Database(object):
...     def open(self):
...         pass
... 
>>> class User(object):
...     pass
...     
>>> user = User()
>>> db = Database()
>>> db.open()

Modules and Packages

Besides the special module __init__, the module names are in lower case with no underscores.

The following are some examples from the standard library:

  • os

  • sys

  • shutil

When the module is private to the package, a leading underscore is added. Compiled C or C++ modules are usually named with an underscore and imported in pure Python modules.

Packages follow the same rules, since they act like modules in the namespace.

Naming Guide

A common set of naming rules can be applied on variables, methods, functions, and properties. The names of classes and modules also play an important role in namespace construction, and in turn in code readability. This mini-guide provides common patterns and anti-patterns for picking their names.

Use "has" or "is" Prefix for Boolean Elements

When an element holds a Boolean value, the "is" and "has" prefixes provide a natural way to make it more readable in its namespace:

>>> class DB(object):
...     is_connected = False
...     has_cache = False
... 
>>> database = DB()
>>> database.has_cache
False
>>> if database.is_connected:
...     print "That's a powerful class"
... else:
...     print "No wonder..."
... 
No wonder...

Use Plural for Elements That Are Sequences

When an element is holding a sequence, it is a good idea to use a plural form. Some mappings can also benefit from this when they are exposed like sequences:

>>> class DB(object):
...     connected_users = ['Tarek']
...     tables = {'Customer': ['id', 'first_name',
...                            'last_name']}
... 

Use Explicit Names for Dictionaries

When a variable holds a mapping, you should use an explicit name when possible. For example, if a dict holds some persons' addresses, it can be named person_address:

>>> person_address = {'Bill': '6565 Monty Road', 
...                   'Pamela': '45 Python street'}
>>> person_address['Pamela']
'45 Python street'

Avoid Generic Names

Using terms such as list, dict, sequence, or elements, even for local variables, is evil if your code is not building a new abstract data type. It makes the code hard to read, understand, and use. Using a built-in name has to be avoided as well, to avoid shadowing it in the current namespace. Generic verbs should also be avoided, unless they have a meaning in the namespace.

Instead, domain-specific terms should be used:

>>> def compute(data):         # too generic
...     for element in data:
...         yield element * 12
... 
>>> def display_numbers(numbers):   # better
...     for number in numbers:
...         yield number * 12   
... 

Avoid Existing Names

It is a bad practice to use names that already exist in the context because it makes reading and, more specifically, debugging very confusing:

>>> def bad_citizen():
...     os = 1
...     import pdb; pdb.set_trace()
...     return os
... 
>>> bad_citizen()
> <stdin>(4)bad_citizen()
(Pdb) os
1
(Pdb) import os
(Pdb) c
<module 'os' from '/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/os.pyc'>

In this example, the os name was shadowed by the code. Both built-ins and module names from the standard library should be avoided.

Try to create original names, even if they are local to the context. For keywords, a trailing underscore is a way to avoid a collision:

>>> def xapian_query(terms, or_=True):
...     """if or_ is true, terms are combined 
...     with the OR clause"""
...     pass 
... 

Note that class is often replaced by klass or cls:

>>> def factory(klass, *args, **kw):
...     return klass(*args, **kw)
... 

Best Practices for Arguments

The signatures of functions and methods are the guardians of code integrity. They drive its usage and build its API. Besides the naming rules that we have seen previously, special care has to be taken for arguments. This can be done through three simple rules:

  • Build arguments by Iterative Design.

  • Trust the arguments and your tests.

  • Use *args and **kw magic arguments carefully.

Build Arguments by Iterative Design

Having a fixed and well-defined list of arguments for each function makes the code more robust. But this can't be done in the first version, so arguments have to be built by iterative design. They should reflect the precise use cases the element was created for, and evolve accordingly.

For instance, when some arguments are appended, they should have default values wherever possible to avoid any regression:

>>> class BD(object): # version 1
...     def _query(self, query, type):
...         print 'done'
...     def execute(self, query): 
...         self._query(query, 'EXECUTE')
... 
>>> BD().execute('my query')
done
>>> import logging
>>> class BD(object): # version 2
...     def _query(self, query, type, logger):
...         logger('done')
...     def execute(self, query, logger=logging.info): 
...         self._query(query, 'EXECUTE', logger)
... 
>>> BD().execute('my query')    # old-style call
>>> BD().execute('my query', logging.warning)
WARNING:root:done

When the argument of a public element has to be changed, a deprecation process is to be used, which is presented later in this section.

Trust the Arguments and Your Tests

Given the dynamic typing nature of Python, some developers use assertions at the top of their functions and methods to make sure the arguments have a proper content:

>>> def division(dividend, divisor):
...     assert type(dividend) in (long, int, float)
...     assert type(divisor) in (long, int, float)
...     return dividend / divisor
... 
>>> division(2, 4)
0
>>> division(2, 'okok')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in division
AssertionError

This is often done by developers who are used to static typing and feel that something is missing in Python.

This way of checking arguments is a part of the Design by Contract (DbC, see http://en.wikipedia.org/wiki/Design_By_Contract) programming style, where pre-conditions are checked before the code is actually run.

The two main problems in this approach are:

  1. DbC's code explains how it should be used, making it less readable.

  2. This can make it slower, since the assertions are made on each call.

The latter can be avoided with the "-O" option of the interpreter. In that case, all assertions are removed from the code before the byte code is created, so that the checking is lost.

In any case, assertions have to be done carefully, and should not be used to bend Python to a statically typed language. The only use case for this is to protect the code from being called nonsensically.

A healthy Test-Driven Development style provides a robust base code in most cases. Here, the functional and unit tests validate all the use cases the code is created for.

When code in a library is used by external elements, making assertions can be useful, as the incoming data might break things up or even create damage. This happens for code that deals with databases or the file system.

Another approach towards this is "fuzz testing" ( http://en.wikipedia.org/wiki/Fuzz_testing), where random pieces of data are sent to the program to detect its weaknesses. When a new defect is found, the code can be fixed to take care of that, together with a new test.

Let's take care that a code base, which follows the TDD approach, evolves in the right direction, and gets increasingly robust, since it is tuned every time a new failure occurs. When it is done in the right way, the list of assertions in the tests becomes similar in some way to the list of pre-conditions.

Anyhow, many DbC libraries exist in Python for people that are fond of it. You can have a look at Contracts for Python ( http://www.wayforward.net/pycontract/).

Use *args and **kw Magic Arguments Carefully

*args and **kw arguments can break the robustness of a function or method. They make the signature fuzzy, and the code often starts to build a small argument parser where it should not:

>>> def fuzzy_thing(**kw):
...     if 'do_this' in kw:
...         print 'ok i did'
...     if 'do_that' in kw:
...         print 'that is done'
...     print 'errr... ok'
... 
>>> fuzzy_thing()
errr... ok
>>> fuzzy_thing(do_this=1)
ok i did
errr... ok
>>> fuzzy_thing(do_that=1)
that is done
errr... ok
>>> fuzzy_thing(hahahahaha=1)
errr... ok

If the argument list gets long and complex, it is tempting to add magic arguments. But this is more a sign of a weak function or method that should be broken into pieces or refactored.

When *args is used to deal with a sequence of elements that are treated the same way in the function, asking for a unique container argument such as an iterator is better:

>>> def sum(*args):       # okay
...     total = 0
...     for arg in args:
...         total += arg
...     return total
... 
>>> def sum(sequence):    # better !
...     total = 0
...     for arg in args:
...         total += arg
...     return total
...

For **kw, the same rule applies. It is better to fix the named arguments to make the method's signature meaningful:

>>> def make_sentence(**kw):
...     noun = kw.get('noun', 'Bill')
...     verb = kw.get('verb', 'is')
...     adj = kw.get('adjective', 'happy')
...     return '%s %s %s' % (noun, verb, adj)
... 
>>> def make_sentence(noun='Bill', verb='is', adjective='happy'):
...     return '%s %s %s' % (noun, verb, adjective)
... 

Another interesting approach is to create a container class that groups several related arguments to provide an execution context. This structure differs from *args or **kw because it can provide internals that work over the values and can evolve independently. The code that uses it as an argument will not have to deal with its internals.

For instance, a web request passed on to a function is often represented by an instance of a class. This class is in charge of holding the data passed by the web server:

>>> def log_request(request):     # version 1
...     print request.get('HTTP_REFERER', 'No referer')
... 
>>> def log_request(request):     # version 2
...     print request.get('HTTP_REFERER', 'No referer')
...     print request.get('HTTP_HOST', 'No host')
... 

Magic arguments cannot be avoided sometimes, especially in meta-programming, for example, the decorators that work on functions with any kind of signature. More globally, when working with unknown data that just traverses the function, the magic arguments are great:

>>> import logging
>>> def log(**context):
...     logging.info('Context is:\n%s\n' % str(context))
... 

Class Names

The name of a class has to be concise, precise, so that it sufficient to understand from it what the class does. A common practice is to use a suffix that informs about its type or nature, for example:

  • SQL Engine

  • Mime Types

  • StringWidget

  • Test Case

For base classes, a Base or Abstract prefix can be used as follows:

  • BaseCookie

  • AbstractFormatter

The most important thing is to be consistent with the class attributes. For example, try to avoid redundancy between the class and its attributes' names:

>>> SMTP.smtp_send()  # redundant information in the namespace
>>> SMTP.send()       # more readable and mnemonic          

Module and Package Names

The module and package names inform about the purpose of their content. The names are short, in lower case, and without underscores.

  • sqlite

  • postgres

  • sha1

They are often suffixed with lib if they are implementing a protocol:

>>> import smtplib

>>> import urllib

>>> import telnetlib

They also need to be consistent within the namespace, so their usage is easier:

>>> from widgets.stringwidgets import TextWidget # bad
>>> from widgets.strings import TextWidget # better

Again, always avoid using the same name as that of one of the modules from the standard library.

When a module is getting complex, and contains a lot of classes, it is a good practice to create a package and split the module's elements in other modules.

The __init__ module can also be used to put back some APIs at the top level, as it will not impact its usage but will help re-organizing the code in smaller parts. For example, a module in a foo package

from module1 import feature1, feature2
from module2 import feature3

will allow users to import features directly:

>>> from foo import feature1
>>> from foo import feature2, feature3

But beware that this can increase your chances to get circular dependencies, and that the code added in the __init__ module will be instantiated. So use it with care.

Working on APIs

We have seen in the previous section that the packages and modules are first-class citizens to ease the usage of a library or an application. They should be organized carefully, since together they create an API.

This section provides some insights on how to work through this matter:

  • Tracking verbosity

  • Building the namespace tree

  • Splitting the code

  • Using a deprecation process

  • Using eggs

Tracking Verbosity

A common mistake when creating a library is "API verbosity". This happens when a feature is provided through a set of calls instead of a single call to the package.

Let's take an example of a script_engine package that will let you execute some code:

>>> from script_engine import make_context
>>> from script_engine import compile
>>> from script_engine import execute
>>> context = make_context({'a': 1, 'b':3})
>>> byte_code = compile('a + b')
>>> print execute(byte_code)
4

This use case should be provided within the package under a new function:

>>> from script_engine import run
>>> print run('a + b', context={'a': 1, 'b':3})
4

Both low-level and high-level functions will then be available for high-level calls and other combinations of low-level functions.

Note

This principle is described in Chapter 14 through the Facade design pattern.

Building the Namespace Tree

A simple technique to organize an application API is to build a namespace tree through the use cases and see how the code can be organized.

Let's take an example. An application called acme provides an engine that knows how to create PDF files. It is based on a list of template files and on a query made on a MySQL database.

The three parts of the acme application are:

  • A PDF generator

  • An SQL engine

  • A template collection

From there, a first draft of the namespace tree that comes in mind could be:

  • acme

    • pdfgen.py

      • class PDFGen

    • sqlengine.py

      • class SQLEngine

    • templates.py

      • class Template

Let's now try the namespace in a code sample and see how a PDF could be created from this application. We will guess how the classes and functions could be named and called in a glue function that resembles the feature of acme:

>>> from acme.templates import Template
>>> from acme.sqlengine import SQLEngine
>>> from acme.pdfgen import PDFGen
>>> SQL_URI = 'sqlite:///:memory:'
>>> def generate_pdf(query, template_name):
...     data = SQLEngine(SQL_URI).execute(query)
...     template = Template(template_name)
...     return PDFGen().create(data, template)        

This first version gives us a feedback on the namespace usability, and can be refactored to simplify things with API verbosity tracking and common sense.

For instance, the PDFGen class does not need to be created within the caller, since any instance of the class can generate any PDF instance. Therefore, it can stay private. The templates usage can also be simplified in the following manner:

>>> from acme import templates
>>> from acme.sqlengine import SQLEngine
>>> from acme.pdf import generate
>>> SQL_URI = 'sqlite:///:memory:'
>>> def generate_pdf(query, template_name):
...     data = SQLEngine(SQL_URI).execute(query)
...     template = templates.generate(template_name)
...     return generate(data, template)        

A second draft of the namespace will then be:

  • acme

    • config.py

      • SQL_URI

    • utils.py

      • function generate_pdf

    • pdf.py

      • function generate

      • class _Generator

    • sqlengine.py

      • class SQLEngine

    • templates.py

      • function generate

      • class _Template

The changes made are as follows:

  • config.py contains the configuration element.

  • utils.py provides the high-level API.

  • pdf.py provides a unique function.

  • templates.py provides a factory.

For each new use case, such structural changes help in designing a usable API. This has to be done before the package is released and used. For released packages, a deprecation process has to be set, which will be explained later in this chapter.

Note

The namespace tree should be carefully designed through real uses cases. We will see in Chapter 11 how to build it through tests.

Splitting the Code

Small is beautiful! And this should be applied to the code as well, at all levels. When a function, class, or a module gets too big, it should be split.

A function or a method should not be bigger than a screen, which is around 25 to 30 lines. Otherwise it is hard to follow and understand.

Note

See the related chapter, in the Art of Unix Programming by Eric Raymond ( http://www.faqs.org/docs/artu/ch13s01.html) for more information about code complexity.

A class should have a limited number of methods. When there are more than ten methods, even the creator can have a hard time to get the whole picture. A common practice is to isolate the functionalities and create several classes out of it.

A module should also be limited in its size. When it is more that 500 lines, it should be split into several modules.

This work will impact the API and will imply some extra work at the package level to ensure that the way the code is split and organized won't make it difficult to use.

In other words, the API should always be tested from the user's point of view to make sure it is usable, mnemonic, and concise.

Using Eggs

When an application grows, the number of packages under the main folder can get quite big. For instance, a framework such as Zope has more than 50 packages under the zope namespace, which is the root package.

To avoid having the whole code base within the same folder, and to be able to release each package separately, "Python eggs" ( http://peak.telecommunity.com/DevCenter/PythonEggs) can be used. They provide a simple way to build "namespaced packages", such as JARs provide in Java.

For instance, if you want to distribute acme.templates as a separate package, you can build an egg-based package with setuptools (the library for creating Python Eggs), using a special __init__.py file in the acme folder, containing ( http://peak.telecommunity.com/DevCenter/setuptools#namespace-packages):

try:
    __import__('pkg_resources').declare_namespace(__name__)
except ImportError:
    from pkgutil import extend_path
    __path__ = extend_path(__path__, __name__)

The acme folder will then be able to hold a templates folder and be available under the acme.templates namespace. acme.pdf can even be separated in a separated acme folder.

Following the same rule, the packages from the same organization can be gathered in the same namespace using eggs, even if they are not related to each other. For example, all packages from Ingeniweb are using the iw namespace and can be found on the Cheeseshop using the following prefix: http://pypi.python.org/pypi?%3Aaction=search&term=iw.&submit=search.

Besides the namespace, distributing the applications in eggs helps the modularization of your work, since each egg can be seen as a separated component.

Note

Chapter 6 will cover how to build, release, and deploy an egg-based application.

Using a Deprecation Process

Changing the API has to be done carefully when the package is already released and used by third-party code. The simplest way to deal with such changes is to follow a deprecation process where an intermediate release contains both versions.

For example, if a class has a run_script method that is replaced by a simplified run command, the DeprecationWarning built-in exception can be used in the intermediate result along with the warnings module as follows:

>>> class SomeClass(object):            # version 1
...     def run_script(self, script, context):
...         print 'doing the work'
... 
>>> import warnings
>>> class SomeClass(object):            # version 1.5
...     def run_script(self, script, context):
...         warnings.warn(("'run_script' will be replaced "
...                        "by 'run' in version 2"), 
...                       DeprecationWarning)
...         return self.run(script, context)
...     def run(self, script, context=None):
...         print 'doing the work'
... 
>>> SomeClass().run_script('a script', {})
__main__:4: DeprecationWarning: 'run_script' will be replaced by 'run' in version 2
doing the work
>>> SomeClass().run_script('a script', {})
doing the work
>>> class SomeClass(object):            # version 2
...     def run(self, script, context=None):
...         print 'doing the work'
...

The warnings module will warn the user on the first call and will ignore the next calls. Another nice feature about this module is that filters can be created to manage warnings that are impacting the application. For example, warnings can be automatically ignored or turned into exceptions to make the changes mandatory. See http://docs.python.org/lib/warning-filter.html.

Useful Tools

Part of the previous conventions and practices can be controlled and worked out with the following tools:

  • Pylint, a very flexible source code analyzer

  • CloneDigger, a duplicate code detection tool

Pylint

Besides some quality assurance metrics, Pylint allows checking whether a given source code is following a naming convention. Its default settings are corresponding to PEP 8 and a Pylint script provides a shell report output.

To install Pylint, you can use the logilab.installer egg, with easy_install:

$ easy_install logilab.pylintinstaller

After this step, the command is available and can be run against a module, or several modules using wild cards:

$ pylint bootstrap.py
No config file found, using default configuration
************* Module bootstrap
C: 25: Invalid name "tmpeggs" (should match (([A-Z_][A-Z1-9_]*)|(__.*__))$)
C: 27: Invalid name "ez" (should match (([A-Z_][A-Z1-9_]*)|(__.*__))$)
W: 28: Use of the exec statement
C: 34: Invalid name "cmd" (should match (([A-Z_][A-Z1-9_]*)|(__.*__))$)
C: 36: Invalid name "cmd" (should match (([A-Z_][A-Z1-9_]*)|(__.*__))$)
C: 38: Invalid name "ws" (should match (([A-Z_][A-Z1-9_]*)|(__.*__))$)
...
Global evaluation
-----------------
Your code has been rated at 6.25/10

Notice that there will always be some cases where Pylint will give you bad rates or complaints. For instance an import statement that is not used by the code of the module itself is perfectly fine in some cases (having it available in the namespace).

Making calls to libraries that are using mixedCase for methods can also lower your rating. In any case, the global evaluation is not as important as "lint" is in C. Pylint is just a tool that points the possible improvements.

The first thing to do to fine-tune Pylint is to create a .pylinrc configuration file in your home directory, with the generate-rcfile option:

$ pylint --generate-rcfile > ~/.pylintrc

Under Windows, the "~" folder has to be replaced with the user folder, which is usually in the Documents and Settings folder. (See the HOME environment variable.)

The first thing to change in the configuration file is to set the reports variable to no in the REPORTS section, in order to avoid a verbose report. In our case, we just want to use the tool to detect the names. After that change, the tool will only display the warnings:

$ pylint boostrap.py
************* Module bootstrap
C: 25: Invalid name "tmpeggs" (should match (([A-Z_][A-Z1-9_]*)|(__.*__))$)
C: 27: Invalid name "ez" (should match (([A-Z_][A-Z1-9_]*)|(__.*__))$)
W: 28: Use of the exec statement
C: 34: Invalid name "cmd" (should match (([A-Z_][A-Z1-9_]*)|(__.*__))$)
C: 36: Invalid name "cmd" (should match (([A-Z_][A-Z1-9_]*)|(__.*__))$)
C: 38: Invalid name "ws" (should match (([A-Z_][A-Z1-9_]*)|(__.*__))$)

CloneDigger

CloneDigger ( http://clonedigger.sourceforge.net) is a nice tool that tries to detect similarities in the code by visiting the code tree. It is based on a rather complex algorithm explained on the website, and complements Pylint.

To install it, use easy_install:

$ easy_install CloneDigger

You will get a clonedigger command that can be used to detect a duplicate. The options are described here: http://clonedigger.sourceforge.net/documentation.html.

$ clonedigger html_report.py ast_suppliers.py
Parsing html_report.py ... done
Parsing ast_suppliers.py ... done
40 sequences
average sequence length: 3.250000
maximum sequence length: 14
Number of statements: 130
Calculating size for each statement... done
Building statement hash... done
Number of different hash values: 52
Building patterns... 66 patterns were discovered
Choosing pattern for each statement... done
Finding similar sequences of statements... 0 sequences were found
Refining candidates... 0 clones were found
Removing dominated clones... 0 clones were removed

An HTML output is generated in output.html that contains a report on CloneDigger's work.

Summary

This chapter explained the following:

  • PEP 8 is the absolute reference for naming convention.

  • A few rules should be followed when choosing names:

    • Use "has" or "is" prefix for Boolean elements.

    • Use plural for elements that are sequences.

    • Avoid generic names.

    • Avoid shadowing existing names, especially built-ins.

  • A set of good practices for arguments is:

    • Build arguments by design.

    • Don't try to implement static-type checking using assertions.

    • Don't misuse *args and **kw.

  • Some common practices when working on APIs are:

    • Track verbosity.

    • Build the namespace tree by design.

    • Split the code into small pieces.

    • Use eggs for your libraries, under a common namespace.

    • Use a deprecation process.

  • Use Pylint and CloneDigger to control the code.

The next chapter explains how to write a package.