Commit e8f44d68 authored by Andrew M. Kuchling's avatar Andrew M. Kuchling

Commit the howto source to the main Python repository, with Fred's approval

parent f1b2ba6a
MKHOWTO=../tools/mkhowto
WEBDIR=.
RSTARGS = --input-encoding=utf-8
VPATH=.:dvi:pdf:ps:txt
# List of HOWTOs that aren't to be processed
REMOVE_HOWTO =
# Determine list of files to be built
HOWTO=$(filter-out $(REMOVE_HOWTO),$(wildcard *.tex))
RST_SOURCES = $(shell echo *.rst)
DVI =$(patsubst %.tex,%.dvi,$(HOWTO))
PDF =$(patsubst %.tex,%.pdf,$(HOWTO))
PS =$(patsubst %.tex,%.ps,$(HOWTO))
TXT =$(patsubst %.tex,%.txt,$(HOWTO))
HTML =$(patsubst %.tex,%,$(HOWTO))
# Rules for building various formats
%.dvi : %.tex
$(MKHOWTO) --dvi $<
mv $@ dvi
%.pdf : %.tex
$(MKHOWTO) --pdf $<
mv $@ pdf
%.ps : %.tex
$(MKHOWTO) --ps $<
mv $@ ps
%.txt : %.tex
$(MKHOWTO) --text $<
mv $@ txt
% : %.tex
$(MKHOWTO) --html --iconserver="." $<
tar -zcvf html/$*.tgz $*
#zip -r html/$*.zip $*
default:
@echo "'all' -- build all files"
@echo "'dvi', 'pdf', 'ps', 'txt', 'html' -- build one format"
all: $(HTML)
.PHONY : dvi pdf ps txt html rst
dvi: $(DVI)
pdf: $(PDF)
ps: $(PS)
txt: $(TXT)
html:$(HTML)
# Rule to build collected tar files
dist: #all
for i in dvi pdf ps txt ; do \
cd $$i ; \
tar -zcf All.tgz *.$$i ;\
cd .. ;\
done
# Rule to copy files to the Web tree on AMK's machine
web: dist
cp dvi/* $(WEBDIR)/dvi
cp ps/* $(WEBDIR)/ps
cp pdf/* $(WEBDIR)/pdf
cp txt/* $(WEBDIR)/txt
for dir in $(HTML) ; do cp -rp $$dir $(WEBDIR) ; done
for ltx in $(HOWTO) ; do cp -p $$ltx $(WEBDIR)/latex ; done
rst: unicode.html
%.html: %.rst
rst2html $(RSTARGS) $< >$@
clean:
rm -f *~ *.log *.ind *.l2h *.aux *.toc *.how
rm -f *.dvi *.ps *.pdf *.bkm
rm -f unicode.html
clobber:
rm dvi/* ps/* pdf/* txt/* html/*
\documentclass{howto}
\title{Python Advocacy HOWTO}
\release{0.03}
\author{A.M. Kuchling}
\authoraddress{\email{amk@amk.ca}}
\begin{document}
\maketitle
\begin{abstract}
\noindent
It's usually difficult to get your management to accept open source
software, and Python is no exception to this rule. This document
discusses reasons to use Python, strategies for winning acceptance,
facts and arguments you can use, and cases where you \emph{shouldn't}
try to use Python.
This document is available from the Python HOWTO page at
\url{http://www.python.org/doc/howto}.
\end{abstract}
\tableofcontents
\section{Reasons to Use Python}
There are several reasons to incorporate a scripting language into
your development process, and this section will discuss them, and why
Python has some properties that make it a particularly good choice.
\subsection{Programmability}
Programs are often organized in a modular fashion. Lower-level
operations are grouped together, and called by higher-level functions,
which may in turn be used as basic operations by still further upper
levels.
For example, the lowest level might define a very low-level
set of functions for accessing a hash table. The next level might use
hash tables to store the headers of a mail message, mapping a header
name like \samp{Date} to a value such as \samp{Tue, 13 May 1997
20:00:54 -0400}. A yet higher level may operate on message objects,
without knowing or caring that message headers are stored in a hash
table, and so forth.
Often, the lowest levels do very simple things; they implement a data
structure such as a binary tree or hash table, or they perform some
simple computation, such as converting a date string to a number. The
higher levels then contain logic connecting these primitive
operations. Using the approach, the primitives can be seen as basic
building blocks which are then glued together to produce the complete
product.
Why is this design approach relevant to Python? Because Python is
well suited to functioning as such a glue language. A common approach
is to write a Python module that implements the lower level
operations; for the sake of speed, the implementation might be in C,
Java, or even Fortran. Once the primitives are available to Python
programs, the logic underlying higher level operations is written in
the form of Python code. The high-level logic is then more
understandable, and easier to modify.
John Ousterhout wrote a paper that explains this idea at greater
length, entitled ``Scripting: Higher Level Programming for the 21st
Century''. I recommend that you read this paper; see the references
for the URL. Ousterhout is the inventor of the Tcl language, and
therefore argues that Tcl should be used for this purpose; he only
briefly refers to other languages such as Python, Perl, and
Lisp/Scheme, but in reality, Ousterhout's argument applies to
scripting languages in general, since you could equally write
extensions for any of the languages mentioned above.
\subsection{Prototyping}
In \emph{The Mythical Man-Month}, Fredrick Brooks suggests the
following rule when planning software projects: ``Plan to throw one
away; you will anyway.'' Brooks is saying that the first attempt at a
software design often turns out to be wrong; unless the problem is
very simple or you're an extremely good designer, you'll find that new
requirements and features become apparent once development has
actually started. If these new requirements can't be cleanly
incorporated into the program's structure, you're presented with two
unpleasant choices: hammer the new features into the program somehow,
or scrap everything and write a new version of the program, taking the
new features into account from the beginning.
Python provides you with a good environment for quickly developing an
initial prototype. That lets you get the overall program structure
and logic right, and you can fine-tune small details in the fast
development cycle that Python provides. Once you're satisfied with
the GUI interface or program output, you can translate the Python code
into C++, Fortran, Java, or some other compiled language.
Prototyping means you have to be careful not to use too many Python
features that are hard to implement in your other language. Using
\code{eval()}, or regular expressions, or the \module{pickle} module,
means that you're going to need C or Java libraries for formula
evaluation, regular expressions, and serialization, for example. But
it's not hard to avoid such tricky code, and in the end the
translation usually isn't very difficult. The resulting code can be
rapidly debugged, because any serious logical errors will have been
removed from the prototype, leaving only more minor slip-ups in the
translation to track down.
This strategy builds on the earlier discussion of programmability.
Using Python as glue to connect lower-level components has obvious
relevance for constructing prototype systems. In this way Python can
help you with development, even if end users never come in contact
with Python code at all. If the performance of the Python version is
adequate and corporate politics allow it, you may not need to do a
translation into C or Java, but it can still be faster to develop a
prototype and then translate it, instead of attempting to produce the
final version immediately.
One example of this development strategy is Microsoft Merchant Server.
Version 1.0 was written in pure Python, by a company that subsequently
was purchased by Microsoft. Version 2.0 began to translate the code
into \Cpp, shipping with some \Cpp code and some Python code. Version
3.0 didn't contain any Python at all; all the code had been translated
into \Cpp. Even though the product doesn't contain a Python
interpreter, the Python language has still served a useful purpose by
speeding up development.
This is a very common use for Python. Past conference papers have
also described this approach for developing high-level numerical
algorithms; see David M. Beazley and Peter S. Lomdahl's paper
``Feeding a Large-scale Physics Application to Python'' in the
references for a good example. If an algorithm's basic operations are
things like "Take the inverse of this 4000x4000 matrix", and are
implemented in some lower-level language, then Python has almost no
additional performance cost; the extra time required for Python to
evaluate an expression like \code{m.invert()} is dwarfed by the cost
of the actual computation. It's particularly good for applications
where seemingly endless tweaking is required to get things right. GUI
interfaces and Web sites are prime examples.
The Python code is also shorter and faster to write (once you're
familiar with Python), so it's easier to throw it away if you decide
your approach was wrong; if you'd spent two weeks working on it
instead of just two hours, you might waste time trying to patch up
what you've got out of a natural reluctance to admit that those two
weeks were wasted. Truthfully, those two weeks haven't been wasted,
since you've learnt something about the problem and the technology
you're using to solve it, but it's human nature to view this as a
failure of some sort.
\subsection{Simplicity and Ease of Understanding}
Python is definitely \emph{not} a toy language that's only usable for
small tasks. The language features are general and powerful enough to
enable it to be used for many different purposes. It's useful at the
small end, for 10- or 20-line scripts, but it also scales up to larger
systems that contain thousands of lines of code.
However, this expressiveness doesn't come at the cost of an obscure or
tricky syntax. While Python has some dark corners that can lead to
obscure code, there are relatively few such corners, and proper design
can isolate their use to only a few classes or modules. It's
certainly possible to write confusing code by using too many features
with too little concern for clarity, but most Python code can look a
lot like a slightly-formalized version of human-understandable
pseudocode.
In \emph{The New Hacker's Dictionary}, Eric S. Raymond gives the following
definition for "compact":
\begin{quotation}
Compact \emph{adj.} Of a design, describes the valuable property
that it can all be apprehended at once in one's head. This
generally means the thing created from the design can be used
with greater facility and fewer errors than an equivalent tool
that is not compact. Compactness does not imply triviality or
lack of power; for example, C is compact and FORTRAN is not,
but C is more powerful than FORTRAN. Designs become
non-compact through accreting features and cruft that don't
merge cleanly into the overall design scheme (thus, some fans
of Classic C maintain that ANSI C is no longer compact).
\end{quotation}
(From \url{http://sagan.earthspace.net/jargon/jargon_18.html\#SEC25})
In this sense of the word, Python is quite compact, because the
language has just a few ideas, which are used in lots of places. Take
namespaces, for example. Import a module with \code{import math}, and
you create a new namespace called \samp{math}. Classes are also
namespaces that share many of the properties of modules, and have a
few of their own; for example, you can create instances of a class.
Instances? They're yet another namespace. Namespaces are currently
implemented as Python dictionaries, so they have the same methods as
the standard dictionary data type: .keys() returns all the keys, and
so forth.
This simplicity arises from Python's development history. The
language syntax derives from different sources; ABC, a relatively
obscure teaching language, is one primary influence, and Modula-3 is
another. (For more information about ABC and Modula-3, consult their
respective Web sites at \url{http://www.cwi.nl/~steven/abc/} and
\url{http://www.m3.org}.) Other features have come from C, Icon,
Algol-68, and even Perl. Python hasn't really innovated very much,
but instead has tried to keep the language small and easy to learn,
building on ideas that have been tried in other languages and found
useful.
Simplicity is a virtue that should not be underestimated. It lets you
learn the language more quickly, and then rapidly write code, code
that often works the first time you run it.
\subsection{Java Integration}
If you're working with Java, Jython
(\url{http://www.jython.org/}) is definitely worth your
attention. Jython is a re-implementation of Python in Java that
compiles Python code into Java bytecodes. The resulting environment
has very tight, almost seamless, integration with Java. It's trivial
to access Java classes from Python, and you can write Python classes
that subclass Java classes. Jython can be used for prototyping Java
applications in much the same way CPython is used, and it can also be
used for test suites for Java code, or embedded in a Java application
to add scripting capabilities.
\section{Arguments and Rebuttals}
Let's say that you've decided upon Python as the best choice for your
application. How can you convince your management, or your fellow
developers, to use Python? This section lists some common arguments
against using Python, and provides some possible rebuttals.
\emph{Python is freely available software that doesn't cost anything.
How good can it be?}
Very good, indeed. These days Linux and Apache, two other pieces of
open source software, are becoming more respected as alternatives to
commercial software, but Python hasn't had all the publicity.
Python has been around for several years, with many users and
developers. Accordingly, the interpreter has been used by many
people, and has gotten most of the bugs shaken out of it. While bugs
are still discovered at intervals, they're usually either quite
obscure (they'd have to be, for no one to have run into them before)
or they involve interfaces to external libraries. The internals of
the language itself are quite stable.
Having the source code should be viewed as making the software
available for peer review; people can examine the code, suggest (and
implement) improvements, and track down bugs. To find out more about
the idea of open source code, along with arguments and case studies
supporting it, go to \url{http://www.opensource.org}.
\emph{Who's going to support it?}
Python has a sizable community of developers, and the number is still
growing. The Internet community surrounding the language is an active
one, and is worth being considered another one of Python's advantages.
Most questions posted to the comp.lang.python newsgroup are quickly
answered by someone.
Should you need to dig into the source code, you'll find it's clear
and well-organized, so it's not very difficult to write extensions and
track down bugs yourself. If you'd prefer to pay for support, there
are companies and individuals who offer commercial support for Python.
\emph{Who uses Python for serious work?}
Lots of people; one interesting thing about Python is the surprising
diversity of applications that it's been used for. People are using
Python to:
\begin{itemize}
\item Run Web sites
\item Write GUI interfaces
\item Control
number-crunching code on supercomputers
\item Make a commercial application scriptable by embedding the Python
interpreter inside it
\item Process large XML data sets
\item Build test suites for C or Java code
\end{itemize}
Whatever your application domain is, there's probably someone who's
used Python for something similar. Yet, despite being useable for
such high-end applications, Python's still simple enough to use for
little jobs.
See \url{http://www.python.org/psa/Users.html} for a list of some of the
organizations that use Python.
\emph{What are the restrictions on Python's use?}
They're practically nonexistent. Consult the \file{Misc/COPYRIGHT}
file in the source distribution, or
\url{http://www.python.org/doc/Copyright.html} for the full language,
but it boils down to three conditions.
\begin{itemize}
\item You have to leave the copyright notice on the software; if you
don't include the source code in a product, you have to put the
copyright notice in the supporting documentation.
\item Don't claim that the institutions that have developed Python
endorse your product in any way.
\item If something goes wrong, you can't sue for damages. Practically
all software licences contain this condition.
\end{itemize}
Notice that you don't have to provide source code for anything that
contains Python or is built with it. Also, the Python interpreter and
accompanying documentation can be modified and redistributed in any
way you like, and you don't have to pay anyone any licensing fees at
all.
\emph{Why should we use an obscure language like Python instead of
well-known language X?}
I hope this HOWTO, and the documents listed in the final section, will
help convince you that Python isn't obscure, and has a healthily
growing user base. One word of advice: always present Python's
positive advantages, instead of concentrating on language X's
failings. People want to know why a solution is good, rather than why
all the other solutions are bad. So instead of attacking a competing
solution on various grounds, simply show how Python's virtues can
help.
\section{Useful Resources}
\begin{definitions}
\term{\url{http://www.fsbassociates.com/books/pythonchpt1.htm}}
The first chapter of \emph{Internet Programming with Python} also
examines some of the reasons for using Python. The book is well worth
buying, but the publishers have made the first chapter available on
the Web.
\term{\url{http://home.pacbell.net/ouster/scripting.html}}
John Ousterhout's white paper on scripting is a good argument for the
utility of scripting languages, though naturally enough, he emphasizes
Tcl, the language he developed. Most of the arguments would apply to
any scripting language.
\term{\url{http://www.python.org/workshops/1997-10/proceedings/beazley.html}}
The authors, David M. Beazley and Peter S. Lomdahl,
describe their use of Python at Los Alamos National Laboratory.
It's another good example of how Python can help get real work done.
This quotation from the paper has been echoed by many people:
\begin{quotation}
Originally developed as a large monolithic application for
massively parallel processing systems, we have used Python to
transform our application into a flexible, highly modular, and
extremely powerful system for performing simulation, data
analysis, and visualization. In addition, we describe how Python
has solved a number of important problems related to the
development, debugging, deployment, and maintenance of scientific
software.
\end{quotation}
%\term{\url{http://www.pythonjournal.com/volume1/art-interview/}}
%This interview with Andy Feit, discussing Infoseek's use of Python, can be
%used to show that choosing Python didn't introduce any difficulties
%into a company's development process, and provided some substantial benefits.
\term{\url{http://www.python.org/psa/Commercial.html}}
Robin Friedrich wrote this document on how to support Python's use in
commercial projects.
\term{\url{http://www.python.org/workshops/1997-10/proceedings/stein.ps}}
For the 6th Python conference, Greg Stein presented a paper that
traced Python's adoption and usage at a startup called eShop, and
later at Microsoft.
\term{\url{http://www.opensource.org}}
Management may be doubtful of the reliability and usefulness of
software that wasn't written commercially. This site presents
arguments that show how open source software can have considerable
advantages over closed-source software.
\term{\url{http://sunsite.unc.edu/LDP/HOWTO/mini/Advocacy.html}}
The Linux Advocacy mini-HOWTO was the inspiration for this document,
and is also well worth reading for general suggestions on winning
acceptance for a new technology, such as Linux or Python. In general,
you won't make much progress by simply attacking existing systems and
complaining about their inadequacies; this often ends up looking like
unfocused whining. It's much better to point out some of the many
areas where Python is an improvement over other systems.
\end{definitions}
\end{document}
\documentclass{howto}
\title{Curses Programming with Python}
\release{2.01}
\author{A.M. Kuchling, Eric S. Raymond}
\authoraddress{\email{amk@amk.ca}, \email{esr@thyrsus.com}}
\begin{document}
\maketitle
\begin{abstract}
\noindent
This document describes how to write text-mode programs with Python 2.x,
using the \module{curses} extension module to control the display.
This document is available from the Python HOWTO page at
\url{http://www.python.org/doc/howto}.
\end{abstract}
\tableofcontents
\section{What is curses?}
The curses library supplies a terminal-independent screen-painting and
keyboard-handling facility for text-based terminals; such terminals
include VT100s, the Linux console, and the simulated terminal provided
by X11 programs such as xterm and rxvt. Display terminals support
various control codes to perform common operations such as moving the
cursor, scrolling the screen, and erasing areas. Different terminals
use widely differing codes, and often have their own minor quirks.
In a world of X displays, one might ask ``why bother''? It's true
that character-cell display terminals are an obsolete technology, but
there are niches in which being able to do fancy things with them are
still valuable. One is on small-footprint or embedded Unixes that
don't carry an X server. Another is for tools like OS installers
and kernel configurators that may have to run before X is available.
The curses library hides all the details of different terminals, and
provides the programmer with an abstraction of a display, containing
multiple non-overlapping windows. The contents of a window can be
changed in various ways--adding text, erasing it, changing its
appearance--and the curses library will automagically figure out what
control codes need to be sent to the terminal to produce the right
output.
The curses library was originally written for BSD Unix; the later System V
versions of Unix from AT\&T added many enhancements and new functions.
BSD curses is no longer maintained, having been replaced by ncurses,
which is an open-source implementation of the AT\&T interface. If you're
using an open-source Unix such as Linux or FreeBSD, your system almost
certainly uses ncurses. Since most current commercial Unix versions
are based on System V code, all the functions described here will
probably be available. The older versions of curses carried by some
proprietary Unixes may not support everything, though.
No one has made a Windows port of the curses module. On a Windows
platform, try the Console module written by Fredrik Lundh. The
Console module provides cursor-addressable text output, plus full
support for mouse and keyboard input, and is available from
\url{http://effbot.org/efflib/console}.
\subsection{The Python curses module}
Thy Python module is a fairly simple wrapper over the C functions
provided by curses; if you're already familiar with curses programming
in C, it's really easy to transfer that knowledge to Python. The
biggest difference is that the Python interface makes things simpler,
by merging different C functions such as \function{addstr},
\function{mvaddstr}, \function{mvwaddstr}, into a single
\method{addstr()} method. You'll see this covered in more detail
later.
This HOWTO is simply an introduction to writing text-mode programs
with curses and Python. It doesn't attempt to be a complete guide to
the curses API; for that, see the Python library guide's serction on
ncurses, and the C manual pages for ncurses. It will, however, give
you the basic ideas.
\section{Starting and ending a curses application}
Before doing anything, curses must be initialized. This is done by
calling the \function{initscr()} function, which will determine the
terminal type, send any required setup codes to the terminal, and
create various internal data structures. If successful,
\function{initscr()} returns a window object representing the entire
screen; this is usually called \code{stdscr}, after the name of the
corresponding C
variable.
\begin{verbatim}
import curses
stdscr = curses.initscr()
\end{verbatim}
Usually curses applications turn off automatic echoing of keys to the
screen, in order to be able to read keys and only display them under
certain circumstances. This requires calling the \function{noecho()}
function.
\begin{verbatim}
curses.noecho()
\end{verbatim}
Applications will also commonly need to react to keys instantly,
without requiring the Enter key to be pressed; this is called cbreak
mode, as opposed to the usual buffered input mode.
\begin{verbatim}
curses.cbreak()
\end{verbatim}
Terminals usually return special keys, such as the cursor keys or
navigation keys such as Page Up and Home, as a multibyte escape
sequence. While you could write your application to expect such
sequences and process them accordingly, curses can do it for you,
returning a special value such as \constant{curses.KEY_LEFT}. To get
curses to do the job, you'll have to enable keypad mode.
\begin{verbatim}
stdscr.keypad(1)
\end{verbatim}
Terminating a curses application is much easier than starting one.
You'll need to call
\begin{verbatim}
curses.nocbreak(); stdscr.keypad(0); curses.echo()
\end{verbatim}
to reverse the curses-friendly terminal settings. Then call the
\function{endwin()} function to restore the terminal to its original
operating mode.
\begin{verbatim}
curses.endwin()
\end{verbatim}
A common problem when debugging a curses application is to get your
terminal messed up when the application dies without restoring the
terminal to its previous state. In Python this commonly happens when
your code is buggy and raises an uncaught exception. Keys are no
longer be echoed to the screen when you type them, for example, which
makes using the shell difficult.
In Python you can avoid these complications and make debugging much
easier by importing the module \module{curses.wrapper}. It supplies a
function \function{wrapper} that takes a hook argument. It does the
initializations described above, and also initializes colors if color
support is present. It then runs your hook, and then finally
deinitializes appropriately. The hook is called inside a try-catch
clause which catches exceptions, performs curses deinitialization, and
then passes the exception upwards. Thus, your terminal won't be left
in a funny state on exception.
\section{Windows and Pads}
Windows are the basic abstraction in curses. A window object
represents a rectangular area of the screen, and supports various
methods to display text, erase it, allow the user to input strings,
and so forth.
The \code{stdscr} object returned by the \function{initscr()} function
is a window object that covers the entire screen. Many programs may
need only this single window, but you might wish to divide the screen
into smaller windows, in order to redraw or clear them separately.
The \function{newwin()} function creates a new window of a given size,
returning the new window object.
\begin{verbatim}
begin_x = 20 ; begin_y = 7
height = 5 ; width = 40
win = curses.newwin(height, width, begin_y, begin_x)
\end{verbatim}
A word about the coordinate system used in curses: coordinates are
always passed in the order \emph{y,x}, and the top-left corner of a
window is coordinate (0,0). This breaks a common convention for
handling coordinates, where the \emph{x} coordinate usually comes
first. This is an unfortunate difference from most other computer
applications, but it's been part of curses since it was first written,
and it's too late to change things now.
When you call a method to display or erase text, the effect doesn't
immediately show up on the display. This is because curses was
originally written with slow 300-baud terminal connections in mind;
with these terminals, minimizing the time required to redraw the
screen is very important. This lets curses accumulate changes to the
screen, and display them in the most efficient manner. For example,
if your program displays some characters in a window, and then clears
the window, there's no need to send the original characters because
they'd never be visible.
Accordingly, curses requires that you explicitly tell it to redraw
windows, using the \function{refresh()} method of window objects. In
practice, this doesn't really complicate programming with curses much.
Most programs go into a flurry of activity, and then pause waiting for
a keypress or some other action on the part of the user. All you have
to do is to be sure that the screen has been redrawn before pausing to
wait for user input, by simply calling \code{stdscr.refresh()} or the
\function{refresh()} method of some other relevant window.
A pad is a special case of a window; it can be larger than the actual
display screen, and only a portion of it displayed at a time.
Creating a pad simply requires the pad's height and width, while
refreshing a pad requires giving the coordinates of the on-screen
area where a subsection of the pad will be displayed.
\begin{verbatim}
pad = curses.newpad(100, 100)
# These loops fill the pad with letters; this is
# explained in the next section
for y in range(0, 100):
for x in range(0, 100):
try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 )
except curses.error: pass
# Displays a section of the pad in the middle of the screen
pad.refresh( 0,0, 5,5, 20,75)
\end{verbatim}
The \function{refresh()} call displays a section of the pad in the
rectangle extending from coordinate (5,5) to coordinate (20,75) on the
screen;the upper left corner of the displayed section is coordinate
(0,0) on the pad. Beyond that difference, pads are exactly like
ordinary windows and support the same methods.
If you have multiple windows and pads on screen there is a more
efficient way to go, which will prevent annoying screen flicker at
refresh time. Use the methods \method{noutrefresh()} and/or
\method{noutrefresh()} of each window to update the data structure
representing the desired state of the screen; then change the physical
screen to match the desired state in one go with the function
\function{doupdate()}. The normal \method{refresh()} method calls
\function{doupdate()} as its last act.
\section{Displaying Text}
{}From a C programmer's point of view, curses may sometimes look like
a twisty maze of functions, all subtly different. For example,
\function{addstr()} displays a string at the current cursor location
in the \code{stdscr} window, while \function{mvaddstr()} moves to a
given y,x coordinate first before displaying the string.
\function{waddstr()} is just like \function{addstr()}, but allows
specifying a window to use, instead of using \code{stdscr} by default.
\function{mvwaddstr()} follows similarly.
Fortunately the Python interface hides all these details;
\code{stdscr} is a window object like any other, and methods like
\function{addstr()} accept multiple argument forms. Usually there are
four different forms.
\begin{tableii}{|c|l|}{textrm}{Form}{Description}
\lineii{\var{str} or \var{ch}}{Display the string \var{str} or
character \var{ch}}
\lineii{\var{str} or \var{ch}, \var{attr}}{Display the string \var{str} or
character \var{ch}, using attribute \var{attr}}
\lineii{\var{y}, \var{x}, \var{str} or \var{ch}}
{Move to position \var{y,x} within the window, and display \var{str}
or \var{ch}}
\lineii{\var{y}, \var{x}, \var{str} or \var{ch}, \var{attr}}
{Move to position \var{y,x} within the window, and display \var{str}
or \var{ch}, using attribute \var{attr}}
\end{tableii}
Attributes allow displaying text in highlighted forms, such as in
boldface, underline, reverse code, or in color. They'll be explained
in more detail in the next subsection.
The \function{addstr()} function takes a Python string as the value to
be displayed, while the \function{addch()} functions take a character,
which can be either a Python string of length 1, or an integer. If
it's a string, you're limited to displaying characters between 0 and
255. SVr4 curses provides constants for extension characters; these
constants are integers greater than 255. For example,
\constant{ACS_PLMINUS} is a +/- symbol, and \constant{ACS_ULCORNER} is
the upper left corner of a box (handy for drawing borders).
Windows remember where the cursor was left after the last operation,
so if you leave out the \var{y,x} coordinates, the string or character
will be displayed wherever the last operation left off. You can also
move the cursor with the \function{move(\var{y,x})} method. Because
some terminals always display a flashing cursor, you may want to
ensure that the cursor is positioned in some location where it won't
be distracting; it can be confusing to have the cursor blinking at
some apparently random location.
If your application doesn't need a blinking cursor at all, you can
call \function{curs_set(0)} to make it invisible. Equivalently, and
for compatibility with older curses versions, there's a
\function{leaveok(\var{bool})} function. When \var{bool} is true, the
curses library will attempt to suppress the flashing cursor, and you
won't need to worry about leaving it in odd locations.
\subsection{Attributes and Color}
Characters can be displayed in different ways. Status lines in a
text-based application are commonly shown in reverse video; a text
viewer may need to highlight certain words. curses supports this by
allowing you to specify an attribute for each cell on the screen.
An attribute is a integer, each bit representing a different
attribute. You can try to display text with multiple attribute bits
set, but curses doesn't guarantee that all the possible combinations
are available, or that they're all visually distinct. That depends on
the ability of the terminal being used, so it's safest to stick to the
most commonly available attributes, listed here.
\begin{tableii}{|c|l|}{constant}{Attribute}{Description}
\lineii{A_BLINK}{Blinking text}
\lineii{A_BOLD}{Extra bright or bold text}
\lineii{A_DIM}{Half bright text}
\lineii{A_REVERSE}{Reverse-video text}
\lineii{A_STANDOUT}{The best highlighting mode available}
\lineii{A_UNDERLINE}{Underlined text}
\end{tableii}
So, to display a reverse-video status line on the top line of the
screen,
you could code:
\begin{verbatim}
stdscr.addstr(0, 0, "Current mode: Typing mode",
curses.A_REVERSE)
stdscr.refresh()
\end{verbatim}
The curses library also supports color on those terminals that
provide it, The most common such terminal is probably the Linux
console, followed by color xterms.
To use color, you must call the \function{start_color()} function
soon after calling \function{initscr()}, to initialize the default
color set (the \function{curses.wrapper.wrapper()} function does this
automatically). Once that's done, the \function{has_colors()}
function returns TRUE if the terminal in use can actually display
color. (Note from AMK: curses uses the American spelling
'color', instead of the Canadian/British spelling 'colour'. If you're
like me, you'll have to resign yourself to misspelling it for the sake
of these functions.)
The curses library maintains a finite number of color pairs,
containing a foreground (or text) color and a background color. You
can get the attribute value corresponding to a color pair with the
\function{color_pair()} function; this can be bitwise-OR'ed with other
attributes such as \constant{A_REVERSE}, but again, such combinations
are not guaranteed to work on all terminals.
An example, which displays a line of text using color pair 1:
\begin{verbatim}
stdscr.addstr( "Pretty text", curses.color_pair(1) )
stdscr.refresh()
\end{verbatim}
As I said before, a color pair consists of a foreground and
background color. \function{start_color()} initializes 8 basic
colors when it activates color mode. They are: 0:black, 1:red,
2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and 7:white. The curses
module defines named constants for each of these colors:
\constant{curses.COLOR_BLACK}, \constant{curses.COLOR_RED}, and so
forth.
The \function{init_pair(\var{n, f, b})} function changes the
definition of color pair \var{n}, to foreground color {f} and
background color {b}. Color pair 0 is hard-wired to white on black,
and cannot be changed.
Let's put all this together. To change color 1 to red
text on a white background, you would call:
\begin{verbatim}
curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
\end{verbatim}
When you change a color pair, any text already displayed using that
color pair will change to the new colors. You can also display new
text in this color with:
\begin{verbatim}
stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) )
\end{verbatim}
Very fancy terminals can change the definitions of the actual colors
to a given RGB value. This lets you change color 1, which is usually
red, to purple or blue or any other color you like. Unfortunately,
the Linux console doesn't support this, so I'm unable to try it out,
and can't provide any examples. You can check if your terminal can do
this by calling \function{can_change_color()}, which returns TRUE if
the capability is there. If you're lucky enough to have such a
talented terminal, consult your system's man pages for more
information.
\section{User Input}
The curses library itself offers only very simple input mechanisms.
Python's support adds a text-input widget that makes up some of the
lack.
The most common way to get input to a window is to use its
\method{getch()} method. that pauses, and waits for the user to hit
a key, displaying it if \function{echo()} has been called earlier.
You can optionally specify a coordinate to which the cursor should be
moved before pausing.
It's possible to change this behavior with the method
\method{nodelay()}. After \method{nodelay(1)}, \method{getch()} for
the window becomes non-blocking and returns ERR (-1) when no input is
ready. There's also a \function{halfdelay()} function, which can be
used to (in effect) set a timer on each \method{getch()}; if no input
becomes available within the number of milliseconds specified as the
argument to \function{halfdelay()}, curses throws an exception.
The \method{getch()} method returns an integer; if it's between 0 and
255, it represents the ASCII code of the key pressed. Values greater
than 255 are special keys such as Page Up, Home, or the cursor keys.
You can compare the value returned to constants such as
\constant{curses.KEY_PPAGE}, \constant{curses.KEY_HOME}, or
\constant{curses.KEY_LEFT}. Usually the main loop of your program
will look something like this:
\begin{verbatim}
while 1:
c = stdscr.getch()
if c == ord('p'): PrintDocument()
elif c == ord('q'): break # Exit the while()
elif c == curses.KEY_HOME: x = y = 0
\end{verbatim}
The \module{curses.ascii} module supplies ASCII class membership
functions that take either integer or 1-character-string
arguments; these may be useful in writing more readable tests for
your command interpreters. It also supplies conversion functions
that take either integer or 1-character-string arguments and return
the same type. For example, \function{curses.ascii.ctrl()} returns
the control character corresponding to its argument.
There's also a method to retrieve an entire string,
\constant{getstr()}. It isn't used very often, because its
functionality is quite limited; the only editing keys available are
the backspace key and the Enter key, which terminates the string. It
can optionally be limited to a fixed number of characters.
\begin{verbatim}
curses.echo() # Enable echoing of characters
# Get a 15-character string, with the cursor on the top line
s = stdscr.getstr(0,0, 15)
\end{verbatim}
The Python \module{curses.textpad} module supplies something better.
With it, you can turn a window into a text box that supports an
Emacs-like set of keybindings. Various methods of \class{Textbox}
class support editing with input validation and gathering the edit
results either with or without trailing spaces. See the library
documentation on \module{curses.textpad} for the details.
\section{For More Information}
This HOWTO didn't cover some advanced topics, such as screen-scraping
or capturing mouse events from an xterm instance. But the Python
library page for the curses modules is now pretty complete. You
should browse it next.
If you're in doubt about the detailed behavior of any of the ncurses
entry points, consult the manual pages for your curses implementation,
whether it's ncurses or a proprietary Unix vendor's. The manual pages
will document any quirks, and provide complete lists of all the
functions, attributes, and \constant{ACS_*} characters available to
you.
Because the curses API is so large, some functions aren't supported in
the Python interface, not because they're difficult to implement, but
because no one has needed them yet. Feel free to add them and then
submit a patch. Also, we don't yet have support for the menus or
panels libraries associated with ncurses; feel free to add that.
If you write an interesting little program, feel free to contribute it
as another demo. We can always use more of them!
The ncurses FAQ: \url{http://dickey.his.com/ncurses/ncurses.faq.html}
\end{document}
\documentclass{howto}
\title{Idioms and Anti-Idioms in Python}
\release{0.00}
\author{Moshe Zadka}
\authoraddress{howto@zadka.site.co.il}
\begin{document}
\maketitle
This document is placed in the public doman.
\begin{abstract}
\noindent
This document can be considered a companion to the tutorial. It
shows how to use Python, and even more importantly, how {\em not}
to use Python.
\end{abstract}
\tableofcontents
\section{Language Constructs You Should Not Use}
While Python has relatively few gotchas compared to other languages, it
still has some constructs which are only useful in corner cases, or are
plain dangerous.
\subsection{from module import *}
\subsubsection{Inside Function Definitions}
\code{from module import *} is {\em invalid} inside function definitions.
While many versions of Python do no check for the invalidity, it does not
make it more valid, no more then having a smart lawyer makes a man innocent.
Do not use it like that ever. Even in versions where it was accepted, it made
the function execution slower, because the compiler could not be certain
which names are local and which are global. In Python 2.1 this construct
causes warnings, and sometimes even errors.
\subsubsection{At Module Level}
While it is valid to use \code{from module import *} at module level it
is usually a bad idea. For one, this loses an important property Python
otherwise has --- you can know where each toplevel name is defined by
a simple "search" function in your favourite editor. You also open yourself
to trouble in the future, if some module grows additional functions or
classes.
One of the most awful question asked on the newsgroup is why this code:
\begin{verbatim}
f = open("www")
f.read()
\end{verbatim}
does not work. Of course, it works just fine (assuming you have a file
called "www".) But it does not work if somewhere in the module, the
statement \code{from os import *} is present. The \module{os} module
has a function called \function{open()} which returns an integer. While
it is very useful, shadowing builtins is one of its least useful properties.
Remember, you can never know for sure what names a module exports, so either
take what you need --- \code{from module import name1, name2}, or keep them in
the module and access on a per-need basis ---
\code{import module;print module.name}.
\subsubsection{When It Is Just Fine}
There are situations in which \code{from module import *} is just fine:
\begin{itemize}
\item The interactive prompt. For example, \code{from math import *} makes
Python an amazing scientific calculator.
\item When extending a module in C with a module in Python.
\item When the module advertises itself as \code{from import *} safe.
\end{itemize}
\subsection{Unadorned \keyword{exec}, \function{execfile} and friends}
The word ``unadorned'' refers to the use without an explicit dictionary,
in which case those constructs evaluate code in the {\em current} environment.
This is dangerous for the same reasons \code{from import *} is dangerous ---
it might step over variables you are counting on and mess up things for
the rest of your code. Simply do not do that.
Bad examples:
\begin{verbatim}
>>> for name in sys.argv[1:]:
>>> exec "%s=1" % name
>>> def func(s, **kw):
>>> for var, val in kw.items():
>>> exec "s.%s=val" % var # invalid!
>>> execfile("handler.py")
>>> handle()
\end{verbatim}
Good examples:
\begin{verbatim}
>>> d = {}
>>> for name in sys.argv[1:]:
>>> d[name] = 1
>>> def func(s, **kw):
>>> for var, val in kw.items():
>>> setattr(s, var, val)
>>> d={}
>>> execfile("handle.py", d, d)
>>> handle = d['handle']
>>> handle()
\end{verbatim}
\subsection{from module import name1, name2}
This is a ``don't'' which is much weaker then the previous ``don't''s
but is still something you should not do if you don't have good reasons
to do that. The reason it is usually bad idea is because you suddenly
have an object which lives in two seperate namespaces. When the binding
in one namespace changes, the binding in the other will not, so there
will be a discrepancy between them. This happens when, for example,
one module is reloaded, or changes the definition of a function at runtime.
Bad example:
\begin{verbatim}
# foo.py
a = 1
# bar.py
from foo import a
if something():
a = 2 # danger: foo.a != a
\end{verbatim}
Good example:
\begin{verbatim}
# foo.py
a = 1
# bar.py
import foo
if something():
foo.a = 2
\end{verbatim}
\subsection{except:}
Python has the \code{except:} clause, which catches all exceptions.
Since {\em every} error in Python raises an exception, this makes many
programming errors look like runtime problems, and hinders
the debugging process.
The following code shows a great example:
\begin{verbatim}
try:
foo = opne("file") # misspelled "open"
except:
sys.exit("could not open file!")
\end{verbatim}
The second line triggers a \exception{NameError} which is caught by the
except clause. The program will exit, and you will have no idea that
this has nothing to do with the readability of \code{"file"}.
The example above is better written
\begin{verbatim}
try:
foo = opne("file") # will be changed to "open" as soon as we run it
except IOError:
sys.exit("could not open file")
\end{verbatim}
There are some situations in which the \code{except:} clause is useful:
for example, in a framework when running callbacks, it is good not to
let any callback disturb the framework.
\section{Exceptions}
Exceptions are a useful feature of Python. You should learn to raise
them whenever something unexpected occurs, and catch them only where
you can do something about them.
The following is a very popular anti-idiom
\begin{verbatim}
def get_status(file):
if not os.path.exists(file):
print "file not found"
sys.exit(1)
return open(file).readline()
\end{verbatim}
Consider the case the file gets deleted between the time the call to
\function{os.path.exists} is made and the time \function{open} is called.
That means the last line will throw an \exception{IOError}. The same would
happen if \var{file} exists but has no read permission. Since testing this
on a normal machine on existing and non-existing files make it seem bugless,
that means in testing the results will seem fine, and the code will get
shipped. Then an unhandled \exception{IOError} escapes to the user, who
has to watch the ugly traceback.
Here is a better way to do it.
\begin{verbatim}
def get_status(file):
try:
return open(file).readline()
except (IOError, OSError):
print "file not found"
sys.exit(1)
\end{verbatim}
In this version, *either* the file gets opened and the line is read
(so it works even on flaky NFS or SMB connections), or the message
is printed and the application aborted.
Still, \function{get_status} makes too many assumptions --- that it
will only be used in a short running script, and not, say, in a long
running server. Sure, the caller could do something like
\begin{verbatim}
try:
status = get_status(log)
except SystemExit:
status = None
\end{verbatim}
So, try to make as few \code{except} clauses in your code --- those will
usually be a catch-all in the \function{main}, or inside calls which
should always succeed.
So, the best version is probably
\begin{verbatim}
def get_status(file):
return open(file).readline()
\end{verbatim}
The caller can deal with the exception if it wants (for example, if it
tries several files in a loop), or just let the exception filter upwards
to {\em its} caller.
The last version is not very good either --- due to implementation details,
the file would not be closed when an exception is raised until the handler
finishes, and perhaps not at all in non-C implementations (e.g., Jython).
\begin{verbatim}
def get_status(file):
fp = open(file)
try:
return fp.readline()
finally:
fp.close()
\end{verbatim}
\section{Using the Batteries}
Every so often, people seem to be writing stuff in the Python library
again, usually poorly. While the occasional module has a poor interface,
it is usually much better to use the rich standard library and data
types that come with Python then inventing your own.
A useful module very few people know about is \module{os.path}. It
always has the correct path arithmetic for your operating system, and
will usually be much better then whatever you come up with yourself.
Compare:
\begin{verbatim}
# ugh!
return dir+"/"+file
# better
return os.path.join(dir, file)
\end{verbatim}
More useful functions in \module{os.path}: \function{basename},
\function{dirname} and \function{splitext}.
There are also many useful builtin functions people seem not to be
aware of for some reason: \function{min()} and \function{max()} can
find the minimum/maximum of any sequence with comparable semantics,
for example, yet many people write they own max/min. Another highly
useful function is \function{reduce()}. Classical use of \function{reduce()}
is something like
\begin{verbatim}
import sys, operator
nums = map(float, sys.argv[1:])
print reduce(operator.add, nums)/len(nums)
\end{verbatim}
This cute little script prints the average of all numbers given on the
command line. The \function{reduce()} adds up all the numbers, and
the rest is just some pre- and postprocessing.
On the same note, note that \function{float()}, \function{int()} and
\function{long()} all accept arguments of type string, and so are
suited to parsing --- assuming you are ready to deal with the
\exception{ValueError} they raise.
\section{Using Backslash to Continue Statements}
Since Python treats a newline as a statement terminator,
and since statements are often more then is comfortable to put
in one line, many people do:
\begin{verbatim}
if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \
calculate_number(10, 20) != forbulate(500, 360):
pass
\end{verbatim}
You should realize that this is dangerous: a stray space after the
\code{\\} would make this line wrong, and stray spaces are notoriously
hard to see in editors. In this case, at least it would be a syntax
error, but if the code was:
\begin{verbatim}
value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \
+ calculate_number(10, 20)*forbulate(500, 360)
\end{verbatim}
then it would just be subtly wrong.
It is usually much better to use the implicit continuation inside parenthesis:
This version is bulletproof:
\begin{verbatim}
value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9]
+ calculate_number(10, 20)*forbulate(500, 360))
\end{verbatim}
\end{document}
\documentclass{howto}
% TODO:
% Document lookbehind assertions
% Better way of displaying a RE, a string, and what it matches
% Mention optional argument to match.groups()
% Unicode (at least a reference)
\title{Regular Expression HOWTO}
\release{0.05}
\author{A.M. Kuchling}
\authoraddress{\email{amk@amk.ca}}
\begin{document}
\maketitle
\begin{abstract}
\noindent
This document is an introductory tutorial to using regular expressions
in Python with the \module{re} module. It provides a gentler
introduction than the corresponding section in the Library Reference.
This document is available from
\url{http://www.amk.ca/python/howto}.
\end{abstract}
\tableofcontents
\section{Introduction}
The \module{re} module was added in Python 1.5, and provides
Perl-style regular expression patterns. Earlier versions of Python
came with the \module{regex} module, which provides Emacs-style
patterns. Emacs-style patterns are slightly less readable and
don't provide as many features, so there's not much reason to use
the \module{regex} module when writing new code, though you might
encounter old code that uses it.
Regular expressions (or REs) are essentially a tiny, highly
specialized programming language embedded inside Python and made
available through the \module{re} module. Using this little language,
you specify the rules for the set of possible strings that you want to
match; this set might contain English sentences, or e-mail addresses,
or TeX commands, or anything you like. You can then ask questions
such as ``Does this string match the pattern?'', or ``Is there a match
for the pattern anywhere in this string?''. You can also use REs to
modify a string or to split it apart in various ways.
Regular expression patterns are compiled into a series of bytecodes
which are then executed by a matching engine written in C. For
advanced use, it may be necessary to pay careful attention to how the
engine will execute a given RE, and write the RE in a certain way in
order to produce bytecode that runs faster. Optimization isn't
covered in this document, because it requires that you have a good
understanding of the matching engine's internals.
The regular expression language is relatively small and restricted, so
not all possible string processing tasks can be done using regular
expressions. There are also tasks that \emph{can} be done with
regular expressions, but the expressions turn out to be very
complicated. In these cases, you may be better off writing Python
code to do the processing; while Python code will be slower than an
elaborate regular expression, it will also probably be more understandable.
\section{Simple Patterns}
We'll start by learning about the simplest possible regular
expressions. Since regular expressions are used to operate on
strings, we'll begin with the most common task: matching characters.
For a detailed explanation of the computer science underlying regular
expressions (deterministic and non-deterministic finite automata), you
can refer to almost any textbook on writing compilers.
\subsection{Matching Characters}
Most letters and characters will simply match themselves. For
example, the regular expression \regexp{test} will match the string
\samp{test} exactly. (You can enable a case-insensitive mode that
would let this RE match \samp{Test} or \samp{TEST} as well; more
about this later.)
There are exceptions to this rule; some characters are
special, and don't match themselves. Instead, they signal that some
out-of-the-ordinary thing should be matched, or they affect other
portions of the RE by repeating them. Much of this document is
devoted to discussing various metacharacters and what they do.
Here's a complete list of the metacharacters; their meanings will be
discussed in the rest of this HOWTO.
\begin{verbatim}
. ^ $ * + ? { [ ] \ | ( )
\end{verbatim}
% $
The first metacharacters we'll look at are \samp{[} and \samp{]}.
They're used for specifying a character class, which is a set of
characters that you wish to match. Characters can be listed
individually, or a range of characters can be indicated by giving two
characters and separating them by a \character{-}. For example,
\regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or
\samp{c}; this is the same as
\regexp{[a-c]}, which uses a range to express the same set of
characters. If you wanted to match only lowercase letters, your
RE would be \regexp{[a-z]}.
Metacharacters are not active inside classes. For example,
\regexp{[akm\$]} will match any of the characters \character{a},
\character{k}, \character{m}, or \character{\$}; \character{\$} is
usually a metacharacter, but inside a character class it's stripped of
its special nature.
You can match the characters not within a range by \dfn{complementing}
the set. This is indicated by including a \character{\^} as the first
character of the class; \character{\^} elsewhere will simply match the
\character{\^} character. For example, \verb|[^5]| will match any
character except \character{5}.
Perhaps the most important metacharacter is the backslash, \samp{\e}.
As in Python string literals, the backslash can be followed by various
characters to signal various special sequences. It's also used to escape
all the metacharacters so you can still match them in patterns; for
example, if you need to match a \samp{[} or
\samp{\e}, you can precede them with a backslash to remove their
special meaning: \regexp{\e[} or \regexp{\e\e}.
Some of the special sequences beginning with \character{\e} represent
predefined sets of characters that are often useful, such as the set
of digits, the set of letters, or the set of anything that isn't
whitespace. The following predefined special sequences are available:
\begin{itemize}
\item[\code{\e d}]Matches any decimal digit; this is
equivalent to the class \regexp{[0-9]}.
\item[\code{\e D}]Matches any non-digit character; this is
equivalent to the class \verb|[^0-9]|.
\item[\code{\e s}]Matches any whitespace character; this is
equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}.
\item[\code{\e S}]Matches any non-whitespace character; this is
equivalent to the class \verb|[^ \t\n\r\f\v]|.
\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class
\regexp{[a-zA-Z0-9_]}.
\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class
\verb|[^a-zA-Z0-9_]|.
\end{itemize}
These sequences can be included inside a character class. For
example, \regexp{[\e s,.]} is a character class that will match any
whitespace character, or \character{,} or \character{.}.
The final metacharacter in this section is \regexp{.}. It matches
anything except a newline character, and there's an alternate mode
(\code{re.DOTALL}) where it will match even a newline. \character{.}
is often used where you want to match ``any character''.
\subsection{Repeating Things}
Being able to match varying sets of characters is the first thing
regular expressions can do that isn't already possible with the
methods available on strings. However, if that was the only
additional capability of regexes, they wouldn't be much of an advance.
Another capability is that you can specify that portions of the RE
must be repeated a certain number of times.
The first metacharacter for repeating things that we'll look at is
\regexp{*}. \regexp{*} doesn't match the literal character \samp{*};
instead, it specifies that the previous character can be matched zero
or more times, instead of exactly once.
For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a}
characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a}
characters), and so forth. The RE engine has various internal
limitations stemming from the size of C's \code{int} type, that will
prevent it from matching over 2 billion \samp{a} characters; you
probably don't have enough memory to construct a string that large, so
you shouldn't run into that limit.
Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE,
the matching engine will try to repeat it as many times as possible.
If later portions of the pattern don't match, the matching engine will
then back up and try again with few repetitions.
A step-by-step example will make this more obvious. Let's consider
the expression \regexp{a[bcd]*b}. This matches the letter
\character{a}, zero or more letters from the class \code{[bcd]}, and
finally ends with a \character{b}. Now imagine matching this RE
against the string \samp{abcbd}.
\begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation}
\lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.}
\lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as
it can, which is to the end of the string.}
\lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the
current position is at the end of the string, so it fails.}
\lineiii{4}{\code{abcb}}{Back up, so that \regexp{[bcd]*} matches
one less character.}
\lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the
current position is at the last character, which is a \character{d}.}
\lineiii{6}{\code{abc}}{Back up again, so that \regexp{[bcd]*} is
only matching \samp{bc}.}
\lineiii{6}{\code{abcb}}{Try \regexp{b} again. This time
but the character at the current position is \character{b}, so it succeeds.}
\end{tableiii}
The end of the RE has now been reached, and it has matched
\samp{abcb}. This demonstrates how the matching engine goes as far as
it can at first, and if no match is found it will then progressively
back up and retry the rest of the RE again and again. It will back up
until it has tried zero matches for \regexp{[bcd]*}, and if that
subsequently fails, the engine will conclude that the string doesn't
match the RE at all.
Another repeating metacharacter is \regexp{+}, which matches one or
more times. Pay careful attention to the difference between
\regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more
times, so whatever's being repeated may not be present at all, while
\regexp{+} requires at least \emph{one} occurrence. To use a similar
example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}),
\samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}.
There are two more repeating qualifiers. The question mark character,
\regexp{?}, matches either once or zero times; you can think of it as
marking something as being optional. For example, \regexp{home-?brew}
matches either \samp{homebrew} or \samp{home-brew}.
The most complicated repeated qualifier is
\regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal
integers. This qualifier means there must be at least \var{m}
repetitions, and at most \var{n}. For example, \regexp{a/\{1,3\}b}
will match \samp{a/b}, \samp{a//b}, and \samp{a///b}. It won't match
\samp{ab}, which has no slashes, or \samp{a////b}, which has four.
You can omit either \var{m} or \var{n}; in that case, a reasonable
value is assumed for the missing value. Omitting \var{m} is
interpreted as a lower limit of 0, while omitting \var{n} results in an
upper bound of infinity --- actually, the 2 billion limit mentioned
earlier, but that might as well be infinity.
Readers of a reductionist bent may notice that the three other qualifiers
can all be expressed using this notation. \regexp{\{0,\}} is the same
as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and
\regexp{\{0,1\}} is the same as \regexp{?}. It's better to use
\regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because
they're shorter and easier to read.
\section{Using Regular Expressions}
Now that we've looked at some simple regular expressions, how do we
actually use them in Python? The \module{re} module provides an
interface to the regular expression engine, allowing you to compile
REs into objects and then perform matches with them.
\subsection{Compiling Regular Expressions}
Regular expressions are compiled into \class{RegexObject} instances,
which have methods for various operations such as searching for
pattern matches or performing string substitutions.
\begin{verbatim}
>>> import re
>>> p = re.compile('ab*')
>>> print p
<re.RegexObject instance at 80b4150>
\end{verbatim}
\function{re.compile()} also accepts an optional \var{flags}
argument, used to enable various special features and syntax
variations. We'll go over the available settings later, but for now a
single example will do:
\begin{verbatim}
>>> p = re.compile('ab*', re.IGNORECASE)
\end{verbatim}
The RE is passed to \function{re.compile()} as a string. REs are
handled as strings because regular expressions aren't part of the core
Python language, and no special syntax was created for expressing
them. (There are applications that don't need REs at all, so there's
no need to bloat the language specification by including them.)
Instead, the \module{re} module is simply a C extension module
included with Python, just like the \module{socket} or \module{zlib}
module.
Putting REs in strings keeps the Python language simpler, but has one
disadvantage which is the topic of the next section.
\subsection{The Backslash Plague}
As stated earlier, regular expressions use the backslash
character (\character{\e}) to indicate special forms or to allow
special characters to be used without invoking their special meaning.
This conflicts with Python's usage of the same character for the same
purpose in string literals.
Let's say you want to write a RE that matches the string
\samp{{\e}section}, which might be found in a \LaTeX\ file. To figure
out what to write in the program code, start with the desired string
to be matched. Next, you must escape any backslashes and other
metacharacters by preceding them with a backslash, resulting in the
string \samp{\e\e section}. The resulting string that must be passed
to \function{re.compile()} must be \verb|\\section|. However, to
express this as a Python string literal, both backslashes must be
escaped \emph{again}.
\begin{tableii}{c|l}{code}{Characters}{Stage}
\lineii{\e section}{Text string to be matched}
\lineii{\e\e section}{Escaped backslash for \function{re.compile}}
\lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal}
\end{tableii}
In short, to match a literal backslash, one has to write
\code{'\e\e\e\e'} as the RE string, because the regular expression
must be \samp{\e\e}, and each backslash must be expressed as
\samp{\e\e} inside a regular Python string literal. In REs that
feature backslashes repeatedly, this leads to lots of repeated
backslashes and makes the resulting strings difficult to understand.
The solution is to use Python's raw string notation for regular
expressions; backslashes are not handled in any special way in
a string literal prefixed with \character{r}, so \code{r"\e n"} is a
two-character string containing \character{\e} and \character{n},
while \code{"\e n"} is a one-character string containing a newline.
Frequently regular expressions will be expressed in Python
code using this raw string notation.
\begin{tableii}{c|c}{code}{Regular String}{Raw string}
\lineii{"ab*"}{\code{r"ab*"}}
\lineii{"\e\e\e\e section"}{\code{r"\e\e section"}}
\lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}}
\end{tableii}
\subsection{Performing Matches}
Once you have an object representing a compiled regular expression,
what do you do with it? \class{RegexObject} instances have several
methods and attributes. Only the most significant ones will be
covered here; consult \ulink{the Library
Reference}{http://www.python.org/doc/lib/module-re.html} for a
complete listing.
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
\lineii{match()}{Determine if the RE matches at the beginning of
the string.}
\lineii{search()}{Scan through a string, looking for any location
where this RE matches.}
\lineii{findall()}{Find all substrings where the RE matches,
and returns them as a list.}
\lineii{finditer()}{Find all substrings where the RE matches,
and returns them as an iterator.}
\end{tableii}
\method{match()} and \method{search()} return \code{None} if no match
can be found. If they're successful, a \code{MatchObject} instance is
returned, containing information about the match: where it starts and
ends, the substring it matched, and more.
You can learn about this by interactively experimenting with the
\module{re} module. If you have Tkinter available, you may also want
to look at \file{Tools/scripts/redemo.py}, a demonstration program
included with the Python distribution. It allows you to enter REs and
strings, and displays whether the RE matches or fails.
\file{redemo.py} can be quite useful when trying to debug a
complicated RE. Phil Schwartz's
\ulink{Kodos}{http://kodos.sourceforge.net} is also an interactive
tool for developing and testing RE patterns. This HOWTO will use the
standard Python interpreter for its examples.
First, run the Python interpreter, import the \module{re} module, and
compile a RE:
\begin{verbatim}
Python 2.2.2 (#1, Feb 10 2003, 12:57:01)
>>> import re
>>> p = re.compile('[a-z]+')
>>> p
<_sre.SRE_Pattern object at 80c3c28>
\end{verbatim}
Now, you can try matching various strings against the RE
\regexp{[a-z]+}. An empty string shouldn't match at all, since
\regexp{+} means 'one or more repetitions'. \method{match()} should
return \code{None} in this case, which will cause the interpreter to
print no output. You can explicitly print the result of
\method{match()} to make this clear.
\begin{verbatim}
>>> p.match("")
>>> print p.match("")
None
\end{verbatim}
Now, let's try it on a string that it should match, such as
\samp{tempo}. In this case, \method{match()} will return a
\class{MatchObject}, so you should store the result in a variable for
later use.
\begin{verbatim}
>>> m = p.match( 'tempo')
>>> print m
<_sre.SRE_Match object at 80c4f68>
\end{verbatim}
Now you can query the \class{MatchObject} for information about the
matching string. \class{MatchObject} instances also have several
methods and attributes; the most important ones are:
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
\lineii{group()}{Return the string matched by the RE}
\lineii{start()}{Return the starting position of the match}
\lineii{end()}{Return the ending position of the match}
\lineii{span()}{Return a tuple containing the (start, end) positions
of the match}
\end{tableii}
Trying these methods will soon clarify their meaning:
\begin{verbatim}
>>> m.group()
'tempo'
>>> m.start(), m.end()
(0, 5)
>>> m.span()
(0, 5)
\end{verbatim}
\method{group()} returns the substring that was matched by the
RE. \method{start()} and \method{end()} return the starting and
ending index of the match. \method{span()} returns both start and end
indexes in a single tuple. Since the \method{match} method only
checks if the RE matches at the start of a string,
\method{start()} will always be zero. However, the \method{search}
method of \class{RegexObject} instances scans through the string, so
the match may not start at zero in that case.
\begin{verbatim}
>>> print p.match('::: message')
None
>>> m = p.search('::: message') ; print m
<re.MatchObject instance at 80c9650>
>>> m.group()
'message'
>>> m.span()
(4, 11)
\end{verbatim}
In actual programs, the most common style is to store the
\class{MatchObject} in a variable, and then check if it was
\code{None}. This usually looks like:
\begin{verbatim}
p = re.compile( ... )
m = p.match( 'string goes here' )
if m:
print 'Match found: ', m.group()
else:
print 'No match'
\end{verbatim}
Two \class{RegexObject} methods return all of the matches for a pattern.
\method{findall()} returns a list of matching strings:
\begin{verbatim}
>>> p = re.compile('\d+')
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
['12', '11', '10']
\end{verbatim}
\method{findall()} has to create the entire list before it can be
returned as the result. In Python 2.2, the \method{finditer()} method
is also available, returning a sequence of \class{MatchObject} instances
as an iterator.
\begin{verbatim}
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
>>> iterator
<callable-iterator object at 0x401833ac>
>>> for match in iterator:
... print match.span()
...
(0, 2)
(22, 24)
(29, 31)
\end{verbatim}
\subsection{Module-Level Functions}
You don't have to produce a \class{RegexObject} and call its methods;
the \module{re} module also provides top-level functions called
\function{match()}, \function{search()}, \function{sub()}, and so
forth. These functions take the same arguments as the corresponding
\class{RegexObject} method, with the RE string added as the first
argument, and still return either \code{None} or a \class{MatchObject}
instance.
\begin{verbatim}
>>> print re.match(r'From\s+', 'Fromage amk')
None
>>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')
<re.MatchObject instance at 80c5978>
\end{verbatim}
Under the hood, these functions simply produce a \class{RegexObject}
for you and call the appropriate method on it. They also store the
compiled object in a cache, so future calls using the same
RE are faster.
Should you use these module-level functions, or should you get the
\class{RegexObject} and call its methods yourself? That choice
depends on how frequently the RE will be used, and on your personal
coding style. If a RE is being used at only one point in the code,
then the module functions are probably more convenient. If a program
contains a lot of regular expressions, or re-uses the same ones in
several locations, then it might be worthwhile to collect all the
definitions in one place, in a section of code that compiles all the
REs ahead of time. To take an example from the standard library,
here's an extract from \file{xmllib.py}:
\begin{verbatim}
ref = re.compile( ... )
entityref = re.compile( ... )
charref = re.compile( ... )
starttagopen = re.compile( ... )
\end{verbatim}
I generally prefer to work with the compiled object, even for
one-time uses, but few people will be as much of a purist about this
as I am.
\subsection{Compilation Flags}
Compilation flags let you modify some aspects of how regular
expressions work. Flags are available in the \module{re} module under
two names, a long name such as \constant{IGNORECASE}, and a short,
one-letter form such as \constant{I}. (If you're familiar with Perl's
pattern modifiers, the one-letter forms use the same letters; the
short form of \constant{re.VERBOSE} is \constant{re.X}, for example.)
Multiple flags can be specified by bitwise OR-ing them; \code{re.I |
re.M} sets both the \constant{I} and \constant{M} flags, for example.
Here's a table of the available flags, followed by
a more detailed explanation of each one.
\begin{tableii}{c|l}{}{Flag}{Meaning}
\lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any
character, including newlines}
\lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches}
\lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match}
\lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching,
affecting \regexp{\^} and \regexp{\$}}
\lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs,
which can be organized more cleanly and understandably.}
\end{tableii}
\begin{datadesc}{I}
\dataline{IGNORECASE}
Perform case-insensitive matching; character class and literal strings
will match
letters by ignoring case. For example, \regexp{[A-Z]} will match
lowercase letters, too, and \regexp{Spam} will match \samp{Spam},
\samp{spam}, or \samp{spAM}.
This lowercasing doesn't take the current locale into account; it will
if you also set the \constant{LOCALE} flag.
\end{datadesc}
\begin{datadesc}{L}
\dataline{LOCALE}
Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b},
and \regexp{\e B}, dependent on the current locale.
Locales are a feature of the C library intended to help in writing
programs that take account of language differences. For example, if
you're processing French text, you'd want to be able to write
\regexp{\e w+} to match words, but \regexp{\e w} only matches the
character class \regexp{[A-Za-z]}; it won't match \character{\'e} or
\character{\c c}. If your system is configured properly and a French
locale is selected, certain C functions will tell the program that
\character{\'e} should also be considered a letter. Setting the
\constant{LOCALE} flag when compiling a regular expression will cause the
resulting compiled object to use these C functions for \regexp{\e w};
this is slower, but also enables \regexp{\e w+} to match French words as
you'd expect.
\end{datadesc}
\begin{datadesc}{M}
\dataline{MULTILINE}
(\regexp{\^} and \regexp{\$} haven't been explained yet;
they'll be introduced in section~\ref{more-metacharacters}.)
Usually \regexp{\^} matches only at the beginning of the string, and
\regexp{\$} matches only at the end of the string and immediately before the
newline (if any) at the end of the string. When this flag is
specified, \regexp{\^} matches at the beginning of the string and at
the beginning of each line within the string, immediately following
each newline. Similarly, the \regexp{\$} metacharacter matches either at
the end of the string and at the end of each line (immediately
preceding each newline).
\end{datadesc}
\begin{datadesc}{S}
\dataline{DOTALL}
Makes the \character{.} special character match any character at all,
including a newline; without this flag, \character{.} will match
anything \emph{except} a newline.
\end{datadesc}
\begin{datadesc}{X}
\dataline{VERBOSE} This flag allows you to write regular expressions
that are more readable by granting you more flexibility in how you can
format them. When this flag has been specified, whitespace within the
RE string is ignored, except when the whitespace is in a character
class or preceded by an unescaped backslash; this lets you organize
and indent the RE more clearly. It also enables you to put comments
within a RE that will be ignored by the engine; comments are marked by
a \character{\#} that's neither in a character class or preceded by an
unescaped backslash.
For example, here's a RE that uses \constant{re.VERBOSE}; see how
much easier it is to read?
\begin{verbatim}
charref = re.compile(r"""
&[#] # Start of a numeric entity reference
(
[0-9]+[^0-9] # Decimal form
| 0[0-7]+[^0-7] # Octal form
| x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form
)
""", re.VERBOSE)
\end{verbatim}
Without the verbose setting, the RE would look like this:
\begin{verbatim}
charref = re.compile("&#([0-9]+[^0-9]"
"|0[0-7]+[^0-7]"
"|x[0-9a-fA-F]+[^0-9a-fA-F])")
\end{verbatim}
In the above example, Python's automatic concatenation of string
literals has been used to break up the RE into smaller pieces, but
it's still more difficult to understand than the version using
\constant{re.VERBOSE}.
\end{datadesc}
\section{More Pattern Power}
So far we've only covered a part of the features of regular
expressions. In this section, we'll cover some new metacharacters,
and how to use groups to retrieve portions of the text that was matched.
\subsection{More Metacharacters\label{more-metacharacters}}
There are some metacharacters that we haven't covered yet. Most of
them will be covered in this section.
Some of the remaining metacharacters to be discussed are
\dfn{zero-width assertions}. They don't cause the engine to advance
through the string; instead, they consume no characters at all,
and simply succeed or fail. For example, \regexp{\e b} is an
assertion that the current position is located at a word boundary; the
position isn't changed by the \regexp{\e b} at all. This means that
zero-width assertions should never be repeated, because if they match
once at a given location, they can obviously be matched an infinite
number of times.
\begin{list}{}{}
\item[\regexp{|}]
Alternation, or the ``or'' operator.
If A and B are regular expressions,
\regexp{A|B} will match any string that matches either \samp{A} or \samp{B}.
\regexp{|} has very low precedence in order to make it work reasonably when
you're alternating multi-character strings.
\regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not
\samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}.
To match a literal \character{|},
use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
\item[\regexp{\^}] Matches at the beginning of lines. Unless the
\constant{MULTILINE} flag has been set, this will only match at the
beginning of the string. In \constant{MULTILINE} mode, this also
matches immediately after each newline within the string.
For example, if you wish to match the word \samp{From} only at the
beginning of a line, the RE to use is \verb|^From|.
\begin{verbatim}
>>> print re.search('^From', 'From Here to Eternity')
<re.MatchObject instance at 80c1520>
>>> print re.search('^From', 'Reciting From Memory')
None
\end{verbatim}
%To match a literal \character{\^}, use \regexp{\e\^} or enclose it
%inside a character class, as in \regexp{[{\e}\^]}.
\item[\regexp{\$}] Matches at the end of a line, which is defined as
either the end of the string, or any location followed by a newline
character.
\begin{verbatim}
>>> print re.search('}$', '{block}')
<re.MatchObject instance at 80adfa8>
>>> print re.search('}$', '{block} ')
None
>>> print re.search('}$', '{block}\n')
<re.MatchObject instance at 80adfa8>
\end{verbatim}
% $
To match a literal \character{\$}, use \regexp{\e\$} or enclose it
inside a character class, as in \regexp{[\$]}.
\item[\regexp{\e A}] Matches only at the start of the string. When
not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are
effectively the same. In \constant{MULTILINE} mode, however, they're
different; \regexp{\e A} still matches only at the beginning of the
string, but \regexp{\^} may match at any location inside the string
that follows a newline character.
\item[\regexp{\e Z}]Matches only at the end of the string.
\item[\regexp{\e b}] Word boundary.
This is a zero-width assertion that matches only at the
beginning or end of a word. A word is defined as a sequence of
alphanumeric characters, so the end of a word is indicated by
whitespace or a non-alphanumeric character.
The following example matches \samp{class} only when it's a complete
word; it won't match when it's contained inside another word.
\begin{verbatim}
>>> p = re.compile(r'\bclass\b')
>>> print p.search('no class at all')
<re.MatchObject instance at 80c8f28>
>>> print p.search('the declassified algorithm')
None
>>> print p.search('one subclass is')
None
\end{verbatim}
There are two subtleties you should remember when using this special
sequence. First, this is the worst collision between Python's string
literals and regular expression sequences. In Python's string
literals, \samp{\e b} is the backspace character, ASCII value 8. If
you're not using raw strings, then Python will convert the \samp{\e b} to
a backspace, and your RE won't match as you expect it to. The
following example looks the same as our previous RE, but omits
the \character{r} in front of the RE string.
\begin{verbatim}
>>> p = re.compile('\bclass\b')
>>> print p.search('no class at all')
None
>>> print p.search('\b' + 'class' + '\b')
<re.MatchObject instance at 80c3ee0>
\end{verbatim}
Second, inside a character class, where there's no use for this
assertion, \regexp{\e b} represents the backspace character, for
compatibility with Python's string literals.
\item[\regexp{\e B}] Another zero-width assertion, this is the
opposite of \regexp{\e b}, only matching when the current
position is not at a word boundary.
\end{list}
\subsection{Grouping}
Frequently you need to obtain more information than just whether the
RE matched or not. Regular expressions are often used to dissect
strings by writing a RE divided into several subgroups which
match different components of interest. For example, an RFC-822
header line is divided into a header name and a value, separated by a
\character{:}. This can be handled by writing a regular expression
which matches an entire header line, and has one group which matches the
header name, and another group which matches the header's value.
Groups are marked by the \character{(}, \character{)} metacharacters.
\character{(} and \character{)} have much the same meaning as they do
in mathematical expressions; they group together the expressions
contained inside them. For example, you can repeat the contents of a
group with a repeating qualifier, such as \regexp{*}, \regexp{+},
\regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example,
\regexp{(ab)*} will match zero or more repetitions of \samp{ab}.
\begin{verbatim}
>>> p = re.compile('(ab)*')
>>> print p.match('ababababab').span()
(0, 10)
\end{verbatim}
Groups indicated with \character{(}, \character{)} also capture the
starting and ending index of the text that they match; this can be
retrieved by passing an argument to \method{group()},
\method{start()}, \method{end()}, and \method{span()}. Groups are
numbered starting with 0. Group 0 is always present; it's the whole
RE, so \class{MatchObject} methods all have group 0 as their default
argument. Later we'll see how to express groups that don't capture
the span of text that they match.
\begin{verbatim}
>>> p = re.compile('(a)b')
>>> m = p.match('ab')
>>> m.group()
'ab'
>>> m.group(0)
'ab'
\end{verbatim}
Subgroups are numbered from left to right, from 1 upward. Groups can
be nested; to determine the number, just count the opening parenthesis
characters, going from left to right.
\begin{verbatim}
>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'
\end{verbatim}
\method{group()} can be passed multiple group numbers at a time, in
which case it will return a tuple containing the corresponding values
for those groups.
\begin{verbatim}
>>> m.group(2,1,2)
('b', 'abc', 'b')
\end{verbatim}
The \method{groups()} method returns a tuple containing the strings
for all the subgroups, from 1 up to however many there are.
\begin{verbatim}
>>> m.groups()
('abc', 'b')
\end{verbatim}
Backreferences in a pattern allow you to specify that the contents of
an earlier capturing group must also be found at the current location
in the string. For example, \regexp{\e 1} will succeed if the exact
contents of group 1 can be found at the current position, and fails
otherwise. Remember that Python's string literals also use a
backslash followed by numbers to allow including arbitrary characters
in a string, so be sure to use a raw string when incorporating
backreferences in a RE.
For example, the following RE detects doubled words in a string.
\begin{verbatim}
>>> p = re.compile(r'(\b\w+)\s+\1')
>>> p.search('Paris in the the spring').group()
'the the'
\end{verbatim}
Backreferences like this aren't often useful for just searching
through a string --- there are few text formats which repeat data in
this way --- but you'll soon find out that they're \emph{very} useful
when performing string substitutions.
\subsection{Non-capturing and Named Groups}
Elaborate REs may use many groups, both to capture substrings of
interest, and to group and structure the RE itself. In complex REs,
it becomes difficult to keep track of the group numbers. There are
two features which help with this problem. Both of them use a common
syntax for regular expression extensions, so we'll look at that first.
Perl 5 added several additional features to standard regular
expressions, and the Python \module{re} module supports most of them.
It would have been difficult to choose new single-keystroke
metacharacters or new special sequences beginning with \samp{\e} to
represent the new features without making Perl's regular expressions
confusingly different from standard REs. If you chose \samp{\&} as a
new metacharacter, for example, old expressions would be assuming that
\samp{\&} was a regular character and wouldn't have escaped it by
writing \regexp{\e \&} or \regexp{[\&]}.
The solution chosen by the Perl developers was to use \regexp{(?...)}
as the extension syntax. \samp{?} immediately after a parenthesis was
a syntax error because the \samp{?} would have nothing to repeat, so
this didn't introduce any compatibility problems. The characters
immediately after the \samp{?} indicate what extension is being used,
so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and
\regexp{(?:foo)} is something else (a non-capturing group containing
the subexpression \regexp{foo}).
Python adds an extension syntax to Perl's extension syntax. If the
first character after the question mark is a \samp{P}, you know that
it's an extension that's specific to Python. Currently there are two
such extensions: \regexp{(?P<\var{name}>...)} defines a named group,
and \regexp{(?P=\var{name})} is a backreference to a named group. If
future versions of Perl 5 add similar features using a different
syntax, the \module{re} module will be changed to support the new
syntax, while preserving the Python-specific syntax for
compatibility's sake.
Now that we've looked at the general extension syntax, we can return
to the features that simplify working with groups in complex REs.
Since groups are numbered from left to right and a complex expression
may use many groups, it can become difficult to keep track of the
correct numbering, and modifying such a complex RE is annoying.
Insert a new group near the beginning, and you change the numbers of
everything that follows it.
First, sometimes you'll want to use a group to collect a part of a
regular expression, but aren't interested in retrieving the group's
contents. You can make this fact explicit by using a non-capturing
group: \regexp{(?:...)}, where you can put any other regular
expression inside the parentheses.
\begin{verbatim}
>>> m = re.match("([abc])+", "abc")
>>> m.groups()
('c',)
>>> m = re.match("(?:[abc])+", "abc")
>>> m.groups()
()
\end{verbatim}
Except for the fact that you can't retrieve the contents of what the
group matched, a non-capturing group behaves exactly the same as a
capturing group; you can put anything inside it, repeat it with a
repetition metacharacter such as \samp{*}, and nest it within other
groups (capturing or non-capturing). \regexp{(?:...)} is particularly
useful when modifying an existing group, since you can add new groups
without changing how all the other groups are numbered. It should be
mentioned that there's no performance difference in searching between
capturing and non-capturing groups; neither form is any faster than
the other.
The second, and more significant, feature is named groups; instead of
referring to them by numbers, groups can be referenced by a name.
The syntax for a named group is one of the Python-specific extensions:
\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of
the group. Except for associating a name with a group, named groups
also behave identically to capturing groups. The \class{MatchObject}
methods that deal with capturing groups all accept either integers, to
refer to groups by number, or a string containing the group name.
Named groups are still given numbers, so you can retrieve information
about a group in two ways:
\begin{verbatim}
>>> p = re.compile(r'(?P<word>\b\w+\b)')
>>> m = p.search( '(((( Lots of punctuation )))' )
>>> m.group('word')
'Lots'
>>> m.group(1)
'Lots'
\end{verbatim}
Named groups are handy because they let you use easily-remembered
names, instead of having to remember numbers. Here's an example RE
from the \module{imaplib} module:
\begin{verbatim}
InternalDate = re.compile(r'INTERNALDATE "'
r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
r'(?P<year>[0-9][0-9][0-9][0-9])'
r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
r'"')
\end{verbatim}
It's obviously much easier to retrieve \code{m.group('zonem')},
instead of having to remember to retrieve group 9.
Since the syntax for backreferences, in an expression like
\regexp{(...)\e 1}, refers to the number of the group there's
naturally a variant that uses the group name instead of the number.
This is also a Python extension: \regexp{(?P=\var{name})} indicates
that the contents of the group called \var{name} should again be found
at the current point. The regular expression for finding doubled
words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
\begin{verbatim}
>>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
>>> p.search('Paris in the the spring').group()
'the the'
\end{verbatim}
\subsection{Lookahead Assertions}
Another zero-width assertion is the lookahead assertion. Lookahead
assertions are available in both positive and negative form, and
look like this:
\begin{itemize}
\item[\regexp{(?=...)}] Positive lookahead assertion. This succeeds
if the contained regular expression, represented here by \code{...},
successfully matches at the current location, and fails otherwise.
But, once the contained expression has been tried, the matching engine
doesn't advance at all; the rest of the pattern is tried right where
the assertion started.
\item[\regexp{(?!...)}] Negative lookahead assertion. This is the
opposite of the positive assertion; it succeeds if the contained expression
\emph{doesn't} match at the current position in the string.
\end{itemize}
An example will help make this concrete by demonstrating a case
where a lookahead is useful. Consider a simple pattern to match a
filename and split it apart into a base name and an extension,
separated by a \samp{.}. For example, in \samp{news.rc}, \samp{news}
is the base name, and \samp{rc} is the filename's extension.
The pattern to match this is quite simple:
\regexp{.*[.].*\$}
Notice that the \samp{.} needs to be treated specially because it's a
metacharacter; I've put it inside a character class. Also notice the
trailing \regexp{\$}; this is added to ensure that all the rest of the
string must be included in the extension. This regular expression
matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and
\samp{printers.conf}.
Now, consider complicating the problem a bit; what if you want to
match filenames where the extension is not \samp{bat}?
Some incorrect attempts:
\verb|.*[.][^b].*$|
% $
The first attempt above tries to exclude \samp{bat} by requiring that
the first character of the extension is not a \samp{b}. This is
wrong, because the pattern also doesn't match \samp{foo.bar}.
% Messes up the HTML without the curly braces around \^
\regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$}
The expression gets messier when you try to patch up the first
solution by requiring one of the following cases to match: the first
character of the extension isn't \samp{b}; the second character isn't
\samp{a}; or the third character isn't \samp{t}. This accepts
\samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a
three-letter extension and won't accept a filename with a two-letter
extension such as \samp{sendmail.cf}. We'll complicate the pattern
again in an effort to fix it.
\regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$}
In the third attempt, the second and third letters are all made
optional in order to allow matching extensions shorter than three
characters, such as \samp{sendmail.cf}.
The pattern's getting really complicated now, which makes it hard to
read and understand. Worse, if the problem changes and you want to
exclude both \samp{bat} and \samp{exe} as extensions, the pattern
would get even more complicated and confusing.
A negative lookahead cuts through all this:
\regexp{.*[.](?!bat\$).*\$}
% $
The lookahead means: if the expression \regexp{bat} doesn't match at
this point, try the rest of the pattern; if \regexp{bat\$} does match,
the whole pattern will fail. The trailing \regexp{\$} is required to
ensure that something like \samp{sample.batch}, where the extension
only starts with \samp{bat}, will be allowed.
Excluding another filename extension is now easy; simply add it as an
alternative inside the assertion. The following pattern excludes
filenames that end in either \samp{bat} or \samp{exe}:
\regexp{.*[.](?!bat\$|exe\$).*\$}
% $
\section{Modifying Strings}
Up to this point, we've simply performed searches against a static
string. Regular expressions are also commonly used to modify a string
in various ways, using the following \class{RegexObject} methods:
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
\lineii{split()}{Split the string into a list, splitting it wherever the RE matches}
\lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string}
\lineii{subn()}{Does the same thing as \method{sub()},
but returns the new string and the number of replacements}
\end{tableii}
\subsection{Splitting Strings}
The \method{split()} method of a \class{RegexObject} splits a string
apart wherever the RE matches, returning a list of the pieces.
It's similar to the \method{split()} method of strings but
provides much more
generality in the delimiters that you can split by;
\method{split()} only supports splitting by whitespace or by
a fixed string. As you'd expect, there's a module-level
\function{re.split()} function, too.
\begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}}
Split \var{string} by the matches of the regular expression. If
capturing parentheses are used in the RE, then their contents will
also be returned as part of the resulting list. If \var{maxsplit}
is nonzero, at most \var{maxsplit} splits are performed.
\end{methoddesc}
You can limit the number of splits made, by passing a value for
\var{maxsplit}. When \var{maxsplit} is nonzero, at most
\var{maxsplit} splits will be made, and the remainder of the string is
returned as the final element of the list. In the following example,
the delimiter is any sequence of non-alphanumeric characters.
\begin{verbatim}
>>> p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']
\end{verbatim}
Sometimes you're not only interested in what the text between
delimiters is, but also need to know what the delimiter was. If
capturing parentheses are used in the RE, then their values are also
returned as part of the list. Compare the following calls:
\begin{verbatim}
>>> p = re.compile(r'\W+')
>>> p2 = re.compile(r'(\W+)')
>>> p.split('This... is a test.')
['This', 'is', 'a', 'test', '']
>>> p2.split('This... is a test.')
['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
\end{verbatim}
The module-level function \function{re.split()} adds the RE to be
used as the first argument, but is otherwise the same.
\begin{verbatim}
>>> re.split('[\W]+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('([\W]+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('[\W]+', 'Words, words, words.', 1)
['Words', 'words, words.']
\end{verbatim}
\subsection{Search and Replace}
Another common task is to find all the matches for a pattern, and
replace them with a different string. The \method{sub()} method takes
a replacement value, which can be either a string or a function, and
the string to be processed.
\begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}}
Returns the string obtained by replacing the leftmost non-overlapping
occurrences of the RE in \var{string} by the replacement
\var{replacement}. If the pattern isn't found, \var{string} is returned
unchanged.
The optional argument \var{count} is the maximum number of pattern
occurrences to be replaced; \var{count} must be a non-negative
integer. The default value of 0 means to replace all occurrences.
\end{methoddesc}
Here's a simple example of using the \method{sub()} method. It
replaces colour names with the word \samp{colour}:
\begin{verbatim}
>>> p = re.compile( '(blue|white|red)')
>>> p.sub( 'colour', 'blue socks and red shoes')
'colour socks and colour shoes'
>>> p.sub( 'colour', 'blue socks and red shoes', count=1)
'colour socks and red shoes'
\end{verbatim}
The \method{subn()} method does the same work, but returns a 2-tuple
containing the new string value and the number of replacements
that were performed:
\begin{verbatim}
>>> p = re.compile( '(blue|white|red)')
>>> p.subn( 'colour', 'blue socks and red shoes')
('colour socks and colour shoes', 2)
>>> p.subn( 'colour', 'no colours at all')
('no colours at all', 0)
\end{verbatim}
Empty matches are replaced only when they're not
adjacent to a previous match.
\begin{verbatim}
>>> p = re.compile('x*')
>>> p.sub('-', 'abxd')
'-a-b-d-'
\end{verbatim}
If \var{replacement} is a string, any backslash escapes in it are
processed. That is, \samp{\e n} is converted to a single newline
character, \samp{\e r} is converted to a carriage return, and so forth.
Unknown escapes such as \samp{\e j} are left alone. Backreferences,
such as \samp{\e 6}, are replaced with the substring matched by the
corresponding group in the RE. This lets you incorporate
portions of the original text in the resulting
replacement string.
This example matches the word \samp{section} followed by a string
enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to
\samp{subsection}:
\begin{verbatim}
>>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
>>> p.sub(r'subsection{\1}','section{First} section{second}')
'subsection{First} subsection{second}'
\end{verbatim}
There's also a syntax for referring to named groups as defined by the
\regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the
substring matched by the group named \samp{name}, and
\samp{\e g<\var{number}>}
uses the corresponding group number.
\samp{\e g<2>} is therefore equivalent to \samp{\e 2},
but isn't ambiguous in a
replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be
interpreted as a reference to group 20, not a reference to group 2
followed by the literal character \character{0}.) The following
substitutions are all equivalent, but use all three variations of the
replacement string.
\begin{verbatim}
>>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
>>> p.sub(r'subsection{\1}','section{First}')
'subsection{First}'
>>> p.sub(r'subsection{\g<1>}','section{First}')
'subsection{First}'
>>> p.sub(r'subsection{\g<name>}','section{First}')
'subsection{First}'
\end{verbatim}
\var{replacement} can also be a function, which gives you even more
control. If \var{replacement} is a function, the function is
called for every non-overlapping occurrence of \var{pattern}. On each
call, the function is
passed a \class{MatchObject} argument for the match
and can use this information to compute the desired replacement string and return it.
In the following example, the replacement function translates
decimals into hexadecimal:
\begin{verbatim}
>>> def hexrepl( match ):
... "Return the hex string for a decimal number"
... value = int( match.group() )
... return hex(value)
...
>>> p = re.compile(r'\d+')
>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
'Call 0xffd2 for printing, 0xc000 for user code.'
\end{verbatim}
When using the module-level \function{re.sub()} function, the pattern
is passed as the first argument. The pattern may be a string or a
\class{RegexObject}; if you need to specify regular expression flags,
you must either use a \class{RegexObject} as the first parameter, or use
embedded modifiers in the pattern, e.g. \code{sub("(?i)b+", "x", "bbbb
BBBB")} returns \code{'x x'}.
\section{Common Problems}
Regular expressions are a powerful tool for some applications, but in
some ways their behaviour isn't intuitive and at times they don't
behave the way you may expect them to. This section will point out
some of the most common pitfalls.
\subsection{Use String Methods}
Sometimes using the \module{re} module is a mistake. If you're
matching a fixed string, or a single character class, and you're not
using any \module{re} features such as the \constant{IGNORECASE} flag,
then the full power of regular expressions may not be required.
Strings have several methods for performing operations with fixed
strings and they're usually much faster, because the implementation is
a single small C loop that's been optimized for the purpose, instead
of the large, more generalized regular expression engine.
One example might be replacing a single fixed string with another
one; for example, you might replace \samp{word}
with \samp{deed}. \code{re.sub()} seems like the function to use for
this, but consider the \method{replace()} method. Note that
\function{replace()} will also replace \samp{word} inside
words, turning \samp{swordfish} into \samp{sdeedfish}, but the
na{\"\i}ve RE \regexp{word} would have done that, too. (To avoid performing
the substitution on parts of words, the pattern would have to be
\regexp{\e bword\e b}, in order to require that \samp{word} have a
word boundary on either side. This takes the job beyond
\method{replace}'s abilities.)
Another common task is deleting every occurrence of a single character
from a string or replacing it with another single character. You
might do this with something like \code{re.sub('\e n', ' ', S)}, but
\method{translate()} is capable of doing both tasks
and will be faster that any regular expression operation can be.
In short, before turning to the \module{re} module, consider whether
your problem can be solved with a faster and simpler string method.
\subsection{match() versus search()}
The \function{match()} function only checks if the RE matches at
the beginning of the string while \function{search()} will scan
forward through the string for a match.
It's important to keep this distinction in mind. Remember,
\function{match()} will only report a successful match which
will start at 0; if the match wouldn't start at zero,
\function{match()} will \emph{not} report it.
\begin{verbatim}
>>> print re.match('super', 'superstition').span()
(0, 5)
>>> print re.match('super', 'insuperable')
None
\end{verbatim}
On the other hand, \function{search()} will scan forward through the
string, reporting the first match it finds.
\begin{verbatim}
>>> print re.search('super', 'superstition').span()
(0, 5)
>>> print re.search('super', 'insuperable').span()
(2, 7)
\end{verbatim}
Sometimes you'll be tempted to keep using \function{re.match()}, and
just add \regexp{.*} to the front of your RE. Resist this temptation
and use \function{re.search()} instead. The regular expression
compiler does some analysis of REs in order to speed up the process of
looking for a match. One such analysis figures out what the first
character of a match must be; for example, a pattern starting with
\regexp{Crow} must match starting with a \character{C}. The analysis
lets the engine quickly scan through the string looking for the
starting character, only trying the full match if a \character{C} is found.
Adding \regexp{.*} defeats this optimization, requiring scanning to
the end of the string and then backtracking to find a match for the
rest of the RE. Use \function{re.search()} instead.
\subsection{Greedy versus Non-Greedy}
When repeating a regular expression, as in \regexp{a*}, the resulting
action is to consume as much of the pattern as possible. This
fact often bites you when you're trying to match a pair of
balanced delimiters, such as the angle brackets surrounding an HTML
tag. The na{\"\i}ve pattern for matching a single HTML tag doesn't
work because of the greedy nature of \regexp{.*}.
\begin{verbatim}
>>> s = '<html><head><title>Title</title>'
>>> len(s)
32
>>> print re.match('<.*>', s).span()
(0, 32)
>>> print re.match('<.*>', s).group()
<html><head><title>Title</title>
\end{verbatim}
The RE matches the \character{<} in \samp{<html>}, and the
\regexp{.*} consumes the rest of the string. There's still more left
in the RE, though, and the \regexp{>} can't match at the end of
the string, so the regular expression engine has to backtrack
character by character until it finds a match for the \regexp{>}.
The final match extends from the \character{<} in \samp{<html>}
to the \character{>} in \samp{</title>}, which isn't what you want.
In this case, the solution is to use the non-greedy qualifiers
\regexp{*?}, \regexp{+?}, \regexp{??}, or
\regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as
possible. In the above example, the \character{>} is tried
immediately after the first \character{<} matches, and when it fails,
the engine advances a character at a time, retrying the \character{>}
at every step. This produces just the right result:
\begin{verbatim}
>>> print re.match('<.*?>', s).group()
<html>
\end{verbatim}
(Note that parsing HTML or XML with regular expressions is painful.
Quick-and-dirty patterns will handle common cases, but HTML and XML
have special cases that will break the obvious regular expression; by
the time you've written a regular expression that handles all of the
possible cases, the patterns will be \emph{very} complicated. Use an
HTML or XML parser module for such tasks.)
\subsection{Not Using re.VERBOSE}
By now you've probably noticed that regular expressions are a very
compact notation, but they're not terribly readable. REs of
moderate complexity can become lengthy collections of backslashes,
parentheses, and metacharacters, making them difficult to read and
understand.
For such REs, specifying the \code{re.VERBOSE} flag when
compiling the regular expression can be helpful, because it allows
you to format the regular expression more clearly.
The \code{re.VERBOSE} flag has several effects. Whitespace in the
regular expression that \emph{isn't} inside a character class is
ignored. This means that an expression such as \regexp{dog | cat} is
equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]}
will still match the characters \character{a}, \character{b}, or a
space. In addition, you can also put comments inside a RE; comments
extend from a \samp{\#} character to the next newline. When used with
triple-quoted strings, this enables REs to be formatted more neatly:
\begin{verbatim}
pat = re.compile(r"""
\s* # Skip leading whitespace
(?P<header>[^:]+) # Header name
\s* : # Whitespace, and a colon
(?P<value>.*?) # The header's value -- *? used to
# lose the following trailing whitespace
\s*$ # Trailing whitespace to end-of-line
""", re.VERBOSE)
\end{verbatim}
% $
This is far more readable than:
\begin{verbatim}
pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
\end{verbatim}
% $
\section{Feedback}
Regular expressions are a complicated topic. Did this document help
you understand them? Were there parts that were unclear, or Problems
you encountered that weren't covered here? If so, please send
suggestions for improvements to the author.
The most complete book on regular expressions is almost certainly
Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published
by O'Reilly. Unfortunately, it exclusively concentrates on Perl and
Java's flavours of regular expressions, and doesn't contain any Python
material at all, so it won't be useful as a reference for programming
in Python. (The first edition covered Python's now-obsolete
\module{regex} module, which won't help you much.) Consider checking
it out from your library.
\end{document}
\documentclass{howto}
\title{Restricted Execution HOWTO}
\release{2.1}
\author{A.M. Kuchling}
\authoraddress{\email{amk@amk.ca}}
\begin{document}
\maketitle
\begin{abstract}
\noindent
Python 2.2.2 and earlier provided a \module{rexec} module running
untrusted code. However, it's never been exhaustively audited for
security and it hasn't been updated to take into account recent
changes to Python such as new-style classes. Therefore, the
\module{rexec} module should not be trusted. To discourage use of
\module{rexec}, this HOWTO has been withdrawn.
The \module{rexec} and \module{Bastion} modules have been disabled in
the Python CVS tree, both on the trunk (which will eventually become
Python 2.3alpha2 and later 2.3final) and on the release22-maint branch
(which will become Python 2.2.3, if someone ever volunteers to issue
2.2.3).
For discussion of the problems with \module{rexec}, see the python-dev
threads starting at the following URLs:
\url{http://mail.python.org/pipermail/python-dev/2002-December/031160.html},
and
\url{http://mail.python.org/pipermail/python-dev/2003-January/031848.html}.
\end{abstract}
\section{Version History}
Sep. 12, 1998: Minor revisions and added the reference to the Janus
project.
Feb. 26, 1998: First version. Suggestions are welcome.
Mar. 16, 1998: Made some revisions suggested by Jeff Rush. Some minor
changes and clarifications, and a sizable section on exceptions added.
Oct. 4, 2000: Checked with Python 2.0. Minor rewrites and fixes made.
Version number increased to 2.0.
Dec. 17, 2002: Withdrawn.
Jan. 8, 2003: Mention that \module{rexec} will be disabled in Python 2.3,
and added links to relevant python-dev threads.
\end{document}
\documentclass{howto}
\title{Socket Programming HOWTO}
\release{0.00}
\author{Gordon McMillan}
\authoraddress{\email{gmcm@hypernet.com}}
\begin{document}
\maketitle
\begin{abstract}
\noindent
Sockets are used nearly everywhere, but are one of the most severely
misunderstood technologies around. This is a 10,000 foot overview of
sockets. It's not really a tutorial - you'll still have work to do in
getting things operational. It doesn't cover the fine points (and there
are a lot of them), but I hope it will give you enough background to
begin using them decently.
This document is available from the Python HOWTO page at
\url{http://www.python.org/doc/howto}.
\end{abstract}
\tableofcontents
\section{Sockets}
Sockets are used nearly everywhere, but are one of the most severely
misunderstood technologies around. This is a 10,000 foot overview of
sockets. It's not really a tutorial - you'll still have work to do in
getting things working. It doesn't cover the fine points (and there
are a lot of them), but I hope it will give you enough background to
begin using them decently.
I'm only going to talk about INET sockets, but they account for at
least 99\% of the sockets in use. And I'll only talk about STREAM
sockets - unless you really know what you're doing (in which case this
HOWTO isn't for you!), you'll get better behavior and performance from
a STREAM socket than anything else. I will try to clear up the mystery
of what a socket is, as well as some hints on how to work with
blocking and non-blocking sockets. But I'll start by talking about
blocking sockets. You'll need to know how they work before dealing
with non-blocking sockets.
Part of the trouble with understanding these things is that "socket"
can mean a number of subtly different things, depending on context. So
first, let's make a distinction between a "client" socket - an
endpoint of a conversation, and a "server" socket, which is more like
a switchboard operator. The client application (your browser, for
example) uses "client" sockets exclusively; the web server it's
talking to uses both "server" sockets and "client" sockets.
\subsection{History}
Of the various forms of IPC (\emph{Inter Process Communication}),
sockets are by far the most popular. On any given platform, there are
likely to be other forms of IPC that are faster, but for
cross-platform communication, sockets are about the only game in town.
They were invented in Berkeley as part of the BSD flavor of Unix. They
spread like wildfire with the Internet. With good reason --- the
combination of sockets with INET makes talking to arbitrary machines
around the world unbelievably easy (at least compared to other
schemes).
\section{Creating a Socket}
Roughly speaking, when you clicked on the link that brought you to
this page, your browser did something like the following:
\begin{verbatim}
#create an INET, STREAMing socket
s = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
#now connect to the web server on port 80
# - the normal http port
s.connect(("www.mcmillan-inc.com", 80))
\end{verbatim}
When the \code{connect} completes, the socket \code{s} can
now be used to send in a request for the text of this page. The same
socket will read the reply, and then be destroyed. That's right -
destroyed. Client sockets are normally only used for one exchange (or
a small set of sequential exchanges).
What happens in the web server is a bit more complex. First, the web
server creates a "server socket".
\begin{verbatim}
#create an INET, STREAMing socket
serversocket = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
#bind the socket to a public host,
# and a well-known port
serversocket.bind((socket.gethostname(), 80))
#become a server socket
serversocket.listen(5)
\end{verbatim}
A couple things to notice: we used \code{socket.gethostname()}
so that the socket would be visible to the outside world. If we had
used \code{s.bind(('', 80))} or \code{s.bind(('localhost',
80))} or \code{s.bind(('127.0.0.1', 80))} we would still
have a "server" socket, but one that was only visible within the same
machine.
A second thing to note: low number ports are usually reserved for
"well known" services (HTTP, SNMP etc). If you're playing around, use
a nice high number (4 digits).
Finally, the argument to \code{listen} tells the socket library that
we want it to queue up as many as 5 connect requests (the normal max)
before refusing outside connections. If the rest of the code is
written properly, that should be plenty.
OK, now we have a "server" socket, listening on port 80. Now we enter
the mainloop of the web server:
\begin{verbatim}
while 1:
#accept connections from outside
(clientsocket, address) = serversocket.accept()
#now do something with the clientsocket
#in this case, we'll pretend this is a threaded server
ct = client_thread(clientsocket)
ct.run()
\end{verbatim}
There's actually 3 general ways in which this loop could work -
dispatching a thread to handle \code{clientsocket}, create a new
process to handle \code{clientsocket}, or restructure this app
to use non-blocking sockets, and mulitplex between our "server" socket
and any active \code{clientsocket}s using
\code{select}. More about that later. The important thing to
understand now is this: this is \emph{all} a "server" socket
does. It doesn't send any data. It doesn't receive any data. It just
produces "client" sockets. Each \code{clientsocket} is created
in response to some \emph{other} "client" socket doing a
\code{connect()} to the host and port we're bound to. As soon as
we've created that \code{clientsocket}, we go back to listening
for more connections. The two "clients" are free to chat it up - they
are using some dynamically allocated port which will be recycled when
the conversation ends.
\subsection{IPC} If you need fast IPC between two processes
on one machine, you should look into whatever form of shared memory
the platform offers. A simple protocol based around shared memory and
locks or semaphores is by far the fastest technique.
If you do decide to use sockets, bind the "server" socket to
\code{'localhost'}. On most platforms, this will take a shortcut
around a couple of layers of network code and be quite a bit faster.
\section{Using a Socket}
The first thing to note, is that the web browser's "client" socket and
the web server's "client" socket are identical beasts. That is, this
is a "peer to peer" conversation. Or to put it another way, \emph{as the
designer, you will have to decide what the rules of etiquette are for
a conversation}. Normally, the \code{connect}ing socket
starts the conversation, by sending in a request, or perhaps a
signon. But that's a design decision - it's not a rule of sockets.
Now there are two sets of verbs to use for communication. You can use
\code{send} and \code{recv}, or you can transform your
client socket into a file-like beast and use \code{read} and
\code{write}. The latter is the way Java presents their
sockets. I'm not going to talk about it here, except to warn you that
you need to use \code{flush} on sockets. These are buffered
"files", and a common mistake is to \code{write} something, and
then \code{read} for a reply. Without a \code{flush} in
there, you may wait forever for the reply, because the request may
still be in your output buffer.
Now we come the major stumbling block of sockets - \code{send}
and \code{recv} operate on the network buffers. They do not
necessarily handle all the bytes you hand them (or expect from them),
because their major focus is handling the network buffers. In general,
they return when the associated network buffers have been filled
(\code{send}) or emptied (\code{recv}). They then tell you
how many bytes they handled. It is \emph{your} responsibility to call
them again until your message has been completely dealt with.
When a \code{recv} returns 0 bytes, it means the other side has
closed (or is in the process of closing) the connection. You will not
receive any more data on this connection. Ever. You may be able to
send data successfully; I'll talk about that some on the next page.
A protocol like HTTP uses a socket for only one transfer. The client
sends a request, the reads a reply. That's it. The socket is
discarded. This means that a client can detect the end of the reply by
receiving 0 bytes.
But if you plan to reuse your socket for further transfers, you need
to realize that \emph{there is no "EOT" (End of Transfer) on a
socket.} I repeat: if a socket \code{send} or
\code{recv} returns after handling 0 bytes, the connection has
been broken. If the connection has \emph{not} been broken, you may
wait on a \code{recv} forever, because the socket will
\emph{not} tell you that there's nothing more to read (for now). Now
if you think about that a bit, you'll come to realize a fundamental
truth of sockets: \emph{messages must either be fixed length} (yuck),
\emph{or be delimited} (shrug), \emph{or indicate how long they are}
(much better), \emph{or end by shutting down the connection}. The
choice is entirely yours, (but some ways are righter than others).
Assuming you don't want to end the connection, the simplest solution
is a fixed length message:
\begin{verbatim}
class mysocket:
'''demonstration class only
- coded for clarity, not efficiency'''
def __init__(self, sock=None):
if sock is None:
self.sock = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
else:
self.sock = sock
def connect(host, port):
self.sock.connect((host, port))
def mysend(msg):
totalsent = 0
while totalsent < MSGLEN:
sent = self.sock.send(msg[totalsent:])
if sent == 0:
raise RuntimeError, \\
"socket connection broken"
totalsent = totalsent + sent
def myreceive():
msg = ''
while len(msg) < MSGLEN:
chunk = self.sock.recv(MSGLEN-len(msg))
if chunk == '':
raise RuntimeError, \\
"socket connection broken"
msg = msg + chunk
return msg
\end{verbatim}
The sending code here is usable for almost any messaging scheme - in
Python you send strings, and you can use \code{len()} to
determine its length (even if it has embedded \code{\e 0}
characters). It's mostly the receiving code that gets more
complex. (And in C, it's not much worse, except you can't use
\code{strlen} if the message has embedded \code{\e 0}s.)
The easiest enhancement is to make the first character of the message
an indicator of message type, and have the type determine the
length. Now you have two \code{recv}s - the first to get (at
least) that first character so you can look up the length, and the
second in a loop to get the rest. If you decide to go the delimited
route, you'll be receiving in some arbitrary chunk size, (4096 or 8192
is frequently a good match for network buffer sizes), and scanning
what you've received for a delimiter.
One complication to be aware of: if your conversational protocol
allows multiple messages to be sent back to back (without some kind of
reply), and you pass \code{recv} an arbitrary chunk size, you
may end up reading the start of a following message. You'll need to
put that aside and hold onto it, until it's needed.
Prefixing the message with it's length (say, as 5 numeric characters)
gets more complex, because (believe it or not), you may not get all 5
characters in one \code{recv}. In playing around, you'll get
away with it; but in high network loads, your code will very quickly
break unless you use two \code{recv} loops - the first to
determine the length, the second to get the data part of the
message. Nasty. This is also when you'll discover that
\code{send} does not always manage to get rid of everything in
one pass. And despite having read this, you will eventually get bit by
it!
In the interests of space, building your character, (and preserving my
competitive position), these enhancements are left as an exercise for
the reader. Lets move on to cleaning up.
\subsection{Binary Data}
It is perfectly possible to send binary data over a socket. The major
problem is that not all machines use the same formats for binary
data. For example, a Motorola chip will represent a 16 bit integer
with the value 1 as the two hex bytes 00 01. Intel and DEC, however,
are byte-reversed - that same 1 is 01 00. Socket libraries have calls
for converting 16 and 32 bit integers - \code{ntohl, htonl, ntohs,
htons} where "n" means \emph{network} and "h" means \emph{host},
"s" means \emph{short} and "l" means \emph{long}. Where network order
is host order, these do nothing, but where the machine is
byte-reversed, these swap the bytes around appropriately.
In these days of 32 bit machines, the ascii representation of binary
data is frequently smaller than the binary representation. That's
because a surprising amount of the time, all those longs have the
value 0, or maybe 1. The string "0" would be two bytes, while binary
is four. Of course, this doesn't fit well with fixed-length
messages. Decisions, decisions.
\section{Disconnecting}
Strictly speaking, you're supposed to use \code{shutdown} on a
socket before you \code{close} it. The \code{shutdown} is
an advisory to the socket at the other end. Depending on the argument
you pass it, it can mean "I'm not going to send anymore, but I'll
still listen", or "I'm not listening, good riddance!". Most socket
libraries, however, are so used to programmers neglecting to use this
piece of etiquette that normally a \code{close} is the same as
\code{shutdown(); close()}. So in most situations, an explicit
\code{shutdown} is not needed.
One way to use \code{shutdown} effectively is in an HTTP-like
exchange. The client sends a request and then does a
\code{shutdown(1)}. This tells the server "This client is done
sending, but can still receive." The server can detect "EOF" by a
receive of 0 bytes. It can assume it has the complete request. The
server sends a reply. If the \code{send} completes successfully
then, indeed, the client was still receiving.
Python takes the automatic shutdown a step further, and says that when a socket is garbage collected, it will automatically do a \code{close} if it's needed. But relying on this is a very bad habit. If your socket just disappears without doing a \code{close}, the socket at the other end may hang indefinitely, thinking you're just being slow. \emph{Please} \code{close} your sockets when you're done.
\subsection{When Sockets Die}
Probably the worst thing about using blocking sockets is what happens
when the other side comes down hard (without doing a
\code{close}). Your socket is likely to hang. SOCKSTREAM is a
reliable protocol, and it will wait a long, long time before giving up
on a connection. If you're using threads, the entire thread is
essentially dead. There's not much you can do about it. As long as you
aren't doing something dumb, like holding a lock while doing a
blocking read, the thread isn't really consuming much in the way of
resources. Do \emph{not} try to kill the thread - part of the reason
that threads are more efficient than processes is that they avoid the
overhead associated with the automatic recycling of resources. In
other words, if you do manage to kill the thread, your whole process
is likely to be screwed up.
\section{Non-blocking Sockets}
If you've understood the preceeding, you already know most of what you
need to know about the mechanics of using sockets. You'll still use
the same calls, in much the same ways. It's just that, if you do it
right, your app will be almost inside-out.
In Python, you use \code{socket.setblocking(0)} to make it
non-blocking. In C, it's more complex, (for one thing, you'll need to
choose between the BSD flavor \code{O_NONBLOCK} and the almost
indistinguishable Posix flavor \code{O_NDELAY}, which is
completely different from \code{TCP_NODELAY}), but it's the
exact same idea. You do this after creating the socket, but before
using it. (Actually, if you're nuts, you can switch back and forth.)
The major mechanical difference is that \code{send},
\code{recv}, \code{connect} and \code{accept} can
return without having done anything. You have (of course) a number of
choices. You can check return code and error codes and generally drive
yourself crazy. If you don't believe me, try it sometime. Your app
will grow large, buggy and suck CPU. So let's skip the brain-dead
solutions and do it right.
Use \code{select}.
In C, coding \code{select} is fairly complex. In Python, it's a
piece of cake, but it's close enough to the C version that if you
understand \code{select} in Python, you'll have little trouble
with it in C.
\begin{verbatim} ready_to_read, ready_to_write, in_error = \\
select.select(
potential_readers,
potential_writers,
potential_errs,
timeout)
\end{verbatim}
You pass \code{select} three lists: the first contains all
sockets that you might want to try reading; the second all the sockets
you might want to try writing to, and the last (normally left empty)
those that you want to check for errors. You should note that a
socket can go into more than one list. The \code{select} call is
blocking, but you can give it a timeout. This is generally a sensible
thing to do - give it a nice long timeout (say a minute) unless you
have good reason to do otherwise.
In return, you will get three lists. They have the sockets that are
actually readable, writable and in error. Each of these lists is a
subset (possbily empty) of the corresponding list you passed in. And
if you put a socket in more than one input list, it will only be (at
most) in one output list.
If a socket is in the output readable list, you can be
as-close-to-certain-as-we-ever-get-in-this-business that a
\code{recv} on that socket will return \emph{something}. Same
idea for the writable list. You'll be able to send
\emph{something}. Maybe not all you want to, but \emph{something} is
better than nothing. (Actually, any reasonably healthy socket will
return as writable - it just means outbound network buffer space is
available.)
If you have a "server" socket, put it in the potential_readers
list. If it comes out in the readable list, your \code{accept}
will (almost certainly) work. If you have created a new socket to
\code{connect} to someone else, put it in the ptoential_writers
list. If it shows up in the writable list, you have a decent chance
that it has connected.
One very nasty problem with \code{select}: if somewhere in those
input lists of sockets is one which has died a nasty death, the
\code{select} will fail. You then need to loop through every
single damn socket in all those lists and do a
\code{select([sock],[],[],0)} until you find the bad one. That
timeout of 0 means it won't take long, but it's ugly.
Actually, \code{select} can be handy even with blocking sockets.
It's one way of determining whether you will block - the socket
returns as readable when there's something in the buffers. However,
this still doesn't help with the problem of determining whether the
other end is done, or just busy with something else.
\textbf{Portability alert}: On Unix, \code{select} works both with
the sockets and files. Don't try this on Windows. On Windows,
\code{select} works with sockets only. Also note that in C, many
of the more advanced socket options are done differently on
Windows. In fact, on Windows I usually use threads (which work very,
very well) with my sockets. Face it, if you want any kind of
performance, your code will look very different on Windows than on
Unix. (I haven't the foggiest how you do this stuff on a Mac.)
\subsection{Performance}
There's no question that the fastest sockets code uses non-blocking
sockets and select to multiplex them. You can put together something
that will saturate a LAN connection without putting any strain on the
CPU. The trouble is that an app written this way can't do much of
anything else - it needs to be ready to shuffle bytes around at all
times.
Assuming that your app is actually supposed to do something more than
that, threading is the optimal solution, (and using non-blocking
sockets will be faster than using blocking sockets). Unfortunately,
threading support in Unixes varies both in API and quality. So the
normal Unix solution is to fork a subprocess to deal with each
connection. The overhead for this is significant (and don't do this on
Windows - the overhead of process creation is enormous there). It also
means that unless each subprocess is completely independent, you'll
need to use another form of IPC, say a pipe, or shared memory and
semaphores, to communicate between the parent and child processes.
Finally, remember that even though blocking sockets are somewhat
slower than non-blocking, in many cases they are the "right"
solution. After all, if your app is driven by the data it receives
over a socket, there's not much sense in complicating the logic just
so your app can wait on \code{select} instead of
\code{recv}.
\end{document}
\documentclass{howto}
\title{Sorting Mini-HOWTO}
% Increment the release number whenever significant changes are made.
% The author and/or editor can define 'significant' however they like.
\release{0.01}
\author{Andrew Dalke}
\authoraddress{\email{dalke@bioreason.com}}
\begin{document}
\maketitle
\begin{abstract}
\noindent
This document is a little tutorial
showing a half dozen ways to sort a list with the built-in
\method{sort()} method.
This document is available from the Python HOWTO page at
\url{http://www.python.org/doc/howto}.
\end{abstract}
\tableofcontents
Python lists have a built-in \method{sort()} method. There are many
ways to use it to sort a list and there doesn't appear to be a single,
central place in the various manuals describing them, so I'll do so
here.
\section{Sorting basic data types}
A simple ascending sort is easy; just call the \method{sort()} method of a list.
\begin{verbatim}
>>> a = [5, 2, 3, 1, 4]
>>> a.sort()
>>> print a
[1, 2, 3, 4, 5]
\end{verbatim}
Sort takes an optional function which can be called for doing the
comparisons. The default sort routine is equivalent to
\begin{verbatim}
>>> a = [5, 2, 3, 1, 4]
>>> a.sort(cmp)
>>> print a
[1, 2, 3, 4, 5]
\end{verbatim}
where \function{cmp} is the built-in function which compares two objects, \code{x} and
\code{y}, and returns -1, 0 or 1 depending on whether $x<y$, $x==y$, or $x>y$. During
the course of the sort the relationships must stay the same for the
final list to make sense.
If you want, you can define your own function for the comparison. For
integers (and numbers in general) we can do:
\begin{verbatim}
>>> def numeric_compare(x, y):
>>> return x-y
>>>
>>> a = [5, 2, 3, 1, 4]
>>> a.sort(numeric_compare)
>>> print a
[1, 2, 3, 4, 5]
\end{verbatim}
By the way, this function won't work if result of the subtraction
is out of range, as in \code{sys.maxint - (-1)}.
Or, if you don't want to define a new named function you can create an
anonymous one using \keyword{lambda}, as in:
\begin{verbatim}
>>> a = [5, 2, 3, 1, 4]
>>> a.sort(lambda x, y: x-y)
>>> print a
[1, 2, 3, 4, 5]
\end{verbatim}
If you want the numbers sorted in reverse you can do
\begin{verbatim}
>>> a = [5, 2, 3, 1, 4]
>>> def reverse_numeric(x, y):
>>> return y-x
>>>
>>> a.sort(reverse_numeric)
>>> print a
[5, 4, 3, 2, 1]
\end{verbatim}
(a more general implementation could return \code{cmp(y,x)} or \code{-cmp(x,y)}).
However, it's faster if Python doesn't have to call a function for
every comparison, so if you want a reverse-sorted list of basic data
types, do the forward sort first, then use the \method{reverse()} method.
\begin{verbatim}
>>> a = [5, 2, 3, 1, 4]
>>> a.sort()
>>> a.reverse()
>>> print a
[5, 4, 3, 2, 1]
\end{verbatim}
Here's a case-insensitive string comparison using a \keyword{lambda} function:
\begin{verbatim}
>>> import string
>>> a = string.split("This is a test string from Andrew.")
>>> a.sort(lambda x, y: cmp(string.lower(x), string.lower(y)))
>>> print a
['a', 'Andrew.', 'from', 'is', 'string', 'test', 'This']
\end{verbatim}
This goes through the overhead of converting a word to lower case
every time it must be compared. At times it may be faster to compute
these once and use those values, and the following example shows how.
\begin{verbatim}
>>> words = string.split("This is a test string from Andrew.")
>>> offsets = []
>>> for i in range(len(words)):
>>> offsets.append( (string.lower(words[i]), i) )
>>>
>>> offsets.sort()
>>> new_words = []
>>> for dontcare, i in offsets:
>>> new_words.append(words[i])
>>>
>>> print new_words
\end{verbatim}
The \code{offsets} list is initialized to a tuple of the lower-case string
and its position in the \code{words} list. It is then sorted. Python's
sort method sorts tuples by comparing terms; given \code{x} and \code{y}, compare
\code{x[0]} to \code{y[0]}, then \code{x[1]} to \code{y[1]}, etc. until there is a difference.
The result is that the \code{offsets} list is ordered by its first
term, and the second term can be used to figure out where the original
data was stored. (The \code{for} loop assigns \code{dontcare} and
\code{i} to the two fields of each term in the list, but we only need the
index value.)
Another way to implement this is to store the original data as the
second term in the \code{offsets} list, as in:
\begin{verbatim}
>>> words = string.split("This is a test string from Andrew.")
>>> offsets = []
>>> for word in words:
>>> offsets.append( (string.lower(word), word) )
>>>
>>> offsets.sort()
>>> new_words = []
>>> for word in offsets:
>>> new_words.append(word[1])
>>>
>>> print new_words
\end{verbatim}
This isn't always appropriate because the second terms in the list
(the word, in this example) will be compared when the first terms are
the same. If this happens many times, then there will be the unneeded
performance hit of comparing the two objects. This can be a large
cost if most terms are the same and the objects define their own
\method{__cmp__} method, but there will still be some overhead to determine if
\method{__cmp__} is defined.
Still, for large lists, or for lists where the comparison information
is expensive to calculate, the last two examples are likely to be the
fastest way to sort a list. It will not work on weakly sorted data,
like complex numbers, but if you don't know what that means, you
probably don't need to worry about it.
\section{Comparing classes}
The comparison for two basic data types, like ints to ints or string to
string, is built into Python and makes sense. There is a default way
to compare class instances, but the default manner isn't usually very
useful. You can define your own comparison with the \method{__cmp__} method,
as in:
\begin{verbatim}
>>> class Spam:
>>> def __init__(self, spam, eggs):
>>> self.spam = spam
>>> self.eggs = eggs
>>> def __cmp__(self, other):
>>> return cmp(self.spam+self.eggs, other.spam+other.eggs)
>>> def __str__(self):
>>> return str(self.spam + self.eggs)
>>>
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
>>> a.sort()
>>> for spam in a:
>>> print str(spam)
5
10
12
\end{verbatim}
Sometimes you may want to sort by a specific attribute of a class. If
appropriate you should just define the \method{__cmp__} method to compare
those values, but you cannot do this if you want to compare between
different attributes at different times. Instead, you'll need to go
back to passing a comparison function to sort, as in:
\begin{verbatim}
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
>>> a.sort(lambda x, y: cmp(x.eggs, y.eggs))
>>> for spam in a:
>>> print spam.eggs, str(spam)
3 12
4 5
6 10
\end{verbatim}
If you want to compare two arbitrary attributes (and aren't overly
concerned about performance) you can even define your own comparison
function object. This uses the ability of a class instance to emulate
an function by defining the \method{__call__} method, as in:
\begin{verbatim}
>>> class CmpAttr:
>>> def __init__(self, attr):
>>> self.attr = attr
>>> def __call__(self, x, y):
>>> return cmp(getattr(x, self.attr), getattr(y, self.attr))
>>>
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
>>> a.sort(CmpAttr("spam")) # sort by the "spam" attribute
>>> for spam in a:
>>> print spam.spam, spam.eggs, str(spam)
1 4 5
4 6 10
9 3 12
>>> a.sort(CmpAttr("eggs")) # re-sort by the "eggs" attribute
>>> for spam in a:
>>> print spam.spam, spam.eggs, str(spam)
9 3 12
1 4 5
4 6 10
\end{verbatim}
Of course, if you want a faster sort you can extract the attributes
into an intermediate list and sort that list.
So, there you have it; about a half-dozen different ways to define how
to sort a list:
\begin{itemize}
\item sort using the default method
\item sort using a comparison function
\item reverse sort not using a comparison function
\item sort on an intermediate list (two forms)
\item sort using class defined __cmp__ method
\item sort using a sort function object
\end{itemize}
\end{document}
% LocalWords: maxint
Unicode HOWTO
================
**Version 1.02**
This HOWTO discusses Python's support for Unicode, and explains various
problems that people commonly encounter when trying to work with Unicode.
Introduction to Unicode
------------------------------
History of Character Codes
''''''''''''''''''''''''''''''
In 1968, the American Standard Code for Information Interchange,
better known by its acronym ASCII, was standardized. ASCII defined
numeric codes for various characters, with the numeric values running from 0 to
127. For example, the lowercase letter 'a' is assigned 97 as its code
value.
ASCII was an American-developed standard, so it only defined
unaccented characters. There was an 'e', but no 'é' or 'Í'. This
meant that languages which required accented characters couldn't be
faithfully represented in ASCII. (Actually the missing accents matter
for English, too, which contains words such as 'naïve' and 'café', and some
publications have house styles which require spellings such as
'coöperate'.)
For a while people just wrote programs that didn't display accents. I
remember looking at Apple ][ BASIC programs, published in French-language
publications in the mid-1980s, that had lines like these::
PRINT "FICHER EST COMPLETE."
PRINT "CARACTERE NON ACCEPTE."
Those messages should contain accents, and they just look wrong to
someone who can read French.
In the 1980s, almost all personal computers were 8-bit, meaning that
bytes could hold values ranging from 0 to 255. ASCII codes only went
up to 127, so some machines assigned values between 128 and 255 to
accented characters. Different machines had different codes, however,
which led to problems exchanging files. Eventually various commonly
used sets of values for the 128-255 range emerged. Some were true
standards, defined by the International Standards Organization, and
some were **de facto** conventions that were invented by one company
or another and managed to catch on.
255 characters aren't very many. For example, you can't fit
both the accented characters used in Western Europe and the Cyrillic
alphabet used for Russian into the 128-255 range because there are more than
127 such characters.
You could write files using different codes (all your Russian
files in a coding system called KOI8, all your French files in
a different coding system called Latin1), but what if you wanted
to write a French document that quotes some Russian text? In the
1980s people began to want to solve this problem, and the Unicode
standardization effort began.
Unicode started out using 16-bit characters instead of 8-bit characters. 16
bits means you have 2^16 = 65,536 distinct values available, making it
possible to represent many different characters from many different
alphabets; an initial goal was to have Unicode contain the alphabets for
every single human language. It turns out that even 16 bits isn't enough to
meet that goal, and the modern Unicode specification uses a wider range of
codes, 0-1,114,111 (0x10ffff in base-16).
There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
originally separate efforts, but the specifications were merged with
the 1.1 revision of Unicode.
(This discussion of Unicode's history is highly simplified. I don't
think the average Python programmer needs to worry about the
historical details; consult the Unicode consortium site listed in the
References for more information.)
Definitions
''''''''''''''''''''''''
A **character** is the smallest possible component of a text. 'A',
'B', 'C', etc., are all different characters. So are 'È' and
'Í'. Characters are abstractions, and vary depending on the
language or context you're talking about. For example, the symbol for
ohms (Ω) is usually drawn much like the capital letter
omega (Ω) in the Greek alphabet (they may even be the same in
some fonts), but these are two different characters that have
different meanings.
The Unicode standard describes how characters are represented by
**code points**. A code point is an integer value, usually denoted in
base 16. In the standard, a code point is written using the notation
U+12ca to mean the character with value 0x12ca (4810 decimal). The
Unicode standard contains a lot of tables listing characters and their
corresponding code points::
0061 'a'; LATIN SMALL LETTER A
0062 'b'; LATIN SMALL LETTER B
0063 'c'; LATIN SMALL LETTER C
...
007B '{'; LEFT CURLY BRACKET
Strictly, these definitions imply that it's meaningless to say 'this is
character U+12ca'. U+12ca is a code point, which represents some particular
character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'.
In informal contexts, this distinction between code points and characters will
sometimes be forgotten.
A character is represented on a screen or on paper by a set of graphical
elements that's called a **glyph**. The glyph for an uppercase A, for
example, is two diagonal strokes and a horizontal stroke, though the exact
details will depend on the font being used. Most Python code doesn't need
to worry about glyphs; figuring out the correct glyph to display is
generally the job of a GUI toolkit or a terminal's font renderer.
Encodings
'''''''''
To summarize the previous section:
a Unicode string is a sequence of code points, which are
numbers from 0 to 0x10ffff. This sequence needs to be represented as
a set of bytes (meaning, values from 0-255) in memory. The rules for
translating a Unicode string into a sequence of bytes are called an
**encoding**.
The first encoding you might think of is an array of 32-bit integers.
In this representation, the string "Python" would look like this::
P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
This representation is straightforward but using
it presents a number of problems.
1. It's not portable; different processors order the bytes
differently.
2. It's very wasteful of space. In most texts, the majority of the code
points are less than 127, or less than 255, so a lot of space is occupied
by zero bytes. The above string takes 24 bytes compared to the 6
bytes needed for an ASCII representation. Increased RAM usage doesn't
matter too much (desktop computers have megabytes of RAM, and strings
aren't usually that large), but expanding our usage of disk and
network bandwidth by a factor of 4 is intolerable.
3. It's not compatible with existing C functions such as ``strlen()``,
so a new family of wide string functions would need to be used.
4. Many Internet standards are defined in terms of textual data, and
can't handle content with embedded zero bytes.
Generally people don't use this encoding, choosing other encodings
that are more efficient and convenient.
Encodings don't have to handle every possible Unicode character, and
most encodings don't. For example, Python's default encoding is the
'ascii' encoding. The rules for converting a Unicode string into the
ASCII encoding are are simple; for each code point:
1. If the code point is <128, each byte is the same as the value of the
code point.
2. If the code point is 128 or greater, the Unicode string can't
be represented in this encoding. (Python raises a
``UnicodeEncodeError`` exception in this case.)
Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode
code points 0-255 are identical to the Latin-1 values, so converting
to this encoding simply requires converting code points to byte
values; if a code point larger than 255 is encountered, the string
can't be encoded into Latin-1.
Encodings don't have to be simple one-to-one mappings like Latin-1.
Consider IBM's EBCDIC, which was used on IBM mainframes. Letter
values weren't in one block: 'a' through 'i' had values from 129 to
137, but 'j' through 'r' were 145 through 153. If you wanted to use
EBCDIC as an encoding, you'd probably use some sort of lookup table to
perform the conversion, but this is largely an internal detail.
UTF-8 is one of the most commonly used encodings. UTF stands for
"Unicode Transformation Format", and the '8' means that 8-bit numbers
are used in the encoding. (There's also a UTF-16 encoding, but it's
less frequently used than UTF-8.) UTF-8 uses the following rules:
1. If the code point is <128, it's represented by the corresponding byte value.
2. If the code point is between 128 and 0x7ff, it's turned into two byte values
between 128 and 255.
3. Code points >0x7ff are turned into three- or four-byte sequences, where
each byte of the sequence is between 128 and 255.
UTF-8 has several convenient properties:
1. It can handle any Unicode code point.
2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent through protocols that can't handle zero bytes.
3. A string of ASCII text is also valid UTF-8 text.
4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize. It's also unlikely that random 8-bit data will look like valid UTF-8.
References
''''''''''''''
The Unicode Consortium site at <http://www.unicode.org> has character
charts, a glossary, and PDF versions of the Unicode specification. Be
prepared for some difficult reading.
<http://www.unicode.org/history/> is a chronology of the origin and
development of Unicode.
To help understand the standard, Jukka Korpela has written an
introductory guide to reading the Unicode character tables,
available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
Roman Czyborra wrote another explanation of Unicode's basic principles;
it's at <http://czyborra.com/unicode/characters.html>.
Czyborra has written a number of other Unicode-related documentation,
available from <http://www.cyzborra.com>.
Two other good introductory articles were written by Joel Spolsky
<http://www.joelonsoftware.com/articles/Unicode.html> and Jason
Orendorff <http://www.jorendorff.com/articles/unicode/>. If this
introduction didn't make things clear to you, you should try reading
one of these alternate articles before continuing.
Wikipedia entries are often helpful; see the entries for "character
encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
<http://en.wikipedia.org/wiki/UTF-8>, for example.
Python's Unicode Support
------------------------
Now that you've learned the rudiments of Unicode, we can look at
Python's Unicode features.
The Unicode Type
'''''''''''''''''''
Unicode strings are expressed as instances of the ``unicode`` type,
one of Python's repertoire of built-in types. It derives from an
abstract type called ``basestring``, which is also an ancestor of the
``str`` type; you can therefore check if a value is a string type with
``isinstance(value, basestring)``. Under the hood, Python represents
Unicode strings as either 16- or 32-bit integers, depending on how the
Python interpreter was compiled, but this
The ``unicode()`` constructor has the signature ``unicode(string[, encoding, errors])``.
All of its arguments should be 8-bit strings. The first argument is converted
to Unicode using the specified encoding; if you leave off the ``encoding`` argument,
the ASCII encoding is used for the conversion, so characters greater than 127 will
be treated as errors::
>>> unicode('abcdef')
u'abcdef'
>>> s = unicode('abcdef')
>>> type(s)
<type 'unicode'>
>>> unicode('abcdef' + chr(255))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
ordinal not in range(128)
The ``errors`` argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument
are 'strict' (raise a ``UnicodeDecodeError`` exception),
'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'),
or 'ignore' (just leave the character out of the Unicode result).
The following examples show the differences::
>>> unicode('\x80abc', errors='strict')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
ordinal not in range(128)
>>> unicode('\x80abc', errors='replace')
u'\ufffdabc'
>>> unicode('\x80abc', errors='ignore')
u'abc'
Encodings are specified as strings containing the encoding's name.
Python 2.4 comes with roughly 100 different encodings; see the Python
Library Reference at
<http://docs.python.org/lib/standard-encodings.html> for a list. Some
encodings have multiple names; for example, 'latin-1', 'iso_8859_1'
and '8859' are all synonyms for the same encoding.
One-character Unicode strings can also be created with the
``unichr()`` built-in function, which takes integers and returns a
Unicode string of length 1 that contains the corresponding code point.
The reverse operation is the built-in `ord()` function that takes a
one-character Unicode string and returns the code point value::
>>> unichr(40960)
u'\ua000'
>>> ord(u'\ua000')
40960
Instances of the ``unicode`` type have many of the same methods as
the 8-bit string type for operations such as searching and formatting::
>>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
>>> s.count('e')
5
>>> s.find('feather')
9
>>> s.find('bird')
-1
>>> s.replace('feather', 'sand')
u'Was ever sand so lightly blown to and fro as this multitude?'
>>> s.upper()
u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
Note that the arguments to these methods can be Unicode strings or 8-bit strings.
8-bit strings will be converted to Unicode before carrying out the operation;
Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception::
>>> s.find('Was\x9f')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
>>> s.find(u'Was\x9f')
-1
Much Python code that operates on strings will therefore work with
Unicode strings without requiring any changes to the code. (Input and
output code needs more updating for Unicode; more on this later.)
Another important method is ``.encode([encoding], [errors='strict'])``,
which returns an 8-bit string version of the
Unicode string, encoded in the requested encoding. The ``errors``
parameter is the same as the parameter of the ``unicode()``
constructor, with one additional possibility; as well as 'strict',
'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which
uses XML's character references. The following example shows the
different results::
>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'
Python's 8-bit strings have a ``.decode([encoding], [errors])`` method
that interprets the string using the given encoding::
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
>>> type(utf8_version), utf8_version
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
>>> u == u2 # The two strings match
True
The low-level routines for registering and accessing the available
encodings are found in the ``codecs`` module. However, the encoding
and decoding functions returned by this module are usually more
low-level than is comfortable, so I'm not going to describe the
``codecs`` module here. If you need to implement a completely new
encoding, you'll need to learn about the ``codecs`` module interfaces,
but implementing encodings is a specialized task that also won't be
covered here. Consult the Python documentation to learn more about
this module.
The most commonly used part of the ``codecs`` module is the
``codecs.open()`` function which will be discussed in the section
on input and output.
Unicode Literals in Python Source Code
''''''''''''''''''''''''''''''''''''''''''
In Python source code, Unicode literals are written as strings
prefixed with the 'u' or 'U' character: ``u'abcdefghijk'``. Specific
code points can be written using the ``\u`` escape sequence, which is
followed by four hex digits giving the code point. The ``\U`` escape
sequence is similar, but expects 8 hex digits, not 4.
Unicode literals can also use the same escape sequences as 8-bit
strings, including ``\x``, but ``\x`` only takes two hex digits so it
can't express an arbitrary code point. Octal escapes can go up to
U+01ff, which is octal 777.
::
>>> s = u"a\xac\u1234\u20ac\U00008000"
^^^^ two-digit hex escape
^^^^^^ four-digit Unicode escape
^^^^^^^^^^ eight-digit Unicode escape
>>> for c in s: print ord(c),
...
97 172 4660 8364 32768
Using escape sequences for code points greater than 127 is fine in
small doses, but becomes an annoyance if you're using many accented
characters, as you would in a program with messages in French or some
other accent-using language. You can also assemble strings using the
``unichr()`` built-in function, but this is even more tedious.
Ideally, you'd want to be able to write literals in your language's
natural encoding. You could then edit Python source code with your
favorite editor which would display the accented characters naturally,
and have the right characters used at runtime.
Python supports writing Unicode literals in any encoding, but you have
to declare the encoding being used. This is done by including a
special comment as either the first or second line of the source
file::
#!/usr/bin/env python
# -*- coding: latin-1 -*-
u = u'abcdé'
print ord(u[-1])
The syntax is inspired by Emacs's notation for specifying variables local to a file.
Emacs supports many different variables, but Python only supports 'coding'.
The ``-*-`` symbols indicate that the comment is special; within them,
you must supply the name ``coding`` and the name of your chosen encoding,
separated by ``':'``.
If you don't include such a comment, the default encoding used will be
ASCII. Versions of Python before 2.4 were Euro-centric and assumed
Latin-1 as a default encoding for string literals; in Python 2.4,
characters greater than 127 still work but result in a warning. For
example, the following program has no encoding declaration::
#!/usr/bin/env python
u = u'abcdé'
print ord(u[-1])
When you run it with Python 2.4, it will output the following warning::
amk:~$ python p263.py
sys:1: DeprecationWarning: Non-ASCII character '\xe9'
in file p263.py on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Unicode Properties
'''''''''''''''''''
The Unicode specification includes a database of information about
code points. For each code point that's defined, the information
includes the character's name, its category, the numeric value if
applicable (Unicode has characters representing the Roman numerals and
fractions such as one-third and four-fifths). There are also
properties related to the code point's use in bidirectional text and
other display-related properties.
The following program displays some information about several
characters, and prints the numeric value of one particular character::
import unicodedata
u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)
# Get numeric value of second character
print unicodedata.numeric(u[1])
When run, this prints::
0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
1 0bf2 No TAMIL NUMBER ONE THOUSAND
2 0f84 Mn TIBETAN MARK HALANTA
3 1770 Lo TAGBANWA LETTER SA
4 33af So SQUARE RAD OVER S SQUARED
1000.0
The category codes are abbreviations describing the nature of the
character. These are grouped into categories such as "Letter",
"Number", "Punctuation", or "Symbol", which in turn are broken up into
subcategories. To take the codes from the above output, ``'Ll'``
means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is
"Mark, nonspacing", and ``'So'`` is "Symbol, other". See
<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values>
for a list of category codes.
References
''''''''''''''
The Unicode and 8-bit string types are described in the Python library
reference at <http://docs.python.org/lib/typesseq.html>.
The documentation for the ``unicodedata`` module is at
<http://docs.python.org/lib/module-unicodedata.html>.
The documentation for the ``codecs`` module is at
<http://docs.python.org/lib/module-codecs.html>.
Marc-André Lemburg gave a presentation at EuroPython 2002
titled "Python and Unicode". A PDF version of his slides
is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>,
and is an excellent overview of the design of Python's Unicode features.
Reading and Writing Unicode Data
----------------------------------------
Once you've written some code that works with Unicode data, the next
problem is input/output. How do you get Unicode strings into your
program, and how do you convert Unicode into a form suitable for
storage or transmission?
It's possible that you may not need to do anything depending on your
input sources and output destinations; you should check whether the
libraries used in your application support Unicode natively. XML
parsers often return Unicode data, for example. Many relational
databases also support Unicode-valued columns and can return Unicode
values from an SQL query.
Unicode data is usually converted to a particular encoding before it
gets written to disk or sent over a socket. It's possible to do all
the work yourself: open a file, read an 8-bit string from it, and
convert the string with ``unicode(str, encoding)``. However, the
manual approach is not recommended.
One problem is the multi-byte nature of encodings; one Unicode
character can be represented by several bytes. If you want to read
the file in arbitrary-sized chunks (say, 1K or 4K), you need to write
error-handling code to catch the case where only part of the bytes
encoding a single Unicode character are read at the end of a chunk.
One solution would be to read the entire file into memory and then
perform the decoding, but that prevents you from working with files
that are extremely large; if you need to read a 2Gb file, you need 2Gb
of RAM. (More, really, since for at least a moment you'd need to have
both the encoded string and its Unicode version in memory.)
The solution would be to use the low-level decoding interface to catch
the case of partial coding sequences. The work of implementing this
has already been done for you: the ``codecs`` module includes a
version of the ``open()`` function that returns a file-like object
that assumes the file's contents are in a specified encoding and
accepts Unicode parameters for methods such as ``.read()`` and
``.write()``.
The function's parameters are
``open(filename, mode='rb', encoding=None, errors='strict', buffering=1)``. ``mode`` can be
``'r'``, ``'w'``, or ``'a'``, just like the corresponding parameter to the
regular built-in ``open()`` function; add a ``'+'`` to
update the file. ``buffering`` is similarly
parallel to the standard function's parameter.
``encoding`` is a string giving
the encoding to use; if it's left as ``None``, a regular Python file
object that accepts 8-bit strings is returned. Otherwise, a wrapper
object is returned, and data written to or read from the wrapper
object will be converted as needed. ``errors`` specifies the action
for encoding errors and can be one of the usual values of 'strict',
'ignore', and 'replace'.
Reading Unicode from a file is therefore simple::
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)
It's also possible to open files in update mode,
allowing both reading and writing::
f = codecs.open('test', encoding='utf-8', mode='w+')
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
f.close()
Unicode character U+FEFF is used as a byte-order mark (BOM),
and is often written as the first character of a file in order
to assist with autodetection of the file's byte ordering.
Some encodings, such as UTF-16, expect a BOM to be present at
the start of a file; when such an encoding is used,
the BOM will be automatically written as the first character
and will be silently dropped when the file is read. There are
variants of these encodings, such as 'utf-16-le' and 'utf-16-be'
for little-endian and big-endian encodings, that specify
one particular byte ordering and don't
skip the BOM.
Unicode filenames
'''''''''''''''''''''''''
Most of the operating systems in common use today support filenames
that contain arbitrary Unicode characters. Usually this is
implemented by converting the Unicode string into some encoding that
varies depending on the system. For example, MacOS X uses UTF-8 while
Windows uses a configurable encoding; on Windows, Python uses the name
"mbcs" to refer to whatever the currently configured encoding is. On
Unix systems, there will only be a filesystem encoding if you've set
the ``LANG`` or ``LC_CTYPE`` environment variables; if you haven't,
the default encoding is ASCII.
The ``sys.getfilesystemencoding()`` function returns the encoding to
use on your current system, in case you want to do the encoding
manually, but there's not much reason to bother. When opening a file
for reading or writing, you can usually just provide the Unicode
string as the filename, and it will be automatically converted to the
right encoding for you::
filename = u'filename\u4500abc'
f = open(filename, 'w')
f.write('blah\n')
f.close()
Functions in the ``os`` module such as ``os.stat()`` will also accept
Unicode filenames.
``os.listdir()``, which returns filenames, raises an issue: should it
return the Unicode version of filenames, or should it return 8-bit
strings containing the encoded versions? ``os.listdir()`` will do
both, depending on whether you provided the directory path as an 8-bit
string or a Unicode string. If you pass a Unicode string as the path,
filenames will be decoded using the filesystem's encoding and a list
of Unicode strings will be returned, while passing an 8-bit path will
return the 8-bit versions of the filenames. For example, assuming the
default filesystem encoding is UTF-8, running the following program::
fn = u'filename\u4500abc'
f = open(fn, 'w')
f.close()
import os
print os.listdir('.')
print os.listdir(u'.')
will produce the following output::
amk:~$ python t.py
['.svn', 'filename\xe4\x94\x80abc', ...]
[u'.svn', u'filename\u4500abc', ...]
The first list contains UTF-8-encoded filenames, and the second list
contains the Unicode versions.
Tips for Writing Unicode-aware Programs
''''''''''''''''''''''''''''''''''''''''''''
This section provides some suggestions on writing software that
deals with Unicode.
The most important tip is:
Software should only work with Unicode strings internally,
converting to a particular encoding on output.
If you attempt to write processing functions that accept both
Unicode and 8-bit strings, you will find your program vulnerable to
bugs wherever you combine the two different kinds of strings. Python's
default encoding is ASCII, so whenever a character with an ASCII value >127
is in the input data, you'll get a ``UnicodeDecodeError``
because that character can't be handled by the ASCII encoding.
It's easy to miss such problems if you only test your software
with data that doesn't contain any
accents; everything will seem to work, but there's actually a bug in your
program waiting for the first user who attempts to use characters >127.
A second tip, therefore, is:
Include characters >127 and, even better, characters >255 in your
test data.
When using data coming from a web browser or some other untrusted source,
a common technique is to check for illegal characters in a string
before using the string in a generated command line or storing it in a
database. If you're doing this, be careful to check
the string once it's in the form that will be used or stored; it's
possible for encodings to be used to disguise characters. This is especially
true if the input data also specifies the encoding;
many encodings leave the commonly checked-for characters alone,
but Python includes some encodings such as ``'base64'``
that modify every single character.
For example, let's say you have a content management system that takes a
Unicode filename, and you want to disallow paths with a '/' character.
You might write this code::
def read_file (filename, encoding):
if '/' in filename:
raise ValueError("'/' not allowed in filenames")
unicode_name = filename.decode(encoding)
f = open(unicode_name, 'r')
# ... return contents of file ...
However, if an attacker could specify the ``'base64'`` encoding,
they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64
encoded form of the string ``'/etc/passwd'``, to read a
system file. The above code looks for ``'/'`` characters
in the encoded form and misses the dangerous character
in the resulting decoded form.
References
''''''''''''''
The PDF slides for Marc-André Lemburg's presentation "Writing
Unicode-aware Applications in Python" are available at
<http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
and discuss questions of character encodings as well as how to
internationalize and localize an application.
Revision History and Acknowledgements
------------------------------------------
Thanks to the following people who have noted errors or offered
suggestions on this article: Nicholas Bastin,
Marius Gedminas, Kent Johnson, Ken Krugler,
Marc-André Lemburg, Martin von Löwis.
Version 1.0: posted August 5 2005.
Version 1.01: posted August 7 2005. Corrects factual and markup
errors; adds several links.
Version 1.02: posted August 16 2005. Corrects factual errors.
.. comment Additional topic: building Python w/ UCS2 or UCS4 support
.. comment Describe obscure -U switch somewhere?
.. comment
Original outline:
- [ ] Unicode introduction
- [ ] ASCII
- [ ] Terms
- [ ] Character
- [ ] Code point
- [ ] Encodings
- [ ] Common encodings: ASCII, Latin-1, UTF-8
- [ ] Unicode Python type
- [ ] Writing unicode literals
- [ ] Obscurity: -U switch
- [ ] Built-ins
- [ ] unichr()
- [ ] ord()
- [ ] unicode() constructor
- [ ] Unicode type
- [ ] encode(), decode() methods
- [ ] Unicodedata module for character properties
- [ ] I/O
- [ ] Reading/writing Unicode data into files
- [ ] Byte-order marks
- [ ] Unicode filenames
- [ ] Writing Unicode programs
- [ ] Do everything in Unicode
- [ ] Declaring source code encodings (PEP 263)
- [ ] Other issues
- [ ] Building Python (UCS2, UCS4)
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment