libre.tex 29.9 KB
Newer Older
Fred Drake's avatar
Fred Drake committed
1
\section{\module{re} ---
2
         Perl-style regular expression operations.}
Fred Drake's avatar
Fred Drake committed
3
\declaremodule{standard}{re}
4 5
\moduleauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
\sectionauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
6 7


8
\modulesynopsis{Perl-style regular expression search and match
Fred Drake's avatar
Fred Drake committed
9
operations.}
10 11 12


This module provides regular expression matching operations similar to
13 14
those found in Perl.  It's 8-bit clean: the strings being processed
may contain both null bytes and characters whose high bit is set.  Regular
15 16 17 18
expression pattern strings may not contain null bytes, but can specify
the null byte using the \code{\e\var{number}} notation.
Characters with the high bit set may be included.  The \module{re}
module is always available.
19

20
Regular expressions use the backslash character (\character{\e}) to
21 22 23 24
indicate special forms or to allow special characters to be used
without invoking their special meaning.  This collides with Python's
usage of the same character for the same purpose in string literals;
for example, to match a literal backslash, one might have to write
25
\code{'\e\e\e\e'} as the pattern string, because the regular expression
Fred Drake's avatar
Fred Drake committed
26 27
must be \samp{\e\e}, and each backslash must be expressed as
\samp{\e\e} inside a regular Python string literal. 
28 29 30

The solution is to use Python's raw string notation for regular
expression patterns; backslashes are not handled in any special way in
31 32 33 34 35
a string literal prefixed with \character{r}.  So \code{r"\e n"} is a
two-character string containing \character{\e} and \character{n},
while \code{"\e n"} is a one-character string containing a newline.
Usually patterns will be expressed in Python code using this raw
string notation.
36

Fred Drake's avatar
Fred Drake committed
37
\subsection{Regular Expression Syntax \label{re-syntax}}
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53

A regular expression (or RE) specifies a set of strings that matches
it; the functions in this module let you check if a particular string
matches a given regular expression (or if a given regular expression
matches a particular string, which comes down to the same thing).

Regular expressions can be concatenated to form new regular
expressions; if \emph{A} and \emph{B} are both regular expressions,
then \emph{AB} is also an regular expression.  If a string \emph{p}
matches A and another string \emph{q} matches B, the string \emph{pq}
will match AB.  Thus, complex expressions can easily be constructed
from simpler primitive expressions like the ones described here.  For
details of the theory and implementation of regular expressions,
consult the Friedl book referenced below, or almost any textbook about
compiler construction.

54 55 56
A brief explanation of the format of regular expressions follows.  For
further information and a gentler presentation, consult the Regular
Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}.
57 58

Regular expressions can contain both special and ordinary characters.
59
Most ordinary characters, like \character{A}, \character{a}, or \character{0},
60
are the simplest regular expressions; they simply match themselves.  
61 62 63 64
You can concatenate ordinary characters, so \regexp{last} matches the
string \code{'last'}.  (In the rest of this section, we'll write RE's in
\regexp{this special style}, usually without quotes, and strings to be
matched \code{'in single quotes'}.)
65

66
Some characters, like \character{|} or \character{(}, are special.  Special
67 68 69 70
characters either stand for classes of ordinary characters, or affect
how the regular expressions around them are interpreted.

The special characters are:
Fred Drake's avatar
Fred Drake committed
71

Fred Drake's avatar
Fred Drake committed
72
\begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
Fred Drake's avatar
Fred Drake committed
73

74
\item[\character{.}] (Dot.)  In the default mode, this matches any
Fred Drake's avatar
Fred Drake committed
75
character except a newline.  If the \constant{DOTALL} flag has been
76
specified, this matches any character including a newline.
Fred Drake's avatar
Fred Drake committed
77

78 79
\item[\character{\^}] (Caret.)  Matches the start of the string, and in
\constant{MULTILINE} mode also matches immediately after each newline.
Fred Drake's avatar
Fred Drake committed
80

81
\item[\character{\$}] Matches the end of the string, and in
Fred Drake's avatar
Fred Drake committed
82
\constant{MULTILINE} mode also matches before a newline.
83 84
\regexp{foo} matches both 'foo' and 'foobar', while the regular
expression \regexp{foo\$} matches only 'foo'.
Fred Drake's avatar
Fred Drake committed
85

86
\item[\character{*}] Causes the resulting RE to
87
match 0 or more repetitions of the preceding RE, as many repetitions
88
as are possible.  \regexp{ab*} will
89
match 'a', 'ab', or 'a' followed by any number of 'b's.
Fred Drake's avatar
Fred Drake committed
90

91
\item[\character{+}] Causes the
92
resulting RE to match 1 or more repetitions of the preceding RE.
93
\regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it
94
will not match just 'a'.
Fred Drake's avatar
Fred Drake committed
95

96 97
\item[\character{?}] Causes the resulting RE to
match 0 or 1 repetitions of the preceding RE.  \regexp{ab?} will
98
match either 'a' or 'ab'.
99 100
\item[\code{*?}, \code{+?}, \code{??}] The \character{*}, \character{+}, and
\character{?} qualifiers are all \dfn{greedy}; they match as much text as
101
possible.  Sometimes this behaviour isn't desired; if the RE
102 103 104 105 106 107
\regexp{<.*>} is matched against \code{'<H1>title</H1>'}, it will match the
entire string, and not just \code{'<H1>'}.
Adding \character{?} after the qualifier makes it perform the match in
\dfn{non-greedy} or \dfn{minimal} fashion; as \emph{few} characters as
possible will be matched.  Using \regexp{.*?} in the previous
expression will match only \code{'<H1>'}.
Fred Drake's avatar
Fred Drake committed
108

Guido van Rossum's avatar
Guido van Rossum committed
109 110
\item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
\var{m} to \var{n} repetitions of the preceding RE, attempting to
111 112 113
match as many repetitions as possible.  For example, \regexp{a\{3,5\}}
will match from 3 to 5 \character{a} characters.  Omitting \var{n}
specifies an infinite upper bound; you can't omit \var{m}.
Fred Drake's avatar
Fred Drake committed
114

Guido van Rossum's avatar
Guido van Rossum committed
115 116 117 118
\item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
match from \var{m} to \var{n} repetitions of the preceding RE,
attempting to match as \emph{few} repetitions as possible.  This is
the non-greedy version of the previous qualifier.  For example, on the
Fred Drake's avatar
Fred Drake committed
119 120 121 122 123 124 125 126
6-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5
\character{a} characters, while \regexp{a\{3,5\}?} will only match 3
characters.

\item[\character{\e}] Either escapes special characters (permitting
you to match characters like \character{*}, \character{?}, and so
forth), or signals a special sequence; special sequences are discussed
below.
127 128 129 130 131 132 133

If you're not using a raw string to
express the pattern, remember that Python also uses the
backslash as an escape sequence in string literals; if the escape
sequence isn't recognized by Python's parser, the backslash and
subsequent character are included in the resulting string.  However,
if Python would recognize the resulting sequence, the backslash should
134 135 136
be repeated twice.  This is complicated and hard to understand, so
it's highly recommended that you use raw strings for all but the
simplest expressions.
Fred Drake's avatar
Fred Drake committed
137

138
\item[\code{[]}] Used to indicate a set of characters.  Characters can
Guido van Rossum's avatar
Guido van Rossum committed
139
be listed individually, or a range of characters can be indicated by
140 141
giving two characters and separating them by a \character{-}.  Special
characters are not active inside sets.  For example, \regexp{[akm\$]}
Fred Drake's avatar
Fred Drake committed
142
will match any of the characters \character{a}, \character{k},
143 144
\character{m}, or \character{\$}; \regexp{[a-z]}
will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any
Fred Drake's avatar
Fred Drake committed
145 146
letter or digit.  Character classes such as \code{\e w} or \code{\e S}
(defined below) are also acceptable inside a range.  If you want to
147 148 149 150 151 152 153
include a \character{]} or a \character{-} inside a set, precede it with a
backslash, or place it as the first character.  The 
pattern \regexp{[]]} will match \code{']'}, for example.  

You can match the characters not within a range by \dfn{complementing}
the set.  This is indicated by including a
\character{\^} as the first character of the set; \character{\^} elsewhere will
154
simply match the \character{\^} character.  For example, \regexp{[{\^}5]}
155
will match any character except \character{5}.
156

157
\item[\character{|}]\code{A|B}, where A and B can be arbitrary REs,
158
creates a regular expression that will match either A or B.  This can
159 160
be used inside groups (see below) as well.  To match a literal \character{|},
use \regexp{\e|}, or enclose it inside a character class, as in  \regexp{[|]}.
Fred Drake's avatar
Fred Drake committed
161

Guido van Rossum's avatar
Guido van Rossum committed
162 163 164
\item[\code{(...)}] Matches whatever regular expression is inside the
parentheses, and indicates the start and end of a group; the contents
of a group can be retrieved after a match has been performed, and can
165
be matched later in the string with the \regexp{\e \var{number}} special
Fred Drake's avatar
Fred Drake committed
166 167 168 169 170 171 172
sequence, described below.  To match the literals \character{(} or
\character{')}, use \regexp{\e(} or \regexp{\e)}, or enclose them
inside a character class: \regexp{[(] [)]}.

\item[\code{(?...)}] This is an extension notation (a \character{?}
following a \character{(} is not meaningful otherwise).  The first
character after the \character{?} 
173
determines what the meaning and further syntax of the construct is.
174
Extensions usually do not create a new group;
175
\regexp{(?P<\var{name}>...)} is the only exception to this rule.
176
Following are the currently supported extensions.
Fred Drake's avatar
Fred Drake committed
177

178 179
\item[\code{(?iLmsx)}] (One or more letters from the set \character{i},
\character{L}, \character{m}, \character{s}, \character{x}.)  The group matches
180
the empty string; the letters set the corresponding flags
Fred Drake's avatar
Fred Drake committed
181 182
(\constant{re.I}, \constant{re.L}, \constant{re.M}, \constant{re.S},
\constant{re.X}) for the entire regular expression.  This is useful if
183
you wish to include the flags as part of the regular expression, instead
Fred Drake's avatar
Fred Drake committed
184
of passing a \var{flag} argument to the \function{compile()} function. 
Fred Drake's avatar
Fred Drake committed
185

186
\item[\code{(?:...)}] A non-grouping version of regular parentheses.
187 188
Matches whatever regular expression is inside the parentheses, but the
substring matched by the 
189 190
group \emph{cannot} be retrieved after performing a match or
referenced later in the pattern. 
Fred Drake's avatar
Fred Drake committed
191

192
\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
193
the substring matched by the group is accessible via the symbolic group
194 195 196 197 198
name \var{name}.  Group names must be valid Python identifiers.  A
symbolic group is also a numbered group, just as if the group were not
named.  So the group named 'id' in the example above can also be
referenced as the numbered group 1.

Guido van Rossum's avatar
Guido van Rossum committed
199
For example, if the pattern is
200
\regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
201
name in arguments to methods of match objects, such as \code{m.group('id')}
202
or \code{m.end('id')}, and also by name in pattern text
203
(e.g. \regexp{(?P=id)}) and replacement text (e.g. \code{\e g<id>}).
Fred Drake's avatar
Fred Drake committed
204

205 206
\item[\code{(?P=\var{name})}] Matches whatever text was matched by the
earlier group named \var{name}.
Fred Drake's avatar
Fred Drake committed
207

208 209
\item[\code{(?\#...)}] A comment; the contents of the parentheses are
simply ignored.
Fred Drake's avatar
Fred Drake committed
210

211
\item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
212
consume any of the string.  This is called a lookahead assertion.  For
213 214
example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
followed by \code{'Asimov'}.
Fred Drake's avatar
Fred Drake committed
215

216
\item[\code{(?!...)}] Matches if \regexp{...} doesn't match next.  This
217
is a negative lookahead assertion.  For example,
218 219
\regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not}
followed by \code{'Asimov'}.
220

221
\end{list}
222

223
The special sequences consist of \character{\e} and a character from the
224 225
list below.  If the ordinary character is not on the list, then the
resulting RE will match the second character.  For example,
226
\regexp{\e\$} matches the character \character{\$}.
227

Fred Drake's avatar
Fred Drake committed
228
\begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
229 230 231

%
\item[\code{\e \var{number}}] Matches the contents of the group of the
232
same number.  Groups are numbered starting from 1.  For example,
233 234
\regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not
\code{'the end'} (note 
235 236 237 238
the space after the group).  This special sequence can only be used to
match one of the first 99 groups.  If the first digit of \var{number}
is 0, or \var{number} is 3 octal digits long, it will not be interpreted
as a group match, but as the character with octal value \var{number}.
239
Inside the \character{[} and \character{]} of a character class, all numeric
240
escapes are treated as characters. 
241 242 243 244 245 246
%
\item[\code{\e A}] Matches only at the start of the string.
%
\item[\code{\e b}] Matches the empty string, but only at the
beginning or end of a word.  A word is defined as a sequence of
alphanumeric characters, so the end of a word is indicated by
Guido van Rossum's avatar
Guido van Rossum committed
247
whitespace or a non-alphanumeric character.  Inside a character range,
248
\regexp{\e b} represents the backspace character, for compatibility with
Guido van Rossum's avatar
Guido van Rossum committed
249
Python's string literals.
250
%
251 252
\item[\code{\e B}] Matches the empty string, but only when it is
\emph{not} at the beginning or end of a word.
253 254
%
\item[\code{\e d}]Matches any decimal digit; this is
255
equivalent to the set \regexp{[0-9]}.
256 257
%
\item[\code{\e D}]Matches any non-digit character; this is
258
equivalent to the set \regexp{[{\^}0-9]}.
259 260
%
\item[\code{\e s}]Matches any whitespace character; this is
261
equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}.
262 263
%
\item[\code{\e S}]Matches any non-whitespace character; this is
264
equivalent to the set \regexp{[\^\ \e t\e n\e r\e f\e v]}.
265
%
Fred Drake's avatar
Fred Drake committed
266
\item[\code{\e w}]When the \constant{LOCALE} flag is not specified,
267
matches any alphanumeric character; this is equivalent to the set
268 269
\regexp{[a-zA-Z0-9_]}.  With \constant{LOCALE}, it will match the set
\regexp{[0-9_]} plus whatever characters are defined as letters for the
270
current locale.
271
%
Fred Drake's avatar
Fred Drake committed
272
\item[\code{\e W}]When the \constant{LOCALE} flag is not specified,
273
matches any non-alphanumeric character; this is equivalent to the set
274
\regexp{[{\^}a-zA-Z0-9_]}.   With \constant{LOCALE}, it will match any
275
character not in the set \regexp{[0-9_]}, and not defined as a letter
276
for the current locale.
277 278 279 280 281 282

\item[\code{\e Z}]Matches only at the end of the string.
%

\item[\code{\e \e}] Matches a literal backslash.

283
\end{list}
284

285

286 287 288 289 290 291 292 293 294 295
\subsection{Matching vs. Searching \label{matching-searching}}
\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}

Python offers two different primitive operations based on regular
expressions: match and search.  If you are accustomed to Perl's
semantics, the search operation is what you're looking for.  See the
\function{search()} function and corresponding method of compiled
regular expression objects.

Note that match may differ from search using a regular expression
296 297 298 299 300 301
beginning with \character{\^}: \character{\^} matches only at the
start of the string, or in \constant{MULTILINE} mode also immediately
following a newline.  The ``match'' operation succeeds only if the
pattern matches at the start of the string regardless of mode, or at
the starting position given by the optional \var{pos} argument
regardless of whether a newline precedes it.
302 303 304 305 306 307 308 309 310 311 312

% Examples from Tim Peters:
\begin{verbatim}
re.compile("a").match("ba", 1)           # succeeds
re.compile("^a").search("ba", 1)         # fails; 'a' not at start
re.compile("^a").search("\na", 1)        # fails; 'a' not at start
re.compile("^a", re.M).search("\na", 1)  # succeeds
re.compile("^a", re.M).search("ba", 1)   # fails; no preceding \n
\end{verbatim}


313
\subsection{Module Contents}
314
\nodename{Contents of Module re}
315 316 317 318

The module defines the following functions and constants, and an exception:


319
\begin{funcdesc}{compile}{pattern\optional{, flags}}
320
  Compile a regular expression pattern into a regular expression
Fred Drake's avatar
Fred Drake committed
321 322
  object, which can be used for matching using its \function{match()} and
  \function{search()} methods, described below.  
323

324 325 326 327
  The expression's behaviour can be modified by specifying a
  \var{flags} value.  Values can be any of the following variables,
  combined using bitwise OR (the \code{|} operator).

Fred Drake's avatar
Fred Drake committed
328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348
The sequence

\begin{verbatim}
prog = re.compile(pat)
result = prog.match(str)
\end{verbatim}

is equivalent to

\begin{verbatim}
result = re.match(pat, str)
\end{verbatim}

but the version using \function{compile()} is more efficient when the
expression will be used several times in a single program.
%(The compiled version of the last pattern passed to
%\function{regex.match()} or \function{regex.search()} is cached, so
%programs that use only a single regular expression at a time needn't
%worry about compiling regular expressions.)
\end{funcdesc}

349 350
\begin{datadesc}{I}
\dataline{IGNORECASE}
351
Perform case-insensitive matching; expressions like \regexp{[A-Z]} will match
Guido van Rossum's avatar
Guido van Rossum committed
352
lowercase letters, too.  This is not affected by the current locale.
353
\end{datadesc}
354

355 356
\begin{datadesc}{L}
\dataline{LOCALE}
357 358
Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b},
\regexp{\e B}, dependent on the current locale. 
359
\end{datadesc}
360

361 362
\begin{datadesc}{M}
\dataline{MULTILINE}
363
When specified, the pattern character \character{\^} matches at the
364 365
beginning of the string and at the beginning of each line
(immediately following each newline); and the pattern character
366
\character{\$} matches at the end of the string and at the end of each line
Guido van Rossum's avatar
Guido van Rossum committed
367
(immediately preceding each newline).
368 369
By default, \character{\^} matches only at the beginning of the string, and
\character{\$} only at the end of the string and immediately before the
370
newline (if any) at the end of the string. 
371
\end{datadesc}
Guido van Rossum's avatar
Guido van Rossum committed
372

373 374
\begin{datadesc}{S}
\dataline{DOTALL}
375 376
Make the \character{.} special character match any character at all, including a
newline; without this flag, \character{.} will match anything \emph{except}
377
a newline.
378
\end{datadesc}
379

380 381
\begin{datadesc}{X}
\dataline{VERBOSE}
382 383
This flag allows you to write regular expressions that look nicer.
Whitespace within the pattern is ignored, 
Guido van Rossum's avatar
Guido van Rossum committed
384
except when in a character class or preceded by an unescaped
385
backslash, and, when a line contains a \character{\#} neither in a character
Guido van Rossum's avatar
Guido van Rossum committed
386
class or preceded by an unescaped backslash, all characters from the
387 388
leftmost such \character{\#} through the end of the line are ignored.
% XXX should add an example here
389
\end{datadesc}
390 391


392 393 394 395 396 397 398
\begin{funcdesc}{search}{pattern, string\optional{, flags}}
  Scan through \var{string} looking for a location where the regular
  expression \var{pattern} produces a match, and return a
  corresponding \class{MatchObject} instance.
  Return \code{None} if no
  position in the string matches the pattern; note that this is
  different from finding a zero-length match at some point in the string.
399 400
\end{funcdesc}

401
\begin{funcdesc}{match}{pattern, string\optional{, flags}}
402 403
  If zero or more characters at the beginning of \var{string} match
  the regular expression \var{pattern}, return a corresponding
Fred Drake's avatar
Fred Drake committed
404
  \class{MatchObject} instance.  Return \code{None} if the string does not
405 406
  match the pattern; note that this is different from a zero-length
  match.
407 408 409

  \strong{Note:}  If you want to locate a match anywhere in
  \var{string}, use \method{search()} instead.
410 411
\end{funcdesc}

412
\begin{funcdesc}{split}{pattern, string\optional{, maxsplit\code{ = 0}}}
413
  Split \var{string} by the occurrences of \var{pattern}.  If
414 415
  capturing parentheses are used in \var{pattern}, then the text of all
  groups in the pattern are also returned as part of the resulting list.
416 417 418 419 420
  If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
  occur, and the remainder of the string is returned as the final
  element of the list.  (Incompatibility note: in the original Python
  1.5 release, \var{maxsplit} was ignored.  This has been fixed in
  later releases.)
421

422
\begin{verbatim}
423
>>> re.split('\W+', 'Words, words, words.')
424
['Words', 'words', 'words', '']
425
>>> re.split('(\W+)', 'Words, words, words.')
426
['Words', ', ', 'words', ', ', 'words', '.', '']
427
>>> re.split('\W+', 'Words, words, words.', 1)
428
['Words', 'words, words.']
429
\end{verbatim}
430

431
  This function combines and extends the functionality of
Fred Drake's avatar
Fred Drake committed
432
  the old \function{regsub.split()} and \function{regsub.splitx()}.
433 434
\end{funcdesc}

435 436 437 438 439
\begin{funcdesc}{findall}{pattern, string}
Return a list of all non-overlapping matches of \var{pattern} in
\var{string}.  If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group.  Empty matches are included in the result.
440
\versionadded{1.5.2}
441 442
\end{funcdesc}

443
\begin{funcdesc}{sub}{pattern, repl, string\optional{, count\code{ = 0}}}
444 445
Return the string obtained by replacing the leftmost non-overlapping
occurrences of \var{pattern} in \var{string} by the replacement
446 447
\var{repl}.  If the pattern isn't found, \var{string} is returned
unchanged.  \var{repl} can be a string or a function; if a function,
448
it is called for every non-overlapping occurrence of \var{pattern}.
449 450
The function takes a single match object argument, and returns the
replacement string.  For example:
451

452
\begin{verbatim}
453
>>> def dashrepl(matchobj):
454 455
....    if matchobj.group(0) == '-': return ' '
....    else: return '-'
456 457
>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
458
\end{verbatim}
459

460
The pattern may be a string or a 
Guido van Rossum's avatar
Guido van Rossum committed
461 462 463
regex object; if you need to specify
regular expression flags, you must use a regex object, or use
embedded modifiers in a pattern; e.g.
464
\samp{sub("(?i)b+", "x", "bbbb BBBB")} returns \code{'x x'}.
465

466
The optional argument \var{count} is the maximum number of pattern
467
occurrences to be replaced; \var{count} must be a non-negative integer, and
468 469 470
the default value of 0 means to replace all occurrences.

Empty matches for the pattern are replaced only when not adjacent to a
471
previous match, so \samp{sub('x*', '-', 'abc')} returns \code{'-a-b-c-'}.
472 473 474 475

If \var{repl} is a string, any backslash escapes in it are processed.
That is, \samp{\e n} is converted to a single newline character,
\samp{\e r} is converted to a linefeed, and so forth.  Unknown escapes
476
such as \samp{\e j} are left alone.  Backreferences, such as \samp{\e 6}, are
477 478 479 480
replaced with the substring matched by group 6 in the pattern. 

In addition to character escapes and backreferences as described
above, \samp{\e g<name>} will use the substring matched by the group
481
named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax.
482 483 484 485
\samp{\e g<number>} uses the corresponding group number; \samp{\e
g<2>} is therefore equivalent to \samp{\e 2}, but isn't ambiguous in a
replacement such as \samp{\e g<2>0}.  \samp{\e 20} would be
interpreted as a reference to group 20, not a reference to group 2
486
followed by the literal character \character{0}.  
487 488
\end{funcdesc}

489
\begin{funcdesc}{subn}{pattern, repl, string\optional{, count\code{ = 0}}}
Fred Drake's avatar
Fred Drake committed
490
Perform the same operation as \function{sub()}, but return a tuple
491
\code{(\var{new_string}, \var{number_of_subs_made})}.
492 493
\end{funcdesc}

494 495 496 497 498 499
\begin{funcdesc}{escape}{string}
  Return \var{string} with all non-alphanumerics backslashed; this is
  useful if you want to match an arbitrary literal string that may have
  regular expression metacharacters in it.
\end{funcdesc}

500 501 502
\begin{excdesc}{error}
  Exception raised when a string passed to one of the functions here
  is not a valid regular expression (e.g., unmatched parentheses) or
503 504
  when some other error occurs during compilation or matching.  It is
  never an error if a string contains no match for a pattern.
505 506
\end{excdesc}

507

Fred Drake's avatar
Fred Drake committed
508
\subsection{Regular Expression Objects \label{re-objects}}
509

510 511 512
Compiled regular expression objects support the following methods and
attributes:

513 514
\begin{methoddesc}[RegexObject]{search}{string\optional{, pos\optional{,
                                        endpos}}}
515 516 517 518 519 520 521 522 523 524
  Scan through \var{string} looking for a location where this regular
  expression produces a match, and return a
  corresponding \class{MatchObject} instance.  Return \code{None} if no
  position in the string matches the pattern; note that this is
  different from finding a zero-length match at some point in the string.
  
  The optional \var{pos} and \var{endpos} parameters have the same
  meaning as for the \method{match()} method.
\end{methoddesc}

525 526
\begin{methoddesc}[RegexObject]{match}{string\optional{, pos\optional{,
                                       endpos}}}
527 528
  If zero or more characters at the beginning of \var{string} match
  this regular expression, return a corresponding
Fred Drake's avatar
Fred Drake committed
529
  \class{MatchObject} instance.  Return \code{None} if the string does not
530 531
  match the pattern; note that this is different from a zero-length
  match.
532 533 534 535

  \strong{Note:}  If you want to locate a match anywhere in
  \var{string}, use \method{search()} instead.

536
  The optional second parameter \var{pos} gives an index in the string
537 538 539 540 541
  where the search is to start; it defaults to \code{0}.  This is not
  completely equivalent to slicing the string; the \code{'\^'} pattern
  character matches at the real beginning of the string and at positions
  just after a newline, but not necessarily at the index where the search
  is to start.
542 543 544 545 546

  The optional parameter \var{endpos} limits how far the string will
  be searched; it will be as if the string is \var{endpos} characters
  long, so only the characters from \var{pos} to \var{endpos} will be
  searched for a match.
Fred Drake's avatar
Fred Drake committed
547
\end{methoddesc}
548

549
\begin{methoddesc}[RegexObject]{split}{string\optional{,
Fred Drake's avatar
Fred Drake committed
550
                                       maxsplit\code{ = 0}}}
Fred Drake's avatar
Fred Drake committed
551
Identical to the \function{split()} function, using the compiled pattern.
Fred Drake's avatar
Fred Drake committed
552
\end{methoddesc}
553

554 555 556 557
\begin{methoddesc}[RegexObject]{findall}{string}
Identical to the \function{findall()} function, using the compiled pattern.
\end{methoddesc}

Fred Drake's avatar
Fred Drake committed
558
\begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}}
Fred Drake's avatar
Fred Drake committed
559
Identical to the \function{sub()} function, using the compiled pattern.
Fred Drake's avatar
Fred Drake committed
560
\end{methoddesc}
561

Fred Drake's avatar
Fred Drake committed
562 563
\begin{methoddesc}[RegexObject]{subn}{repl, string\optional{,
                                      count\code{ = 0}}}
Fred Drake's avatar
Fred Drake committed
564
Identical to the \function{subn()} function, using the compiled pattern.
Fred Drake's avatar
Fred Drake committed
565
\end{methoddesc}
566 567


Fred Drake's avatar
Fred Drake committed
568
\begin{memberdesc}[RegexObject]{flags}
569 570
The flags argument used when the regex object was compiled, or
\code{0} if no flags were provided.
Fred Drake's avatar
Fred Drake committed
571
\end{memberdesc}
572

Fred Drake's avatar
Fred Drake committed
573
\begin{memberdesc}[RegexObject]{groupindex}
574
A dictionary mapping any symbolic group names defined by 
575
\regexp{(?P<\var{id}>)} to group numbers.  The dictionary is empty if no
576
symbolic groups were used in the pattern.
Fred Drake's avatar
Fred Drake committed
577
\end{memberdesc}
578

Fred Drake's avatar
Fred Drake committed
579
\begin{memberdesc}[RegexObject]{pattern}
580
The pattern string from which the regex object was compiled.
Fred Drake's avatar
Fred Drake committed
581
\end{memberdesc}
582

583

Fred Drake's avatar
Fred Drake committed
584
\subsection{Match Objects \label{match-objects}}
585

Fred Drake's avatar
Fred Drake committed
586
\class{MatchObject} instances support the following methods and attributes:
587

588
\begin{methoddesc}[MatchObject]{group}{\optional{group1, \moreargs}}
589 590
Returns one or more subgroups of the match.  If there is a single
argument, the result is a single string; if there are
Guido van Rossum's avatar
Guido van Rossum committed
591
multiple arguments, the result is a tuple with one item per argument.
592 593 594
Without arguments, \var{group1} defaults to zero (i.e. the whole match
is returned).
If a \var{groupN} argument is zero, the corresponding return value is the
Guido van Rossum's avatar
Guido van Rossum committed
595
entire matching string; if it is in the inclusive range [1..99], it is
596 597 598 599
the string matching the the corresponding parenthesized group.  If a
group number is negative or larger than the number of groups defined
in the pattern, an \exception{IndexError} exception is raised.
If a group is contained in a part of the pattern that did not match,
600
the corresponding result is \code{-1}.  If a group is contained in a 
601 602
part of the pattern that matched multiple times, the last match is
returned.
603

604
If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax,
605
the \var{groupN} arguments may also be strings identifying groups by
606 607
their group name.  If a string argument is not used as a group name in 
the pattern, an \exception{IndexError} exception is raised.
Guido van Rossum's avatar
Guido van Rossum committed
608 609

A moderately complicated example:
610 611

\begin{verbatim}
Guido van Rossum's avatar
Guido van Rossum committed
612
m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
613 614 615
\end{verbatim}

After performing this match, \code{m.group(1)} is \code{'3'}, as is
616
\code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
Fred Drake's avatar
Fred Drake committed
617
\end{methoddesc}
618

619
\begin{methoddesc}[MatchObject]{groups}{\optional{default}}
Guido van Rossum's avatar
Guido van Rossum committed
620
Return a tuple containing all the subgroups of the match, from 1 up to
621 622 623 624 625 626 627 628 629 630 631 632 633
however many groups are in the pattern.  The \var{default} argument is
used for groups that did not participate in the match; it defaults to
\code{None}.  (Incompatibility note: in the original Python 1.5
release, if the tuple was one element long, a string would be returned
instead.  In later versions (from 1.5.1 on), a singleton tuple is
returned in such cases.)
\end{methoddesc}

\begin{methoddesc}[MatchObject]{groupdict}{\optional{default}}
Return a dictionary containing all the \emph{named} subgroups of the
match, keyed by the subgroup name.  The \var{default} argument is
used for groups that did not participate in the match; it defaults to
\code{None}.
Fred Drake's avatar
Fred Drake committed
634
\end{methoddesc}
Guido van Rossum's avatar
Guido van Rossum committed
635

Fred Drake's avatar
Fred Drake committed
636
\begin{methoddesc}[MatchObject]{start}{\optional{group}}
637
\funcline{end}{\optional{group}}
Guido van Rossum's avatar
Guido van Rossum committed
638
Return the indices of the start and end of the substring
639 640
matched by \var{group}; \var{group} defaults to zero (meaning the whole
matched substring).
641
Return \code{-1} if \var{group} exists but
Guido van Rossum's avatar
Guido van Rossum committed
642
did not contribute to the match.  For a match object
643 644 645 646 647 648 649 650
\var{m}, and a group \var{g} that did contribute to the match, the
substring matched by group \var{g} (equivalent to
\code{\var{m}.group(\var{g})}) is

\begin{verbatim}
m.string[m.start(g):m.end(g)]
\end{verbatim}

Guido van Rossum's avatar
Guido van Rossum committed
651 652
Note that
\code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
653 654 655 656
\var{group} matched a null string.  For example, after \code{\var{m} =
re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
\code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
\code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
Fred Drake's avatar
Fred Drake committed
657
an \exception{IndexError} exception.
Fred Drake's avatar
Fred Drake committed
658
\end{methoddesc}
Guido van Rossum's avatar
Guido van Rossum committed
659

Fred Drake's avatar
Fred Drake committed
660
\begin{methoddesc}[MatchObject]{span}{\optional{group}}
Fred Drake's avatar
Fred Drake committed
661
For \class{MatchObject} \var{m}, return the 2-tuple
662
\code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
Guido van Rossum's avatar
Guido van Rossum committed
663
Note that if \var{group} did not contribute to the match, this is
664
\code{(-1, -1)}.  Again, \var{group} defaults to zero.
Fred Drake's avatar
Fred Drake committed
665
\end{methoddesc}
Guido van Rossum's avatar
Guido van Rossum committed
666

Fred Drake's avatar
Fred Drake committed
667
\begin{memberdesc}[MatchObject]{pos}
668
The value of \var{pos} which was passed to the
Fred Drake's avatar
Fred Drake committed
669
\function{search()} or \function{match()} function.  This is the index into
670
the string at which the regex engine started looking for a match. 
Fred Drake's avatar
Fred Drake committed
671
\end{memberdesc}
672

Fred Drake's avatar
Fred Drake committed
673
\begin{memberdesc}[MatchObject]{endpos}
674
The value of \var{endpos} which was passed to the
Fred Drake's avatar
Fred Drake committed
675
\function{search()} or \function{match()} function.  This is the index into
676
the string beyond which the regex engine will not go.
Fred Drake's avatar
Fred Drake committed
677
\end{memberdesc}
678

Fred Drake's avatar
Fred Drake committed
679
\begin{memberdesc}[MatchObject]{re}
Fred Drake's avatar
Fred Drake committed
680 681
The regular expression object whose \method{match()} or
\method{search()} method produced this \class{MatchObject} instance.
Fred Drake's avatar
Fred Drake committed
682
\end{memberdesc}
683

Fred Drake's avatar
Fred Drake committed
684
\begin{memberdesc}[MatchObject]{string}
Fred Drake's avatar
Fred Drake committed
685
The string passed to \function{match()} or \function{search()}.
Fred Drake's avatar
Fred Drake committed
686
\end{memberdesc}
687 688

\begin{seealso}
Fred Drake's avatar
Fred Drake committed
689
\seetext{Jeffrey Friedl, \citetitle{Mastering Regular Expressions},
690
O'Reilly.  The Python material in this book dates from before the
Fred Drake's avatar
Fred Drake committed
691
\module{re} module, but it covers writing good regular expression
692
patterns in great detail.}
693
\end{seealso}
694