Commit 062ea2e7 authored by Fred Drake's avatar Fred Drake

Made a number of revisions suggested by Fredrik Lundh.

Revised the first paragraph so it doesn't sound like it was written
when 7-bit strings were assumed; note that Unicode strings can be used.
parent e2b7c4de
\section{\module{re} --- \section{\module{re} ---
Perl-style regular expression operations.} Regular expression operations}
\declaremodule{standard}{re} \declaremodule{standard}{re}
\moduleauthor{Andrew M. Kuchling}{amk1@bigfoot.com} \moduleauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
\moduleauthor{Fredrik Lundh}{effbot@telia.com}
\sectionauthor{Andrew M. Kuchling}{amk1@bigfoot.com} \sectionauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
\modulesynopsis{Perl-style regular expression search and match \modulesynopsis{Regular expression search and match operations with a
operations.} Perl-style expression syntax.}
This module provides regular expression matching operations similar to This module provides regular expression matching operations similar to
those found in Perl. It's 8-bit clean: the strings being processed those found in Perl. Regular expression pattern strings may not
may contain both null bytes and characters whose high bit is set. Regular contain null bytes, but can specify the null byte using the
expression pattern strings may not contain null bytes, but can specify \code{\e\var{number}} notation. Both patterns and strings to be
the null byte using the \code{\e\var{number}} notation. searched can be Unicode strings as well as 8-bit strings. The
Characters with the high bit set may be included. The \module{re} \module{re} module is always available.
module is always available.
Regular expressions use the backslash character (\character{\e}) to Regular expressions use the backslash character (\character{\e}) to
indicate special forms or to allow special characters to be used indicate special forms or to allow special characters to be used
...@@ -34,6 +34,15 @@ while \code{"\e n"} is a one-character string containing a newline. ...@@ -34,6 +34,15 @@ while \code{"\e n"} is a one-character string containing a newline.
Usually patterns will be expressed in Python code using this raw Usually patterns will be expressed in Python code using this raw
string notation. string notation.
\strong{Implementation note:}
The \module{re}\refstmodindex{pre} module has two distinct
implementations: \module{sre} is the default implementation and
includes Unicode support, but may run into stack limitations for some
patterns. Though this will be fixed for a future release of Python,
the older implementation (without Unicode support) is still available
as the \module{pre}\refstmodindex{pre} module.
\subsection{Regular Expression Syntax \label{re-syntax}} \subsection{Regular Expression Syntax \label{re-syntax}}
A regular expression (or RE) specifies a set of strings that matches A regular expression (or RE) specifies a set of strings that matches
...@@ -155,9 +164,16 @@ simply match the \character{\^} character. For example, \regexp{[{\^}5]} ...@@ -155,9 +164,16 @@ simply match the \character{\^} character. For example, \regexp{[{\^}5]}
will match any character except \character{5}. will match any character except \character{5}.
\item[\character{|}]\code{A|B}, where A and B can be arbitrary REs, \item[\character{|}]\code{A|B}, where A and B can be arbitrary REs,
creates a regular expression that will match either A or B. This can creates a regular expression that will match either A or B. An
be used inside groups (see below) as well. To match a literal \character{|}, arbitrary number of REs can be separated by the \character{|} in this
use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}. way. This can be used inside groups (see below) as well. REs
separated by \character{|} are tried from left to right, and the first
one that allows the complete pattern to match is considered the
accepted branch. This means that if \code{A} matches, \code{B} will
never be tested, even if it would produce a longer overall match. In
other words, the \character{|} operator is never greedy. To match a
literal \character{|}, use \regexp{\e|}, or enclose it inside a
character class, as in \regexp{[|]}.
\item[\code{(...)}] Matches whatever regular expression is inside the \item[\code{(...)}] Matches whatever regular expression is inside the
parentheses, and indicates the start and end of a group; the contents parentheses, and indicates the start and end of a group; the contents
...@@ -184,6 +200,11 @@ for the entire regular expression. This is useful if you wish to ...@@ -184,6 +200,11 @@ for the entire regular expression. This is useful if you wish to
include the flags as part of the regular expression, instead of include the flags as part of the regular expression, instead of
passing a \var{flag} argument to the \function{compile()} function. passing a \var{flag} argument to the \function{compile()} function.
Note that the \regexp{(?x)} flag changes how the expression is parsed.
It should be used first in the expression string, or after one or more
whitespace characters. If there are non-whitespace characters before
the flag, the results are undefined.
\item[\code{(?:...)}] A non-grouping version of regular parentheses. \item[\code{(?:...)}] A non-grouping version of regular parentheses.
Matches whatever regular expression is inside the parentheses, but the Matches whatever regular expression is inside the parentheses, but the
substring matched by the substring matched by the
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment