Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
C
cpython
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
Kirill Smelkov
cpython
Commits
46aace34
Commit
46aace34
authored
Aug 18, 2013
by
Andrew Kuchling
Browse files
Options
Browse Files
Download
Plain Diff
Merge from 3.3
parents
5b3d9067
3f4f3ba1
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
62 additions
and
70 deletions
+62
-70
Doc/howto/regex.rst
Doc/howto/regex.rst
+62
-70
No files found.
Doc/howto/regex.rst
View file @
46aace34
...
@@ -104,13 +104,25 @@ you can still match them in patterns; for example, if you need to match a ``[``
...
@@ -104,13 +104,25 @@ you can still match them in patterns; for example, if you need to match a ``[``
or ``\``, you can precede them with a backslash to remove their special
or ``\``, you can precede them with a backslash to remove their special
meaning: ``\[`` or ``\\``.
meaning: ``\[`` or ``\\``.
Some of the special sequences beginning with ``'\'`` represent predefined sets
Some of the special sequences beginning with ``'\'`` represent
of characters that are often useful, such as the set of digits, the set of
predefined sets of characters that are often useful, such as the set
letters, or the set of anything that isn't whitespace. The following predefined
of digits, the set of letters, or the set of anything that isn't
special sequences are a subset of those available. The equivalent classes are
whitespace.
for bytes patterns. For a complete list of sequences and expanded class
definitions for Unicode string patterns, see the last part of
Let's take an example: ``\w`` matches any alphanumeric character. If
:ref:`Regular Expression Syntax
<re-syntax>
`.
the regex pattern is expressed in bytes, this is equivalent to the
class ``[a-zA-Z0-9_]``. If the regex pattern is a string, ``\w`` will
match all the characters marked as letters in the Unicode database
provided by the :mod:`unicodedata` module. You can use the more
restricted definition of ``\w`` in a string pattern by supplying the
:const:`re.ASCII` flag when compiling the regular expression.
The following list of special sequences isn't complete. For a complete
list of sequences and expanded class definitions for Unicode string
patterns, see the last part of :ref:`Regular Expression Syntax
<re-syntax>
` in the Standard Library reference. In general, the
Unicode versions match any character that's in the appropriate
category in the Unicode database.
``\d``
``\d``
Matches any decimal digit; this is equivalent to the class ``[0-9]``.
Matches any decimal digit; this is equivalent to the class ``[0-9]``.
...
@@ -160,9 +172,8 @@ previous character can be matched zero or more times, instead of exactly once.
...
@@ -160,9 +172,8 @@ previous character can be matched zero or more times, instead of exactly once.
For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``),
For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``),
``caaat`` (3 ``a`` characters), and so forth. The RE engine has various
``caaat`` (3 ``a`` characters), and so forth. The RE engine has various
internal limitations stemming from the size of C's ``int`` type that will
internal limitations stemming from the size of C's ``int`` type that will
prevent it from matching over 2 billion ``a`` characters; you probably don't
prevent it from matching over 2 billion ``a`` characters; patterns
have enough memory to construct a string that large, so you shouldn't run into
are usually not written to match that much data.
that limit.
Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching
Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching
engine will try to repeat it as many times as possible. If later portions of the
engine will try to repeat it as many times as possible. If later portions of the
...
@@ -353,7 +364,7 @@ for a complete listing.
...
@@ -353,7 +364,7 @@ for a complete listing.
| | returns them as an :term:`iterator`. |
| | returns them as an :term:`iterator`. |
+------------------+-----------------------------------------------+
+------------------+-----------------------------------------------+
:meth:`
match` and :meth:`
search` return ``None`` if no match can be found. If
:meth:`
~re.regex.match` and :meth:`~re.regex.
search` return ``None`` if no match can be found. If
they're successful, a :ref:`match object
<match-objects>
` instance is returned,
they're successful, a :ref:`match object
<match-objects>
` instance is returned,
containing information about the match: where it starts and ends, the substring
containing information about the match: where it starts and ends, the substring
it matched, and more.
it matched, and more.
...
@@ -419,8 +430,8 @@ Trying these methods will soon clarify their meaning::
...
@@ -419,8 +430,8 @@ Trying these methods will soon clarify their meaning::
>>> m.span()
>>> m.span()
(0, 5)
(0, 5)
:meth:`
group` returns the substring that was matched by the RE. :meth:`
start`
:meth:`
~re.match.group` returns the substring that was matched by the RE. :meth:`~re.match.
start`
and :meth:`
end` return the starting and ending index of the match. :meth:`
span`
and :meth:`
~re.match.end` return the starting and ending index of the match. :meth:`~re.match.
span`
returns both start and end indexes in a single tuple. Since the :meth:`match`
returns both start and end indexes in a single tuple. Since the :meth:`match`
method only checks if the RE matches at the start of a string, :meth:`start`
method only checks if the RE matches at the start of a string, :meth:`start`
will always be zero. However, the :meth:`search` method of patterns
will always be zero. However, the :meth:`search` method of patterns
...
@@ -448,14 +459,14 @@ In actual programs, the most common style is to store the
...
@@ -448,14 +459,14 @@ In actual programs, the most common style is to store the
print('No match')
print('No match')
Two pattern methods return all of the matches for a pattern.
Two pattern methods return all of the matches for a pattern.
:meth:`findall` returns a list of matching strings::
:meth:`
~re.regex.
findall` returns a list of matching strings::
>>> p = re.compile('\d+')
>>> p = re.compile('\d+')
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
['12', '11', '10']
['12', '11', '10']
:meth:`findall` has to create the entire list before it can be returned as the
:meth:`findall` has to create the entire list before it can be returned as the
result. The :meth:`finditer` method returns a sequence of
result. The :meth:`
~re.regex.
finditer` method returns a sequence of
:ref:`match object
<match-objects>
` instances as an :term:`iterator`::
:ref:`match object
<match-objects>
` instances as an :term:`iterator`::
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
...
@@ -473,9 +484,9 @@ Module-Level Functions
...
@@ -473,9 +484,9 @@ Module-Level Functions
----------------------
----------------------
You don't have to create a pattern object and call its methods; the
You don't have to create a pattern object and call its methods; the
:mod:`re` module also provides top-level functions called :func:`match`,
:mod:`re` module also provides top-level functions called :func:`
~re.
match`,
:func:`
search`, :func:`findall`, :func:`
sub`, and so forth. These functions
:func:`
~re.search`, :func:`~re.findall`, :func:`~re.
sub`, and so forth. These functions
take the same arguments as the corresponding pattern method
,
with
take the same arguments as the corresponding pattern method with
the RE string added as the first argument, and still return either ``None`` or a
the RE string added as the first argument, and still return either ``None`` or a
:ref:`match object
<match-objects>
` instance. ::
:ref:`match object
<match-objects>
` instance. ::
...
@@ -485,26 +496,15 @@ the RE string added as the first argument, and still return either ``None`` or a
...
@@ -485,26 +496,15 @@ the RE string added as the first argument, and still return either ``None`` or a
<
_sre
.
SRE_Match
object
at
0x
...
>
<
_sre
.
SRE_Match
object
at
0x
...
>
Under the hood, these functions simply create a pattern object for you
Under the hood, these functions simply create a pattern object for you
and call the appropriate method on it. They also store the compiled object in a
and call the appropriate method on it. They also store the compiled
cache, so future calls using the same RE are faster.
object in a cache, so future calls using the same RE won't need to
parse the pattern again and again.
Should you use these module-level functions, or should you get the
Should you use these module-level functions, or should you get the
pattern and call its methods yourself? That choice depends on how
pattern and call its methods yourself? If you're accessing a regex
frequently the RE will be used, and on your personal coding style. If the RE is
within a loop, pre-compiling it will save a few function calls.
being used at only one point in the code, then the module functions are probably
Outside of loops, there's not much difference thanks to the internal
more convenient. If a program contains a lot of regular expressions, or re-uses
cache.
the same ones in several locations, then it might be worthwhile to collect all
the definitions in one place, in a section of code that compiles all the REs
ahead of time. To take an example from the standard library, here's an extract
from the now-defunct Python 2 standard :mod:`xmllib` module::
ref = re.compile( ... )
entityref = re.compile( ... )
charref = re.compile( ... )
starttagopen = re.compile( ... )
I generally prefer to work with the compiled object, even for one-time uses, but
few people will be as much of a purist about this as I am.
Compilation Flags
Compilation Flags
...
@@ -524,6 +524,10 @@ of each one.
...
@@ -524,6 +524,10 @@ of each one.
+---------------------------------+--------------------------------------------+
+---------------------------------+--------------------------------------------+
| Flag | Meaning |
| Flag | Meaning |
+=================================+============================================+
+=================================+============================================+
| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, |
| | ``\s`` and ``\d`` match only on ASCII |
| | characters with the respective property. |
+---------------------------------+--------------------------------------------+
| :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including |
| :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including |
| | newlines |
| | newlines |
+---------------------------------+--------------------------------------------+
+---------------------------------+--------------------------------------------+
...
@@ -535,11 +539,7 @@ of each one.
...
@@ -535,11 +539,7 @@ of each one.
| | ``$`` |
| | ``$`` |
+---------------------------------+--------------------------------------------+
+---------------------------------+--------------------------------------------+
| :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized |
| :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized |
| | more cleanly and understandably. |
| (for 'extended') | more cleanly and understandably. |
+---------------------------------+--------------------------------------------+
| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, |
| | ``\s`` and ``\d`` match only on ASCII |
| | characters with the respective property. |
+---------------------------------+--------------------------------------------+
+---------------------------------+--------------------------------------------+
...
@@ -558,7 +558,8 @@ of each one.
...
@@ -558,7 +558,8 @@ of each one.
LOCALE
LOCALE
:noindex:
:noindex:
Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale.
Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale
instead of the Unicode database.
Locales are a feature of the C library intended to help in writing programs that
Locales are a feature of the C library intended to help in writing programs that
take account of language differences. For example, if you're processing French
take account of language differences. For example, if you're processing French
...
@@ -851,11 +852,10 @@ keep track of the group numbers. There are two features which help with this
...
@@ -851,11 +852,10 @@ keep track of the group numbers. There are two features which help with this
problem. Both of them use a common syntax for regular expression extensions, so
problem. Both of them use a common syntax for regular expression extensions, so
we'll look at that first.
we'll look at that first.
Perl 5 added several additional features to standard regular expressions, and
Perl 5 is well-known for its powerful additions to standard regular expressions.
the Python :mod:`re` module supports most of them. It would have been
For these new features the Perl developers couldn't choose new single-keystroke metacharacters
difficult to choose new single-keystroke metacharacters or new special sequences
or new special sequences beginning with ``\`` without making Perl's regular
beginning with ``\`` to represent the new features without making Perl's regular
expressions confusingly different from standard REs. If they chose ``
&
`` as a
expressions confusingly different from standard REs. If you chose ``
&
`` as a
new metacharacter, for example, old expressions would be assuming that ``
&
`` was
new metacharacter, for example, old expressions would be assuming that ``
&
`` was
a regular character and wouldn't have escaped it by writing ``\
&
`` or ``[
&
]``.
a regular character and wouldn't have escaped it by writing ``\
&
`` or ``[
&
]``.
...
@@ -867,22 +867,15 @@ what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead
...
@@ -867,22 +867,15 @@ what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead
assertion) and ``(?:foo)`` is something else (a non-capturing group containing
assertion) and ``(?:foo)`` is something else (a non-capturing group containing
the subexpression ``foo``).
the subexpression ``foo``).
Python adds an extension syntax to Perl's extension syntax. If the first
Python supports several of Perl's extensions and adds an extension
character after the question mark is a ``P``, you know that it's an extension
syntax to Perl's extension syntax. If the first character after the
that's specific to Python. Currently there are two such extensions:
question mark is a ``P``, you know that it's an extension that's
``(?P
<name>
...)`` defines a named group, and ``(?P=name)`` is a backreference to
specific to Python.
a named group. If future versions of Perl 5 add similar features using a
different syntax, the :mod:`re` module will be changed to support the new
Now that we've looked at the general extension syntax, we can return
syntax, while preserving the Python-specific syntax for compatibility's sake.
to the features that simplify working with groups in complex REs.
Now that we've looked at the general extension syntax, we can return to the
Sometimes you'll want to use a group to denote a part of a regular expression,
features that simplify working with groups in complex REs. Since groups are
numbered from left to right and a complex expression may use many groups, it can
become difficult to keep track of the correct numbering. Modifying such a
complex RE is annoying, too: insert a new group near the beginning and you
change the numbers of everything that follows it.
Sometimes you'll want to use a group to collect a part of a regular expression,
but aren't interested in retrieving the group's contents. You can make this fact
but aren't interested in retrieving the group's contents. You can make this fact
explicit by using a non-capturing group: ``(?:...)``, where you can replace the
explicit by using a non-capturing group: ``(?:...)``, where you can replace the
``...`` with any other regular expression. ::
``...`` with any other regular expression. ::
...
@@ -908,7 +901,7 @@ numbers, groups can be referenced by a name.
...
@@ -908,7 +901,7 @@ numbers, groups can be referenced by a name.
The syntax for a named group is one of the Python-specific extensions:
The syntax for a named group is one of the Python-specific extensions:
``(?P
<name>
...)``. *name* is, obviously, the name of the group. Named groups
``(?P
<name>
...)``. *name* is, obviously, the name of the group. Named groups
also
behave exactly like capturing groups, and additionally associate a name
behave exactly like capturing groups, and additionally associate a name
with a group. The :ref:`match object
<match-objects>
` methods that deal with
with a group. The :ref:`match object
<match-objects>
` methods that deal with
capturing groups all accept either integers that refer to the group by number
capturing groups all accept either integers that refer to the group by number
or strings that contain the desired group's name. Named groups are still
or strings that contain the desired group's name. Named groups are still
...
@@ -975,9 +968,10 @@ The pattern to match this is quite simple:
...
@@ -975,9 +968,10 @@ The pattern to match this is quite simple:
``.*[.].*$``
``.*[.].*$``
Notice that the ``.`` needs to be treated specially because it's a
Notice that the ``.`` needs to be treated specially because it's a
metacharacter; I've put it inside a character class. Also notice the trailing
metacharacter, so it's inside a character class to only match that
``$``; this is added to ensure that all the rest of the string must be included
specific character. Also notice the trailing ``$``; this is added to
in the extension. This regular expression matches ``foo.bar`` and
ensure that all the rest of the string must be included in the
extension. This regular expression matches ``foo.bar`` and
``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``.
``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``.
Now, consider complicating the problem a bit; what if you want to match
Now, consider complicating the problem a bit; what if you want to match
...
@@ -1051,7 +1045,7 @@ Splitting Strings
...
@@ -1051,7 +1045,7 @@ Splitting Strings
The :meth:`split` method of a pattern splits a string apart
The :meth:`split` method of a pattern splits a string apart
wherever the RE matches, returning a list of the pieces. It's similar to the
wherever the RE matches, returning a list of the pieces. It's similar to the
:meth:`split` method of strings but provides much more generality in the
:meth:`split` method of strings but provides much more generality in the
delimiters that you can split by; :meth:`split` only supports splitting by
delimiters that you can split by;
string
:meth:`split` only supports splitting by
whitespace or by a fixed string. As you'd expect, there's a module-level
whitespace or by a fixed string. As you'd expect, there's a module-level
:func:`re.split` function, too.
:func:`re.split` function, too.
...
@@ -1106,7 +1100,6 @@ Another common task is to find all the matches for a pattern, and replace them
...
@@ -1106,7 +1100,6 @@ Another common task is to find all the matches for a pattern, and replace them
with a different string. The :meth:`sub` method takes a replacement value,
with a different string. The :meth:`sub` method takes a replacement value,
which can be either a string or a function, and the string to be processed.
which can be either a string or a function, and the string to be processed.
.. method:: .sub(replacement, string[, count=0])
.. method:: .sub(replacement, string[, count=0])
:noindex:
:noindex:
...
@@ -1362,4 +1355,3 @@ and doesn't contain any Python material at all, so it won't be useful as a
...
@@ -1362,4 +1355,3 @@ and doesn't contain any Python material at all, so it won't be useful as a
reference for programming in Python. (The first edition covered Python's
reference for programming in Python. (The first edition covered Python's
now-removed :mod:`regex` module, which won't help you much.) Consider checking
now-removed :mod:`regex` module, which won't help you much.) Consider checking
it out from your library.
it out from your library.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment