Commit 1111983a authored by Mark Summerfield's avatar Mark Summerfield

Added a note in each regarding the fact that unicode strings that look the same

may not compare equal (due to the possibility of multiple representations).
parent 822fd532
...@@ -107,7 +107,7 @@ the following functions: ...@@ -107,7 +107,7 @@ the following functions:
based on the definition of canonical equivalence and compatibility equivalence. based on the definition of canonical equivalence and compatibility equivalence.
In Unicode, several characters can be expressed in various way. For example, the In Unicode, several characters can be expressed in various way. For example, the
character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA). the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C).
For each character, there are two normal forms: normal form C and normal form D. For each character, there are two normal forms: normal form C and normal form D.
Normal form D (NFD) is also known as canonical decomposition, and translates Normal form D (NFD) is also known as canonical decomposition, and translates
...@@ -126,6 +126,10 @@ the following functions: ...@@ -126,6 +126,10 @@ the following functions:
(NFKC) first applies the compatibility decomposition, followed by the canonical (NFKC) first applies the compatibility decomposition, followed by the canonical
composition. composition.
Even if two unicode strings are normalized and look the same to
a human reader, if one has combining characters and the other
doesn't, they may not compare equal.
.. versionadded:: 2.3 .. versionadded:: 2.3
In addition, the module exposes the following constant: In addition, the module exposes the following constant:
......
...@@ -1040,7 +1040,7 @@ Comparison of objects of the same type depends on the type: ...@@ -1040,7 +1040,7 @@ Comparison of objects of the same type depends on the type:
* Strings are compared lexicographically using the numeric equivalents (the * Strings are compared lexicographically using the numeric equivalents (the
result of the built-in function :func:`ord`) of their characters. Unicode and result of the built-in function :func:`ord`) of their characters. Unicode and
8-bit strings are fully interoperable in this behavior. 8-bit strings are fully interoperable in this behavior. [#]_
* Tuples and lists are compared lexicographically using comparison of * Tuples and lists are compared lexicographically using comparison of
corresponding elements. This means that to compare equal, each element must corresponding elements. This means that to compare equal, each element must
...@@ -1328,6 +1328,12 @@ groups from right to left). ...@@ -1328,6 +1328,12 @@ groups from right to left).
cases, Python returns the latter result, in order to preserve that cases, Python returns the latter result, in order to preserve that
``divmod(x,y)[0] * y + x % y`` be very close to ``x``. ``divmod(x,y)[0] * y + x % y`` be very close to ``x``.
.. [#] While comparisons between unicode strings make sense at the byte
level, they may be counter-intuitive to users. For example, the
strings ``u"\u00C7"`` and ``u"\u0327\u0043"`` compare differently,
even though they both represent the same unicode character (LATIN
CAPTITAL LETTER C WITH CEDILLA).
.. [#] The implementation computes this efficiently, without constructing lists or .. [#] The implementation computes this efficiently, without constructing lists or
sorting. sorting.
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment