Commit 52dbbb90 authored by Guido van Rossum's avatar Guido van Rossum

- Issue #3300: make urllib.parse.[un]quote() default to UTF-8.

  Code contributed by Matt Giuca.  quote() now encodes the input
  before quoting, unquote() decodes after unquoting.  There are
  new arguments to change the encoding and errors settings.
  There are also new APIs to skip the encode/decode steps.
  [un]quote_plus() are also affected.
parent 4171da5c
...@@ -182,36 +182,84 @@ The :mod:`urllib.parse` module defines the following functions: ...@@ -182,36 +182,84 @@ The :mod:`urllib.parse` module defines the following functions:
string. If there is no fragment identifier in *url*, return *url* unmodified string. If there is no fragment identifier in *url*, return *url* unmodified
and an empty string. and an empty string.
.. function:: quote(string[, safe]) .. function:: quote(string[, safe[, encoding[, errors]]])
Replace special characters in *string* using the ``%xx`` escape. Letters, Replace special characters in *string* using the ``%xx`` escape. Letters,
digits, and the characters ``'_.-'`` are never quoted. The optional *safe* digits, and the characters ``'_.-'`` are never quoted. The optional *safe*
parameter specifies additional characters that should not be quoted --- its parameter specifies additional ASCII characters that should not be quoted
default value is ``'/'``. --- its default value is ``'/'``.
Example: ``quote('/~connolly/')`` yields ``'/%7econnolly/'``. *string* may be either a :class:`str` or a :class:`bytes`.
The optional *encoding* and *errors* parameters specify how to deal with
non-ASCII characters, as accepted by the :meth:`str.encode` method.
*encoding* defaults to ``'utf-8'``.
*errors* defaults to ``'strict'``, meaning unsupported characters raise a
:class:`UnicodeEncodeError`.
*encoding* and *errors* must not be supplied if *string* is a
:class:`bytes`, or a :class:`TypeError` is raised.
.. function:: quote_plus(string[, safe]) Note that ``quote(string, safe, encoding, errors)`` is equivalent to
``quote_from_bytes(string.encode(encoding, errors), safe)``.
Example: ``quote('/El Niño/')`` yields ``'/El%20Ni%C3%B1o/'``.
.. function:: quote_plus(string[, safe[, encoding[, errors]]])
Like :func:`quote`, but also replace spaces by plus signs, as required for Like :func:`quote`, but also replace spaces by plus signs, as required for
quoting HTML form values. Plus signs in the original string are escaped quoting HTML form values. Plus signs in the original string are escaped
unless they are included in *safe*. It also does not have *safe* default to unless they are included in *safe*. It also does not have *safe* default to
``'/'``. ``'/'``.
Example: ``quote_plus('/El Niño/')`` yields ``'%2FEl+Ni%C3%B1o%2F'``.
.. function:: quote_from_bytes(bytes[, safe])
.. function:: unquote(string) Like :func:`quote`, but accepts a :class:`bytes` object rather than a
:class:`str`, and does not perform string-to-bytes encoding.
Example: ``quote_from_bytes(b'a&\xef')`` yields
``'a%26%EF'``.
.. function:: unquote(string[, encoding[, errors]])
Replace ``%xx`` escapes by their single-character equivalent. Replace ``%xx`` escapes by their single-character equivalent.
The optional *encoding* and *errors* parameters specify how to decode
percent-encoded sequences into Unicode characters, as accepted by the
:meth:`bytes.decode` method.
*string* must be a :class:`str`.
*encoding* defaults to ``'utf-8'``.
*errors* defaults to ``'replace'``, meaning invalid sequences are replaced
by a placeholder character.
Example: ``unquote('/%7Econnolly/')`` yields ``'/~connolly/'``. Example: ``unquote('/El%20Ni%C3%B1o/')`` yields ``'/El Niño/'``.
.. function:: unquote_plus(string) .. function:: unquote_plus(string[, encoding[, errors]])
Like :func:`unquote`, but also replace plus signs by spaces, as required for Like :func:`unquote`, but also replace plus signs by spaces, as required for
unquoting HTML form values. unquoting HTML form values.
*string* must be a :class:`str`.
Example: ``unquote_plus('/El+Ni%C3%B1o/')`` yields ``'/El Niño/'``.
.. function:: unquote_to_bytes(string)
Replace ``%xx`` escapes by their single-octet equivalent, and return a
:class:`bytes` object.
*string* may be either a :class:`str` or a :class:`bytes`.
If it is a :class:`str`, unescaped non-ASCII characters in *string*
are encoded into UTF-8 bytes.
Example: ``unquote_to_bytes('a%26%EF')`` yields
``b'a&\xef'``.
.. function:: urlencode(query[, doseq]) .. function:: urlencode(query[, doseq])
......
...@@ -219,7 +219,7 @@ def encode_rfc2231(s, charset=None, language=None): ...@@ -219,7 +219,7 @@ def encode_rfc2231(s, charset=None, language=None):
charset is given but not language, the string is encoded using the empty charset is given but not language, the string is encoded using the empty
string for language. string for language.
""" """
s = urllib.parse.quote(s, safe='') s = urllib.parse.quote(s, safe='', encoding=charset or 'ascii')
if charset is None and language is None: if charset is None and language is None:
return s return s
if language is None: if language is None:
...@@ -271,7 +271,10 @@ def decode_params(params): ...@@ -271,7 +271,10 @@ def decode_params(params):
# language specifiers at the beginning of the string. # language specifiers at the beginning of the string.
for num, s, encoded in continuations: for num, s, encoded in continuations:
if encoded: if encoded:
s = urllib.parse.unquote(s) # Decode as "latin-1", so the characters in s directly
# represent the percent-encoded octet values.
# collapse_rfc2231_value treats this as an octet sequence.
s = urllib.parse.unquote(s, encoding="latin-1")
extended = True extended = True
value.append(s) value.append(s)
value = quote(EMPTYSTRING.join(value)) value = quote(EMPTYSTRING.join(value))
......
...@@ -68,6 +68,8 @@ parse_qsl_test_cases = [ ...@@ -68,6 +68,8 @@ parse_qsl_test_cases = [
("&a=b", [('a', 'b')]), ("&a=b", [('a', 'b')]),
("a=a+b&b=b+c", [('a', 'a b'), ('b', 'b c')]), ("a=a+b&b=b+c", [('a', 'a b'), ('b', 'b c')]),
("a=1&a=2", [('a', '1'), ('a', '2')]), ("a=1&a=2", [('a', '1'), ('a', '2')]),
("a=%26&b=%3D", [('a', '&'), ('b', '=')]),
("a=%C3%BC&b=%CA%83", [('a', '\xfc'), ('b', '\u0283')]),
] ]
parse_strict_test_cases = [ parse_strict_test_cases = [
......
...@@ -539,6 +539,8 @@ class CookieTests(TestCase): ...@@ -539,6 +539,8 @@ class CookieTests(TestCase):
# unquoted unsafe # unquoted unsafe
("/foo\031/bar", "/foo%19/bar"), ("/foo\031/bar", "/foo%19/bar"),
("/\175foo/bar", "/%7Dfoo/bar"), ("/\175foo/bar", "/%7Dfoo/bar"),
# unicode, latin-1 range
("/foo/bar\u00fc", "/foo/bar%C3%BC"), # UTF-8 encoded
# unicode # unicode
("/foo/bar\uabcd", "/foo/bar%EA%AF%8D"), # UTF-8 encoded ("/foo/bar\uabcd", "/foo/bar%EA%AF%8D"), # UTF-8 encoded
] ]
...@@ -1444,7 +1446,8 @@ class LWPCookieTests(TestCase): ...@@ -1444,7 +1446,8 @@ class LWPCookieTests(TestCase):
# Try some URL encodings of the PATHs. # Try some URL encodings of the PATHs.
# (the behaviour here has changed from libwww-perl) # (the behaviour here has changed from libwww-perl)
c = CookieJar(DefaultCookiePolicy(rfc2965=True)) c = CookieJar(DefaultCookiePolicy(rfc2965=True))
interact_2965(c, "http://www.acme.com/foo%2f%25/%3c%3c%0Anew%E5/%E5", interact_2965(c, "http://www.acme.com/foo%2f%25/"
"%3c%3c%0Anew%C3%A5/%C3%A5",
"foo = bar; version = 1") "foo = bar; version = 1")
cookie = interact_2965( cookie = interact_2965(
......
This diff is collapsed.
...@@ -291,6 +291,7 @@ class UtilityTests(TestCase): ...@@ -291,6 +291,7 @@ class UtilityTests(TestCase):
def testAppURIs(self): def testAppURIs(self):
self.checkAppURI("http://127.0.0.1/") self.checkAppURI("http://127.0.0.1/")
self.checkAppURI("http://127.0.0.1/spam", SCRIPT_NAME="/spam") self.checkAppURI("http://127.0.0.1/spam", SCRIPT_NAME="/spam")
self.checkAppURI("http://127.0.0.1/sp%C3%A4m", SCRIPT_NAME="/späm")
self.checkAppURI("http://spam.example.com:2071/", self.checkAppURI("http://spam.example.com:2071/",
HTTP_HOST="spam.example.com:2071", SERVER_PORT="2071") HTTP_HOST="spam.example.com:2071", SERVER_PORT="2071")
self.checkAppURI("http://spam.example.com/", self.checkAppURI("http://spam.example.com/",
...@@ -304,6 +305,7 @@ class UtilityTests(TestCase): ...@@ -304,6 +305,7 @@ class UtilityTests(TestCase):
def testReqURIs(self): def testReqURIs(self):
self.checkReqURI("http://127.0.0.1/") self.checkReqURI("http://127.0.0.1/")
self.checkReqURI("http://127.0.0.1/spam", SCRIPT_NAME="/spam") self.checkReqURI("http://127.0.0.1/spam", SCRIPT_NAME="/spam")
self.checkReqURI("http://127.0.0.1/sp%C3%A4m", SCRIPT_NAME="/späm")
self.checkReqURI("http://127.0.0.1/spammity/spam", self.checkReqURI("http://127.0.0.1/spammity/spam",
SCRIPT_NAME="/spammity", PATH_INFO="/spam") SCRIPT_NAME="/spammity", PATH_INFO="/spam")
self.checkReqURI("http://127.0.0.1/spammity/spam?say=ni", self.checkReqURI("http://127.0.0.1/spammity/spam?say=ni",
......
...@@ -5,9 +5,12 @@ UC Irvine, June 1995. ...@@ -5,9 +5,12 @@ UC Irvine, June 1995.
""" """
import sys import sys
import collections
__all__ = ["urlparse", "urlunparse", "urljoin", "urldefrag", __all__ = ["urlparse", "urlunparse", "urljoin", "urldefrag",
"urlsplit", "urlunsplit"] "urlsplit", "urlunsplit",
"quote", "quote_plus", "quote_from_bytes",
"unquote", "unquote_plus", "unquote_to_bytes"]
# A classification of schemes ('' means apply by default) # A classification of schemes ('' means apply by default)
uses_relative = ['ftp', 'http', 'gopher', 'nntp', 'imap', uses_relative = ['ftp', 'http', 'gopher', 'nntp', 'imap',
...@@ -269,50 +272,101 @@ def urldefrag(url): ...@@ -269,50 +272,101 @@ def urldefrag(url):
else: else:
return url, '' return url, ''
def unquote_to_bytes(string):
_hextochr = dict(('%02x' % i, chr(i)) for i in range(256)) """unquote_to_bytes('abc%20def') -> b'abc def'."""
_hextochr.update(('%02X' % i, chr(i)) for i in range(256)) # Note: strings are encoded as UTF-8. This is only an issue if it contains
# unescaped non-ASCII characters, which URIs should not.
def unquote(s): if isinstance(string, str):
"""unquote('abc%20def') -> 'abc def'.""" string = string.encode('utf-8')
res = s.split('%') res = string.split(b'%')
res[0] = res[0]
for i in range(1, len(res)): for i in range(1, len(res)):
item = res[i] item = res[i]
try: try:
res[i] = _hextochr[item[:2]] + item[2:] res[i] = bytes([int(item[:2], 16)]) + item[2:]
except KeyError: except ValueError:
res[i] = '%' + item res[i] = b'%' + item
except UnicodeDecodeError: return b''.join(res)
res[i] = chr(int(item[:2], 16)) + item[2:]
return "".join(res) def unquote(string, encoding='utf-8', errors='replace'):
"""Replace %xx escapes by their single-character equivalent. The optional
def unquote_plus(s): encoding and errors parameters specify how to decode percent-encoded
"""unquote('%7e/abc+def') -> '~/abc def'""" sequences into Unicode characters, as accepted by the bytes.decode()
s = s.replace('+', ' ') method.
return unquote(s) By default, percent-encoded sequences are decoded with UTF-8, and invalid
sequences are replaced by a placeholder character.
always_safe = ('ABCDEFGHIJKLMNOPQRSTUVWXYZ'
'abcdefghijklmnopqrstuvwxyz' unquote('abc%20def') -> 'abc def'.
'0123456789' '_.-') """
if encoding is None: encoding = 'utf-8'
if errors is None: errors = 'replace'
# pct_sequence: contiguous sequence of percent-encoded bytes, decoded
# (list of single-byte bytes objects)
pct_sequence = []
res = string.split('%')
for i in range(1, len(res)):
item = res[i]
try:
if not item: raise ValueError
pct_sequence.append(bytes.fromhex(item[:2]))
rest = item[2:]
except ValueError:
rest = '%' + item
if not rest:
# This segment was just a single percent-encoded character.
# May be part of a sequence of code units, so delay decoding.
# (Stored in pct_sequence).
res[i] = ''
else:
# Encountered non-percent-encoded characters. Flush the current
# pct_sequence.
res[i] = b''.join(pct_sequence).decode(encoding, errors) + rest
pct_sequence = []
if pct_sequence:
# Flush the final pct_sequence
# res[-1] will always be empty if pct_sequence != []
assert not res[-1], "string=%r, res=%r" % (string, res)
res[-1] = b''.join(pct_sequence).decode(encoding, errors)
return ''.join(res)
def unquote_plus(string, encoding='utf-8', errors='replace'):
"""Like unquote(), but also replace plus signs by spaces, as required for
unquoting HTML form values.
unquote_plus('%7e/abc+def') -> '~/abc def'
"""
string = string.replace('+', ' ')
return unquote(string, encoding, errors)
_ALWAYS_SAFE = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
b'abcdefghijklmnopqrstuvwxyz'
b'0123456789'
b'_.-')
_safe_quoters= {} _safe_quoters= {}
class Quoter: class Quoter(collections.defaultdict):
"""A mapping from bytes (in range(0,256)) to strings.
String values are percent-encoded byte values, unless the key < 128, and
in the "safe" set (either the specified safe set, or default set).
"""
# Keeps a cache internally, using defaultdict, for efficiency (lookups
# of cached keys don't call Python code at all).
def __init__(self, safe): def __init__(self, safe):
self.cache = {} """safe: bytes object."""
self.safe = safe + always_safe self.safe = _ALWAYS_SAFE.union(c for c in safe if c < 128)
def __call__(self, c): def __repr__(self):
try: # Without this, will just display as a defaultdict
return self.cache[c] return "<Quoter %r>" % dict(self)
except KeyError:
if ord(c) < 256: def __missing__(self, b):
res = (c in self.safe) and c or ('%%%02X' % ord(c)) # Handle a cache miss. Store quoted string in cache and return.
self.cache[c] = res res = b in self.safe and chr(b) or ('%%%02X' % b)
self[b] = res
return res return res
else:
return "".join(['%%%02X' % i for i in c.encode("utf-8")])
def quote(s, safe = '/'): def quote(string, safe='/', encoding=None, errors=None):
"""quote('abc def') -> 'abc%20def' """quote('abc def') -> 'abc%20def'
Each part of a URL, e.g. the path info, the query, etc., has a Each part of a URL, e.g. the path info, the query, etc., has a
...@@ -332,22 +386,57 @@ def quote(s, safe = '/'): ...@@ -332,22 +386,57 @@ def quote(s, safe = '/'):
is reserved, but in typical usage the quote function is being is reserved, but in typical usage the quote function is being
called on a path where the existing slash characters are used as called on a path where the existing slash characters are used as
reserved characters. reserved characters.
string and safe may be either str or bytes objects. encoding must
not be specified if string is a str.
The optional encoding and errors parameters specify how to deal with
non-ASCII characters, as accepted by the str.encode method.
By default, encoding='utf-8' (characters are encoded with UTF-8), and
errors='strict' (unsupported characters raise a UnicodeEncodeError).
"""
if isinstance(string, str):
if encoding is None:
encoding = 'utf-8'
if errors is None:
errors = 'strict'
string = string.encode(encoding, errors)
else:
if encoding is not None:
raise TypeError("quote() doesn't support 'encoding' for bytes")
if errors is not None:
raise TypeError("quote() doesn't support 'errors' for bytes")
return quote_from_bytes(string, safe)
def quote_plus(string, safe='', encoding=None, errors=None):
"""Like quote(), but also replace ' ' with '+', as required for quoting
HTML form values. Plus signs in the original string are escaped unless
they are included in safe. It also does not have safe default to '/'.
""" """
cachekey = (safe, always_safe) # Check if ' ' in string, where string may either be a str or bytes
if ' ' in string if isinstance(string, str) else b' ' in string:
string = quote(string,
safe + ' ' if isinstance(safe, str) else safe + b' ')
return string.replace(' ', '+')
return quote(string, safe, encoding, errors)
def quote_from_bytes(bs, safe='/'):
"""Like quote(), but accepts a bytes object rather than a str, and does
not perform string-to-bytes encoding. It always returns an ASCII string.
quote_from_bytes(b'abc def\xab') -> 'abc%20def%AB'
"""
if isinstance(safe, str):
# Normalize 'safe' by converting to bytes and removing non-ASCII chars
safe = safe.encode('ascii', 'ignore')
cachekey = bytes(safe) # In case it was a bytearray
if not (isinstance(bs, bytes) or isinstance(bs, bytearray)):
raise TypeError("quote_from_bytes() expected a bytes")
try: try:
quoter = _safe_quoters[cachekey] quoter = _safe_quoters[cachekey]
except KeyError: except KeyError:
quoter = Quoter(safe) quoter = Quoter(safe)
_safe_quoters[cachekey] = quoter _safe_quoters[cachekey] = quoter
res = map(quoter, s) return ''.join(map(quoter.__getitem__, bs))
return ''.join(res)
def quote_plus(s, safe = ''):
"""Quote the query fragment of a URL; replacing ' ' with '+'"""
if ' ' in s:
s = quote(s, safe + ' ')
return s.replace(' ', '+')
return quote(s, safe)
def urlencode(query,doseq=0): def urlencode(query,doseq=0):
"""Encode a sequence of two-element tuples or dictionary into a URL query string. """Encode a sequence of two-element tuples or dictionary into a URL query string.
......
...@@ -30,6 +30,13 @@ Core and Builtins ...@@ -30,6 +30,13 @@ Core and Builtins
Library Library
------- -------
- Issue #3300: make urllib.parse.[un]quote() default to UTF-8.
Code contributed by Matt Giuca. quote() now encodes the input
before quoting, unquote() decodes after unquoting. There are
new arguments to change the encoding and errors settings.
There are also new APIs to skip the encode/decode steps.
[un]quote_plus() are also affected.
- Issue #2235: numbers.Number now blocks inheritance of the default id() - Issue #2235: numbers.Number now blocks inheritance of the default id()
based hash because that hash mechanism is not correct for numeric types. based hash because that hash mechanism is not correct for numeric types.
All concrete numeric types that inherit from Number (rather than just All concrete numeric types that inherit from Number (rather than just
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment