Commit 7ed7aead authored by Victor Stinner's avatar Victor Stinner Committed by GitHub

bpo-29240: Fix locale encodings in UTF-8 Mode (#5170)

Modify locale.localeconv(), time.tzname, os.strerror() and other
functions to ignore the UTF-8 Mode: always use the current locale
encoding.

Changes:

* Add _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx(). On decoding or
  encoding error, they return the position of the error and an error
  message which are used to raise Unicode errors in
  PyUnicode_DecodeLocale() and PyUnicode_EncodeLocale().
* Replace _Py_DecodeCurrentLocale() with _Py_DecodeLocaleEx().
* PyUnicode_DecodeLocale() now uses _Py_DecodeLocaleEx() for all
  cases, especially for the strict error handler.
* Add _Py_DecodeUTF8Ex(): return more information on decoding error
  and supports the strict error handler.
* Rename _Py_EncodeUTF8_surrogateescape() to _Py_EncodeUTF8Ex().
* Replace _Py_EncodeCurrentLocale() with _Py_EncodeLocaleEx().
* Ignore the UTF-8 mode to encode/decode localeconv(), strerror()
  and time zone name.
* Remove PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize()
  and PyUnicode_EncodeLocale() now ignore the UTF-8 mode: always use
  the "current" locale.
* Remove _PyUnicode_DecodeCurrentLocale(),
  _PyUnicode_DecodeCurrentLocaleAndSize() and
  _PyUnicode_EncodeCurrentLocale().
parent ee3b8354
...@@ -106,6 +106,16 @@ Operating System Utilities ...@@ -106,6 +106,16 @@ Operating System Utilities
surrogate character, escape the bytes using the surrogateescape error surrogate character, escape the bytes using the surrogateescape error
handler instead of decoding them. handler instead of decoding them.
Encoding, highest priority to lowest priority:
* ``UTF-8`` on macOS and Android;
* ``UTF-8`` if the Python UTF-8 mode is enabled;
* ``ASCII`` if the ``LC_CTYPE`` locale is ``"C"``,
``nl_langinfo(CODESET)`` returns the ``ASCII`` encoding (or an alias),
and :c:func:`mbstowcs` and :c:func:`wcstombs` functions uses the
``ISO-8859-1`` encoding.
* the current locale encoding.
Return a pointer to a newly allocated wide character string, use Return a pointer to a newly allocated wide character string, use
:c:func:`PyMem_RawFree` to free the memory. If size is not ``NULL``, write :c:func:`PyMem_RawFree` to free the memory. If size is not ``NULL``, write
the number of wide characters excluding the null character into ``*size`` the number of wide characters excluding the null character into ``*size``
...@@ -137,6 +147,18 @@ Operating System Utilities ...@@ -137,6 +147,18 @@ Operating System Utilities
:ref:`surrogateescape error handler <surrogateescape>`: surrogate characters :ref:`surrogateescape error handler <surrogateescape>`: surrogate characters
in the range U+DC80..U+DCFF are converted to bytes 0x80..0xFF. in the range U+DC80..U+DCFF are converted to bytes 0x80..0xFF.
Encoding, highest priority to lowest priority:
* ``UTF-8`` on macOS and Android;
* ``UTF-8`` if the Python UTF-8 mode is enabled;
* ``ASCII`` if the ``LC_CTYPE`` locale is ``"C"``,
``nl_langinfo(CODESET)`` returns the ``ASCII`` encoding (or an alias),
and :c:func:`mbstowcs` and :c:func:`wcstombs` functions uses the
``ISO-8859-1`` encoding.
* the current locale encoding.
The function uses the UTF-8 encoding in the Python UTF-8 mode.
Return a pointer to a newly allocated byte string, use :c:func:`PyMem_Free` Return a pointer to a newly allocated byte string, use :c:func:`PyMem_Free`
to free the memory. Return ``NULL`` on encoding error or memory allocation to free the memory. Return ``NULL`` on encoding error or memory allocation
error error
......
...@@ -770,12 +770,20 @@ system. ...@@ -770,12 +770,20 @@ system.
:c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
Python startup). Python startup).
This function ignores the Python UTF-8 mode.
.. seealso:: .. seealso::
The :c:func:`Py_DecodeLocale` function. The :c:func:`Py_DecodeLocale` function.
.. versionadded:: 3.3 .. versionadded:: 3.3
.. versionchanged:: 3.7
The function now also uses the current locale encoding for the
``surrogateescape`` error handler. Previously, :c:func:`Py_DecodeLocale`
was used for the ``surrogateescape``, and the current locale encoding was
used for ``strict``.
.. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors) .. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors)
...@@ -797,12 +805,20 @@ system. ...@@ -797,12 +805,20 @@ system.
:c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
Python startup). Python startup).
This function ignores the Python UTF-8 mode.
.. seealso:: .. seealso::
The :c:func:`Py_EncodeLocale` function. The :c:func:`Py_EncodeLocale` function.
.. versionadded:: 3.3 .. versionadded:: 3.3
.. versionchanged:: 3.7
The function now also uses the current locale encoding for the
``surrogateescape`` error handler. Previously, :c:func:`Py_EncodeLocale`
was used for the ``surrogateescape``, and the current locale encoding was
used for ``strict``.
File System Encoding File System Encoding
"""""""""""""""""""" """"""""""""""""""""
......
...@@ -20,18 +20,41 @@ PyAPI_FUNC(char*) _Py_EncodeLocaleRaw( ...@@ -20,18 +20,41 @@ PyAPI_FUNC(char*) _Py_EncodeLocaleRaw(
#endif #endif
#ifdef Py_BUILD_CORE #ifdef Py_BUILD_CORE
PyAPI_FUNC(int) _Py_DecodeUTF8Ex(
const char *arg,
Py_ssize_t arglen,
wchar_t **wstr,
size_t *wlen,
const char **reason,
int surrogateescape);
PyAPI_FUNC(int) _Py_EncodeUTF8Ex(
const wchar_t *text,
char **str,
size_t *error_pos,
const char **reason,
int raw_malloc,
int surrogateescape);
PyAPI_FUNC(wchar_t*) _Py_DecodeUTF8_surrogateescape( PyAPI_FUNC(wchar_t*) _Py_DecodeUTF8_surrogateescape(
const char *s, const char *arg,
Py_ssize_t size, Py_ssize_t arglen);
size_t *p_wlen);
PyAPI_FUNC(wchar_t *) _Py_DecodeCurrentLocale( PyAPI_FUNC(int) _Py_DecodeLocaleEx(
const char *arg, const char *arg,
size_t *size); wchar_t **wstr,
size_t *wlen,
const char **reason,
int current_locale,
int surrogateescape);
PyAPI_FUNC(char*) _Py_EncodeCurrentLocale( PyAPI_FUNC(int) _Py_EncodeLocaleEx(
const wchar_t *text, const wchar_t *text,
size_t *error_pos); char **str,
size_t *error_pos,
const char **reason,
int current_locale,
int surrogateescape);
#endif #endif
#ifndef Py_LIMITED_API #ifndef Py_LIMITED_API
......
...@@ -1810,20 +1810,6 @@ PyAPI_FUNC(PyObject*) PyUnicode_EncodeLocale( ...@@ -1810,20 +1810,6 @@ PyAPI_FUNC(PyObject*) PyUnicode_EncodeLocale(
PyObject *unicode, PyObject *unicode,
const char *errors const char *errors
); );
PyAPI_FUNC(PyObject*) _PyUnicode_DecodeCurrentLocale(
const char *str,
const char *errors);
PyAPI_FUNC(PyObject*) _PyUnicode_DecodeCurrentLocaleAndSize(
const char *str,
Py_ssize_t len,
const char *errors);
PyAPI_FUNC(PyObject*) _PyUnicode_EncodeCurrentLocale(
PyObject *unicode,
const char *errors
);
#endif #endif
/* --- File system encoding ---------------------------------------------- */ /* --- File system encoding ---------------------------------------------- */
......
...@@ -696,7 +696,7 @@ static int parse_isoformat_date(const char *dtstr, ...@@ -696,7 +696,7 @@ static int parse_isoformat_date(const char *dtstr,
if (NULL == p) { if (NULL == p) {
return -1; return -1;
} }
if (*(p++) != '-') { if (*(p++) != '-') {
return -2; return -2;
} }
......
...@@ -572,8 +572,9 @@ PyIntl_bind_textdomain_codeset(PyObject* self,PyObject*args) ...@@ -572,8 +572,9 @@ PyIntl_bind_textdomain_codeset(PyObject* self,PyObject*args)
if (!PyArg_ParseTuple(args, "sz", &domain, &codeset)) if (!PyArg_ParseTuple(args, "sz", &domain, &codeset))
return NULL; return NULL;
codeset = bind_textdomain_codeset(domain, codeset); codeset = bind_textdomain_codeset(domain, codeset);
if (codeset) if (codeset) {
return PyUnicode_DecodeLocale(codeset, NULL); return PyUnicode_DecodeLocale(codeset, NULL);
}
Py_RETURN_NONE; Py_RETURN_NONE;
} }
#endif #endif
......
...@@ -449,8 +449,8 @@ search_for_exec_prefix(const _PyCoreConfig *core_config, ...@@ -449,8 +449,8 @@ search_for_exec_prefix(const _PyCoreConfig *core_config,
n = fread(buf, 1, MAXPATHLEN, f); n = fread(buf, 1, MAXPATHLEN, f);
buf[n] = '\0'; buf[n] = '\0';
fclose(f); fclose(f);
rel_builddir_path = _Py_DecodeUTF8_surrogateescape(buf, n, NULL); rel_builddir_path = _Py_DecodeUTF8_surrogateescape(buf, n);
if (rel_builddir_path != NULL) { if (rel_builddir_path) {
wcsncpy(exec_prefix, calculate->argv0_path, MAXPATHLEN); wcsncpy(exec_prefix, calculate->argv0_path, MAXPATHLEN);
exec_prefix[MAXPATHLEN] = L'\0'; exec_prefix[MAXPATHLEN] = L'\0';
joinpath(exec_prefix, rel_builddir_path); joinpath(exec_prefix, rel_builddir_path);
......
...@@ -132,13 +132,13 @@ static PyModuleDef readlinemodule; ...@@ -132,13 +132,13 @@ static PyModuleDef readlinemodule;
static PyObject * static PyObject *
encode(PyObject *b) encode(PyObject *b)
{ {
return _PyUnicode_EncodeCurrentLocale(b, "surrogateescape"); return PyUnicode_EncodeLocale(b, "surrogateescape");
} }
static PyObject * static PyObject *
decode(const char *s) decode(const char *s)
{ {
return _PyUnicode_DecodeCurrentLocale(s, "surrogateescape"); return PyUnicode_DecodeLocale(s, "surrogateescape");
} }
......
...@@ -418,11 +418,11 @@ tmtotuple(struct tm *p ...@@ -418,11 +418,11 @@ tmtotuple(struct tm *p
SET(8, p->tm_isdst); SET(8, p->tm_isdst);
#ifdef HAVE_STRUCT_TM_TM_ZONE #ifdef HAVE_STRUCT_TM_TM_ZONE
PyStructSequence_SET_ITEM(v, 9, PyStructSequence_SET_ITEM(v, 9,
_PyUnicode_DecodeCurrentLocale(p->tm_zone, "surrogateescape")); PyUnicode_DecodeLocale(p->tm_zone, "surrogateescape"));
SET(10, p->tm_gmtoff); SET(10, p->tm_gmtoff);
#else #else
PyStructSequence_SET_ITEM(v, 9, PyStructSequence_SET_ITEM(v, 9,
_PyUnicode_DecodeCurrentLocale(zone, "surrogateescape")); PyUnicode_DecodeLocale(zone, "surrogateescape"));
PyStructSequence_SET_ITEM(v, 10, _PyLong_FromTime_t(gmtoff)); PyStructSequence_SET_ITEM(v, 10, _PyLong_FromTime_t(gmtoff));
#endif /* HAVE_STRUCT_TM_TM_ZONE */ #endif /* HAVE_STRUCT_TM_TM_ZONE */
#undef SET #undef SET
...@@ -809,8 +809,7 @@ time_strftime(PyObject *self, PyObject *args) ...@@ -809,8 +809,7 @@ time_strftime(PyObject *self, PyObject *args)
#ifdef HAVE_WCSFTIME #ifdef HAVE_WCSFTIME
ret = PyUnicode_FromWideChar(outbuf, buflen); ret = PyUnicode_FromWideChar(outbuf, buflen);
#else #else
ret = _PyUnicode_DecodeCurrentLocaleAndSize(outbuf, buflen, ret = PyUnicode_DecodeLocaleAndSize(outbuf, buflen, "surrogateescape");
"surrogateescape");
#endif #endif
PyMem_Free(outbuf); PyMem_Free(outbuf);
break; break;
...@@ -1541,8 +1540,8 @@ PyInit_timezone(PyObject *m) { ...@@ -1541,8 +1540,8 @@ PyInit_timezone(PyObject *m) {
PyModule_AddIntConstant(m, "altzone", timezone-3600); PyModule_AddIntConstant(m, "altzone", timezone-3600);
#endif #endif
PyModule_AddIntConstant(m, "daylight", daylight); PyModule_AddIntConstant(m, "daylight", daylight);
otz0 = _PyUnicode_DecodeCurrentLocale(tzname[0], "surrogateescape"); otz0 = PyUnicode_DecodeLocale(tzname[0], "surrogateescape");
otz1 = _PyUnicode_DecodeCurrentLocale(tzname[1], "surrogateescape"); otz1 = PyUnicode_DecodeLocale(tzname[1], "surrogateescape");
PyModule_AddObject(m, "tzname", Py_BuildValue("(NN)", otz0, otz1)); PyModule_AddObject(m, "tzname", Py_BuildValue("(NN)", otz0, otz1));
#else /* !HAVE_TZNAME || __GLIBC__ || __CYGWIN__*/ #else /* !HAVE_TZNAME || __GLIBC__ || __CYGWIN__*/
{ {
......
This diff is collapsed.
This diff is collapsed.
...@@ -382,8 +382,8 @@ _Py_FindEnvConfigValue(FILE *env_file, const wchar_t *key, ...@@ -382,8 +382,8 @@ _Py_FindEnvConfigValue(FILE *env_file, const wchar_t *key,
/* Comment - skip */ /* Comment - skip */
continue; continue;
} }
tmpbuffer = _Py_DecodeUTF8_surrogateescape(buffer, n, NULL); tmpbuffer = _Py_DecodeUTF8_surrogateescape(buffer, n);
if (tmpbuffer != NULL) { if (tmpbuffer) {
wchar_t * state; wchar_t * state;
wchar_t * tok = wcstok(tmpbuffer, L" \t\r\n", &state); wchar_t * tok = wcstok(tmpbuffer, L" \t\r\n", &state);
if ((tok != NULL) && !wcscmp(tok, key)) { if ((tok != NULL) && !wcscmp(tok, key)) {
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment