bpo-29240: Fix locale encodings in UTF-8 Mode (#5170)

Modify locale.localeconv(), time.tzname, os.strerror() and other functions to ignore the UTF-8 Mode: always use the current locale encoding. Changes: * Add _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx(). On decoding or encoding error, they return the position of the error and an error message which are used to raise Unicode errors in PyUnicode_DecodeLocale() and PyUnicode_EncodeLocale(). * Replace _Py_DecodeCurrentLocale() with _Py_DecodeLocaleEx(). * PyUnicode_DecodeLocale() now uses _Py_DecodeLocaleEx() for all cases, especially for the strict error handler. * Add _Py_DecodeUTF8Ex(): return more information on decoding error and supports the strict error handler. * Rename _Py_EncodeUTF8_surrogateescape() to _Py_EncodeUTF8Ex(). * Replace _Py_EncodeCurrentLocale() with _Py_EncodeLocaleEx(). * Ignore the UTF-8 mode to encode/decode localeconv(), strerror() and time zone name. * Remove PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize() and PyUnicode_EncodeLocale() now ignore the UTF-8 mode: always use the "current" locale. * Remove _PyUnicode_DecodeCurrentLocale(), _PyUnicode_DecodeCurrentLocaleAndSize() and _PyUnicode_EncodeCurrentLocale().

bpo-29240: Fix locale encodings in UTF-8 Mode (#5170)
Modify locale.localeconv(), time.tzname, os.strerror() and other functions to ignore the UTF-8 Mode: always use the current locale encoding. Changes: * Add _Py_DecodeLocaleEx() and _Py_EncodeLocaleEx(). On decoding or encoding error, they return the position of the error and an error message which are used to raise Unicode errors in PyUnicode_DecodeLocale() and PyUnicode_EncodeLocale(). * Replace _Py_DecodeCurrentLocale() with _Py_DecodeLocaleEx(). * PyUnicode_DecodeLocale() now uses _Py_DecodeLocaleEx() for all cases, especially for the strict error handler. * Add _Py_DecodeUTF8Ex(): return more information on decoding error and supports the strict error handler. * Rename _Py_EncodeUTF8_surrogateescape() to _Py_EncodeUTF8Ex(). * Replace _Py_EncodeCurrentLocale() with _Py_EncodeLocaleEx(). * Ignore the UTF-8 mode to encode/decode localeconv(), strerror() and time zone name. * Remove PyUnicode_DecodeLocale(), PyUnicode_DecodeLocaleAndSize() and PyUnicode_EncodeLocale() now ignore the UTF-8 mode: always use the "current" locale. * Remove _PyUnicode_DecodeCurrentLocale(), _PyUnicode_DecodeCurrentLocaleAndSize() and _PyUnicode_EncodeCurrentLocale().
7ed7aead · Victor Stinner · GitHub · ee3b8354 · 7ed7aead · 7ed7aead
Commit 7ed7aead authored Jan 15, 2018 by Victor Stinner Committed by GitHub Jan 15, 2018
12 changed files
--- a/Doc/c-api/sys.rst
+++ b/Doc/c-api/sys.rst
@@ -106,6 +106,16 @@ Operating System Utilities
   surrogate character, escape the bytes using the surrogateescape error
   handler instead of decoding them.
+   Encoding, highest priority to lowest priority:
+   * ``UTF-8`` on macOS and Android;
+   * ``UTF-8`` if the Python UTF-8 mode is enabled;
+   * ``ASCII`` if the ``LC_CTYPE`` locale is ``"C"``,
+     ``nl_langinfo(CODESET)`` returns the ``ASCII`` encoding (or an alias),
+     and :c:func:`mbstowcs` and :c:func:`wcstombs` functions uses the
+     ``ISO-8859-1`` encoding.
+   * the current locale encoding.
   Return a pointer to a newly allocated wide character string, use
   :c:func:`PyMem_RawFree` to free the memory. If size is not ``NULL``, write
   the number of wide characters excluding the null character into ``*size``
@@ -137,6 +147,18 @@ Operating System Utilities
   :ref:`surrogateescape error handler <surrogateescape>`: surrogate characters
   in the range U+DC80..U+DCFF are converted to bytes 0x80..0xFF.
+   Encoding, highest priority to lowest priority:
+   * ``UTF-8`` on macOS and Android;
+   * ``UTF-8`` if the Python UTF-8 mode is enabled;
+   * ``ASCII`` if the ``LC_CTYPE`` locale is ``"C"``,
+     ``nl_langinfo(CODESET)`` returns the ``ASCII`` encoding (or an alias),
+     and :c:func:`mbstowcs` and :c:func:`wcstombs` functions uses the
+     ``ISO-8859-1`` encoding.
+   * the current locale encoding.
+   The function uses the UTF-8 encoding in the Python UTF-8 mode.
   Return a pointer to a newly allocated byte string, use :c:func:`PyMem_Free`
   to free the memory. Return ``NULL`` on encoding error or memory allocation
   error

--- a/Doc/c-api/unicode.rst
+++ b/Doc/c-api/unicode.rst
@@ -770,12 +770,20 @@ system.
   :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
   Python startup).
+   This function ignores the Python UTF-8 mode.
   .. seealso::
      The :c:func:`Py_DecodeLocale` function.
   .. versionadded:: 3.3
+   .. versionchanged:: 3.7
+      The function now also uses the current locale encoding for the
+      ``surrogateescape`` error handler. Previously, :c:func:`Py_DecodeLocale`
+      was used for the ``surrogateescape``, and the current locale encoding was
+      used for ``strict``.
 .. c:function:: PyObject* PyUnicode_DecodeLocale(const char *str, const char *errors)
@@ -797,12 +805,20 @@ system.
   :c:data:`Py_FileSystemDefaultEncoding` (the locale encoding read at
   Python startup).
+   This function ignores the Python UTF-8 mode.
   .. seealso::
      The :c:func:`Py_EncodeLocale` function.
   .. versionadded:: 3.3
+   .. versionchanged:: 3.7
+      The function now also uses the current locale encoding for the
+      ``surrogateescape`` error handler. Previously, :c:func:`Py_EncodeLocale`
+      was used for the ``surrogateescape``, and the current locale encoding was
+      used for ``strict``.
 File System Encoding
 """"""""""""""""""""

--- a/Include/fileutils.h
+++ b/Include/fileutils.h
@@ -20,18 +20,41 @@ PyAPI_FUNC(char*) _Py_EncodeLocaleRaw(
 #endif
 #ifdef Py_BUILD_CORE
+PyAPI_FUNC(int) _Py_DecodeUTF8Ex(
+    const char *arg,
+    Py_ssize_t arglen,
+    wchar_t **wstr,
+    size_t *wlen,
+    const char **reason,
+    int surrogateescape);
+PyAPI_FUNC(int) _Py_EncodeUTF8Ex(
+    const wchar_t *text,
+    char **str,
+    size_t *error_pos,
+    const char **reason,
+    int raw_malloc,
+    int surrogateescape);
 PyAPI_FUNC(wchar_t*) _Py_DecodeUTF8_surrogateescape(
-    const char *s,
+    const char *arg,
-    Py_ssize_t size,
+    Py_ssize_t arglen);
-    size_t *p_wlen);
-PyAPI_FUNC(wchar_t *) _Py_DecodeCurrentLocale(
+PyAPI_FUNC(int) _Py_DecodeLocaleEx(
    const char *arg,
-    size_t *size);
+    wchar_t **wstr,
+    size_t *wlen,
+    const char **reason,
+    int current_locale,
+    int surrogateescape);
-PyAPI_FUNC(char*) _Py_EncodeCurrentLocale(
+PyAPI_FUNC(int) _Py_EncodeLocaleEx(
    const wchar_t *text,
-    size_t *error_pos);
+    char **str,
+    size_t *error_pos,
+    const char **reason,
+    int current_locale,
+    int surrogateescape);
 #endif
 #ifndef Py_LIMITED_API

--- a/Include/unicodeobject.h
+++ b/Include/unicodeobject.h
@@ -1810,20 +1810,6 @@ PyAPI_FUNC(PyObject*) PyUnicode_EncodeLocale(
    PyObject *unicode,
    const char *errors
    );
-PyAPI_FUNC(PyObject*) _PyUnicode_DecodeCurrentLocale(
-    const char *str,
-    const char *errors);
-PyAPI_FUNC(PyObject*) _PyUnicode_DecodeCurrentLocaleAndSize(
-    const char *str,
-    Py_ssize_t len,
-    const char *errors);
-PyAPI_FUNC(PyObject*) _PyUnicode_EncodeCurrentLocale(
-    PyObject *unicode,
-    const char *errors
-    );
 #endif
 /* --- File system encoding ---------------------------------------------- */

--- a/Modules/_datetimemodule.c
+++ b/Modules/_datetimemodule.c
@@ -696,7 +696,7 @@ static int parse_isoformat_date(const char *dtstr,
    if (NULL == p) {
        return -1;
    }
    if (*(p++) != '-') {
        return -2;
    }

--- a/Modules/_localemodule.c
+++ b/Modules/_localemodule.c
@@ -572,8 +572,9 @@ PyIntl_bind_textdomain_codeset(PyObject* self,PyObject*args)
    if (!PyArg_ParseTuple(args, "sz", &domain, &codeset))
        return NULL;
    codeset = bind_textdomain_codeset(domain, codeset);
-    if (codeset)
+    if (codeset) {
        return PyUnicode_DecodeLocale(codeset, NULL);
+    }
    Py_RETURN_NONE;
 }
 #endif

--- a/Modules/getpath.c
+++ b/Modules/getpath.c
@@ -449,8 +449,8 @@ search_for_exec_prefix(const _PyCoreConfig *core_config,
            n = fread(buf, 1, MAXPATHLEN, f);
            buf[n] = '\0';
            fclose(f);
-            rel_builddir_path = _Py_DecodeUTF8_surrogateescape(buf, n, NULL);
+            rel_builddir_path = _Py_DecodeUTF8_surrogateescape(buf, n);
-            if (rel_builddir_path != NULL) {
+            if (rel_builddir_path) {
                wcsncpy(exec_prefix, calculate->argv0_path, MAXPATHLEN);
                exec_prefix[MAXPATHLEN] = L'\0';
                joinpath(exec_prefix, rel_builddir_path);

--- a/Modules/readline.c
+++ b/Modules/readline.c
@@ -132,13 +132,13 @@ static PyModuleDef readlinemodule;
 static PyObject *
 encode(PyObject *b)
 {
-    return _PyUnicode_EncodeCurrentLocale(b, "surrogateescape");
+    return PyUnicode_EncodeLocale(b, "surrogateescape");
 }
 static PyObject *
 decode(const char *s)
 {
-    return _PyUnicode_DecodeCurrentLocale(s, "surrogateescape");
+    return PyUnicode_DecodeLocale(s, "surrogateescape");
 }

--- a/Modules/timemodule.c
+++ b/Modules/timemodule.c
@@ -418,11 +418,11 @@ tmtotuple(struct tm *p
    SET(8, p->tm_isdst);
 #ifdef HAVE_STRUCT_TM_TM_ZONE
    PyStructSequence_SET_ITEM(v, 9,
-        _PyUnicode_DecodeCurrentLocale(p->tm_zone, "surrogateescape"));
+        PyUnicode_DecodeLocale(p->tm_zone, "surrogateescape"));
    SET(10, p->tm_gmtoff);
 #else
    PyStructSequence_SET_ITEM(v, 9,
-        _PyUnicode_DecodeCurrentLocale(zone, "surrogateescape"));
+        PyUnicode_DecodeLocale(zone, "surrogateescape"));
    PyStructSequence_SET_ITEM(v, 10, _PyLong_FromTime_t(gmtoff));
 #endif /* HAVE_STRUCT_TM_TM_ZONE */
 #undef SET
@@ -809,8 +809,7 @@ time_strftime(PyObject *self, PyObject *args)
 #ifdef HAVE_WCSFTIME
            ret = PyUnicode_FromWideChar(outbuf, buflen);
 #else
-            ret = _PyUnicode_DecodeCurrentLocaleAndSize(outbuf, buflen,
+            ret = PyUnicode_DecodeLocaleAndSize(outbuf, buflen, "surrogateescape");
-                                                        "surrogateescape");
 #endif
            PyMem_Free(outbuf);
            break;
@@ -1541,8 +1540,8 @@ PyInit_timezone(PyObject *m) {
    PyModule_AddIntConstant(m, "altzone", timezone-3600);
 #endif
    PyModule_AddIntConstant(m, "daylight", daylight);
-    otz0 = _PyUnicode_DecodeCurrentLocale(tzname[0], "surrogateescape");
+    otz0 = PyUnicode_DecodeLocale(tzname[0], "surrogateescape");
-    otz1 = _PyUnicode_DecodeCurrentLocale(tzname[1], "surrogateescape");
+    otz1 = PyUnicode_DecodeLocale(tzname[1], "surrogateescape");
    PyModule_AddObject(m, "tzname", Py_BuildValue("(NN)", otz0, otz1));
 #else /* !HAVE_TZNAME || __GLIBC__ || __CYGWIN__*/
    {

--- a/Objects/unicodeobject.c
+++ b/Objects/unicodeobject.c
--- a/Python/fileutils.c
+++ b/Python/fileutils.c
--- a/Python/pathconfig.c
+++ b/Python/pathconfig.c
@@ -382,8 +382,8 @@ _Py_FindEnvConfigValue(FILE *env_file, const wchar_t *key,
            /* Comment - skip */
            continue;
        }
-        tmpbuffer = _Py_DecodeUTF8_surrogateescape(buffer, n, NULL);
+        tmpbuffer = _Py_DecodeUTF8_surrogateescape(buffer, n);
-        if (tmpbuffer != NULL) {
+        if (tmpbuffer) {
            wchar_t * state;
            wchar_t * tok = wcstok(tmpbuffer, L" \t\r\n", &state);
            if ((tok != NULL) && !wcscmp(tok, key)) {