Commit 6ea4186d authored by Nick Coghlan's avatar Nick Coghlan Committed by GitHub

bpo-28180: Implementation for PEP 538 (#659)

- new PYTHONCOERCECLOCALE config setting
- coerces legacy C locale to C.UTF-8, C.utf8 or UTF-8 by default
- always uses C.UTF-8 on Android
- uses `surrogateescape` on stdin and stdout in the coercion
  target locales
- configure option to disable locale coercion at build time
- configure option to disable C locale warning at build time
parent 0afbabe2
......@@ -713,6 +713,42 @@ conflict.
.. versionadded:: 3.6
.. envvar:: PYTHONCOERCECLOCALE
If set to the value ``0``, causes the main Python command line application
to skip coercing the legacy ASCII-based C locale to a more capable UTF-8
based alternative. Note that this setting is checked even when the
:option:`-E` or :option:`-I` options are used, as it is handled prior to
the processing of command line options.
If this variable is *not* set, or is set to a value other than ``0``, and
the current locale reported for the ``LC_CTYPE`` category is the default
``C`` locale, then the Python CLI will attempt to configure the following
locales for the ``LC_CTYPE`` category in the order listed before loading the
interpreter runtime:
* ``C.UTF-8``
* ``C.utf8``
* ``UTF-8``
If setting one of these locale categories succeeds, then the ``LC_CTYPE``
environment variable will also be set accordingly in the current process
environment before the Python runtime is initialized. This ensures the
updated setting is seen in subprocesses, as well as in operations that
query the environment rather than the current C locale (such as Python's
own :func:`locale.getdefaultlocale`).
Configuring one of these locales (either explicitly or via the above
implicit locale coercion) will automatically set the error handler for
:data:`sys.stdin` and :data:`sys.stdout` to ``surrogateescape``. This
behavior can be overridden using :envvar:`PYTHONIOENCODING` as usual.
Availability: \*nix
.. versionadded:: 3.7
See :pep:`538` for more details.
Debug-mode variables
~~~~~~~~~~~~~~~~~~~~
......
......@@ -70,6 +70,51 @@ Summary -- Release highlights
New Features
============
.. _whatsnew37-pep538:
PEP 538: Legacy C Locale Coercion
---------------------------------
An ongoing challenge within the Python 3 series has been determining a sensible
default strategy for handling the "7-bit ASCII" text encoding assumption
currently implied by the use of the default C locale on non-Windows platforms.
:pep:`538` updates the default interpreter command line interface to
automatically coerce that locale to an available UTF-8 based locale as
described in the documentation of the new :envvar:`PYTHONCOERCECLOCALE`
environment variable. Automatically setting ``LC_CTYPE`` this way means that
both the core interpreter and locale-aware C extensions (such as
:mod:`readline`) will assume the use of UTF-8 as the default text encoding,
rather than ASCII.
The platform support definition in :pep:`11` has also been updated to limit
full text handling support to suitably configured non-ASCII based locales.
As part of this change, the default error handler for ``stdin`` and ``stdout``
is now ``surrogateescape`` (rather than ``strict``) when using any of the
defined coercion target locales (currently ``C.UTF-8``, ``C.utf8``, and
``UTF-8``). The default error handler for ``stderr`` continues to be
``backslashreplace``, regardless of locale.
.. note::
In the current implementation, a warning message is printed directly to
``stderr`` even for successful implicit locale coercion. This gives
redistributors and system integrators the opportunity to determine if they
should be making an environmental change to avoid the need for implicit
coercion at the Python interpreter level.
However, it's not clear that this is going to be the best approach for
the final 3.7.0 release, and we may end up deciding to disable the warning
by default and provide some way of opting into it at runtime or build time.
Concrete examples of use cases where it would be preferrable to disable the
warning by default can be noted on :issue:`30565`.
.. seealso::
:pep:`538` -- Coercing the legacy C locale to a UTF-8 based locale
PEP written and implemented by Nick Coghlan.
Other Language Changes
......
......@@ -48,8 +48,35 @@ def interpreter_requires_environment():
return __cached_interp_requires_environment
_PythonRunResult = collections.namedtuple("_PythonRunResult",
("rc", "out", "err"))
class _PythonRunResult(collections.namedtuple("_PythonRunResult",
("rc", "out", "err"))):
"""Helper for reporting Python subprocess run results"""
def fail(self, cmd_line):
"""Provide helpful details about failed subcommand runs"""
# Limit to 80 lines to ASCII characters
maxlen = 80 * 100
out, err = self.out, self.err
if len(out) > maxlen:
out = b'(... truncated stdout ...)' + out[-maxlen:]
if len(err) > maxlen:
err = b'(... truncated stderr ...)' + err[-maxlen:]
out = out.decode('ascii', 'replace').rstrip()
err = err.decode('ascii', 'replace').rstrip()
raise AssertionError("Process return code is %d\n"
"command line: %r\n"
"\n"
"stdout:\n"
"---\n"
"%s\n"
"---\n"
"\n"
"stderr:\n"
"---\n"
"%s\n"
"---"
% (self.rc, cmd_line,
out,
err))
# Executing the interpreter in a subprocess
......@@ -107,30 +134,7 @@ def run_python_until_end(*args, **env_vars):
def _assert_python(expected_success, *args, **env_vars):
res, cmd_line = run_python_until_end(*args, **env_vars)
if (res.rc and expected_success) or (not res.rc and not expected_success):
# Limit to 80 lines to ASCII characters
maxlen = 80 * 100
out, err = res.out, res.err
if len(out) > maxlen:
out = b'(... truncated stdout ...)' + out[-maxlen:]
if len(err) > maxlen:
err = b'(... truncated stderr ...)' + err[-maxlen:]
out = out.decode('ascii', 'replace').rstrip()
err = err.decode('ascii', 'replace').rstrip()
raise AssertionError("Process return code is %d\n"
"command line: %r\n"
"\n"
"stdout:\n"
"---\n"
"%s\n"
"---\n"
"\n"
"stderr:\n"
"---\n"
"%s\n"
"---"
% (res.rc, cmd_line,
out,
err))
res.fail(cmd_line)
return res
def assert_python_ok(*args, **env_vars):
......
This diff is collapsed.
......@@ -371,14 +371,21 @@ class EmbeddingTests(unittest.TestCase):
def tearDown(self):
os.chdir(self.oldcwd)
def run_embedded_interpreter(self, *args):
def run_embedded_interpreter(self, *args, env=None):
"""Runs a test in the embedded interpreter"""
cmd = [self.test_exe]
cmd.extend(args)
if env is not None and sys.platform == 'win32':
# Windows requires at least the SYSTEMROOT environment variable to
# start Python.
env = env.copy()
env['SYSTEMROOT'] = os.environ['SYSTEMROOT']
p = subprocess.Popen(cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
universal_newlines=True)
universal_newlines=True,
env=env)
(out, err) = p.communicate()
self.assertEqual(p.returncode, 0,
"bad returncode %d, stderr is %r" %
......@@ -471,26 +478,16 @@ class EmbeddingTests(unittest.TestCase):
self.assertNotEqual(sub.tstate, main.tstate)
self.assertNotEqual(sub.modules, main.modules)
@staticmethod
def _get_default_pipe_encoding():
rp, wp = os.pipe()
try:
with os.fdopen(wp, 'w') as w:
default_pipe_encoding = w.encoding
finally:
os.close(rp)
return default_pipe_encoding
def test_forced_io_encoding(self):
# Checks forced configuration of embedded interpreter IO streams
out, err = self.run_embedded_interpreter("forced_io_encoding")
if support.verbose:
env = {"PYTHONIOENCODING": "utf-8:surrogateescape"}
out, err = self.run_embedded_interpreter("forced_io_encoding", env=env)
if support.verbose > 1:
print()
print(out)
print(err)
expected_errors = sys.__stdout__.errors
expected_stdin_encoding = sys.__stdin__.encoding
expected_pipe_encoding = self._get_default_pipe_encoding()
expected_stream_encoding = "utf-8"
expected_errors = "surrogateescape"
expected_output = '\n'.join([
"--- Use defaults ---",
"Expected encoding: default",
......@@ -517,8 +514,8 @@ class EmbeddingTests(unittest.TestCase):
"stdout: latin-1:replace",
"stderr: latin-1:backslashreplace"])
expected_output = expected_output.format(
in_encoding=expected_stdin_encoding,
out_encoding=expected_pipe_encoding,
in_encoding=expected_stream_encoding,
out_encoding=expected_stream_encoding,
errors=expected_errors)
# This is useful if we ever trip over odd platform behaviour
self.maxDiff = None
......
......@@ -8,8 +8,9 @@ import sys
import subprocess
import tempfile
from test.support import script_helper, is_android
from test.support.script_helper import (spawn_python, kill_python, assert_python_ok,
assert_python_failure)
from test.support.script_helper import (
spawn_python, kill_python, assert_python_ok, assert_python_failure
)
# XXX (ncoghlan): Move to script_helper and make consistent with run_python
......@@ -150,6 +151,7 @@ class CmdLineTest(unittest.TestCase):
env = os.environ.copy()
# Use C locale to get ascii for the locale encoding
env['LC_ALL'] = 'C'
env['PYTHONCOERCECLOCALE'] = '0'
code = (
b'import locale; '
b'print(ascii("' + undecodable + b'"), '
......
......@@ -642,7 +642,8 @@ class ProcessTestCase(BaseTestCase):
# on adding even when the environment in exec is empty.
# Gentoo sandboxes also force LD_PRELOAD and SANDBOX_* to exist.
return ('VERSIONER' in n or '__CF' in n or # MacOS
n == 'LD_PRELOAD' or n.startswith('SANDBOX')) # Gentoo
n == 'LD_PRELOAD' or n.startswith('SANDBOX') or # Gentoo
n == 'LC_CTYPE') # Locale coercion triggered
with subprocess.Popen([sys.executable, "-c",
'import os; print(list(os.environ.keys()))'],
......
......@@ -682,6 +682,7 @@ class SysModuleTest(unittest.TestCase):
# Force the POSIX locale
env = os.environ.copy()
env["LC_ALL"] = "C"
env["PYTHONCOERCECLOCALE"] = "0"
code = '\n'.join((
'import sys',
'def dump(name):',
......
......@@ -10,6 +10,11 @@ What's New in Python 3.7.0 alpha 1?
Core and Builtins
-----------------
- bpo-28180: Implement PEP 538 (legacy C locale coercion). This means that when
a suitable coercion target locale is available, both the core interpreter and
locale-aware C extensions will assume the use of UTF-8 as the default text
encoding, rather than ASCII.
- bpo-30486: Allows setting cell values for __closure__. Patch by Lisa Roach.
- bpo-30537: itertools.islice now accepts integer-like objects (having
......
......@@ -15,6 +15,21 @@ wmain(int argc, wchar_t **argv)
}
#else
/* Access private pylifecycle helper API to better handle the legacy C locale
*
* The legacy C locale assumes ASCII as the default text encoding, which
* causes problems not only for the CPython runtime, but also other
* components like GNU readline.
*
* Accordingly, when the CLI detects it, it attempts to coerce it to a
* more capable UTF-8 based alternative.
*
* See the documentation of the PYTHONCOERCECLOCALE setting for more details.
*
*/
extern int _Py_LegacyLocaleDetected(void);
extern void _Py_CoerceLegacyLocale(void);
int
main(int argc, char **argv)
{
......@@ -25,7 +40,11 @@ main(int argc, char **argv)
char *oldloc;
/* Force malloc() allocator to bootstrap Python */
#ifdef Py_DEBUG
(void)_PyMem_SetupAllocators("malloc_debug");
# else
(void)_PyMem_SetupAllocators("malloc");
# endif
argv_copy = (wchar_t **)PyMem_RawMalloc(sizeof(wchar_t*) * (argc+1));
argv_copy2 = (wchar_t **)PyMem_RawMalloc(sizeof(wchar_t*) * (argc+1));
......@@ -49,7 +68,21 @@ main(int argc, char **argv)
return 1;
}
#ifdef __ANDROID__
/* Passing "" to setlocale() on Android requests the C locale rather
* than checking environment variables, so request C.UTF-8 explicitly
*/
setlocale(LC_ALL, "C.UTF-8");
#else
/* Reconfigure the locale to the default for this process */
setlocale(LC_ALL, "");
#endif
if (_Py_LegacyLocaleDetected()) {
_Py_CoerceLegacyLocale();
}
/* Convert from char to wchar_t based on the locale settings */
for (i = 0; i < argc; i++) {
argv_copy[i] = Py_DecodeLocale(argv[i], NULL);
if (!argv_copy[i]) {
......@@ -70,7 +103,11 @@ main(int argc, char **argv)
/* Force again malloc() allocator to release memory blocks allocated
before Py_Main() */
#ifdef Py_DEBUG
(void)_PyMem_SetupAllocators("malloc_debug");
# else
(void)_PyMem_SetupAllocators("malloc");
# endif
for (i = 0; i < argc; i++) {
PyMem_RawFree(argv_copy2[i]);
......
......@@ -178,6 +178,7 @@ Py_SetStandardStreamEncoding(const char *encoding, const char *errors)
return 0;
}
/* Global initializations. Can be undone by Py_FinalizeEx(). Don't
call this twice without an intervening Py_FinalizeEx() call. When
initializations fail, a fatal error is issued and the function does
......@@ -330,6 +331,159 @@ initexternalimport(PyInterpreterState *interp)
Py_DECREF(value);
}
/* Helper functions to better handle the legacy C locale
*
* The legacy C locale assumes ASCII as the default text encoding, which
* causes problems not only for the CPython runtime, but also other
* components like GNU readline.
*
* Accordingly, when the CLI detects it, it attempts to coerce it to a
* more capable UTF-8 based alternative as follows:
*
* if (_Py_LegacyLocaleDetected()) {
* _Py_CoerceLegacyLocale();
* }
*
* See the documentation of the PYTHONCOERCECLOCALE setting for more details.
*
* Locale coercion also impacts the default error handler for the standard
* streams: while the usual default is "strict", the default for the legacy
* C locale and for any of the coercion target locales is "surrogateescape".
*/
int
_Py_LegacyLocaleDetected(void)
{
#ifndef MS_WINDOWS
/* On non-Windows systems, the C locale is considered a legacy locale */
const char *ctype_loc = setlocale(LC_CTYPE, NULL);
return ctype_loc != NULL && strcmp(ctype_loc, "C") == 0;
#else
/* Windows uses code pages instead of locales, so no locale is legacy */
return 0;
#endif
}
typedef struct _CandidateLocale {
const char *locale_name; /* The locale to try as a coercion target */
} _LocaleCoercionTarget;
static _LocaleCoercionTarget _TARGET_LOCALES[] = {
{"C.UTF-8"},
{"C.utf8"},
{"UTF-8"},
{NULL}
};
static char *
get_default_standard_stream_error_handler(void)
{
const char *ctype_loc = setlocale(LC_CTYPE, NULL);
if (ctype_loc != NULL) {
/* "surrogateescape" is the default in the legacy C locale */
if (strcmp(ctype_loc, "C") == 0) {
return "surrogateescape";
}
#ifdef PY_COERCE_C_LOCALE
/* "surrogateescape" is the default in locale coercion target locales */
const _LocaleCoercionTarget *target = NULL;
for (target = _TARGET_LOCALES; target->locale_name; target++) {
if (strcmp(ctype_loc, target->locale_name) == 0) {
return "surrogateescape";
}
}
#endif
}
/* Otherwise return NULL to request the typical default error handler */
return NULL;
}
#ifdef PY_COERCE_C_LOCALE
static const char *_C_LOCALE_COERCION_WARNING =
"Python detected LC_CTYPE=C: LC_CTYPE coerced to %.20s (set another locale "
"or PYTHONCOERCECLOCALE=0 to disable this locale coercion behavior).\n";
static void
_coerce_default_locale_settings(const _LocaleCoercionTarget *target)
{
const char *newloc = target->locale_name;
/* Reset locale back to currently configured defaults */
setlocale(LC_ALL, "");
/* Set the relevant locale environment variable */
if (setenv("LC_CTYPE", newloc, 1)) {
fprintf(stderr,
"Error setting LC_CTYPE, skipping C locale coercion\n");
return;
}
fprintf(stderr, _C_LOCALE_COERCION_WARNING, newloc);
/* Reconfigure with the overridden environment variables */
setlocale(LC_ALL, "");
}
#endif
void
_Py_CoerceLegacyLocale(void)
{
#ifdef PY_COERCE_C_LOCALE
/* We ignore the Python -E and -I flags here, as the CLI needs to sort out
* the locale settings *before* we try to do anything with the command
* line arguments. For cross-platform debugging purposes, we also need
* to give end users a way to force even scripts that are otherwise
* isolated from their environment to use the legacy ASCII-centric C
* locale.
*
* Ignoring -E and -I is safe from a security perspective, as we only use
* the setting to turn *off* the implicit locale coercion, and anyone with
* access to the process environment already has the ability to set
* `LC_ALL=C` to override the C level locale settings anyway.
*/
const char *coerce_c_locale = getenv("PYTHONCOERCECLOCALE");
if (coerce_c_locale == NULL || strncmp(coerce_c_locale, "0", 2) != 0) {
/* PYTHONCOERCECLOCALE is not set, or is set to something other than "0" */
const char *locale_override = getenv("LC_ALL");
if (locale_override == NULL || *locale_override == '\0') {
/* LC_ALL is also not set (or is set to an empty string) */
const _LocaleCoercionTarget *target = NULL;
for (target = _TARGET_LOCALES; target->locale_name; target++) {
const char *new_locale = setlocale(LC_CTYPE,
target->locale_name);
if (new_locale != NULL) {
/* Successfully configured locale, so make it the default */
_coerce_default_locale_settings(target);
return;
}
}
}
}
/* No C locale warning here, as Py_Initialize will emit one later */
#endif
}
#ifdef PY_WARN_ON_C_LOCALE
static const char *_C_LOCALE_WARNING =
"Python runtime initialized with LC_CTYPE=C (a locale with default ASCII "
"encoding), which may cause Unicode compatibility problems. Using C.UTF-8, "
"C.utf8, or UTF-8 (if available) as alternative Unicode-compatible "
"locales is recommended.\n";
static void
_emit_stderr_warning_for_c_locale(void)
{
const char *coerce_c_locale = getenv("PYTHONCOERCECLOCALE");
if (coerce_c_locale == NULL || strncmp(coerce_c_locale, "0", 2) != 0) {
if (_Py_LegacyLocaleDetected()) {
fprintf(stderr, "%s", _C_LOCALE_WARNING);
}
}
}
#endif
/* Global initializations. Can be undone by Py_Finalize(). Don't
call this twice without an intervening Py_Finalize() call.
......@@ -396,11 +550,21 @@ void _Py_InitializeCore(const _PyCoreConfig *config)
*/
_Py_Finalizing = NULL;
#ifdef HAVE_SETLOCALE
#ifdef __ANDROID__
/* Passing "" to setlocale() on Android requests the C locale rather
* than checking environment variables, so request C.UTF-8 explicitly
*/
setlocale(LC_CTYPE, "C.UTF-8");
#else
#ifndef MS_WINDOWS
/* Set up the LC_CTYPE locale, so we can obtain
the locale's charset without having to switch
locales. */
setlocale(LC_CTYPE, "");
#ifdef PY_WARN_ON_C_LOCALE
_emit_stderr_warning_for_c_locale();
#endif
#endif
#endif
if ((p = Py_GETENV("PYTHONDEBUG")) && *p != '\0')
......@@ -1457,12 +1621,8 @@ initstdio(void)
}
}
if (!errors && !(pythonioencoding && *pythonioencoding)) {
/* When the LC_CTYPE locale is the POSIX locale ("C locale"),
stdin and stdout use the surrogateescape error handler by
default, instead of the strict error handler. */
char *loc = setlocale(LC_CTYPE, NULL);
if (loc != NULL && strcmp(loc, "C") == 0)
errors = "surrogateescape";
/* Choose the default error handler based on the current locale */
errors = get_default_standard_stream_error_handler();
}
}
......
......@@ -834,6 +834,8 @@ with_thread
enable_ipv6
with_doc_strings
with_pymalloc
with_c_locale_coercion
with_c_locale_warning
with_valgrind
with_dtrace
with_fpectl
......@@ -1528,6 +1530,12 @@ Optional Packages:
deprecated; use --with(out)-threads
--with(out)-doc-strings disable/enable documentation strings
--with(out)-pymalloc disable/enable specialized mallocs
--with(out)-c-locale-coercion
disable/enable C locale coercion to a UTF-8 based
locale
--with(out)-c-locale-warning
disable/enable locale compatibility warning in the C
locale
--with-valgrind Enable Valgrind support
--with(out)-dtrace disable/enable DTrace support
--with-fpectl enable SIGFPE catching
......@@ -11047,6 +11055,52 @@ fi
{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_pymalloc" >&5
$as_echo "$with_pymalloc" >&6; }
# Check for --with-c-locale-coercion
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for --with-c-locale-coercion" >&5
$as_echo_n "checking for --with-c-locale-coercion... " >&6; }
# Check whether --with-c-locale-coercion was given.
if test "${with_c_locale_coercion+set}" = set; then :
withval=$with_c_locale_coercion;
fi
if test -z "$with_c_locale_coercion"
then
with_c_locale_coercion="yes"
fi
if test "$with_c_locale_coercion" != "no"
then
$as_echo "#define PY_COERCE_C_LOCALE 1" >>confdefs.h
fi
{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_c_locale_coercion" >&5
$as_echo "$with_c_locale_coercion" >&6; }
# Check for --with-c-locale-warning
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for --with-c-locale-warning" >&5
$as_echo_n "checking for --with-c-locale-warning... " >&6; }
# Check whether --with-c-locale-warning was given.
if test "${with_c_locale_warning+set}" = set; then :
withval=$with_c_locale_warning;
fi
if test -z "$with_c_locale_warning"
then
with_c_locale_warning="yes"
fi
if test "$with_c_locale_warning" != "no"
then
$as_echo "#define PY_WARN_ON_C_LOCALE 1" >>confdefs.h
fi
{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_c_locale_warning" >&5
$as_echo "$with_c_locale_warning" >&6; }
# Check for Valgrind support
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for --with-valgrind" >&5
$as_echo_n "checking for --with-valgrind... " >&6; }
......
......@@ -3325,6 +3325,40 @@ then
fi
AC_MSG_RESULT($with_pymalloc)
# Check for --with-c-locale-coercion
AC_MSG_CHECKING(for --with-c-locale-coercion)
AC_ARG_WITH(c-locale-coercion,
AS_HELP_STRING([--with(out)-c-locale-coercion],
[disable/enable C locale coercion to a UTF-8 based locale]))
if test -z "$with_c_locale_coercion"
then
with_c_locale_coercion="yes"
fi
if test "$with_c_locale_coercion" != "no"
then
AC_DEFINE(PY_COERCE_C_LOCALE, 1,
[Define if you want to coerce the C locale to a UTF-8 based locale])
fi
AC_MSG_RESULT($with_c_locale_coercion)
# Check for --with-c-locale-warning
AC_MSG_CHECKING(for --with-c-locale-warning)
AC_ARG_WITH(c-locale-warning,
AS_HELP_STRING([--with(out)-c-locale-warning],
[disable/enable locale compatibility warning in the C locale]))
if test -z "$with_c_locale_warning"
then
with_c_locale_warning="yes"
fi
if test "$with_c_locale_warning" != "no"
then
AC_DEFINE(PY_WARN_ON_C_LOCALE, 1,
[Define to emit a locale compatibility warning in the C locale])
fi
AC_MSG_RESULT($with_c_locale_warning)
# Check for Valgrind support
AC_MSG_CHECKING([for --with-valgrind])
AC_ARG_WITH([valgrind],
......
......@@ -1247,9 +1247,15 @@
/* Define as the preferred size in bits of long digits */
#undef PYLONG_BITS_IN_DIGIT
/* Define if you want to coerce the C locale to a UTF-8 based locale */
#undef PY_COERCE_C_LOCALE
/* Define to printf format modifier for Py_ssize_t */
#undef PY_FORMAT_SIZE_T
/* Define to emit a locale compatibility warning in the C locale */
#undef PY_WARN_ON_C_LOCALE
/* Define if you want to build an interpreter with many run-time checks. */
#undef Py_DEBUG
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment