bpo-33416: Add end positions to Python AST (GH-11605)

The majority of this PR is tediously passing `end_lineno` and `end_col_offset` everywhere. Here are non-trivial points: * It is not possible to reconstruct end positions in AST "on the fly", some information is lost after an AST node is constructed, so we need two more attributes for every AST node `end_lineno` and `end_col_offset`. * I add end position information to both CST and AST. Although it may be technically possible to avoid adding end positions to CST, the code becomes more cumbersome and less efficient. * Since the end position is not known for non-leaf CST nodes while the next token is added, this requires a bit of extra care (see `_PyNode_FinalizeEndPos`). Unless I made some mistake, the algorithm should be linear. * For statements, I "trim" the end position of suites to not include the terminal newlines and dedent (this seems to be what people would expect), for example in ```python class C: pass pass ``` the end line and end column for the class definition is (2, 8). * For `end_col_offset` I use the common Python convention for indexing, for example for `pass` the `end_col_offset` is 4 (not 3), so that `[0:4]` gives one the source code that corresponds to the node. * I added a helper function `ast.get_source_segment()`, to get source text segment corresponding to a given AST node. It is also useful for testing. An (inevitable) downside of this PR is that AST now takes almost 25% more memory. I think however it is probably justified by the benefits.

bpo-33416: Add end positions to Python AST (GH-11605)
The majority of this PR is tediously passing `end_lineno` and `end_col_offset` everywhere. Here are non-trivial points: * It is not possible to reconstruct end positions in AST "on the fly", some information is lost after an AST node is constructed, so we need two more attributes for every AST node `end_lineno` and `end_col_offset`. * I add end position information to both CST and AST. Although it may be technically possible to avoid adding end positions to CST, the code becomes more cumbersome and less efficient. * Since the end position is not known for non-leaf CST nodes while the next token is added, this requires a bit of extra care (see `_PyNode_FinalizeEndPos`). Unless I made some mistake, the algorithm should be linear. * For statements, I "trim" the end position of suites to not include the terminal newlines and dedent (this seems to be what people would expect), for example in ```python class C: pass pass ``` the end line and end column for the class definition is (2, 8). * For `end_col_offset` I use the common Python convention for indexing, for example for `pass` the `end_col_offset` is 4 (not 3), so that `[0:4]` gives one the source code that corresponds to the node. * I added a helper function `ast.get_source_segment()`, to get source text segment corresponding to a given AST node. It is also useful for testing. An (inevitable) downside of this PR is that AST now takes almost 25% more memory. I think however it is probably justified by the benefits.
9932a228 · Ivan Levkivskyi · GitHub · 7a236806 · 9932a228 · 9932a228
Commit 9932a228 authored Jan 22, 2019 by Ivan Levkivskyi Committed by GitHub Jan 22, 2019
19 changed files
--- a/Doc/library/ast.rst
+++ b/Doc/library/ast.rst
@@ -61,13 +61,21 @@ Node classes

   .. attribute:: lineno
                  col_offset
+                  end_lineno
+                  end_col_offset

      Instances of :class:`ast.expr` and :class:`ast.stmt` subclasses have
-      :attr:`lineno` and :attr:`col_offset` attributes.  The :attr:`lineno` is
-      the line number of source text (1-indexed so the first line is line 1) and
-      the :attr:`col_offset` is the UTF-8 byte offset of the first token that
-      generated the node.  The UTF-8 offset is recorded because the parser uses
-      UTF-8 internally.
+      :attr:`lineno`, :attr:`col_offset`, :attr:`lineno`, and :attr:`col_offset`
+      attributes.  The :attr:`lineno` and :attr:`end_lineno` are the first and
+      last line numbers of source text span (1-indexed so the first line is line 1)
+      and the :attr:`col_offset` and :attr:`end_col_offset` are the corresponding
+      UTF-8 byte offsets of the first and last tokens that generated the node.
+      The UTF-8 offset is recorded because the parser uses UTF-8 internally.
+
+      Note that the end positions are not required by the compiler and are
+      therefore optional. The end offset is *after* the last symbol, for example
+      one can get the source segment of a one-line expression node using
+      ``source_line[node.col_offset : node.end_col_offset]``.

   The constructor of a class :class:`ast.T` parses its arguments as follows:

@@ -162,6 +170,18 @@ and classes for traversing abstract syntax trees:
      :class:`AsyncFunctionDef` is now supported.


+.. function:: get_source_segment(source, node, *, padded=False)
+
+   Get source code segment of the *source* that generated *node*.
+   If some location information (:attr:`lineno`, :attr:`end_lineno`,
+   :attr:`col_offset`, or :attr:`end_col_offset`) is missing, return ``None``.
+
+   If *padded* is ``True``, the first line of a multi-line statement will
+   be padded with spaces to match its original position.
+
+   .. versionadded:: 3.8
+
+
 .. function:: fix_missing_locations(node)

   When you compile a node tree with :func:`compile`, the compiler expects
@@ -173,14 +193,16 @@ and classes for traversing abstract syntax trees:

 .. function:: increment_lineno(node, n=1)

-   Increment the line number of each node in the tree starting at *node* by *n*.
-   This is useful to "move code" to a different location in a file.
+   Increment the line number and end line number of each node in the tree
+   starting at *node* by *n*. This is useful to "move code" to a different
+   location in a file.


 .. function:: copy_location(new_node, old_node)

-   Copy source location (:attr:`lineno` and :attr:`col_offset`) from *old_node*
-   to *new_node* if possible, and return *new_node*.
+   Copy source location (:attr:`lineno`, :attr:`col_offset`, :attr:`end_lineno`,
+   and :attr:`end_col_offset`) from *old_node* to *new_node* if possible,
+   and return *new_node*.


 .. function:: iter_fields(node)

--- a/Include/Python-ast.h
+++ b/Include/Python-ast.h
--- a/Include/node.h
+++ b/Include/node.h
@@ -14,11 +14,14 @@ typedef struct _node {
    int                 n_col_offset;
    int                 n_nchildren;
    struct _node        *n_child;
+    int                 n_end_lineno;
+    int                 n_end_col_offset;
 } node;

 PyAPI_FUNC(node *) PyNode_New(int type);
 PyAPI_FUNC(int) PyNode_AddChild(node *n, int type,
-                                      char *str, int lineno, int col_offset);
+                                char *str, int lineno, int col_offset,
+                                int end_lineno, int end_col_offset);
 PyAPI_FUNC(void) PyNode_Free(node *n);
 #ifndef Py_LIMITED_API
 PyAPI_FUNC(Py_ssize_t) _PyNode_SizeOf(node *n);
@@ -37,6 +40,7 @@ PyAPI_FUNC(Py_ssize_t) _PyNode_SizeOf(node *n);
 #define REQ(n, type) assert(TYPE(n) == (type))

 PyAPI_FUNC(void) PyNode_ListTree(node *);
+void _PyNode_FinalizeEndPos(node *n);  // helper also used in parsetok.c

 #ifdef __cplusplus
 }

--- a/Lib/ast.py
+++ b/Lib/ast.py
@@ -115,10 +115,10 @@ def dump(node, annotate_fields=True, include_attributes=False):

 def copy_location(new_node, old_node):
    """
-    Copy source location (`lineno` and `col_offset` attributes) from
-    *old_node* to *new_node* if possible, and return *new_node*.
+    Copy source location (`lineno`, `col_offset`, `end_lineno`, and `end_col_offset`
+    attributes) from *old_node* to *new_node* if possible, and return *new_node*.
    """
-    for attr in 'lineno', 'col_offset':
+    for attr in 'lineno', 'col_offset', 'end_lineno', 'end_col_offset':
        if attr in old_node._attributes and attr in new_node._attributes \
           and hasattr(old_node, attr):
            setattr(new_node, attr, getattr(old_node, attr))
@@ -133,31 +133,44 @@ def fix_missing_locations(node):
    recursively where not already set, by setting them to the values of the
    parent node.  It works recursively starting at *node*.
    """
-    def _fix(node, lineno, col_offset):
+    def _fix(node, lineno, col_offset, end_lineno, end_col_offset):
        if 'lineno' in node._attributes:
            if not hasattr(node, 'lineno'):
                node.lineno = lineno
            else:
                lineno = node.lineno
+        if 'end_lineno' in node._attributes:
+            if not hasattr(node, 'end_lineno'):
+                node.end_lineno = end_lineno
+            else:
+                end_lineno = node.end_lineno
        if 'col_offset' in node._attributes:
            if not hasattr(node, 'col_offset'):
                node.col_offset = col_offset
            else:
                col_offset = node.col_offset
+        if 'end_col_offset' in node._attributes:
+            if not hasattr(node, 'end_col_offset'):
+                node.end_col_offset = end_col_offset
+            else:
+                end_col_offset = node.end_col_offset
        for child in iter_child_nodes(node):
-            _fix(child, lineno, col_offset)
-    _fix(node, 1, 0)
+            _fix(child, lineno, col_offset, end_lineno, end_col_offset)
+    _fix(node, 1, 0, 1, 0)
    return node


 def increment_lineno(node, n=1):
    """
-    Increment the line number of each node in the tree starting at *node* by *n*.
-    This is useful to "move code" to a different location in a file.
+    Increment the line number and end line number of each node in the tree
+    starting at *node* by *n*. This is useful to "move code" to a different
+    location in a file.
    """
    for child in walk(node):
        if 'lineno' in child._attributes:
            child.lineno = getattr(child, 'lineno', 0) + n
+        if 'end_lineno' in child._attributes:
+            child.end_lineno = getattr(child, 'end_lineno', 0) + n
    return node


@@ -213,6 +226,77 @@ def get_docstring(node, clean=True):
    return text


+def _splitlines_no_ff(source):
+    """Split a string into lines ignoring form feed and other chars.
+
+    This mimics how the Python parser splits source code.
+    """
+    idx = 0
+    lines = []
+    next_line = ''
+    while idx < len(source):
+        c = source[idx]
+        next_line += c
+        idx += 1
+        # Keep \r\n together
+        if c == '\r' and idx < len(source) and source[idx] == '\n':
+            next_line += '\n'
+            idx += 1
+        if c in '\r\n':
+            lines.append(next_line)
+            next_line = ''
+
+    if next_line:
+        lines.append(next_line)
+    return lines
+
+
+def _pad_whitespace(source):
+    """Replace all chars except '\f\t' in a line with spaces."""
+    result = ''
+    for c in source:
+        if c in '\f\t':
+            result += c
+        else:
+            result += ' '
+    return result
+
+
+def get_source_segment(source, node, *, padded=False):
+    """Get source code segment of the *source* that generated *node*.
+
+    If some location information (`lineno`, `end_lineno`, `col_offset`,
+    or `end_col_offset`) is missing, return None.
+
+    If *padded* is `True`, the first line of a multi-line statement will
+    be padded with spaces to match its original position.
+    """
+    try:
+        lineno = node.lineno - 1
+        end_lineno = node.end_lineno - 1
+        col_offset = node.col_offset
+        end_col_offset = node.end_col_offset
+    except AttributeError:
+        return None
+
+    lines = _splitlines_no_ff(source)
+    if end_lineno == lineno:
+        return lines[lineno].encode()[col_offset:end_col_offset].decode()
+
+    if padded:
+        padding = _pad_whitespace(lines[lineno].encode()[:col_offset].decode())
+    else:
+        padding = ''
+
+    first = padding + lines[lineno].encode()[col_offset:].decode()
+    last = lines[end_lineno].encode()[:end_col_offset].decode()
+    lines = lines[lineno+1:end_lineno]
+
+    lines.insert(0, first)
+    lines.append(last)
+    return ''.join(lines)
+
+
 def walk(node):
    """
    Recursively yield all descendant nodes in the tree starting at *node*

--- a/Lib/test/test_asdl_parser.py
+++ b/Lib/test/test_asdl_parser.py
@@ -62,14 +62,16 @@ class TestAsdlParser(unittest.TestCase):

    def test_attributes(self):
        stmt = self.types['stmt']
-        self.assertEqual(len(stmt.attributes), 2)
+        self.assertEqual(len(stmt.attributes), 4)
        self.assertEqual(str(stmt.attributes[0]), 'Field(int, lineno)')
        self.assertEqual(str(stmt.attributes[1]), 'Field(int, col_offset)')
+        self.assertEqual(str(stmt.attributes[2]), 'Field(int, end_lineno, opt=True)')
+        self.assertEqual(str(stmt.attributes[3]), 'Field(int, end_col_offset, opt=True)')

    def test_constructor_fields(self):
        ehandler = self.types['excepthandler']
        self.assertEqual(len(ehandler.types), 1)
-        self.assertEqual(len(ehandler.attributes), 2)
+        self.assertEqual(len(ehandler.attributes), 4)

        cons = ehandler.types[0]
        self.assertIsInstance(cons, self.asdl.Constructor)

--- a/Lib/test/test_ast.py
+++ b/Lib/test/test_ast.py
--- a/Lib/test/test_parser.py
+++ b/Lib/test/test_parser.py
@@ -880,7 +880,7 @@ class STObjectTestCase(unittest.TestCase):
            return 1 << (n - 1).bit_length()

        basesize = support.calcobjsize('Pii')
-        nodesize = struct.calcsize('hP3iP0h')
+        nodesize = struct.calcsize('hP3iP0h2i')
        def sizeofchildren(node):
            if node is None:
                return 0

--- a/Misc/NEWS.d/next/Core and Builtins/2019-01-19-19-41-53.bpo-33416.VDeOU5.rst
+++ b/Misc/NEWS.d/next/Core and Builtins/2019-01-19-19-41-53.bpo-33416.VDeOU5.rst
+Add end line and end column position information to the Python AST nodes.
+This is a C-level backwards incompatible change.
\ No newline at end of file
--- a/Modules/parsermodule.c
+++ b/Modules/parsermodule.c
@@ -920,7 +920,7 @@ build_node_children(PyObject *tuple, node *root, int *line_num)
            Py_DECREF(elem);
            return NULL;
        }
-        err = PyNode_AddChild(root, type, strn, *line_num, 0);
+        err = PyNode_AddChild(root, type, strn, *line_num, 0, *line_num, 0);
        if (err == E_NOMEM) {
            Py_DECREF(elem);
            PyObject_FREE(strn);

--- a/Parser/Python.asdl
+++ b/Parser/Python.asdl
@@ -50,7 +50,7 @@ module Python

          -- XXX Jython will be different
          -- col_offset is the byte offset in the utf8 string the parser uses
-          attributes (int lineno, int col_offset)
+          attributes (int lineno, int col_offset, int? end_lineno, int? end_col_offset)

          -- BoolOp() can use left & right?
    expr = BoolOp(boolop op, expr* values)
@@ -85,7 +85,7 @@ module Python
         | Tuple(expr* elts, expr_context ctx)

          -- col_offset is the byte offset in the utf8 string the parser uses
-          attributes (int lineno, int col_offset)
+          attributes (int lineno, int col_offset, int? end_lineno, int? end_col_offset)

    expr_context = Load | Store | Del | AugLoad | AugStore | Param

@@ -105,13 +105,13 @@ module Python
    comprehension = (expr target, expr iter, expr* ifs, int is_async)

    excepthandler = ExceptHandler(expr? type, identifier? name, stmt* body)
-                    attributes (int lineno, int col_offset)
+                    attributes (int lineno, int col_offset, int? end_lineno, int? end_col_offset)

    arguments = (arg* args, arg? vararg, arg* kwonlyargs, expr* kw_defaults,
                 arg? kwarg, expr* defaults)

    arg = (identifier arg, expr? annotation)
-           attributes (int lineno, int col_offset)
+           attributes (int lineno, int col_offset, int? end_lineno, int? end_col_offset)

    -- keyword arguments supplied to call (NULL identifier for **kwargs)
    keyword = (identifier? arg, expr value)

--- a/Parser/asdl_c.py
+++ b/Parser/asdl_c.py
@@ -1250,10 +1250,12 @@ def main(srcfile, dump_module=False):
            f.write('#undef Yield   /* undefine macro conflicting with <winbase.h> */\n')
            f.write('\n')
            c = ChainOfVisitors(TypeDefVisitor(f),
-                                StructVisitor(f),
-                                PrototypeVisitor(f),
-                                )
+                                StructVisitor(f))
+
            c.visit(mod)
+            f.write("// Note: these macros affect function definitions, not only call sites.\n")
+            PrototypeVisitor(f).visit(mod)
+            f.write("\n")
            f.write("PyObject* PyAST_mod2obj(mod_ty t);\n")
            f.write("mod_ty PyAST_obj2mod(PyObject* ast, PyArena* arena, int mode);\n")
            f.write("int PyAST_Check(PyObject* obj);\n")

--- a/Parser/node.c
+++ b/Parser/node.c
@@ -13,6 +13,8 @@ PyNode_New(int type)
    n->n_type = type;
    n->n_str = NULL;
    n->n_lineno = 0;
+    n->n_end_lineno = 0;
+    n->n_end_col_offset = -1;
    n->n_nchildren = 0;
    n->n_child = NULL;
    return n;
@@ -75,14 +77,34 @@ fancy_roundup(int n)
               fancy_roundup(n))


+void
+_PyNode_FinalizeEndPos(node *n)
+{
+    int nch = NCH(n);
+    node *last;
+    if (nch == 0) {
+        return;
+    }
+    last = CHILD(n, nch - 1);
+    _PyNode_FinalizeEndPos(last);
+    n->n_end_lineno = last->n_end_lineno;
+    n->n_end_col_offset = last->n_end_col_offset;
+}
+
 int
-PyNode_AddChild(node *n1, int type, char *str, int lineno, int col_offset)
+PyNode_AddChild(node *n1, int type, char *str, int lineno, int col_offset,
+                int end_lineno, int end_col_offset)
 {
    const int nch = n1->n_nchildren;
    int current_capacity;
    int required_capacity;
    node *n;

+    // finalize end position of previous node (if any)
+    if (nch > 0) {
+        _PyNode_FinalizeEndPos(CHILD(n1, nch - 1));
+    }
+
    if (nch == INT_MAX || nch < 0)
        return E_OVERFLOW;

@@ -107,6 +129,8 @@ PyNode_AddChild(node *n1, int type, char *str, int lineno, int col_offset)
    n->n_str = str;
    n->n_lineno = lineno;
    n->n_col_offset = col_offset;
+    n->n_end_lineno = end_lineno;  // this and below will be updates after all children are added.
+    n->n_end_col_offset = end_col_offset;
    n->n_nchildren = 0;
    n->n_child = NULL;
    return 0;

--- a/Parser/parser.c
+++ b/Parser/parser.c
@@ -105,11 +105,13 @@ PyParser_Delete(parser_state *ps)
 /* PARSER STACK OPERATIONS */

 static int
-shift(stack *s, int type, char *str, int newstate, int lineno, int col_offset)
+shift(stack *s, int type, char *str, int newstate, int lineno, int col_offset,
+      int end_lineno, int end_col_offset)
 {
    int err;
    assert(!s_empty(s));
-    err = PyNode_AddChild(s->s_top->s_parent, type, str, lineno, col_offset);
+    err = PyNode_AddChild(s->s_top->s_parent, type, str, lineno, col_offset,
+                          end_lineno, end_col_offset);
    if (err)
        return err;
    s->s_top->s_state = newstate;
@@ -117,13 +119,15 @@ shift(stack *s, int type, char *str, int newstate, int lineno, int col_offset)
 }

 static int
-push(stack *s, int type, dfa *d, int newstate, int lineno, int col_offset)
+push(stack *s, int type, dfa *d, int newstate, int lineno, int col_offset,
+     int end_lineno, int end_col_offset)
 {
    int err;
    node *n;
    n = s->s_top->s_parent;
    assert(!s_empty(s));
-    err = PyNode_AddChild(n, type, (char *)NULL, lineno, col_offset);
+    err = PyNode_AddChild(n, type, (char *)NULL, lineno, col_offset,
+                          end_lineno, end_col_offset);
    if (err)
        return err;
    s->s_top->s_state = newstate;
@@ -225,7 +229,9 @@ future_hack(parser_state *ps)

 int
 PyParser_AddToken(parser_state *ps, int type, char *str,
-                  int lineno, int col_offset, int *expected_ret)
+                  int lineno, int col_offset,
+                  int end_lineno, int end_col_offset,
+                  int *expected_ret)
 {
    int ilabel;
    int err;
@@ -257,7 +263,8 @@ PyParser_AddToken(parser_state *ps, int type, char *str,
                    dfa *d1 = PyGrammar_FindDFA(
                        ps->p_grammar, nt);
                    if ((err = push(&ps->p_stack, nt, d1,
-                        arrow, lineno, col_offset)) > 0) {
+                        arrow, lineno, col_offset,
+                        end_lineno, end_col_offset)) > 0) {
                        D(printf(" MemError: push\n"));
                        return err;
                    }
@@ -267,7 +274,8 @@ PyParser_AddToken(parser_state *ps, int type, char *str,

                /* Shift the token */
                if ((err = shift(&ps->p_stack, type, str,
-                                x, lineno, col_offset)) > 0) {
+                                x, lineno, col_offset,
+                                end_lineno, end_col_offset)) > 0) {
                    D(printf(" MemError: shift.\n"));
                    return err;
                }

--- a/Parser/parser.h
+++ b/Parser/parser.h
@@ -32,7 +32,9 @@ typedef struct {

 parser_state *PyParser_New(grammar *g, int start);
 void PyParser_Delete(parser_state *ps);
-int PyParser_AddToken(parser_state *ps, int type, char *str, int lineno, int col_offset,
+int PyParser_AddToken(parser_state *ps, int type, char *str,
+                      int lineno, int col_offset,
+                      int end_lineno, int end_col_offset,
                      int *expected_ret);
 void PyGrammar_AddAccelerators(grammar *g);


--- a/Parser/parsetok.c
+++ b/Parser/parsetok.c
@@ -187,7 +187,7 @@ parsetok(struct tok_state *tok, grammar *g, int start, perrdetail *err_ret,
    parser_state *ps;
    node *n;
    int started = 0;
-    int col_offset;
+    int col_offset, end_col_offset;

    if ((ps = PyParser_New(g, start)) == NULL) {
        err_ret->error = E_NOMEM;
@@ -270,9 +270,16 @@ parsetok(struct tok_state *tok, grammar *g, int start, perrdetail *err_ret,
            col_offset = -1;
        }

+        if (b != NULL && b >= tok->line_start) {
+            end_col_offset = Py_SAFE_DOWNCAST(b - tok->line_start,
+                                              intptr_t, int);
+        }
+        else {
+            end_col_offset = -1;
+        }
        if ((err_ret->error =
             PyParser_AddToken(ps, (int)type, str,
-                               lineno, col_offset,
+                               lineno, col_offset, tok->lineno, end_col_offset,
                               &(err_ret->expected))) != E_OK) {
            if (err_ret->error != E_DONE) {
                PyObject_FREE(str);
@@ -368,6 +375,9 @@ parsetok(struct tok_state *tok, grammar *g, int start, perrdetail *err_ret,
 done:
    PyTokenizer_Free(tok);

+    if (n != NULL) {
+        _PyNode_FinalizeEndPos(n);
+    }
    return n;
 }


--- a/Python/Python-ast.c
+++ b/Python/Python-ast.c
--- a/Python/ast.c
+++ b/Python/ast.c
--- a/Python/ast_opt.c
+++ b/Python/ast_opt.c
@@ -439,7 +439,8 @@ astfold_body(asdl_seq *stmts, PyArena *ctx_, int optimize_)
            return 0;
        }
        asdl_seq_SET(values, 0, st->v.Expr.value);
-        expr_ty expr = JoinedStr(values, st->lineno, st->col_offset, ctx_);
+        expr_ty expr = JoinedStr(values, st->lineno, st->col_offset,
+                                 st->end_lineno, st->end_col_offset, ctx_);
        if (!expr) {
            return 0;
        }

--- a/Python/compile.c
+++ b/Python/compile.c
@@ -4757,7 +4757,8 @@ compiler_augassign(struct compiler *c, stmt_ty s)
    switch (e->kind) {
    case Attribute_kind:
        auge = Attribute(e->v.Attribute.value, e->v.Attribute.attr,
-                         AugLoad, e->lineno, e->col_offset, c->c_arena);
+                         AugLoad, e->lineno, e->col_offset,
+                         e->end_lineno, e->end_col_offset, c->c_arena);
        if (auge == NULL)
            return 0;
        VISIT(c, expr, auge);
@@ -4768,7 +4769,8 @@ compiler_augassign(struct compiler *c, stmt_ty s)
        break;
    case Subscript_kind:
        auge = Subscript(e->v.Subscript.value, e->v.Subscript.slice,
-                         AugLoad, e->lineno, e->col_offset, c->c_arena);
+                         AugLoad, e->lineno, e->col_offset,
+                         e->end_lineno, e->end_col_offset, c->c_arena);
        if (auge == NULL)
            return 0;
        VISIT(c, expr, auge);