cmd/compile: move argument stack construction to SSA generation

The goal of this change is to move work from walk to SSA, and simplify things along the way. This is hard to accomplish cleanly with small incremental changes, so this large commit message aims to provide a roadmap to the diff. High level description: Prior to this change, walk was responsible for constructing (most of) the stack for function calls. ascompatte gathered variadic arguments into a slice. It also rewrote n.List from a list of arguments to a list of assignments to stack slots. ascompatte was called multiple times to handle the receiver in a method call. reorder1 then introduced temporaries into n.List as needed to avoid smashing the stack. adjustargs then made extra stack space for go/defer args as needed. Node to SSA construction evaluated all the statements in n.List, and issued the function call, assuming that the stack was correctly constructed. Intrinsic calls had to dig around inside n.List to extract the arguments, since intrinsics don't use the stack to make function calls. This change moves stack construction to the SSA construction phase. ascompatte, now called walkParams, does all the work that ascompatte and reorder1 did. It handles variadic arguments, inserts the method receiver if needed, and allocates temporaries. It does not, however, make any assignments to stack slots. Instead, it moves the function arguments to n.Rlist, leaving assignments to temporaries in n.List. (It would be better to use Ninit instead of List; future work.) During SSA construction, after doing all the temporary assignments in n.List, the function arguments are assigned to stack slots by constructing the appropriate SSA Value, using (*state).storeArg. SSA construction also now handles adjustments for go/defer args. This change also simplifies intrinsic calls, since we no longer need to undo walk's work. Along the way, we simplify nodarg by pushing the fp==1 case to its callers, where it fits nicely. Generated code differences: There were a few optimizations applied along the way, the old way. f(g()) was rewritten to do a block copy of function results to function arguments. And reorder1 avoided introducing the final "save the stack" temporary in n.List. The f(g()) block copy optimization never actually triggered; the order pass rewrote away g(), so that has been removed. SSA optimizations mostly obviated the need for reorder1's optimization of avoiding the final temporary. The exception was when the temporary's type was not SSA-able; in that case, we got a Move into an autotmp and then an immediate Move onto the stack, with the autotmp never read or used again. This change introduces a new rewrite rule to detect such pointless double Moves and collapse them into a single Move. This is actually more powerful than the original optimization, since the original optimization relied on the imprecise Node.HasCall calculation. The other significant difference in the generated code is that the stack is now constructed completely in SP-offset order. Prior to this change, the stack was constructed somewhat haphazardly: first the final argument that Node.HasCall deemed to require a temporary, then other arguments, then the method receiver, then the defer/go args. SP-offset is probably a good default order. See future work. There are a few minor object file size changes as a result of this change. I investigated some regressions in early versions of this change. One regression (in archive/tar) was the addition of a single CMPQ instruction, which would be eliminated were this TODO from flagalloc to be done: // TODO: Remove original instructions if they are never used. One regression (in text/template) was an ADDQconstmodify that is now a regular MOVQLoad+ADDQconst+MOVQStore, due to an unlucky change in the order in which arguments are written. The argument change order can also now be luckier, so this appears to be a wash. All in all, though there will be minor winners and losers, this change appears to be performance neutral. Future work: Move loading the result of function calls to SSA construction; eliminate OINDREGSP. Consider pushing stack construction deeper into SSA world, perhaps in an arch-specific pass. Among other benefits, this would make it easier to transition to a new calling convention. This would require rethinking the handling of stack conflicts and is non-trivial. Figure out some clean way to indicate that stack construction Stores/Moves do not alias each other, so that subsequent passes may do things like CSE+tighten shared stack setup, do DSE using non-first Stores, etc. This would allow us to eliminate the minor text/template regression. Possibly make assignments to stack slots not treated as statements by DWARF. Compiler benchmarks: name old time/op new time/op delta Template 182ms ± 2% 179ms ± 2% -1.69% (p=0.000 n=47+48) Unicode 86.3ms ± 5% 85.1ms ± 4% -1.36% (p=0.001 n=50+50) GoTypes 646ms ± 1% 642ms ± 1% -0.63% (p=0.000 n=49+48) Compiler 2.89s ± 1% 2.86s ± 2% -1.36% (p=0.000 n=48+50) SSA 8.47s ± 1% 8.37s ± 2% -1.22% (p=0.000 n=47+50) Flate 122ms ± 2% 121ms ± 2% -0.66% (p=0.000 n=47+45) GoParser 147ms ± 2% 146ms ± 2% -0.53% (p=0.006 n=46+49) Reflect 406ms ± 2% 403ms ± 2% -0.76% (p=0.000 n=48+43) Tar 162ms ± 3% 162ms ± 4% ~ (p=0.191 n=46+50) XML 223ms ± 2% 222ms ± 2% -0.37% (p=0.031 n=45+49) [Geo mean] 382ms 378ms -0.89% name old user-time/op new user-time/op delta Template 219ms ± 3% 216ms ± 3% -1.56% (p=0.000 n=50+48) Unicode 109ms ± 6% 109ms ± 5% ~ (p=0.190 n=50+49) GoTypes 836ms ± 2% 828ms ± 2% -0.96% (p=0.000 n=49+48) Compiler 3.87s ± 2% 3.80s ± 1% -1.81% (p=0.000 n=49+46) SSA 12.0s ± 1% 11.8s ± 1% -2.01% (p=0.000 n=48+50) Flate 142ms ± 3% 141ms ± 3% -0.85% (p=0.003 n=50+48) GoParser 178ms ± 4% 175ms ± 4% -1.66% (p=0.000 n=48+46) Reflect 520ms ± 2% 512ms ± 2% -1.44% (p=0.000 n=45+48) Tar 200ms ± 3% 198ms ± 4% -0.61% (p=0.037 n=47+50) XML 277ms ± 3% 275ms ± 3% -0.85% (p=0.000 n=49+48) [Geo mean] 482ms 476ms -1.23% name old alloc/op new alloc/op delta Template 36.1MB ± 0% 35.3MB ± 0% -2.18% (p=0.008 n=5+5) Unicode 29.8MB ± 0% 29.3MB ± 0% -1.58% (p=0.008 n=5+5) GoTypes 125MB ± 0% 123MB ± 0% -2.13% (p=0.008 n=5+5) Compiler 531MB ± 0% 513MB ± 0% -3.40% (p=0.008 n=5+5) SSA 2.00GB ± 0% 1.93GB ± 0% -3.34% (p=0.008 n=5+5) Flate 24.5MB ± 0% 24.3MB ± 0% -1.18% (p=0.008 n=5+5) GoParser 29.4MB ± 0% 28.7MB ± 0% -2.34% (p=0.008 n=5+5) Reflect 87.1MB ± 0% 86.0MB ± 0% -1.33% (p=0.008 n=5+5) Tar 35.3MB ± 0% 34.8MB ± 0% -1.44% (p=0.008 n=5+5) XML 47.9MB ± 0% 47.1MB ± 0% -1.86% (p=0.008 n=5+5) [Geo mean] 82.8MB 81.1MB -2.08% name old allocs/op new allocs/op delta Template 352k ± 0% 347k ± 0% -1.32% (p=0.008 n=5+5) Unicode 342k ± 0% 339k ± 0% -0.66% (p=0.008 n=5+5) GoTypes 1.29M ± 0% 1.27M ± 0% -1.30% (p=0.008 n=5+5) Compiler 4.98M ± 0% 4.87M ± 0% -2.14% (p=0.008 n=5+5) SSA 15.7M ± 0% 15.2M ± 0% -2.86% (p=0.008 n=5+5) Flate 233k ± 0% 231k ± 0% -0.83% (p=0.008 n=5+5) GoParser 296k ± 0% 291k ± 0% -1.54% (p=0.016 n=5+4) Reflect 1.05M ± 0% 1.04M ± 0% -0.65% (p=0.008 n=5+5) Tar 343k ± 0% 339k ± 0% -0.97% (p=0.008 n=5+5) XML 432k ± 0% 426k ± 0% -1.19% (p=0.008 n=5+5) [Geo mean] 815k 804k -1.35% name old object-bytes new object-bytes delta Template 505kB ± 0% 505kB ± 0% -0.01% (p=0.008 n=5+5) Unicode 224kB ± 0% 224kB ± 0% ~ (all equal) GoTypes 1.82MB ± 0% 1.83MB ± 0% +0.06% (p=0.008 n=5+5) Flate 324kB ± 0% 324kB ± 0% +0.00% (p=0.008 n=5+5) GoParser 402kB ± 0% 402kB ± 0% +0.04% (p=0.008 n=5+5) Reflect 1.39MB ± 0% 1.39MB ± 0% -0.01% (p=0.008 n=5+5) Tar 449kB ± 0% 449kB ± 0% -0.02% (p=0.008 n=5+5) XML 598kB ± 0% 597kB ± 0% -0.05% (p=0.008 n=5+5) Change-Id: Ifc9d5c1bd01f90171414b8fb18ffe2290d271143 Reviewed-on: https://go-review.googlesource.com/c/114797 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Matthew Dempsky <mdempsky@google.com>

cmd/compile: move argument stack construction to SSA generation
The goal of this change is to move work from walk to SSA, and simplify things along the way. This is hard to accomplish cleanly with small incremental changes, so this large commit message aims to provide a roadmap to the diff. High level description: Prior to this change, walk was responsible for constructing (most of) the stack for function calls. ascompatte gathered variadic arguments into a slice. It also rewrote n.List from a list of arguments to a list of assignments to stack slots. ascompatte was called multiple times to handle the receiver in a method call. reorder1 then introduced temporaries into n.List as needed to avoid smashing the stack. adjustargs then made extra stack space for go/defer args as needed. Node to SSA construction evaluated all the statements in n.List, and issued the function call, assuming that the stack was correctly constructed. Intrinsic calls had to dig around inside n.List to extract the arguments, since intrinsics don't use the stack to make function calls. This change moves stack construction to the SSA construction phase. ascompatte, now called walkParams, does all the work that ascompatte and reorder1 did. It handles variadic arguments, inserts the method receiver if needed, and allocates temporaries. It does not, however, make any assignments to stack slots. Instead, it moves the function arguments to n.Rlist, leaving assignments to temporaries in n.List. (It would be better to use Ninit instead of List; future work.) During SSA construction, after doing all the temporary assignments in n.List, the function arguments are assigned to stack slots by constructing the appropriate SSA Value, using (*state).storeArg. SSA construction also now handles adjustments for go/defer args. This change also simplifies intrinsic calls, since we no longer need to undo walk's work. Along the way, we simplify nodarg by pushing the fp==1 case to its callers, where it fits nicely. Generated code differences: There were a few optimizations applied along the way, the old way. f(g()) was rewritten to do a block copy of function results to function arguments. And reorder1 avoided introducing the final "save the stack" temporary in n.List. The f(g()) block copy optimization never actually triggered; the order pass rewrote away g(), so that has been removed. SSA optimizations mostly obviated the need for reorder1's optimization of avoiding the final temporary. The exception was when the temporary's type was not SSA-able; in that case, we got a Move into an autotmp and then an immediate Move onto the stack, with the autotmp never read or used again. This change introduces a new rewrite rule to detect such pointless double Moves and collapse them into a single Move. This is actually more powerful than the original optimization, since the original optimization relied on the imprecise Node.HasCall calculation. The other significant difference in the generated code is that the stack is now constructed completely in SP-offset order. Prior to this change, the stack was constructed somewhat haphazardly: first the final argument that Node.HasCall deemed to require a temporary, then other arguments, then the method receiver, then the defer/go args. SP-offset is probably a good default order. See future work. There are a few minor object file size changes as a result of this change. I investigated some regressions in early versions of this change. One regression (in archive/tar) was the addition of a single CMPQ instruction, which would be eliminated were this TODO from flagalloc to be done: // TODO: Remove original instructions if they are never used. One regression (in text/template) was an ADDQconstmodify that is now a regular MOVQLoad+ADDQconst+MOVQStore, due to an unlucky change in the order in which arguments are written. The argument change order can also now be luckier, so this appears to be a wash. All in all, though there will be minor winners and losers, this change appears to be performance neutral. Future work: Move loading the result of function calls to SSA construction; eliminate OINDREGSP. Consider pushing stack construction deeper into SSA world, perhaps in an arch-specific pass. Among other benefits, this would make it easier to transition to a new calling convention. This would require rethinking the handling of stack conflicts and is non-trivial. Figure out some clean way to indicate that stack construction Stores/Moves do not alias each other, so that subsequent passes may do things like CSE+tighten shared stack setup, do DSE using non-first Stores, etc. This would allow us to eliminate the minor text/template regression. Possibly make assignments to stack slots not treated as statements by DWARF. Compiler benchmarks: name old time/op new time/op delta Template 182ms ± 2% 179ms ± 2% -1.69% (p=0.000 n=47+48) Unicode 86.3ms ± 5% 85.1ms ± 4% -1.36% (p=0.001 n=50+50) GoTypes 646ms ± 1% 642ms ± 1% -0.63% (p=0.000 n=49+48) Compiler 2.89s ± 1% 2.86s ± 2% -1.36% (p=0.000 n=48+50) SSA 8.47s ± 1% 8.37s ± 2% -1.22% (p=0.000 n=47+50) Flate 122ms ± 2% 121ms ± 2% -0.66% (p=0.000 n=47+45) GoParser 147ms ± 2% 146ms ± 2% -0.53% (p=0.006 n=46+49) Reflect 406ms ± 2% 403ms ± 2% -0.76% (p=0.000 n=48+43) Tar 162ms ± 3% 162ms ± 4% ~ (p=0.191 n=46+50) XML 223ms ± 2% 222ms ± 2% -0.37% (p=0.031 n=45+49) [Geo mean] 382ms 378ms -0.89% name old user-time/op new user-time/op delta Template 219ms ± 3% 216ms ± 3% -1.56% (p=0.000 n=50+48) Unicode 109ms ± 6% 109ms ± 5% ~ (p=0.190 n=50+49) GoTypes 836ms ± 2% 828ms ± 2% -0.96% (p=0.000 n=49+48) Compiler 3.87s ± 2% 3.80s ± 1% -1.81% (p=0.000 n=49+46) SSA 12.0s ± 1% 11.8s ± 1% -2.01% (p=0.000 n=48+50) Flate 142ms ± 3% 141ms ± 3% -0.85% (p=0.003 n=50+48) GoParser 178ms ± 4% 175ms ± 4% -1.66% (p=0.000 n=48+46) Reflect 520ms ± 2% 512ms ± 2% -1.44% (p=0.000 n=45+48) Tar 200ms ± 3% 198ms ± 4% -0.61% (p=0.037 n=47+50) XML 277ms ± 3% 275ms ± 3% -0.85% (p=0.000 n=49+48) [Geo mean] 482ms 476ms -1.23% name old alloc/op new alloc/op delta Template 36.1MB ± 0% 35.3MB ± 0% -2.18% (p=0.008 n=5+5) Unicode 29.8MB ± 0% 29.3MB ± 0% -1.58% (p=0.008 n=5+5) GoTypes 125MB ± 0% 123MB ± 0% -2.13% (p=0.008 n=5+5) Compiler 531MB ± 0% 513MB ± 0% -3.40% (p=0.008 n=5+5) SSA 2.00GB ± 0% 1.93GB ± 0% -3.34% (p=0.008 n=5+5) Flate 24.5MB ± 0% 24.3MB ± 0% -1.18% (p=0.008 n=5+5) GoParser 29.4MB ± 0% 28.7MB ± 0% -2.34% (p=0.008 n=5+5) Reflect 87.1MB ± 0% 86.0MB ± 0% -1.33% (p=0.008 n=5+5) Tar 35.3MB ± 0% 34.8MB ± 0% -1.44% (p=0.008 n=5+5) XML 47.9MB ± 0% 47.1MB ± 0% -1.86% (p=0.008 n=5+5) [Geo mean] 82.8MB 81.1MB -2.08% name old allocs/op new allocs/op delta Template 352k ± 0% 347k ± 0% -1.32% (p=0.008 n=5+5) Unicode 342k ± 0% 339k ± 0% -0.66% (p=0.008 n=5+5) GoTypes 1.29M ± 0% 1.27M ± 0% -1.30% (p=0.008 n=5+5) Compiler 4.98M ± 0% 4.87M ± 0% -2.14% (p=0.008 n=5+5) SSA 15.7M ± 0% 15.2M ± 0% -2.86% (p=0.008 n=5+5) Flate 233k ± 0% 231k ± 0% -0.83% (p=0.008 n=5+5) GoParser 296k ± 0% 291k ± 0% -1.54% (p=0.016 n=5+4) Reflect 1.05M ± 0% 1.04M ± 0% -0.65% (p=0.008 n=5+5) Tar 343k ± 0% 339k ± 0% -0.97% (p=0.008 n=5+5) XML 432k ± 0% 426k ± 0% -1.19% (p=0.008 n=5+5) [Geo mean] 815k 804k -1.35% name old object-bytes new object-bytes delta Template 505kB ± 0% 505kB ± 0% -0.01% (p=0.008 n=5+5) Unicode 224kB ± 0% 224kB ± 0% ~ (all equal) GoTypes 1.82MB ± 0% 1.83MB ± 0% +0.06% (p=0.008 n=5+5) Flate 324kB ± 0% 324kB ± 0% +0.00% (p=0.008 n=5+5) GoParser 402kB ± 0% 402kB ± 0% +0.04% (p=0.008 n=5+5) Reflect 1.39MB ± 0% 1.39MB ± 0% -0.01% (p=0.008 n=5+5) Tar 449kB ± 0% 449kB ± 0% -0.02% (p=0.008 n=5+5) XML 598kB ± 0% 597kB ± 0% -0.05% (p=0.008 n=5+5) Change-Id: Ifc9d5c1bd01f90171414b8fb18ffe2290d271143 Reviewed-on: https://go-review.googlesource.com/c/114797 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Matthew Dempsky <mdempsky@google.com>
2578ac54 · Josh Bleecher Snyder · 6c631ae2 · 2578ac54 · 2578ac54 · 2578ac54
Commit 2578ac54 authored May 06, 2018 by Josh Bleecher Snyder
7 changed files
--- a/src/cmd/compile/internal/gc/ssa.go
+++ b/src/cmd/compile/internal/gc/ssa.go
@@ -3558,59 +3558,34 @@ func (s *state) intrinsicCall(n *Node) *ssa.Value {
 	return v
 }

-type callArg struct {
-	offset int64
-	v      *ssa.Value
-}
-type byOffset []callArg
-
-func (x byOffset) Len() int      { return len(x) }
-func (x byOffset) Swap(i, j int) { x[i], x[j] = x[j], x[i] }
-func (x byOffset) Less(i, j int) bool {
-	return x[i].offset < x[j].offset
-}
-
 // intrinsicArgs extracts args from n, evaluates them to SSA values, and returns them.
 func (s *state) intrinsicArgs(n *Node) []*ssa.Value {
-	// This code is complicated because of how walk transforms calls. For a call node,
-	// each entry in n.List is either an assignment to OINDREGSP which actually
-	// stores an arg, or an assignment to a temporary which computes an arg
-	// which is later assigned.
-	// The args can also be out of order.
-	// TODO: when walk goes away someday, this code can go away also.
-	var args []callArg
+	// Construct map of temps; see comments in s.call about the structure of n.
 	temps := map[*Node]*ssa.Value{}
 	for _, a := range n.List.Slice() {
 		if a.Op != OAS {
-			s.Fatalf("non-assignment as a function argument %v", a.Op)
+			s.Fatalf("non-assignment as a temp function argument %v", a.Op)
 		}
 		l, r := a.Left, a.Right
-		switch l.Op {
-		case ONAME:
+		if l.Op != ONAME {
+			s.Fatalf("non-ONAME temp function argument %v", a.Op)
+		}
 		// Evaluate and store to "temporary".
 		// Walk ensures these temporaries are dead outside of n.
 		temps[l] = s.expr(r)
-		case OINDREGSP:
+	}
+	args := make([]*ssa.Value, n.Rlist.Len())
+	for i, n := range n.Rlist.Slice() {
 		// Store a value to an argument slot.
-			var v *ssa.Value
-			if x, ok := temps[r]; ok {
+		if x, ok := temps[n]; ok {
 			// This is a previously computed temporary.
-				v = x
-			} else {
-				// This is an explicit value; evaluate it.
-				v = s.expr(r)
-			}
-			args = append(args, callArg{l.Xoffset, v})
-		default:
-			s.Fatalf("function argument assignment target not allowed: %v", l.Op)
-		}
+			args[i] = x
+			continue
 		}
-	sort.Sort(byOffset(args))
-	res := make([]*ssa.Value, len(args))
-	for i, a := range args {
-		res[i] = a.v
+		// This is an explicit value; evaluate it.
+		args[i] = s.expr(n)
 	}
-	return res
+	return args
 }

 // Calls the function n using the specified call type.
@@ -3651,7 +3626,7 @@ func (s *state) call(n *Node, k callKind) *ssa.Value {
 		n2.Pos = fn.Pos
 		n2.Type = types.Types[TUINT8] // dummy type for a static closure. Could use runtime.funcval if we had it.
 		closure = s.expr(n2)
-		// Note: receiver is already assigned in n.List, so we don't
+		// Note: receiver is already present in n.Rlist, so we don't
 		// want to set it here.
 	case OCALLINTER:
 		if fn.Op != ODOTINTER {
@@ -3672,32 +3647,43 @@ func (s *state) call(n *Node, k callKind) *ssa.Value {
 	dowidth(fn.Type)
 	stksize := fn.Type.ArgWidth() // includes receiver

-	// Run all argument assignments. The arg slots have already
-	// been offset by the appropriate amount (+2*widthptr for go/defer,
-	// +widthptr for interface calls).
-	// For OCALLMETH, the receiver is set in these statements.
+	// Run all assignments of temps.
+	// The temps are introduced to avoid overwriting argument
+	// slots when arguments themselves require function calls.
 	s.stmtList(n.List)

-	// Set receiver (for interface calls)
-	if rcvr != nil {
+	// Store arguments to stack, including defer/go arguments and receiver for method calls.
+	// These are written in SP-offset order.
 	argStart := Ctxt.FixedFrameSize()
-		if k != callNormal {
-			argStart += int64(2 * Widthptr)
-		}
-		addr := s.constOffPtrSP(s.f.Config.Types.UintptrPtr, argStart)
-		s.store(types.Types[TUINTPTR], addr, rcvr)
-	}
-
-	// Defer/go args
+	// Defer/go args.
 	if k != callNormal {
 		// Write argsize and closure (args to newproc/deferproc).
-		argStart := Ctxt.FixedFrameSize()
 		argsize := s.constInt32(types.Types[TUINT32], int32(stksize))
 		addr := s.constOffPtrSP(s.f.Config.Types.UInt32Ptr, argStart)
 		s.store(types.Types[TUINT32], addr, argsize)
 		addr = s.constOffPtrSP(s.f.Config.Types.UintptrPtr, argStart+int64(Widthptr))
 		s.store(types.Types[TUINTPTR], addr, closure)
 		stksize += 2 * int64(Widthptr)
+		argStart += 2 * int64(Widthptr)
+	}
+
+	// Set receiver (for interface calls).
+	if rcvr != nil {
+		addr := s.constOffPtrSP(s.f.Config.Types.UintptrPtr, argStart)
+		s.store(types.Types[TUINTPTR], addr, rcvr)
+	}
+
+	// Write args.
+	t := n.Left.Type
+	args := n.Rlist.Slice()
+	if n.Op == OCALLMETH {
+		f := t.Recv()
+		s.storeArg(args[0], f.Type, argStart+f.Offset)
+		args = args[1:]
+	}
+	for i, n := range args {
+		f := t.Params().Field(i)
+		s.storeArg(n, f.Type, argStart+f.Offset)
 	}

 	// call target
@@ -4182,6 +4168,20 @@ func (s *state) storeTypePtrs(t *types.Type, left, right *ssa.Value) {
 	}
 }

+func (s *state) storeArg(n *Node, t *types.Type, off int64) {
+	pt := types.NewPtr(t)
+	sp := s.constOffPtrSP(pt, off)
+
+	if !canSSAType(t) {
+		a := s.addr(n, false)
+		s.move(t, sp, a)
+		return
+	}
+
+	a := s.expr(n)
+	s.storeType(t, sp, a, 0, false)
+}
+
 // slice computes the slice v[i:j:k] and returns ptr, len, and cap of result.
 // i,j,k may be nil, in which case they are set to their default value.
 // t is a slice, ptr to array, or string type.

--- a/src/cmd/compile/internal/gc/syntax.go
+++ b/src/cmd/compile/internal/gc/syntax.go
@@ -603,9 +603,17 @@ const (
 	OAS2DOTTYPE      // List = Rlist (x, ok = I.(int))
 	OASOP            // Left Etype= Right (x += y)
 	OCALL            // Left(List) (function call, method call or type conversion)
-	OCALLFUNC        // Left(List) (function call f(args))
-	OCALLMETH        // Left(List) (direct method call x.Method(args))
-	OCALLINTER       // Left(List) (interface method call x.Method(args))
+
+	// OCALLFUNC, OCALLMETH, and OCALLINTER have the same structure.
+	// Prior to walk, they are: Left(List), where List is all regular arguments.
+	// If present, Right is an ODDDARG that holds the
+	// generated slice used in a call to a variadic function.
+	// After walk, List is a series of assignments to temporaries,
+	// and Rlist is an updated set of arguments, including any ODDDARG slice.
+	// TODO(josharian/khr): Use Ninit instead of List for the assignments to temporaries. See CL 114797.
+	OCALLFUNC  // Left(List/Rlist) (function call f(args))
+	OCALLMETH  // Left(List/Rlist) (direct method call x.Method(args))
+	OCALLINTER // Left(List/Rlist) (interface method call x.Method(args))
 	OCALLPART  // Left.Right (method expression x.Method, not called)
 	OCAP       // cap(Left)
 	OCLOSE     // close(Left)

--- a/src/cmd/compile/internal/gc/walk.go
+++ b/src/cmd/compile/internal/gc/walk.go
@@ -109,32 +109,6 @@ func paramoutheap(fn *Node) bool {
 	return false
 }

-// adds "adjust" to all the argument locations for the call n.
-// n must be a defer or go node that has already been walked.
-func adjustargs(n *Node, adjust int) {
-	callfunc := n.Left
-	for _, arg := range callfunc.List.Slice() {
-		if arg.Op != OAS {
-			Fatalf("call arg not assignment")
-		}
-		lhs := arg.Left
-		if lhs.Op == ONAME {
-			// This is a temporary introduced by reorder1.
-			// The real store to the stack appears later in the arg list.
-			continue
-		}
-
-		if lhs.Op != OINDREGSP {
-			Fatalf("call argument store does not use OINDREGSP")
-		}
-
-		// can't really check this in machine-indep code.
-		//if(lhs->val.u.reg != D_SP)
-		//      Fatalf("call arg assign not indreg(SP)")
-		lhs.Xoffset += int64(adjust)
-	}
-}
-
 // The result of walkstmt MUST be assigned back to n, e.g.
 // 	n.Left = walkstmt(n.Left)
 func walkstmt(n *Node) *Node {
@@ -264,9 +238,6 @@ func walkstmt(n *Node) *Node {
 			n.Left = walkexpr(n.Left, &n.Ninit)
 		}

-		// make room for size & fn arguments.
-		adjustargs(n, 2*Widthptr)
-
 	case OFOR, OFORUNTIL:
 		if n.Left != nil {
 			walkstmtlist(n.Left.Ninit.Slice())
@@ -334,8 +305,19 @@ func walkstmt(n *Node) *Node {
 		}
 		walkexprlist(n.List.Slice(), &n.Ninit)

-		ll := ascompatte(nil, false, Curfn.Type.Results(), n.List.Slice(), 1, &n.Ninit)
-		n.List.Set(ll)
+		// For each return parameter (lhs), assign the corresponding result (rhs).
+		lhs := Curfn.Type.Results()
+		rhs := n.List.Slice()
+		res := make([]*Node, lhs.NumFields())
+		for i, nl := range lhs.FieldSlice() {
+			nname := asNode(nl.Nname)
+			if nname.isParamHeapCopy() {
+				nname = nname.Name.Param.Stackcopy
+			}
+			a := nod(OAS, nname, rhs[i])
+			res[i] = convas(a, &n.Ninit)
+		}
+		n.List.Set(res)

 	case ORETJMP:
 		break
@@ -612,19 +594,12 @@ opswitch:
 	case OCLOSUREVAR, OCFUNC:
 		n.SetAddable(true)

-	case OCALLINTER:
+	case OCALLINTER, OCALLFUNC, OCALLMETH:
+		if n.Op == OCALLINTER {
 			usemethod(n)
-		t := n.Left.Type
-		if n.List.Len() != 0 && n.List.First().Op == OAS {
-			break
 		}
-		n.Left = walkexpr(n.Left, init)
-		walkexprlist(n.List.Slice(), init)
-		ll := ascompatte(n, n.Isddd(), t.Params(), n.List.Slice(), 0, init)
-		n.List.Set(reorder1(ll))

-	case OCALLFUNC:
-		if n.Left.Op == OCLOSURE {
+		if n.Op == OCALLFUNC && n.Left.Op == OCLOSURE {
 			// Transform direct call of a closure to call of a normal function.
 			// transformclosure already did all preparation work.

@@ -645,30 +620,7 @@ opswitch:
 			}
 		}

-		t := n.Left.Type
-		if n.List.Len() != 0 && n.List.First().Op == OAS {
-			break
-		}
-
-		n.Left = walkexpr(n.Left, init)
-		walkexprlist(n.List.Slice(), init)
-
-		ll := ascompatte(n, n.Isddd(), t.Params(), n.List.Slice(), 0, init)
-		n.List.Set(reorder1(ll))
-
-	case OCALLMETH:
-		t := n.Left.Type
-		if n.List.Len() != 0 && n.List.First().Op == OAS {
-			break
-		}
-		n.Left = walkexpr(n.Left, init)
-		walkexprlist(n.List.Slice(), init)
-		ll := ascompatte(n, false, t.Recvs(), []*Node{n.Left.Left}, 0, init)
-		lr := ascompatte(n, n.Isddd(), t.Params(), n.List.Slice(), 0, init)
-		ll = append(ll, lr...)
-		n.Left.Left = nil
-		updateHasCall(n.Left)
-		n.List.Set(reorder1(ll))
+		walkCall(n, init)

 	case OAS, OASOP:
 		init.AppendNodes(&n.Ninit)
@@ -1714,7 +1666,7 @@ func ascompatet(nl Nodes, nr *types.Type) []*Node {
 			l = tmp
 		}

-		a := nod(OAS, l, nodarg(r, 0))
+		a := nod(OAS, l, nodarg(r))
 		a = convas(a, &nn)
 		updateHasCall(a)
 		if a.HasCall() {
@@ -1727,99 +1679,23 @@ func ascompatet(nl Nodes, nr *types.Type) []*Node {
 	return append(nn.Slice(), mm.Slice()...)
 }

-// nodarg returns a Node for the function argument denoted by t,
-// which is either the entire function argument or result struct (t is a  struct *types.Type)
-// or a specific argument (t is a *types.Field within a struct *types.Type).
+// nodarg returns a Node for the function argument f.
+// f is a *types.Field within a struct *types.Type.
 //
-// If fp is 0, the node is for use by a caller invoking the given
+// The node is for use by a caller invoking the given
 // function, preparing the arguments before the call
 // or retrieving the results after the call.
 // In this case, the node will correspond to an outgoing argument
 // slot like 8(SP).
-//
-// If fp is 1, the node is for use by the function itself
-// (the callee), to retrieve its arguments or write its results.
-// In this case the node will be an ONAME with an appropriate
-// type and offset.
-func nodarg(t interface{}, fp int) *Node {
-	var n *Node
-
-	switch t := t.(type) {
-	default:
-		Fatalf("bad nodarg %T(%v)", t, t)
-
-	case *types.Type:
-		// Entire argument struct, not just one arg
-		if !t.IsFuncArgStruct() {
-			Fatalf("nodarg: bad type %v", t)
-		}
-
-		// Build fake variable name for whole arg struct.
-		n = newname(lookup(".args"))
-		n.Type = t
-		first := t.Field(0)
-		if first == nil {
-			Fatalf("nodarg: bad struct")
-		}
-		if first.Offset == BADWIDTH {
-			Fatalf("nodarg: offset not computed for %v", t)
-		}
-		n.Xoffset = first.Offset
-
-	case *types.Field:
-		if fp == 1 {
-			// NOTE(rsc): This should be using t.Nname directly,
-			// except in the case where t.Nname.Sym is the blank symbol and
-			// so the assignment would be discarded during code generation.
-			// In that case we need to make a new node, and there is no harm
-			// in optimization passes to doing so. But otherwise we should
-			// definitely be using the actual declaration and not a newly built node.
-			// The extra Fatalf checks here are verifying that this is the case,
-			// without changing the actual logic (at time of writing, it's getting
-			// toward time for the Go 1.7 beta).
-			// At some quieter time (assuming we've never seen these Fatalfs happen)
-			// we could change this code to use "expect" directly.
-			expect := asNode(t.Nname)
-			if expect.isParamHeapCopy() {
-				expect = expect.Name.Param.Stackcopy
-			}
-
-			for _, n := range Curfn.Func.Dcl {
-				if (n.Class() == PPARAM || n.Class() == PPARAMOUT) && !t.Sym.IsBlank() && n.Sym == t.Sym {
-					if n != expect {
-						Fatalf("nodarg: unexpected node: %v (%p %v) vs %v (%p %v)", n, n, n.Op, asNode(t.Nname), asNode(t.Nname), asNode(t.Nname).Op)
-					}
-					return n
-				}
-			}
-
-			if !expect.Sym.IsBlank() {
-				Fatalf("nodarg: did not find node in dcl list: %v", expect)
-			}
-		}
-
+func nodarg(f *types.Field) *Node {
 	// Build fake name for individual variable.
-		// This is safe because if there was a real declared name
-		// we'd have used it above.
-		n = newname(lookup("__"))
-		n.Type = t.Type
-		if t.Offset == BADWIDTH {
-			Fatalf("nodarg: offset not computed for %v", t)
-		}
-		n.Xoffset = t.Offset
-		n.Orig = asNode(t.Nname)
-	}
-
-	// Rewrite argument named _ to __,
-	// or else the assignment to _ will be
-	// discarded during code generation.
-	if n.isBlank() {
-		n.Sym = lookup("__")
-	}
-
-	if fp != 0 {
-		Fatalf("bad fp: %v", fp)
+	n := newname(lookup("__"))
+	n.Type = f.Type
+	if f.Offset == BADWIDTH {
+		Fatalf("nodarg: offset not computed for %v", f)
 	}
+	n.Xoffset = f.Offset
+	n.Orig = asNode(f.Nname)

 	// preparing arguments for call
 	n.Op = OINDREGSP
@@ -1856,59 +1732,58 @@ func mkdotargslice(typ *types.Type, args []*Node, init *Nodes, ddd *Node) *Node
 	return n
 }

-// check assign expression list to
-// a type list. called in
-//	return expr-list
-//	func(expr-list)
-func ascompatte(call *Node, isddd bool, lhs *types.Type, rhs []*Node, fp int, init *Nodes) []*Node {
-	// f(g()) where g has multiple return values
-	if len(rhs) == 1 && rhs[0].Type.IsFuncArgStruct() {
-		// optimization - can do block copy
-		if eqtypenoname(rhs[0].Type, lhs) {
-			nl := nodarg(lhs, fp)
-			nr := convnop(rhs[0], nl.Type)
-			n := convas(nod(OAS, nl, nr), init)
-			n.SetTypecheck(1)
-			return []*Node{n}
-		}
-
-		// conversions involved.
-		// copy into temporaries.
-		var tmps []*Node
-		for _, nr := range rhs[0].Type.FieldSlice() {
-			tmps = append(tmps, temp(nr.Type))
-		}
-
-		a := nod(OAS2, nil, nil)
-		a.List.Set(tmps)
-		a.Rlist.Set(rhs)
-		a = typecheck(a, Etop)
-		a = walkstmt(a)
-		init.Append(a)
-
-		rhs = tmps
+func walkCall(n *Node, init *Nodes) {
+	if n.Rlist.Len() != 0 {
+		return // already walked
 	}
+	n.Left = walkexpr(n.Left, init)
+	walkexprlist(n.List.Slice(), init)

-	// For each parameter (LHS), assign its corresponding argument (RHS).
+	params := n.Left.Type.Params()
+	args := n.List.Slice()
 	// If there's a ... parameter (which is only valid as the final
 	// parameter) and this is not a ... call expression,
 	// then assign the remaining arguments as a slice.
-	var nn []*Node
-	for i, nl := range lhs.FieldSlice() {
-		var nr *Node
-		if nl.Isddd() && !isddd {
-			nr = mkdotargslice(nl.Type, rhs[i:], init, call.Right)
-		} else {
-			nr = rhs[i]
+	if nf := params.NumFields(); nf > 0 {
+		if last := params.Field(nf - 1); last.Isddd() && !n.Isddd() {
+			tail := args[nf-1:]
+			slice := mkdotargslice(last.Type, tail, init, n.Right)
+			// Allow immediate GC.
+			for i := range tail {
+				tail[i] = nil
+			}
+			args = append(args[:nf-1], slice)
+		}
 	}

-		a := nod(OAS, nodarg(nl, fp), nr)
-		a = convas(a, init)
-		a.SetTypecheck(1)
-		nn = append(nn, a)
+	// If this is a method call, add the receiver at the beginning of the args.
+	if n.Op == OCALLMETH {
+		withRecv := make([]*Node, len(args)+1)
+		withRecv[0] = n.Left.Left
+		n.Left.Left = nil
+		copy(withRecv[1:], args)
+		args = withRecv
+	}
+
+	// For any argument whose evaluation might require a function call,
+	// store that argument into a temporary variable,
+	// to prevent that calls from clobbering arguments already on the stack.
+	// When instrumenting, all arguments might require function calls.
+	var tempAssigns []*Node
+	for i, arg := range args {
+		updateHasCall(arg)
+		if instrumenting || arg.HasCall() {
+			// make assignment of fncall to tempname
+			tmp := temp(arg.Type)
+			a := nod(OAS, tmp, arg)
+			tempAssigns = append(tempAssigns, a)
+			// replace arg with temp
+			args[i] = tmp
+		}
 	}

-	return nn
+	n.List.Set(tempAssigns)
+	n.Rlist.Set(args)
 }

 // generate code for print
@@ -2111,71 +1986,6 @@ func convas(n *Node, init *Nodes) *Node {
 	return n
 }

-// from ascompat[te]
-// evaluating actual function arguments.
-//	f(a,b)
-// if there is exactly one function expr,
-// then it is done first. otherwise must
-// make temp variables
-func reorder1(all []*Node) []*Node {
-	// When instrumenting, force all arguments into temporary
-	// variables to prevent instrumentation calls from clobbering
-	// arguments already on the stack.
-
-	funcCalls := 0
-	if !instrumenting {
-		if len(all) == 1 {
-			return all
-		}
-
-		for _, n := range all {
-			updateHasCall(n)
-			if n.HasCall() {
-				funcCalls++
-			}
-		}
-		if funcCalls == 0 {
-			return all
-		}
-	}
-
-	var g []*Node // fncalls assigned to tempnames
-	var f *Node   // last fncall assigned to stack
-	var r []*Node // non fncalls and tempnames assigned to stack
-	d := 0
-	for _, n := range all {
-		if !instrumenting {
-			if !n.HasCall() {
-				r = append(r, n)
-				continue
-			}
-
-			d++
-			if d == funcCalls {
-				f = n
-				continue
-			}
-		}
-
-		// make assignment of fncall to tempname
-		a := temp(n.Right.Type)
-
-		a = nod(OAS, a, n.Right)
-		g = append(g, a)
-
-		// put normal arg assignment on list
-		// with fncall replaced by tempname
-		n.Right = a.Left
-
-		r = append(r, n)
-	}
-
-	if f != nil {
-		g = append(g, f)
-	}
-	return append(g, r...)
-}
-
 // from ascompat[ee]
 //	a,b = c,d
 // simultaneous assignment. there cannot
@@ -2501,14 +2311,24 @@ func paramstoheap(params *types.Type) []*Node {
 // The generated code is added to Curfn's Enter list.
 func zeroResults() {
 	for _, f := range Curfn.Type.Results().Fields().Slice() {
-		if v := asNode(f.Nname); v != nil && v.Name.Param.Heapaddr != nil {
+		v := asNode(f.Nname)
+		if v != nil && v.Name.Param.Heapaddr != nil {
 			// The local which points to the return value is the
 			// thing that needs zeroing. This is already handled
 			// by a Needzero annotation in plive.go:livenessepilogue.
 			continue
 		}
+		if v.isParamHeapCopy() {
+			// TODO(josharian/khr): Investigate whether we can switch to "continue" here,
+			// and document more in either case.
+			// In the review of CL 114797, Keith wrote (roughly):
+			// I don't think the zeroing below matters.
+			// The stack return value will never be marked as live anywhere in the function.
+			// It is not written to until deferreturn returns.
+			v = v.Name.Param.Stackcopy
+		}
 		// Zero the stack location containing f.
-		Curfn.Func.Enter.Append(nodl(Curfn.Pos, OAS, nodarg(f, 1), nil))
+		Curfn.Func.Enter.Append(nodl(Curfn.Pos, OAS, v, nil))
 	}
 }


--- a/src/cmd/compile/internal/ssa/config.go
+++ b/src/cmd/compile/internal/ssa/config.go
@@ -178,6 +178,7 @@ type GCNode interface {
 	Typ() *types.Type
 	String() string
 	IsSynthetic() bool
+	IsAutoTmp() bool
 	StorageClass() StorageClass
 }


--- a/src/cmd/compile/internal/ssa/export_test.go
+++ b/src/cmd/compile/internal/ssa/export_test.go
@@ -86,6 +86,10 @@ func (d *DummyAuto) IsSynthetic() bool {
 	return false
 }

+func (d *DummyAuto) IsAutoTmp() bool {
+	return true
+}
+
 func (DummyFrontend) StringData(s string) interface{} {
 	return nil
 }

--- a/src/cmd/compile/internal/ssa/gen/generic.rules
+++ b/src/cmd/compile/internal/ssa/gen/generic.rules
@@ -1799,3 +1799,17 @@
 					(Zero {t1} [n] dst mem)))))

 (StaticCall {sym} x) && needRaceCleanup(sym,v) -> x
+
+// Collapse moving A -> B -> C into just A -> C.
+// Later passes (deadstore, elim unread auto) will remove the A -> B move, if possible.
+// This happens most commonly when B is an autotmp inserted earlier
+// during compilation to ensure correctness.
+(Move {t1} [s1] dst tmp1 midmem:(Move {t2} [s2] tmp2 src _))
+	&& s1 == s2
+	&& t1.(*types.Type).Compare(t2.(*types.Type)) == types.CMPeq
+	&& isSamePtr(tmp1, tmp2)
+	-> (Move {t1} [s1] dst src midmem)
+
+// Elide self-moves. This only happens rarely (e.g test/fixedbugs/bug277.go).
+// However, this rule is needed to prevent the previous rule from looping forever in such cases.
+(Move dst src mem) && isSamePtr(dst, src) -> mem
--- a/src/cmd/compile/internal/ssa/rewritegeneric.go
+++ b/src/cmd/compile/internal/ssa/rewritegeneric.go
@@ -17353,6 +17353,51 @@ func rewriteValuegeneric_OpMove_20(v *Value) bool {
 		v.AddArg(v1)
 		return true
 	}
+	// match: (Move {t1} [s1] dst tmp1 midmem:(Move {t2} [s2] tmp2 src _))
+	// cond: s1 == s2 && t1.(*types.Type).Compare(t2.(*types.Type)) == types.CMPeq && isSamePtr(tmp1, tmp2)
+	// result: (Move {t1} [s1] dst src midmem)
+	for {
+		s1 := v.AuxInt
+		t1 := v.Aux
+		_ = v.Args[2]
+		dst := v.Args[0]
+		tmp1 := v.Args[1]
+		midmem := v.Args[2]
+		if midmem.Op != OpMove {
+			break
+		}
+		s2 := midmem.AuxInt
+		t2 := midmem.Aux
+		_ = midmem.Args[2]
+		tmp2 := midmem.Args[0]
+		src := midmem.Args[1]
+		if !(s1 == s2 && t1.(*types.Type).Compare(t2.(*types.Type)) == types.CMPeq && isSamePtr(tmp1, tmp2)) {
+			break
+		}
+		v.reset(OpMove)
+		v.AuxInt = s1
+		v.Aux = t1
+		v.AddArg(dst)
+		v.AddArg(src)
+		v.AddArg(midmem)
+		return true
+	}
+	// match: (Move dst src mem)
+	// cond: isSamePtr(dst, src)
+	// result: mem
+	for {
+		_ = v.Args[2]
+		dst := v.Args[0]
+		src := v.Args[1]
+		mem := v.Args[2]
+		if !(isSamePtr(dst, src)) {
+			break
+		}
+		v.reset(OpCopy)
+		v.Type = mem.Type
+		v.AddArg(mem)
+		return true
+	}
 	return false
 }
 func rewriteValuegeneric_OpMul16_0(v *Value) bool {