• Russ Cox's avatar
    net/http: fix Transport.MaxConnsPerHost limits & idle pool races · fbaf881c
    Russ Cox authored
    There were at least three races in the implementation of the pool of
    idle HTTP connections before this CL.
    
    The first race is that HTTP/2 connections can be shared for many
    requests, but each requesting goroutine would take the connection out
    of the pool and then immediately return it before using it; this
    created unnecessary, tiny little race windows during which another
    goroutine might dial a second connection instead of reusing the first.
    This CL changes the idle pool to just leave the HTTP/2 connection in
    the pool permanently (until there is reason to close it), instead of
    doing the take-it-out-put-it-back dance race.
    
    The second race is that “is there an idle connection?” and
    “register to wait for an idle connection” were implemented as two
    separate steps, in different critical sections. So a client could end
    up registered to wait for an idle connection and be waiting or perhaps
    dialing, not having noticed the idle connection sitting in the pool
    that arrived between the two steps.
    
    The third race is that t.getIdleConnCh assumes that the inability to
    send on the channel means the client doesn't need the result, when it
    could mean that the client has not yet entered the select.
    That is, the main dial does:
    
    	idleConnCh := t.getIdleConnCh(cm)
    	select {
    	case v := <-dialc:
    		...
    	case pc := <-idleConnCh
    		...
    	...
    	}
    
    But then tryPutIdleConn does:
    
    	waitingDialer := t.idleConnCh[key] // what getIdleConnCh(cm) returned
    	select {
    	case waitingDialer <- pconn:
    		// We're done ...
    		return nil
    	default:
    		if waitingDialer != nil {
    			// They had populated this, but their dial won
    			// first, so we can clean up this map entry.
    			delete(t.idleConnCh, key)
    		}
    	}
    
    If the client has returned from getIdleConnCh but not yet reached the
    select, tryPutIdleConn will be unable to do the send, incorrectly
    conclude that the client does not care anymore, and put the connection
    in the idle pool instead, again leaving the client dialing unnecessarily
    while a connection sits in the idle pool.
    
    (It's also odd that the success case does not clean up the map entry,
    and also that the map has room for only a single waiting goroutine for
    a given host.)
    
    None of these races mattered too much before Go 1.11: at most they
    meant that connections were not reused quite as promptly as possible,
    or a few more than necessary would be created. But Go 1.11 added
    Transport.MaxConnsPerHost, which limited the number of connections
    created for a given host. The default is 0 (unlimited), but if a user
    did explicitly impose a low limit (2 is common), all these misplaced
    conns could easily add up to the entire limit, causing a deadlock.
    This was causing intermittent timeouts in TestTransportMaxConnsPerHost.
    
    The addition of the MaxConnsPerHost support added its own races.
    
    For example, here t.incHostConnCount could increment the count
    and return a channel ready for receiving, and then the client would
    not receive from it nor ever issue the decrement, because the select
    need not evaluate these two cases in order:
    
    	select {
    	case <-t.incHostConnCount(cmKey):
    		// count below conn per host limit; proceed
    	case pc := <-t.getIdleConnCh(cm):
    		if trace != nil && trace.GotConn != nil {
    			trace.GotConn(httptrace.GotConnInfo{Conn: pc.conn, Reused: pc.isReused()})
    		}
    		return pc, nil
    	...
    	}
    
    Obviously, unmatched increments are another way to get to a deadlock.
    TestTransportMaxConnsPerHost deadlocked approximately 100% of
    the time with a small random sleep added between incHostConnCount
    and the select:
    
    	ch := t.incHostConnCount(cmKey):
    	time.Sleep(time.Duration(rand.Intn(10))*time.Millisecond)
    	select {
    	case <-ch
    		// count below conn per host limit; proceed
    	case pc := <-t.getIdleConnCh(cm):
    		...
    	}
    
    The limit also did not properly apply to HTTP/2, because of the
    decrement being attached to the underlying net.Conn.Close
    and net/http not having access to the underlying HTTP/2 conn.
    The alternate decrements for HTTP/2 may also have introduced
    spurious decrements (discussion in #29889). Perhaps those
    spurious decrements or other races caused the other intermittent
    non-deadlock failures in TestTransportMaxConnsPerHost,
    in which the HTTP/2 phase created too many connections (#31982).
    
    This CL replaces the buggy, racy code with new code that is hopefully
    neither buggy nor racy.
    
    Fixes #29889.
    Fixes #31982.
    Fixes #32336.
    
    Change-Id: I0dfac3a6fe8a6cdf5f0853722781fe2ec071ac97
    Reviewed-on: https://go-review.googlesource.com/c/go/+/184262
    Run-TryBot: Russ Cox <rsc@golang.org>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    Reviewed-by: default avatarBryan C. Mills <bcmills@google.com>
    fbaf881c
serve_test.go 163 KB