• Matthew Holt's avatar
    tls: Fix background certificate renewals that use TLS-SNI challenge · 0e34c7c9
    Matthew Holt authored
    The loop which performs renewals in the background obtains a read lock
    on the certificate cache map, so that it can be safely iterated. Before
    this fix, it would obtain the renewals in the read lock. This has been
    fine, except that the TLS-SNI challenge, when invoked after Caddy has
    already started, requires adding a certificate to the cache. Doing this
    requires an exclusive write lock. But it cannot obtain a write lock
    because a read lock is obtained higher in the stack, while the loop
    iterates. In other words, it's a deadlock.
    
    I was able to reproduce this issue consistently locally, after jumping
    through many hoops to force a renewal in a short time that bypasses
    Let's Encrypt's authz caching. I was also able to verify that by queuing
    renewals (like we do deletions and OCSP updates), lock contention is
    relieved and the deadlock is avoided.
    
    This only affects background renewals where the TLS-SNI(-01) challenge
    are used. Users report seeing strange errors in the logs after this
    happens ("tls: client offered an unsupported, maximum protocol version
    of 301"), but I was not able to reproduce these locally. I was also not
    able to reproduce the leak of sockets which are left in CLOSE_WAIT.
    I am not sure if those are symptoms of running in production on Linux
    and are related to this bug, or not.
    
    Either way, this is an important fix. I do not yet know the ripple
    effects this will have on other symptoms we've been chasing. But it
    definitely resolves a deadlock during renewals.
    0e34c7c9
maintain.go 9.36 KB