golang/runtime/libgolang.cpp · 90143b64eff79fe49aa3fafc921630f1e0a290c2 · Kirill Smelkov / pygolang

Kirill Smelkov authored Apr 15, 2024
Background: in 2019 in 9c260fde (time: New package that mirrors Go's
time) and b073f6df (time: Move/Port timers to C++/Pyx nogil) I've added
basic timers - with proper API but with very dumb implementation that
was spawning one thread per each timer. There were just a few timers in
the users and this was working, surprisingly, relatively ok...

... until 2023 where I was working on XLTE that needs to organize 100Hz
polling of Amarisoft eNodeB service to retrieve information about flows
on Data Radio Bearers:

    xlte@2a016d48
    https://lab.nexedi.com/kirr/xlte/-/blob/8e606c64/amari/drb.py

There each request comes with its own deadline - to catch "no reply",
and the deadlines are implemented via timers. So there are 100 threads
created every second which adds visible overhead, consumes a lot of
virtual address space and RSS for threads stacks, and should be all unnecessary.

We was tolerating even that for some time, but recently Joanne approached me
with reports that xamari program, that does the polling, is leaking memory.

With that, and because it was hard to find what is actually leaking,
I've started to remove uncertainties and there are a lot of uncertainty
in what is going on when lots of threads are being created over and over.

In the end the leak turned out to be likely a different thing (see
nexedi/pygolang!24, still
discovered while working on hereby patch), but all of the above was
enough motivation to finally start redoing the timers properly.

--------

So when it comes to do the timers properly more or less, there is
usually queue of armed timers, and a loop that picks entries from that
queue to fire them. I was initially trying to do the simple thing and
use std::priority_queue for that, because priority_queue is internally
heap, and heaps can provide O(log(n)) insertion and removal of arbitrary
element, plus O(1) "pick top element to process". Exactly what would
suit. However I quickly found that even in 2024, std::priority_queue
does not provide removal operation at all, and there is no such thing as
e.g. std::sift_heap, that would help to implement that manually. Which
is surprising, because e.g. libevent implements all that just ok via
sifting up/down upon removal in logarithmic complexity:

https://github.com/libevent/libevent/blob/80e25c02/minheap-internal.h#L96-L115

the lack of efficient removal operation turned out to be a blocker to
use std::priority_queue because most of the timers, that are armed for
timeouts, are never expired and upon successful completion of covered
operation, the timer is stopped. In other words the timer is removed
from the timer queue and the removal is one of the most often
operations.

So, if std::priority_queue cannot work, we would need to either bring in
another implementation of a heap, or, if we are to bring something,
bring and use something else that is more suitable for implementing
timers.

That reminded me that in 2005 for my Navy project, I already implemented
custom timer wheel to handle timeouts after reading https://lwn.net/Articles/152436/ .
Contrary to heaps, such timer wheels provide O(1) insertion and removal
of timers and work generally faster. But this time I did not want to
delve into implementing all that myself again and tried to look around
of what is available out there.

There was an update to kernel timer-wheel implementation described at
https://lwn.net/Articles/646950/ and from that a project called
Timeout.c was also found that provides implementation for such a wheel
for user space: https://25thandclement.com/~william/projects/timeout.c.html .

However when we are to pick third-party code, we should be ready to
understand it and fix bugs there on our own. So the audit of timeout.c
did not went very smoothly - there are many platform-depended places,
and the issue tracker shows signs that sometimes not everything is ok
with the implementation. With that I've looked around a bit more and
found more compact and more portable Ratas library with good structure
and description and whose audit came more well:

    https://www.snellman.net/blog/archive/2016-07-27-ratas-hierarchical-timer-wheel
    https://github.com/jsnell/ratas

Here, after going through the code, I feel to be capable to understand
issues and fix bugs myself if that would become needed.

And the benchmark comparison of Timeout.c and Ratas shows that they
should be of the same order regarding performance:

https://lab.nexedi.com/kirr/misc/-/blob/4f51fd6/bench/time-wheel/ratas-vs-timeout.pdf
ratas@382321d2
timeout@d6f15744

which makes Ratas the winner for me.

Having timer-wheel implementation, the rest is just technique to glue it
all together. One implementation aspect deserves to be mentioned though:

The timer loop uses Semaphore.acquire, recently modernized to also
accept timeout, to organize sleep in between pauses with also being able
to be simultaneously woken up if new timer is armed with earlier
expiration time.

Other than that the changes are mostly straightforward. Please see the
patch itself for details.

Regarding how the new implementation is more efficient for what we had
before, there are added benchmarks to measure arming timers that do not
fire, and, for symmetry, arming timers that do fire. We are most
interested in the first benchmark, because it shows how cheap or
expensive it is to use timers to implement timeouts, but the second one
is also useful to have to see the overhead of the whole timers machinery.

On my machine under py3.11 they go as after this patch:

    name              time/op
    timer_arm_cancel   805ns ± 0%
    timer_arm_fire    9.63µs ± 0%

and before the patch the benchmarks simply do not run till the end
because they run out of memory due to huge number of threads being
created.

Still with the following test program we can measure the effect new
timers implementation has:

    ---- 8< ----
    from golang import time

    def main():
        δt_rate = 1*time.millisecond

        tprev = time.now()
        tnext = tprev + δt_rate
        while 1:
            timer = time.Timer(5*time.second)
            _ = timer.stop()
            assert _ is True

            t = time.now()
            δtsleep = tnext - t
            #print('sleep %.3f ms' % (δtsleep/time.millisecond))
            time.sleep(δtsleep)
            tprev = tnext
            tnext += δt_rate

    main()
    ---- 8< ----

This program creates/arms and cancels a timer 1000 times per second.

Before hereby patch this program consumes ~ 30% of CPU, while after
hereby patch this program consumes ~ 7-8% of CPU.

For the reference just a sleep part of that program, with all code
related to timers removed consumes ~5% of CPU, while the consumption of
plain sleep(1ms) in C and directly using system calls

    ---- 8< ----
    #include <unistd.h>

    int main() {
        while (1) {
            usleep(1000);
        }
        return 0;
    }
    ---- 8< ----

is ~ 3-4% of CPU on my machine.

/cc @jerome
/cc ORS team (@jhuge, @lu.xu, @tomo, @xavier_thompson, @Daetalus)
90143b64
libgolang.cpp 36.9 KB
Replace libgolang.cpp