• Linus Torvalds's avatar
    Linux 2.2.0pre7 · c68677ac
    Linus Torvalds authored
    Ok, I think I now know why pre-6 looks so unbalanced. It's two issues.
    Basically, trying to swap out a large number of pages from one process
    context is just doomed. It bascially sucks, because
    
     - it has bad latency. This is further excerberated by the per-process
       "thrashing_memory" flag, which means that if we were unlucky enough to
       be selected to be the process that frees up memory, we'll probably be
       stuck with it for a long time. That can make it extremely unfair under
       some circumstances - other processes may allocate the pages we free'd
       up, so that we keep on being counted as a memory trasher even if we
       really aren't.
       Note that this shows most under "moderate" load - the problem doesn't
       tend to show itself if you have some process that is _really_
       allocating a lot of pages, because then that process will be correctly
       found by the trashing logic. But if you have lots of "normal load"
       processes, some of those can get really badly hurt by this.
       In particular, the worst case you have a number of processes that all
       allocate memory, but not very quickly - certainly not more quickly than
       we can page things out. What happens is that under these circumstances
       one of them gets marked as a "scapegoat", and once that happens all the
       others will just live off the pages that the scapegoat frees up, while
       the scapegoat itself doesn't make much progress at all because it is
       always just freeing memory for others.
       The really bad behaviour tends to go away reasonably quickly, but while
       it happens it's _really_ unfair.
     - try_to_free_pages() just goes overboard, and starts paging stuff out
       without getting back to the nice balanced behaviour. This is what
       Andrea noticed.
       Essentially, once it starts failing the shrink_mmap() tests, it will
       just page things out crazily. Normally this is avoided by just always
       starting from shrink_mmap(), but if you ask try_to_free_pages() to try
       to free up a ton of pages, the balancing that it does is basically
       bypassed.
    
    So basically pre-6 works _really_ well for the kind of stress-me stuff
    that it was designed for: a few processes that are extremely memory
    hungry. It gets close to perfect swap-out behaviour, simply because it is
    optimized for getting into a paging rut.
    
    That makes for nice benchmarks, but it also explains why (a) sometimes
    it's just not very nice for interactive behaviour and (b) why it under
    normal load can easily swap much too eagerly.
    
    Anyway, the first problem is fixed by making "trashing" be a global flag
    rather than a per-process flag. Being per-process is really nice when it
    finds the right process, but it's really unfair under a lot of other
    circumstances. I'd rather be fair than get the best possible page-out
    speed.
    
    Note that even a global flag helps: it still clusters the write-outs, and
    means that processes that allocate more pages tend to be more likely to be
    hit by it, so it still does a large part of what the per-process flag did
    - without the unfairness (but admittedly being unfair sometimes gets you
    better performance - you just have to be _very_ careful whom you target
    with the unfairness, and that's the hard part).
    
    The second problem actually goes away by simply just not asking
    try_to_free_pages() to free too many pages - and having the global
    trashing flag makes it unnecessary to do so anyway because the flag will
    essentially cluster the page-outs even without asking for them to be all
    done in one large chunk (and now it's not just one process that gets hit
    any more).
    
    There's a "pre-7.gz" on ftp.kernel.org in testing, anybody interested?
    It's not the real thing, as I haven't done the write semaphore deadlock
    thing yet, but that one will not affect normal users anyway so for
    performance testing this should be equivalent.
    
                            Linus
    c68677ac
setup.c 19.5 KB