1. 10 Sep, 2002 16 commits
    • Andrew Morton's avatar
      [PATCH] rmap pte_chain speedup and space saving · 9dc8af80
      Andrew Morton authored
      The pte_chains presently consist of a pte pointer and a `next' link.
      So there's a 50% memory wastage here as well as potential for a lot of
      misses during walks of the singly-linked per-page list.
      
      This patch increases the pte_chain structure to occupy a full
      cacheline.  There are 7, 15 or 31 pte pointers per structure rather
      than just one.  So the wastage falls to a few percent and the number of
      misses during the walk is reduced.
      
      The patch doesn't make much difference in simple testing, because in
      those tests the pte_chain list from the previous page has good cache
      locality with the next page's list.
      
      The patch sped up Anton's "10,000 concurrently exitting shells" test by
      3x or 4x.  It gives a 10% reduction in system time for a kernel build
      on 16p NUMAQ.
      
      It saves memory and reduces the amount of work performed in the slab
      allocator.
      
      Pages which are mapped by only a single process continue to not have a
      pte_chain.  The pointer in struct page points directly at the mapping
      pte (a "PageDirect" pte pointer).  Once the page is shared a pte_chain
      is allocated and both the new and old pte pointers are moved into it.
      
      We used to collapse the pte_chain back to a PageDirect representation
      in page_remove_rmap().  That has been changed.  That collapse is now
      performed inside page reclaim, via page_referenced().  The thinking
      here is that if a page was previously shared then it may become shared
      again, so leave the pte_chain structure in place.  But if the system is
      under memory pressure then start reaping them anyway.
      9dc8af80
    • Andrew Morton's avatar
      [PATCH] buffer_head takedown for bighighmem machines · e182d612
      Andrew Morton authored
      This patch addresses the excessive consumption of ZONE_NORMAL by
      buffer_heads on highmem machines.  The algorithms which decide which
      buffers to shoot down are fairly dumb, but they only cut in on machines
      with large highmem:lowmem ratios and the code footprint is tiny.
      
      The buffer.c change implements the buffer_head accounting - it sets the
      upper limit on buffer_head memory occupancy to 10% of ZONE_NORMAL.
      
      A possible side-effect of this change is that the kernel will perform
      more calls to get_block() to map pages to disk.  This will only be
      observed when a file is being repeatadly overwritten - this is the only
      case in which the "cached get_block result" in the buffers is useful.
      
      I did quite some testing of this back in the delalloc ext2 days, and
      was not able to come up with a test in which the cached get_block
      result was measurably useful.  That's for ext2, which has a fast
      get_block().
      
      A desirable side effect of this patch is that the kernel will be able
      to cache much more blockdev pagecache in ZONE_NORMAL, so there are more
      ext2/3 indirect blocks in cache, so with some workloads, less I/O will
      be performed.
      
      In mpage_writepage(): if the number of buffer_heads is excessive then
      buffers are stripped from pages as they are submitted for writeback.
      This change is only useful for filesystems which are using the mpage
      code.  That's ext2 and ext3-writeback and JFS.  An mpage patch for
      reiserfs was floating about but seems to have got lost.
      
      There is no need to strip buffers for reads because the mpage code does
      not attach buffers for reads.
      
      These are perhaps not the most appropriate buffer_heads to toss away.
      Perhaps something smarter should be done to detect file overwriting, or
      to toss the 'oldest' buffer_heads first.
      
      In refill_inactive(): if the number of buffer_heads is excessive then
      strip buffers from pages as they move onto the inactive list.  This
      change is useful for all filesystems.  This approach is good because
      pages which are being repeatedly overwritten will remain on the active
      list and will retain their buffers, whereas pages which are not being
      overwritten will be stripped.
      e182d612
    • Andrew Morton's avatar
      [PATCH] reduce the default dirty memory thresholds · ce92adf3
      Andrew Morton authored
      Writeback parameter tuning.  Somewhat experimental, but heading in the
      right direction, I hope.
      
      - Allowing 40% of physical memory to be dirtied on massive ia32 boxes
        is unreasonable.  It pins too many buffer_heads and contribues to
        page reclaim latency.
      
        The patch changes the initial value of
        /proc/sys/vm/dirty_background_ratio, dirty_async_ratio and (the
        presently non-functional) dirty_sync_ratio so that they are reduced
        when the highmem:lowmem ratio exceeds 4:1.
      
        These ratios are scaled so that as the highmem:lowmem ratio goes
        beyond 4:1, the maximum amount of allowed dirty memory ceases to
        increase.  It is clamped at the amount of memory which a 4:1 machine
        is allowed to use.
      
      - Aggressive reduction in the dirty memory threshold at which
        background writeback cuts in.  2.4 uses 30% of ZONE_NORMAL.  2.5 uses
        40% of total memory.  This patch changes it to 10% of total memory
        (if total memory <= 4G.  Even less otherwise - see above).
      
      This means that:
      
      - Much more writeback is performed by pdflush.
      
      - When the application is generating dirty data at a moderate
        rate, background writeback cuts in much earlier, so memory is
        cleaned more promptly.
      
      - Reduces the risk of user applications getting stalled by writeback.
      
      - Will damage dbench numbers.  It turns out that the damage is
        fairly small, and dbench isn't a worthwhile workload for
        optimisation.
      
      - Moderate reduction in the dirty level at which the write(2) caller
        is forced to perform writeback (throttling).  Was 40% of total
        memory.  Is now 30% of total memory (if total memory <= 4G, less
        otherwise).
      
      This is to reduce page reclaim latency, and generally because
      allowing processes to flood the machine with dirty data is a bad
      thing in mixed workloads.
      ce92adf3
    • Andrew Morton's avatar
      [PATCH] discontigmem code cleanup #2 · e2f5e334
      Andrew Morton authored
      Patch from Martin Bligh
      
      "This mainly just rips out some magic extra structures in the boot time
       code to determine node sizes, and counts in pages instead of bytes.
       Oh, and I put the code that allocates pgdat into allocage_pgdat,
       instead of find_max_pfn_node, which seems like an incongruous home for
       it.
      
       No functionality changes, nothing touched outside i386 discontigmem ...
       just makes code cleaner and more readable.  Tested on 16-way NUMA-Q."
      e2f5e334
    • Andrew Morton's avatar
      [PATCH] discontigmem code cleanup #1 · 79a96230
      Andrew Morton authored
      Patch from Martin Bligh.
      
      "This mainly changes the PLAT_MY_MACRO_IS_ALL_CAPS() stuff to be
       normal_macro(), and takes out some unnecessary redirection of function
       names.  No functionality changes, nothing touched outside i386
       discontigmem ...  just makes code readable.  Rumour has it that the
       PLAT_* stuff came from IRIX - I don't see that as a good reason to make
       the Linux code unreadable.  Tested on 16-way NUMA-Q."
      79a96230
    • Andrew Morton's avatar
      [PATCH] exact dirty state accounting · 1f90eedd
      Andrew Morton authored
      Some adjustments to global dirty page accounting.
      
      Previously, dirty page accounting counted all dirty pages.  Even dirty
      anonymous pages.  This has potential to upset the throttling logic in
      balance_dirty_pages().  Particularly as I suspect we should decrease
      the dirty memory writeback thresholds by a lot.
      
      So this patch changes it so that we only account for dirty pagecache
      pages which have backing store.  Not anonymous pages, not swapcache,
      not in-memory filesystem pages.
      
      To support this, the `memory_backed' boolean has been added to struct
      backing_dev_info.  When an address space's backing device is marked as
      memory-backed, the core kernel knows to not include that mapping's
      pages in the dirty memory accounting.
      
      For memory-backed mappings, dirtiness is a way of pinning the page, and
      there's nothing the kernel can to do clean the page to make it freeable.
      
      driverfs, tmpfs, and ranfs have been coverted to mark their mappings as
      memory-backed.
      
      The ramdisk driver hasn't been converted.  I have a separate patch for
      ramdisk, which fails to fix the longstanding problems in there :(
      
      With this patch, /bin/sync now sends /proc/meminfo:Dirty to zero, which
      is rather comforting.
      1f90eedd
    • Andrew Morton's avatar
      [PATCH] pass the correct flags to aops->releasepage() · 6a0fb424
      Andrew Morton authored
      Restore the gfp_mask in the VM's call to a_ops->releasepage().  We can
      block in there again, and XFS (at least) can use that.
      6a0fb424
    • Andrew Morton's avatar
      [PATCH] writer throttling fix · 95b88300
      Andrew Morton authored
      The patch fixes a few problems in the writer throttling code.  Mainly
      in the situation where a single large file is being written out.
      
      That file could be parked on sb->locked_inodes due to pdflush
      writeback, and the writer throttling path coming out of
      balance_dirty_pages() forgot to look for inodes on ->locked_inodes.
      
      The net effect was that the amount of dirty memory was exceeding the
      limit set in /proc/sys/vm/dirty_async_ratio, possibly to the point
      where the system gets seriously choked.
      
      The patch removes sb->locked_inodes altogether and teaches the
      throttling code to look for inodes on sb->s_io as well as sb->s_dirty.
      
      Also, just leave unwritten dirty pages on mapping->io_pages, and
      unwritten dirty inodes on sb->s_io.  Putting them back onto
      ->dirty_pages and ->dirty_inodes was fairly pointless, given that both
      lists need to be looked at.
      95b88300
    • Ingo Molnar's avatar
      [PATCH] Re: do_syslog/__down_trylock lockup in current BK · 0d8b3b44
      Ingo Molnar authored
      This fixes the lockup.
      
      The bug happened because reparenting in the CLONE_THREAD case was done in
      a fundamentally non-atomic way, which was asking for various races to
      happen: eg. the target parent gets reparented to the currently exiting
      thread ...
      
      (the non-CLONE_THREAD case is safe because nothing reparents init.)
      
      the solution is to make all of reparenting atomic (including the
      forget_original_parent() bit) - this is possible with some reorganization
      done in signal.c and exit.c. This also made some of the loops simpler.
      0d8b3b44
    • Alexander Viro's avatar
      [PATCH] Missing IDE partition 3 of 3 on 2.5.34 · 8fb345bd
      Alexander Viro authored
      devfs side fixed thus:
      8fb345bd
    • Jens Axboe's avatar
      [PATCH] hdreg command updates etc · f1c84a2e
      Jens Axboe authored
      Update hdreg to match 2.4 levels.
      
      o Use consistent SRV_STAT instead of SERVICE_STAT
      o Add sector count status bits for tcq
      o Add various missing commands
      o hd_driveid update
      f1c84a2e
    • Jens Axboe's avatar
      [PATCH] IDE pci ids · 8930eafc
      Jens Axboe authored
      Update IDE pci ids to match 2.4.20-pre5-ac4 levels.
      8930eafc
    • Jens Axboe's avatar
      [PATCH] blk_fs_request() · 4372b607
      Jens Axboe authored
      Add blk_fs_request(rq) to avoid testing rq->flags & REQ_CMD directly.
      4372b607
    • Jens Axboe's avatar
      [PATCH] PCI individual resource handling · e47901f9
      Jens Axboe authored
      This merges the changes from 2.4-ac that allow drivers to enable (and
      mark as used) only a subset of PCI resources, for those drivers that
      need it (at this point apparently only the i845 IDE controller).
      e47901f9
    • Mikael Pettersson's avatar
      [PATCH] undo 2.5.34 ftape damage · ac9c060c
      Mikael Pettersson authored
      In the 2.5.33->2.5.34 step someone removed "export-objs" from
      drivers/char/ftape/lowlevel/Makefile, which makes it impossible to build
      ftape as a module since is _does_ have a number of EXPORT_SYMBOL's.
      
      This reverts that change.
      ac9c060c
    • Mikael Pettersson's avatar
      [PATCH] 2.5.34 floppy driver init/exit fixes · 9d1f9419
      Mikael Pettersson authored
      The 2.5 floppy driver has for a long time has two init/exit bugs:
      1. It calls register_sys_device() on init, but fails to call
         unregister_sys_device() in exit. This leads to data structure
         corruption if floppy is a module and it gets unloaded.
      2. If calls register_sys_device() early on init, but fails to call
         unregister_sys_device() if init fails. Again, this leads to
         data structure corruption.
      
      The patch below fixes both these problems.
      9d1f9419
  2. 09 Sep, 2002 24 commits