1. 05 Sep, 2005 40 commits
    • Zachary Amsden's avatar
      [PATCH] x86: ptep_clear optimization · a600388d
      Zachary Amsden authored
      Add a new accessor for PTEs, which passes the full hint from the mmu_gather
      struct; this allows architectures with hardware pagetables to optimize away
      atomic PTE operations when destroying an address space.  Removing the
      locked operation should allow better pipelining of memory access in this
      loop.  I measured an average savings of 30-35 cycles per zap_pte_range on
      the first 500 destructions on Pentium-M, but I believe the optimization
      would win more on older processors which still assert the bus lock on xchg
      for an exclusive cacheline.
      
      Update: I made some new measurements, and this saves exactly 26 cycles over
      ptep_get_and_clear on Pentium M.  On P4, with a PAE kernel, this saves 180
      cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a
      full address space destruction.
      
      pte_clear_full is not yet used, but is provided for future optimizations
      (in particular, when running inside of a hypervisor that queues page table
      updates, the full hint allows us to avoid queueing unnecessary page table
      update for an address space in the process of being destroyed.
      
      This is not a huge win, but it does help a bit, and sets the stage for
      further hypervisor optimization of the mm layer on all architectures.
      Signed-off-by: default avatarZachary Amsden <zach@vmware.com>
      Cc: Christoph Lameter <christoph@lameter.com>
      Cc: <linux-mm@kvack.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a600388d
    • Kyle Moffett's avatar
      [PATCH] sab: consolidate kmem_bufctl_t · fa5b08d5
      Kyle Moffett authored
      This is used only in slab.c and each architecture gets to define whcih
      underlying type is to be used.
      
      Seems a bit silly - move it to slab.c and use the same type for all
      architectures: unsigned int.
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fa5b08d5
    • Chen, Kenneth W's avatar
      [PATCH] remove hugetlb_clean_stale_pgtable() and fix huge_pte_alloc() · 0e5c9f39
      Chen, Kenneth W authored
      I don't think we need to call hugetlb_clean_stale_pgtable() anymore
      in 2.6.13 because of the rework with free_pgtables().  It now collect
      all the pte page at the time of munmap.  It used to only collect page
      table pages when entire one pgd can be freed and left with staled pte
      pages.  Not anymore with 2.6.13.  This function will never be called
      and We should turn it into a BUG_ON.
      
      I also spotted two problems here, not Adam's fault :-)
      (1) in huge_pte_alloc(), it looks like a bug to me that pud is not
          checked before calling pmd_alloc()
      (2) in hugetlb_clean_stale_pgtable(), it also missed a call to
          pmd_free_tlb.  I think a tlb flush is required to flush the mapping
          for the page table itself when we clear out the pmd pointing to a
          pte page.  However, since hugetlb_clean_stale_pgtable() is never
          called, so it won't trigger the bug.
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0e5c9f39
    • Adam Litke's avatar
      [PATCH] hugetlb: check p?d_present in huge_pte_offset() · 02b0ccef
      Adam Litke authored
      For demand faulting, we cannot assume that the page tables will be
      populated.  Do what the rest of the architectures do and test p?d_present()
      while walking down the page table.
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: <linux-mm@kvack.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      02b0ccef
    • Adam Litke's avatar
      [PATCH] hugetlb: move stale pte check into huge_pte_alloc() · 7bf07f3d
      Adam Litke authored
      Initial Post (Wed, 17 Aug 2005)
      
      This patch moves the
      	if (! pte_none(*pte))
      		hugetlb_clean_stale_pgtable(pte);
      logic into huge_pte_alloc() so all of its callers can be immune to the bug
      described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246
      
      > It turns out there is a bug in hugetlb_prefault(): with 3 level page table,
      > huge_pte_alloc() might return a pmd that points to a PTE page. It happens
      > if the virtual address for hugetlb mmap is recycled from previously used
      > normal page mmap. free_pgtables() might not scrub the pmd entry on
      > munmap and hugetlb_prefault skips on any pmd presence regardless what type
      > it is.
      
      Unless I am missing something, it seems more correct to place the check inside
      huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated.
      It also allows checking for this condition when lazily faulting huge pages
      later in the series.
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: <linux-mm@kvack.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7bf07f3d
    • Adam Litke's avatar
      [PATCH] hugetlb: add pte_huge() macro · 32e51a8c
      Adam Litke authored
      This patch adds a macro pte_huge(pte) for i386/x86_64 which is needed by a
      patch later in the series.  Instead of repeating (_PAGE_PRESENT |
      _PAGE_PSE), I've added __LARGE_PTE to i386 to match x86_64.
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Cc: <linux-mm@kvack.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      32e51a8c
    • Deepak Saxena's avatar
      [PATCH] arm: allow for arch-specific IOREMAP_MAX_ORDER · fd195c49
      Deepak Saxena authored
      Version 6 of the ARM architecture introduces the concept of 16MB pages
      (supersections) and 36-bit (40-bit actually, but nobody uses this) physical
      addresses.  36-bit addressed memory and I/O and ARMv6 can only be mapped
      using supersections and the requirement on these is that both virtual and
      physical addresses be 16MB aligned.  In trying to add support for ioremap()
      of 36-bit I/O, we run into the issue that get_vm_area() allows for a
      maximum of 512K alignment via the IOREMAP_MAX_ORDER constant.  To work
      around this, we can:
      
      - Allocate a larger VM area than needed (size + (1ul << IOREMAP_MAX_ORDER))
        and then align the pointer ourselves, but this ends up with 512K of
        wasted VM per ioremap().
      
      - Provide a new __get_vm_area_aligned() API and make __get_vm_area() sit
        on top of this. I did this and it works but I don't like the idea
        adding another VM API just for this one case.
      
      - My preferred solution which is to allow the architecture to override
        the IOREMAP_MAX_ORDER constant with it's own version.
      Signed-off-by: default avatarDeepak Saxena <dsaxena@plexity.net>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fd195c49
    • Paolo 'Blaisorblade' Giarrusso's avatar
      [PATCH] mm: correct _PAGE_FILE comment · 9b4ee40e
      Paolo 'Blaisorblade' Giarrusso authored
      _PAGE_FILE does not indicate whether a file is in page / swap cache, it is
      set just for non-linear PTE's.  Correct the comment for i386, x86_64, UML.
      Also clearify _PAGE_NONE.
      Signed-off-by: default avatarPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9b4ee40e
    • Paolo 'Blaisorblade' Giarrusso's avatar
      [PATCH] mm: remove implied vm_ops check · 4944e76d
      Paolo 'Blaisorblade' Giarrusso authored
      If !vma->vm-ops we already BUG above, so retesting it is useless.  The
      compiler cannot optimize this because BUG is a macro and is not thus marked
      noreturn; that should possibly be fixed.
      Signed-off-by: default avatarPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4944e76d
    • Paolo 'Blaisorblade' Giarrusso's avatar
      [PATCH] shmem_populate: avoid an useless check, and some comments · d44ed4f8
      Paolo 'Blaisorblade' Giarrusso authored
      Either shmem_getpage returns a failure, or it found a page, or it was told
      it couldn't do any I/O.  So it's useless to check nonblock in the else
      branch.  We could add a BUG() there but I preferred to comment the
      offending function.
      
      This was taken out from one Ingo Molnar's old patch I'm resurrecting.
      Signed-off-by: default avatarPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d44ed4f8
    • Martin Hicks's avatar
      [PATCH] vm: slab.c spelling correction · 0abf40c1
      Martin Hicks authored
      Fix a small spelling mistake.  subtile->subtle
      Signed-off-by: default avatarMartin Hicks <mort@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0abf40c1
    • Paolo 'Blaisorblade' Giarrusso's avatar
      [PATCH] comment typo fix · e83a9596
      Paolo 'Blaisorblade' Giarrusso authored
      smp_entry_t -> swap_entry_t
      Signed-off-by: default avatarPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e83a9596
    • Hugh Dickins's avatar
      [PATCH] mm: fix madvise vma merging · 836d5ffd
      Hugh Dickins authored
      Better late than never, I've at last reviewed the madvise vma merging
      going into 2.6.13.  Remove a pointless check and fix two little bugs -
      a simple test (with /proc/<pid>/maps hacked to show ReadHints) showed
      both mismerges in practice: though being madvise, neither was disastrous.
      
      1. Correct placement of the success label in madvise_behavior: as in
         mprotect_fixup and mlock_fixup, it is necessary to update vm_flags
         when vma_merge succeeds (to handle the exceptional Case 8 noted in
         the comments above vma_merge itself).
      
      2. Correct initial value of prev when starting part way into a vma: as
         in sys_mprotect and do_mlock, it needs to be set to vma in this case
         (vma_merge handles only that minimum of cases shown in its comments).
      
      3. If find_vma_prev sets prev, then the vma it returns is prev->vm_next,
         so it's pointless to make that same assignment again in sys_madvise.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      836d5ffd
    • Martin Hicks's avatar
      [PATCH] VM: zone reclaim atomic ops cleanup · 53e9a615
      Martin Hicks authored
      Christoph Lameter and Marcelo Tosatti asked to get rid of the
      atomic_inc_and_test() to cleanup the atomic ops in the zone reclaim code.
      Signed-off-by: default avatarMartin Hicks <mort@sgi.com>
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      53e9a615
    • Martin Hicks's avatar
      [PATCH] VM: add capabilites check to set_zone_reclaim · bce5f6ba
      Martin Hicks authored
      Add a capability check to sys_set_zone_reclaim().  This syscall is not
      something that should be available to a user.
      Signed-off-by: default avatarMartin Hicks <mort@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      bce5f6ba
    • Nick Piggin's avatar
      [PATCH] mm: remove atomic · 242e5468
      Nick Piggin authored
      This bitop does not need to be atomic because it is performed when there will
      be no references to the page (ie.  the page is being freed).
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      242e5468
    • Nick Piggin's avatar
      [PATCH] mm: remap ZERO_PAGE mappings · 9a61c349
      Nick Piggin authored
      filemap_xip's nopage routine maps the ZERO_PAGE into readonly mappings, if it
      has no data page to map there: then if the hole in the file is later filled,
      __xip_unmap uses an rmap technique to replace the ZERO_PAGEs mapped for that
      offset by the newly allocated file page, so that established mappings will see
      the newly written data.
      
      However, on MIPS (alone) there's not one but as many as eight ZERO_PAGEs,
      chosen for coloring by user virtual address; and if mremap has meanwhile been
      used to move a mapping containing a ZERO_PAGE, it will generally not match the
      ZERO_PAGE(address) __xip_unmap is looking for.
      
      To maintain XIP's established mappings correctly on MIPS, we need Nick's fix
      to mremap's move_one_page (originally presented as an optimization), to
      replace the ZERO_PAGE appropriate to the old address by the ZERO_PAGE
      appropriate to the new address.
      
      (But when I first saw this, I was thinking the ZERO_PAGEs themselves would get
      corrupted, very bad.  Now I think it's the other way round, that the
      established mappings will fail to see the newly written data: incorrect, but
      not corrupting everything else.  Whether filemap_xip's technique is generally
      safe, I'd hesitate to say in a hurry: it's interesting, but we've never tried
      to do that in tmpfs.)
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9a61c349
    • Nick Piggin's avatar
      [PATCH] mm: cleanup rmap · 4d7670e0
      Nick Piggin authored
      Thanks to Bill Irwin for pointing this out.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4d7670e0
    • Nick Piggin's avatar
      [PATCH] mm: micro-optimise rmap · 2822c1aa
      Nick Piggin authored
      Microoptimise page_add_anon_rmap.  Although these expressions are used only in
      the taken branch of the if() statement, the compiler can't reorder them inside
      because atomic_inc_and_test is a barrier.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2822c1aa
    • Nick Piggin's avatar
      [PATCH] mm: comment rmap · c3dce2d8
      Nick Piggin authored
      Just be clear that VM_RESERVED pages here are a bug, and the test is not there
      because they are expected.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c3dce2d8
    • Christoph Lameter's avatar
      [PATCH] /proc/<pid>/numa_maps to show on which nodes pages reside · 6e21c8f1
      Christoph Lameter authored
      This patch was recently discussed on linux-mm:
      http://marc.theaimsgroup.com/?t=112085728500002&r=1&w=2
      
      I inherited a large code base from Ray for page migration.  There was a
      small patch in there that I find to be very useful since it allows the
      display of the locality of the pages in use by a process.  I reworked that
      patch and came up with a /proc/<pid>/numa_maps that gives more information
      about the vma's of a process.  numa_maps is indexes by the start address
      found in /proc/<pid>/maps.  F.e.  with this patch you can see the page use
      of the "getty" process:
      
      margin:/proc/12008 # cat maps
      00000000-00004000 r--p 00000000 00:00 0
      2000000000000000-200000000002c000 r-xp 00000000 08:04 516                /lib/ld-2.3.3.so
      2000000000038000-2000000000040000 rw-p 00028000 08:04 516                /lib/ld-2.3.3.so
      2000000000040000-2000000000044000 rw-p 2000000000040000 00:00 0
      2000000000058000-2000000000260000 r-xp 00000000 08:04 54707842           /lib/tls/libc.so.6.1
      2000000000260000-2000000000268000 ---p 00208000 08:04 54707842           /lib/tls/libc.so.6.1
      2000000000268000-2000000000274000 rw-p 00200000 08:04 54707842           /lib/tls/libc.so.6.1
      2000000000274000-2000000000280000 rw-p 2000000000274000 00:00 0
      2000000000280000-20000000002b4000 r--p 00000000 08:04 9126923            /usr/lib/locale/en_US.utf8/LC_CTYPE
      2000000000300000-2000000000308000 r--s 00000000 08:04 60071467           /usr/lib/gconv/gconv-modules.cache
      2000000000318000-2000000000328000 rw-p 2000000000318000 00:00 0
      4000000000000000-4000000000008000 r-xp 00000000 08:04 29576399           /sbin/mingetty
      6000000000004000-6000000000008000 rw-p 00004000 08:04 29576399           /sbin/mingetty
      6000000000008000-600000000002c000 rw-p 6000000000008000 00:00 0          [heap]
      60000fff7fffc000-60000fff80000000 rw-p 60000fff7fffc000 00:00 0
      60000ffffff44000-60000ffffff98000 rw-p 60000ffffff44000 00:00 0          [stack]
      a000000000000000-a000000000020000 ---p 00000000 00:00 0                  [vdso]
      
      cat numa_maps
      2000000000000000 default MaxRef=43 Pages=11 Mapped=11 N0=4 N1=3 N2=2 N3=2
      2000000000038000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
      2000000000040000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
      2000000000058000 default MaxRef=43 Pages=61 Mapped=61 N0=14 N1=15 N2=16 N3=16
      2000000000268000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
      2000000000274000 default MaxRef=1 Pages=3 Mapped=3 Anon=3 N0=3
      2000000000280000 default MaxRef=8 Pages=3 Mapped=3 N0=3
      2000000000300000 default MaxRef=8 Pages=2 Mapped=2 N0=2
      2000000000318000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N2=1
      4000000000000000 default MaxRef=6 Pages=2 Mapped=2 N1=2
      6000000000004000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
      6000000000008000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
      60000fff7fffc000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
      60000ffffff44000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
      
      getty uses ld.so.  The first vma is the code segment which is used by 43
      other processes and the pages are evenly distributed over the 4 nodes.
      
      The second vma is the process specific data portion for ld.so.  This is
      only one page.
      
      The display format is:
      
      <startaddress>	 Links to information in /proc/<pid>/map
      <memory policy>  This can be "default" "interleave={}", "prefer=<node>" or "bind={<zones>}"
      MaxRef=		<maximum reference to a page in this vma>
      Pages=		<Nr of pages in use>
      Mapped=		<Nr of pages with mapcount >
      Anon=		<nr of anonymous pages>
      Nx=		<Nr of pages on Node x>
      
      The content of the proc-file is self-evident.  If this would be tied into
      the sparsemem system then the contents of this file would not be too
      useful.
      Signed-off-by: default avatarChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6e21c8f1
    • Hugh Dickins's avatar
      [PATCH] rmap: don't test rss · 839b9685
      Hugh Dickins authored
      Remove the three get_mm_counter(mm, rss) tests from rmap.c: there was a
      time when testing rss was important to avoid a particular race between
      dup_mmap and the anonmm rmap; but now it's just a rather silly pseudo-
      optimization, made even more obscure by the get_mm_counter macro.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      839b9685
    • Hugh Dickins's avatar
      [PATCH] delete from_swap_cache BUG_ONs · 3279ffd9
      Hugh Dickins authored
      Three of the four BUG_ONs in delete_from_swap_cache are immediately
      repeated in __delete_from_swap_cache: delete those and add the one.  But
      perhaps mm/ is altogether overprovisioned with historic BUGs?
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3279ffd9
    • Hugh Dickins's avatar
      [PATCH] swap: update swsusp use of swap_info · dae06ac4
      Hugh Dickins authored
      Aha, swsusp dips into swap_info[], better update it to swap_lock.  It's
      bitflipping flags with 0xFF, so get_swap_page will allocate from only the one
      chosen device: let's change that to flip SWP_WRITEOK.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      dae06ac4
    • Hugh Dickins's avatar
      [PATCH] swap: swap_lock replace list+device · 5d337b91
      Hugh Dickins authored
      The idea of a swap_device_lock per device, and a swap_list_lock over them all,
      is appealing; but in practice almost every holder of swap_device_lock must
      already hold swap_list_lock, which defeats the purpose of the split.
      
      The only exceptions have been swap_duplicate, valid_swaphandles and an
      untrodden path in try_to_unuse (plus a few places added in this series).
      valid_swaphandles doesn't show up high in profiles, but swap_duplicate does
      demand attention.  However, with the hold time in get_swap_pages so much
      reduced, I've not yet found a load and set of swap device priorities to show
      even swap_duplicate benefitting from the split.  Certainly the split is mere
      overhead in the common case of a single swap device.
      
      So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock
      (generally we seem to prefer an _ in the name, and not hide in a macro).
      
      If someone can show a regression in swap_duplicate, then probably we should
      add a hashlock for the swap_map entries alone (shorts being anatomic), so as
      to help the case of the single swap device too.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      5d337b91
    • Hugh Dickins's avatar
      [PATCH] swap: scan_swap_map latency breaks · 048c27fd
      Hugh Dickins authored
      The get_swap_page/scan_swap_map latency can be so bad that even those without
      preemption configured deserve relief: periodically cond_resched.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      048c27fd
    • Hugh Dickins's avatar
      [PATCH] swap: scan_swap_map drop swap_device_lock · 52b7efdb
      Hugh Dickins authored
      get_swap_page has often shown up on latency traces, doing lengthy scans while
      holding two spinlocks.  swap_list_lock is already dropped, now scan_swap_map
      drop swap_device_lock before scanning the swap_map.
      
      While scanning for an empty cluster, don't worry that racing tasks may
      allocate what was free and free what was allocated; but when allocating an
      entry, check it's still free after retaking the lock.  Avoid dropping the lock
      in the expected common path.  No barriers beyond the locks, just let the
      cookie crumble; highest_bit limit is volatile, but benign.
      
      Guard against swapoff: must check SWP_WRITEOK before allocating, must raise
      SWP_SCANNING reference count while in scan_swap_map, swapoff wait for that to
      fall - just use schedule_timeout, we don't want to burden scan_swap_map
      itself, and it's very unlikely that anyone can really still be in
      scan_swap_map once swapoff gets this far.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      52b7efdb
    • Hugh Dickins's avatar
      [PATCH] swap: scan_swap_map restyled · 7dfad418
      Hugh Dickins authored
      Rewrite scan_swap_map to allocate in just the same way as before (taking the
      next free entry SWAPFILE_CLUSTER-1 times, then restarting at the lowest wholly
      empty cluster, falling back to lowest entry if none), but with a view towards
      dropping the lock in the next patch.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7dfad418
    • Hugh Dickins's avatar
      [PATCH] swap: get_swap_page drop swap_list_lock · fb4f88dc
      Hugh Dickins authored
      Rewrite get_swap_page to allocate in just the same sequence as before, but
      without holding swap_list_lock across its scan_swap_map.  Decrement
      nr_swap_pages and update swap_list.next in advance, while still holding
      swap_list_lock.  Skip full devices by testing highest_bit.  Swapoff hold
      swap_device_lock as well as swap_list_lock to clear SWP_WRITEOK.  Reduces lock
      contention when there are parallel swap devices of the same priority.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fb4f88dc
    • Hugh Dickins's avatar
      [PATCH] swap: freeing update swap_list.next · 89d09a2c
      Hugh Dickins authored
      This makes negligible difference in practice: but swap_list.next should not be
      updated to a higher prio in the general helper swap_info_get, but rather in
      swap_entry_free; and then only in the case when entry is actually freed.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      89d09a2c
    • Hugh Dickins's avatar
      [PATCH] swap: swap unsigned int consistency · 6eb396dc
      Hugh Dickins authored
      The swap header's unsigned int last_page determines the range of swap pages,
      but swap_info has been using int or unsigned long in some cases: use unsigned
      int throughout (except, in several places a local unsigned long is useful to
      avoid overflows when adding).
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarJens Axboe <axboe@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6eb396dc
    • Hugh Dickins's avatar
      [PATCH] swap: show span of swap extents · 53092a74
      Hugh Dickins authored
      The "Adding %dk swap" message shows the number of swap extents, as a guide to
      how fragmented the swapfile may be.  But a useful further guide is what total
      extent they span across (sometimes scarily large).
      
      And there's no need to keep nr_extents in swap_info: it's unused after the
      initial message, so save a little space by keeping it on stack.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      53092a74
    • Hugh Dickins's avatar
      [PATCH] swap: swap extent list is ordered · 11d31886
      Hugh Dickins authored
      There are several comments that swap's extent_list.prev points to the lowest
      extent: that's not so, it's extent_list.next which points to it, as you'd
      expect.  And a couple of loops in add_swap_extent which go all the way through
      the list, when they should just add to the other end.
      
      Fix those up, and let map_swap_page search the list forwards: profiles shows
      it to be twice as quick that way - because prefetch works better on how the
      structs are typically kmalloc'ed?  or because usually more is written to than
      read from swap, and swap is allocated ascendingly?
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      11d31886
    • Hugh Dickins's avatar
      [PATCH] swap: move destroy_swap_extents calls · 4cd3bb10
      Hugh Dickins authored
      sys_swapon's call to destroy_swap_extents on failure is made after the final
      swap_list_unlock, which is faintly unsafe: another sys_swapon might already be
      setting up that swap_info_struct.  Calling it earlier, before taking
      swap_list_lock, is safe.  sys_swapoff's call to destroy_swap_extents was safe,
      but likewise move it earlier, before taking the locks (once try_to_unuse has
      completed, nothing can be needing the swap extents).
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4cd3bb10
    • Hugh Dickins's avatar
      [PATCH] swap: correct swapfile nr_good_pages · e2244ec2
      Hugh Dickins authored
      If a regular swapfile lies on a filesystem whose blocksize is less than
      PAGE_SIZE, then setup_swap_extents may have to cut the number of usable swap
      pages; but sys_swapon's nr_good_pages was not expecting that.  Also,
      setup_swap_extents takes no account of badpages listed in the swap header: not
      worth doing so, but ensure nr_badpages is 0 for a regular swapfile.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e2244ec2
    • Hugh Dickins's avatar
      [PATCH] swap: update swapfile i_sem comment · b0d9bcd4
      Hugh Dickins authored
      Update swap extents comment: nowadays we guard with S_SWAPFILE not i_sem.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b0d9bcd4
    • Stephen Rothwell's avatar
      [PATCH] mm: consolidate get_order · fd4fd5aa
      Stephen Rothwell authored
      Someone mentioned that almost all the architectures used basically the same
      implementation of get_order.  This patch consolidates them into
      asm-generic/page.h and includes that in the appropriate places.  The
      exceptions are ia64 and ppc which have their own (presumably optimised)
      versions.
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      fd4fd5aa
    • Dave Hansen's avatar
      [PATCH] sparsemem extreme: hotplug preparation · 28ae55c9
      Dave Hansen authored
      This splits up sparse_index_alloc() into two pieces.  This is needed
      because we'll allocate the memory for the second level in a different place
      from where we actually consume it to keep the allocation from happening
      underneath a lock
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarBob Picco <bob.picco@hp.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      28ae55c9
    • Bob Picco's avatar
      [PATCH] sparsemem extreme implementation · 3e347261
      Bob Picco authored
      With cleanups from Dave Hansen <haveblue@us.ibm.com>
      
      SPARSEMEM_EXTREME makes mem_section a one dimensional array of pointers to
      mem_sections.  This two level layout scheme is able to achieve smaller
      memory requirements for SPARSEMEM with the tradeoff of an additional shift
      and load when fetching the memory section.  The current SPARSEMEM
      implementation is a one dimensional array of mem_sections which is the
      default SPARSEMEM configuration.  The patch attempts isolates the
      implementation details of the physical layout of the sparsemem section
      array.
      
      SPARSEMEM_EXTREME requires bootmem to be functioning at the time of
      memory_present() calls.  This is not always feasible, so architectures
      which do not need it may allocate everything statically by using
      SPARSEMEM_STATIC.
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarBob Picco <bob.picco@hp.com>
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3e347261
    • Bob Picco's avatar
      [PATCH] SPARSEMEM EXTREME · 802f192e
      Bob Picco authored
      A new option for SPARSEMEM is ARCH_SPARSEMEM_EXTREME.  Architecture
      platforms with a very sparse physical address space would likely want to
      select this option.  For those architecture platforms that don't select the
      option, the code generated is equivalent to SPARSEMEM currently in -mm.
      I'll be posting a patch on ia64 ml which uses this new SPARSEMEM feature.
      
      ARCH_SPARSEMEM_EXTREME makes mem_section a one dimensional array of
      pointers to mem_sections.  This two level layout scheme is able to achieve
      smaller memory requirements for SPARSEMEM with the tradeoff of an
      additional shift and load when fetching the memory section.  The current
      SPARSEMEM -mm implementation is a one dimensional array of mem_sections
      which is the default SPARSEMEM configuration.  The patch attempts isolates
      the implementation details of the physical layout of the sparsemem section
      array.
      
      ARCH_SPARSEMEM_EXTREME depends on 64BIT and is by default boolean false.
      
      I've boot tested under aim load ia64 configured for ARCH_SPARSEMEM_EXTREME.
       I've also boot tested a 4 way Opteron machine with !ARCH_SPARSEMEM_EXTREME
      and tested with aim.
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: default avatarBob Picco <bob.picco@hp.com>
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      802f192e