1. 09 Jan, 2006 40 commits
    • Tom Zanussi's avatar
      [PATCH] relayfs: remove unused alloc/destroy_inode() · aaea25d7
      Tom Zanussi authored
      Since we're no longer using relayfs_inode_info, remove relayfs_alloc_inode()
      and relayfs_destroy_inode() along with the relayfs inode cache.
      Signed-off-by: default avatarTom Zanussi <zanussi@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      aaea25d7
    • Tom Zanussi's avatar
      [PATCH] relayfs: use generic_ip for private data · 51008f9f
      Tom Zanussi authored
      Use inode->u.generic_ip instead of relayfs_inode_info to store pointer to user
      data.  Clients using relayfs_file_create() to create their own files would
      probably more expect their data to be stored in generic_ip; we also intend in
      the next set of patches to get rid of relayfs-specific stuff in the file
      operations, so we might as well do it here.
      Signed-off-by: default avatarTom Zanussi <zanussi@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      51008f9f
    • Tom Zanussi's avatar
      [PATCH] relayfs: add relayfs_remove_file() · 74317337
      Tom Zanussi authored
      This patch adds and exports relayfs_remove_file(), for API symmetry (with
      relayfs_create_file()).
      Signed-off-by: default avatarTom Zanussi <zanussi@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      74317337
    • Tom Zanussi's avatar
      [PATCH] relayfs: export relayfs_create_file() with fileops param · 907f2c77
      Tom Zanussi authored
      This patch adds a mandatory fileops param to relayfs_create_file() and exports
      that function so that clients can use it to create files defined by their own
      set of file operations, in relayfs.  The purpose is to allow relayfs
      applications to create their own set of 'control' files alongside their relay
      files in relayfs rather than having to create them in /proc or debugfs for
      instance.  relayfs_create_file() is also used by relay_open_buf() to create
      the relay files for a channel.  In this case, a pointer to
      relayfs_file_operations is passed in, along with a pointer to the buffer
      associated with the file.
      Signed-off-by: default avatarTom Zanussi <zanussi@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      907f2c77
    • Tom Zanussi's avatar
      [PATCH] relayfs: decouple buffer creation from inode creation · 6625b861
      Tom Zanussi authored
      The patch series implementa or fixes 3 things that were specifically requested
      or suggested by relayfs users:
      
      - support for non-relay files (patches 1-6)
      
      Currently, the relayfs API only supports the creation of directories
      (relayfs_create_dir()) and relay files (relay_open()).  These patches adds
      support for non-relay files (relayfs_create_file()).  This is so relayfs
      applications can create 'control files' in relayfs itself rather than in /proc
      or via a netlink channel, as is currently done in the relay-app examples.
      Basically what this amounts to is exporting relayfs_create_file() with an
      additional file_ops param that clients can use to supply file operations for
      their own special-purpose files in relayfs.
      
      - make exported relay file ops useful (patches 7-8)
      
      The relayfs relay_file_operations have always been exported, the intent being
      to make it possible to create relay files in other filesystems such as
      debugfs.  The problem, though, is that currently the file operations are too
      tightly coupled to relayfs to actually be used for this purpose.  This patch
      fixes that by adding a couple of callback functions that allow a client to
      hook into relay_open()/close() and supply the files that will be used to
      represent the channel buffers; the default implementation if no callbacks are
      defined is to create the files in relayfs.
      
      - add an option to create global relay buffer (patches 9-10) The file creation
      callback also supplies an optional param, is_global, that can be used by
      clients to create a single global relayfs buffer instead of the default
      per-cpu buffers.  This was suggested as being useful for certain debugging
      applications where it's more convenient to be able to get all the data from a
      single channel without having to go to the bother of dealing with per-cpu
      files.
      
      - cleanup, some renaming and Documentation updates (patches 11-12)
      
      There were several comments that the use of netlink in the example code was
      non-intuitive and in fact the whole relay-app business was needlessly
      confusing.  Based on that feedback, the example code has been completely
      converted over to relayfs control files as supported by this patch, and have
      also been made completely self-contained.
      
      The converted examples along with a couple of new examples that demonstrate
      using exported relay files can be found in relay-apps tarball:
      http://prdownloads.sourceforge.net/relayfs/relay-apps-0.9.tar.gz?download
      
      This patch:
      
      Separate buffer create/destroy from inode create/destroy.  We want to be able
      to associate other data and not just relay buffers with inodes.  Buffer
      create/destroy is moved out of inode.c and into relayfs core code.
      Signed-off-by: default avatarTom Zanussi <zanussi@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6625b861
    • Andrew Morton's avatar
      [PATCH] ipc: expand shm_flags · b33291c0
      Andrew Morton authored
      Unobfsucate this struct member
      
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b33291c0
    • Jan Beulich's avatar
      [PATCH] ELF: symbol table type additions · b3f3d614
      Jan Beulich authored
      Needed for the Novell kernel debugger and perhaps some per-cpu data on x86_64
      in the future.
      
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b3f3d614
    • Nick Piggin's avatar
      [PATCH] rcu file: use atomic primitives · 095975da
      Nick Piggin authored
      Use atomic_inc_not_zero for rcu files instead of special case rcuref.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      095975da
    • Nick Piggin's avatar
      [PATCH] atomic: dec_and_lock use atomic primitives · a57004e1
      Nick Piggin authored
      Convert atomic_dec_and_lock to use new atomic primitives.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a57004e1
    • Peter Osterlund's avatar
      [PATCH] pktcdvd: Use bd_claim to get exclusive access · 8382bf2e
      Peter Osterlund authored
      Use bd_claim() when opening the cdrom device to prevent user space programs
      such as cdrecord, hald and kded from interfering with the burning process.
      Signed-off-by: default avatarPeter Osterlund <petero2@telia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8382bf2e
    • Adrian Bunk's avatar
      [PATCH] kernel/: small cleanups · 97a41e26
      Adrian Bunk authored
      This patch contains the following cleanups:
      - make needlessly global functions static
      - every file should include the headers containing the prototypes for
        it's global functions
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Acked-by: default avatar"Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      97a41e26
    • Adrian Bunk's avatar
      [PATCH] drivers/isdn/: "extern inline" -> "static inline" · b7b4d7a4
      Adrian Bunk authored
      "extern inline" -> "static inline"
      
      Since there's no pullphone() function this patch removes the dead
      prototype.
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Acked-by: default avatarKarsten Keil <kkeil@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b7b4d7a4
    • Adrian Bunk's avatar
      [PATCH] move rtc_interrupt() prototype to rtc.h · 2a10e0b2
      Adrian Bunk authored
      This patch moves the rtc_interrupt() prototype to rtc.h and removes the
      prototypes from C files.
      
      It also renames static rtc_interrupt() functions in
      arch/arm/mach-integrator/time.c and arch/sh64/kernel/time.c to avoid compile
      problems.
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarPaul Gortmaker <p_gortmaker@yahoo.com>
      Acked-by: default avatarPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2a10e0b2
    • OGAWA Hirofumi's avatar
      [PATCH] Fix and add EXPORT_SYMBOL(filemap_write_and_wait) · 28fd1298
      OGAWA Hirofumi authored
      This patch add EXPORT_SYMBOL(filemap_write_and_wait) and use it.
      
      See mm/filemap.c:
      
      And changes the filemap_write_and_wait() and filemap_write_and_wait_range().
      
      Current filemap_write_and_wait() doesn't wait if filemap_fdatawrite()
      returns error.  However, even if filemap_fdatawrite() returned an
      error, it may have submitted the partially data pages to the device.
      (e.g. in the case of -ENOSPC)
      
      <quotation>
      Andrew Morton writes,
      
      If filemap_fdatawrite() returns an error, this might be due to some
      I/O problem: dead disk, unplugged cable, etc.  Given the generally
      crappy quality of the kernel's handling of such exceptions, there's a
      good chance that the filemap_fdatawait() will get stuck in D state
      forever.
      </quotation>
      
      So, this patch doesn't wait if filemap_fdatawrite() returns the -EIO.
      
      Trond, could you please review the nfs part?  Especially I'm not sure,
      nfs must use the "filemap_fdatawrite(inode->i_mapping) == 0", or not.
      Acked-by: default avatarTrond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      28fd1298
    • OGAWA Hirofumi's avatar
      [PATCH] fat: support a truncate() for expanding size (generic_cont_expand) · 05eb0b51
      OGAWA Hirofumi authored
      This patch changes generic_cont_expand(), in order to share the code
      with fatfs.
      
        - Use vmtruncate() if ->prepare_write() returns a error.
      
      Even if ->prepare_write() returns an error, it may already have added some
      blocks.  So, this truncates blocks outside of ->i_size by vmtruncate().
      
        - Add generic_cont_expand_simple().
      
      The generic_cont_expand_simple() assumes that ->prepare_write() can handle
      the block boundary.  With this, we don't need to care the extra byte.
      
      And for expanding a file size by truncate(), fatfs uses the
      added generic_cont_expand_simple().
      Signed-off-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      05eb0b51
    • OGAWA Hirofumi's avatar
      [PATCH] export/change sync_page_range/_nolock() · 268fc16e
      OGAWA Hirofumi authored
      This exports/changes the sync_page_range/_nolock().  The fatfs needs
      sync_page_range/_nolock() for expanding truncate, and changes "size_t count"
      to "loff_t count".
      Signed-off-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      268fc16e
    • OGAWA Hirofumi's avatar
      [PATCH] fat: support ->direct_IO() · e5174baa
      OGAWA Hirofumi authored
      This patch add to support of ->direct_IO() for mostly read.
      
      The user of this seems to want to use for streaming read.  So, current direct
      I/O has limitation, it can only overwrite.  (For write operation, mainly we
      need to handle the hole etc..)
      Signed-off-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e5174baa
    • OGAWA Hirofumi's avatar
      [PATCH] fat: s/EXPORT_SYMBOL/EXPORT_SYMBOL_GPL/ · 7c709d00
      OGAWA Hirofumi authored
      All EXPORT_SYMBOL of fatfs is only for vfat/msdos. _GPL would be proper.
      Signed-off-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7c709d00
    • OGAWA Hirofumi's avatar
      a5425d29
    • OGAWA Hirofumi's avatar
      [PATCH] fat: use sb_find_get_block() instead of sb_getblk() · 83b7c996
      OGAWA Hirofumi authored
      We don't need to allocate buffer for checking the buffer is uptodate.  This
      use sb_find_get_block() instead, and if it returns NULL it's not uptodate.
      Signed-off-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      83b7c996
    • OGAWA Hirofumi's avatar
      [PATCH] fat: move fat_clusters_flush() to write_super() · a6bf6b21
      OGAWA Hirofumi authored
      It is overkill to update the FS_INFO whenever modifying
      prev_free/free_clusters, because those are just a hint.
      
      So, this patch uses ->write_super() for updating FS_INFO instead.
      Signed-off-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      a6bf6b21
    • Russell King's avatar
      [PATCH] IRQ type flags · 9ded96f2
      Russell King authored
      Some ARM platforms have the ability to program the interrupt controller to
      detect various interrupt edges and/or levels.  For some platforms, this is
      critical to setup correctly, particularly those which the setting is dependent
      on the device.
      
      Currently, ARM drivers do (eg) the following:
      
      	err = request_irq(irq, ...);
      
      	set_irq_type(irq, IRQT_RISING);
      
      However, if the interrupt has previously been programmed to be level sensitive
      (for whatever reason) then this will cause an interrupt storm.
      
      Hence, if we combine set_irq_type() with request_irq(), we can then safely set
      the type prior to unmasking the interrupt.  The unfortunate problem is that in
      order to support this, these flags need to be visible outside of the ARM
      architecture - drivers such as smc91x need these flags and they're
      cross-architecture.
      
      Finally, the SA_TRIGGER_* flag passed to request_irq() should reflect the
      property that the device would like.  The IRQ controller code should do its
      best to select the most appropriate supported mode.
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9ded96f2
    • Paul Fulghum's avatar
      [PATCH] new driver synclink_gt · 705b6c7b
      Paul Fulghum authored
      New character device driver for the SyncLink GT and SyncLink AC families of
      synchronous and asynchronous serial adapters
      Signed-off-by: default avatarPaul Fulghum <paulkf@microgate.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      705b6c7b
    • Tim Schmielau's avatar
      [PATCH] fix more missing includes · de25968c
      Tim Schmielau authored
      Include fixes for 2.6.14-git11.  Should allow to remove sched.h from
      module.h on i386, x86_64, arm, ia64, ppc, ppc64, and s390.  Probably more
      to come since I haven't yet checked the other archs.
      Signed-off-by: default avatarTim Schmielau <tim@physik3.uni-rostock.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      de25968c
    • Paul Jackson's avatar
      [PATCH] cpuset: skip rcu check if task is in root cpuset · 03a285f5
      Paul Jackson authored
      For systems that aren't using cpusets, but have them CONFIG_CPUSET enabled in
      their kernel (eventually this may be most distribution kernels), this patch
      removes even the minimal rcu_read_lock() from the memory page allocation path.
      
      Actually, it removes that rcu call for any task that is in the root cpuset
      (top_cpuset), which on systems not actively using cpusets, is all tasks.
      
      We don't need the rcu check for tasks in the top_cpuset, because the
      top_cpuset is statically allocated, so at no risk of being freed out from
      underneath us.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      03a285f5
    • Paul Jackson's avatar
      [PATCH] cpuset: mark number_of_cpusets read_mostly · 7edc5962
      Paul Jackson authored
      Mark cpuset global 'number_of_cpusets' as __read_mostly.
      
      This global is accessed everytime a zone is considered in the zonelist loops
      beneath __alloc_pages, looking for a free memory page.  If number_of_cpusets
      is just one, then we can short circuit the mems_allowed check.
      
      Since this global is read alot on a hot path, and written rarely, it is an
      excellent candidate for __read_mostly.
      
      Thanks to Christoph Lameter for the suggestion.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      7edc5962
    • Paul Jackson's avatar
      [PATCH] cpuset: use rcu directly optimization · 6b9c2603
      Paul Jackson authored
      Optimize the cpuset impact on page allocation, the most performance critical
      cpuset hook in the kernel.
      
      On each page allocation, the cpuset hook needs to check for a possible change
      in the current tasks cpuset.  It can now handle the common case, of no change,
      without taking any spinlock or semaphore, thanks to RCU.
      
      Convert a spinlock on the current task to an rcu_read_lock(), saving
      approximately a memory barrier and an atomic op, depending on architecture.
      
      This is done by adding rcu_assign_pointer() and synchronize_rcu() calls to the
      write side of the task->cpuset pointer, in cpuset.c:attach_task(), to delay
      freeing up a detached cpuset until after any critical sections referencing
      that pointer.
      
      Thanks to Andi Kleen, Nick Piggin and Eric Dumazet for ideas.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      6b9c2603
    • Paul Jackson's avatar
      [PATCH] cpuset: remove test for null cpuset from alloc code path · c417f024
      Paul Jackson authored
      Remove a couple of more lines of code from the cpuset hooks in the page
      allocation code path.
      
      There was a check for a NULL cpuset pointer in the routine
      cpuset_update_task_memory_state() that was only needed during system boot,
      after the memory subsystem was initialized, before the cpuset subsystem was
      initialized, to catch a NULL task->cpuset pointer.
      
      Add a cpuset_init_early() routine, just before the mem_init() call in
      init/main.c, that sets up just enough of the init tasks cpuset structure to
      render cpuset_update_task_memory_state() calls harmless.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c417f024
    • Paul Jackson's avatar
      [PATCH] cpuset: migrate all tasks in cpuset at once · 04c19fa6
      Paul Jackson authored
      Given the mechanism in the previous patch to handle rebinding the per-vma
      mempolicies of all tasks in a cpuset that changes its memory placement, it is
      now easier to handle the page migration requirements of such tasks at the same
      time.
      
      The previous code didn't actually attempt to migrate the pages of the tasks in
      a cpuset whose memory placement changed until the next time each such task
      tried to allocate memory.  This was undesirable, as users invoking memory page
      migration exected to happen when the placement changed, not some unspecified
      time later when the task needed more memory.
      
      It is now trivial to handle the page migration at the same time as the per-vma
      rebinding is done.
      
      The routine cpuset.c:update_nodemask(), which handles changing a cpusets
      memory placement ('mems') now checks for the special case of being asked to
      write a placement that is the same as before.  It was harmless enough before
      to just recompute everything again, even though nothing had changed.  But page
      migration is a heavy weight operation - moving pages about.  So now it is
      worth avoiding that if asked to move a cpuset to its current location.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      04c19fa6
    • Paul Jackson's avatar
      [PATCH] cpuset: rebind vma mempolicies fix · 4225399a
      Paul Jackson authored
      Fix more of longstanding bug in cpuset/mempolicy interaction.
      
      NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
      to just the Memory Nodes allowed by that cpuset.  The kernel maintains
      internal state for each mempolicy, tracking what nodes are used for the
      MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
      
      When a tasks cpuset memory placement changes, whether because the cpuset
      changed, or because the task was attached to a different cpuset, then the
      tasks mempolicies have to be rebound to the new cpuset placement, so as to
      preserve the cpuset-relative numbering of the nodes in that policy.
      
      An earlier fix handled such mempolicy rebinding for mempolicies attached to a
      task.
      
      This fix rebinds mempolicies attached to vma's (address ranges in a tasks
      address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
      updating vma's, the rebinding of vma mempolicies has to be done when the
      cpuset memory placement is changed, at which time mmap_sem can be safely
      acquired.  The tasks mempolicy is rebound later, when the task next attempts
      to allocate memory and notices that its task->cpuset_mems_generation is
      out-of-date with its cpusets mems_generation.
      
      Because walking the tasklist to find all tasks attached to a changing cpuset
      requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
      affected tasks while doing the tasklist scan.  In general, one cannot acquire
      a semaphore (which can sleep) while already holding a spinlock (such as
      tasklist_lock).  So a list of mm references has to be built up during the
      tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
      acquired, and the vma's in that mm rebound.
      
      Once the tasklist lock is dropped, affected tasks may fork new tasks, before
      their mm's are rebound.  A kernel global 'cpuset_being_rebound' is set to
      point to the cpuset being rebound (there can only be one; cpuset modifications
      are done under a global 'manage_sem' semaphore), and the mpol_copy code that
      is used to copy a tasks mempolicies during fork catches such forking tasks,
      and ensures their children are also rebound.
      
      When a task is moved to a different cpuset, it is easier, as there is only one
      task involved.  It's mm->vma's are scanned, using the same
      mpol_rebind_policy() as used above.
      
      It may happen that both the mpol_copy hook and the update done via the
      tasklist scan update the same mm twice.  This is ok, as the mempolicies of
      each vma in an mm keep track of what mems_allowed they are relative to, and
      safely no-op a second request to rebind to the same nodes.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4225399a
    • Paul Jackson's avatar
      [PATCH] cpuset: number_of_cpusets optimization · 202f72d5
      Paul Jackson authored
      Easy little optimization hack to avoid actually having to call
      cpuset_zone_allowed() and check mems_allowed, in the main page allocation
      routine, __alloc_pages().  This saves several CPU cycles per page allocation
      on systems not using cpusets.
      
      A counter is updated each time a cpuset is created or removed, and whenever
      there is only one cpuset in the system, it must be the root cpuset, which
      contains all CPUs and all Memory Nodes.  In that case, when the counter is
      one, all allocations are allowed.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      202f72d5
    • Paul Jackson's avatar
      [PATCH] cpuset: numa_policy_rebind cleanup · 74cb2155
      Paul Jackson authored
      Cleanup, reorganize and make more robust the mempolicy.c code to rebind
      mempolicies relative to the containing cpuset after a tasks memory placement
      changes.
      
      The real motivator for this cleanup patch is to lay more groundwork for the
      upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
      after the containing cpuset memory placement changes.
      
      NUMA mempolicies are constrained by the cpuset their task is a member of.
      When either (1) a task is moved to a different cpuset, or (2) the 'mems'
      mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
      node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
      be recalculated, relative to their new cpuset placement.
      
      The old code used an unreliable method of determining what was the old
      mems_allowed constraining the mempolicy.  It just looked at the tasks
      mems_allowed value.  This sort of worked with the present code, that just
      rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
      referring to the old nodes.  But in an upcoming patch, the vma mempolicies
      will be rebound as well.  Then the order in which the various task and vma
      mempolicies are updated will no longer be deterministic, and one can no longer
      count on the task->mems_allowed holding the old value for as long as needed.
      It's not even clear if the current code was guaranteed to work reliably for
      task mempolicies.
      
      So I added a mems_allowed field to each mempolicy, stating exactly what
      mems_allowed the policy is relative to, and updated synchronously and reliably
      anytime that the mempolicy is rebound.
      
      Also removed a useless wrapper routine, numa_policy_rebind(), and had its
      caller, cpuset_update_task_memory_state(), call directly to the rewritten
      policy_rebind() routine, and made that rebind routine extern instead of
      static, and added a "mpol_" prefix to its name, making it
      mpol_rebind_policy().
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      74cb2155
    • Paul Jackson's avatar
      [PATCH] cpuset: implement cpuset_mems_allowed · 909d75a3
      Paul Jackson authored
      Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
      needed, to obtain the mems_allowed vector of a cpuset, and replaced the
      workaround in sys_migrate_pages() to call this new method.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      909d75a3
    • Paul Jackson's avatar
      [PATCH] cpuset: combine refresh_mems and update_mems · cf2a473c
      Paul Jackson authored
      The important code paths through alloc_pages_current() and alloc_page_vma(),
      by which most kernel page allocations go, both called
      cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
      -Both- of these latter two routines did a tasklock, got the tasks cpuset
      pointer, and checked for out of date cpuset->mems_generation.
      
      That was a silly duplication of code and waste of CPU cycles on an important
      code path.
      
      Consolidated those two routines into a single routine, called
      cpuset_update_task_memory_state(), since it updates more than just
      mems_allowed.
      
      Changed all callers of either routine to call the new consolidated routine.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      cf2a473c
    • Paul Jackson's avatar
      [PATCH] cpuset: fork hook fix · b4b26418
      Paul Jackson authored
      Fix obscure, never seen in real life, cpuset fork race.  The cpuset_fork()
      call in fork.c was setting up the correct task->cpuset pointer after the
      tasklist_lock was dropped, which briefly exposed the newly forked process with
      an unsafe (copied from parent without locks or usage counter increment) cpuset
      pointer.
      
      In theory, that exposed cpuset pointer could have been pointing at a cpuset
      that was already freed and removed, and in theory another task that had been
      sitting on the tasklist_lock waiting to scan the task list could have raced
      down the entire tasklist, found our new child at the far end, and dereferenced
      that bogus cpuset pointer.
      
      To fix, setup up the correct cpuset pointer in the new child by calling
      cpuset_fork() before the new task is linked into the tasklist, and with that,
      add a fork failure case, to dereference that cpuset, if the fork fails along
      the way, after cpuset_fork() was called.
      
      Had to remove a BUG_ON() from cpuset_exit(), because it was no longer valid -
      the call to cpuset_exit() from a failed fork would not have PF_EXITING set.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b4b26418
    • Paul Jackson's avatar
      [PATCH] cpuset: update_nodemask code reformat · 59dac16f
      Paul Jackson authored
      Restructure code layout of the kernel/cpuset.c update_nodemask() routine,
      removing embedded returns and nested if's in favor of goto completion labels.
      This is being done in anticipation of adding more logic to this routine, which
      will favor the goto style structure.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      59dac16f
    • Paul Jackson's avatar
      [PATCH] cpuset: minor spacing initializer fixes · c5b2aff8
      Paul Jackson authored
      Four trivial cpuset fixes: remove extra spaces, remove useless initializers,
      mark one __read_mostly.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c5b2aff8
    • Paul Jackson's avatar
      [PATCH] cpuset: remove marker_pid documentation · 90c9cc40
      Paul Jackson authored
      Remove documentation for the cpuset 'marker_pid' feature, that was in the
      patch "cpuset: change marker for relative numbering" That patch was previously
      pulled from *-mm at my (pj) request.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      90c9cc40
    • Paul Jackson's avatar
      [PATCH] cpuset: document additional features · bd5e09cf
      Paul Jackson authored
      Document the additional cpuset features:
      	notify_on_release
      	marker_pid
      	memory_pressure
      	memory_pressure_enabled
      
      Rearrange and improve formatting of existing documentation for
      cpu_exclusive and mem_exclusive features.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      bd5e09cf
    • Paul Jackson's avatar
      [PATCH] cpuset: memory pressure meter · 3e0d98b9
      Paul Jackson authored
      Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
      that the tasks in a cpuset call try_to_free_pages(), the synchronous
      (direct) memory reclaim code.
      
      This enables batch managers monitoring jobs running in dedicated cpusets to
      efficiently detect what level of memory pressure that job is causing.
      
      This is useful both on tightly managed systems running a wide mix of
      submitted jobs, which may choose to terminate or reprioritize jobs that are
      trying to use more memory than allowed on the nodes assigned them, and with
      tightly coupled, long running, massively parallel scientific computing jobs
      that will dramatically fail to meet required performance goals if they
      start to use more memory than allowed to them.
      
      This patch just provides a very economical way for the batch manager to
      monitor a cpuset for signs of memory pressure.  It's up to the batch
      manager or other user code to decide what to do about it and take action.
      
      ==> Unless this feature is enabled by writing "1" to the special file
          /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
          code of __alloc_pages() for this metric reduces to simply noticing
          that the cpuset_memory_pressure_enabled flag is zero.  So only
          systems that enable this feature will compute the metric.
      
      Why a per-cpuset, running average:
      
          Because this meter is per-cpuset, rather than per-task or mm, the
          system load imposed by a batch scheduler monitoring this metric is
          sharply reduced on large systems, because a scan of the tasklist can be
          avoided on each set of queries.
      
          Because this meter is a running average, instead of an accumulating
          counter, a batch scheduler can detect memory pressure with a single
          read, instead of having to read and accumulate results for a period of
          time.
      
          Because this meter is per-cpuset rather than per-task or mm, the
          batch scheduler can obtain the key information, memory pressure in a
          cpuset, with a single read, rather than having to query and accumulate
          results over all the (dynamically changing) set of tasks in the cpuset.
      
      A per-cpuset simple digital filter (requires a spinlock and 3 words of data
      per-cpuset) is kept, and updated by any task attached to that cpuset, if it
      enters the synchronous (direct) page reclaim code.
      
      A per-cpuset file provides an integer number representing the recent
      (half-life of 10 seconds) rate of direct page reclaims caused by the tasks
      in the cpuset, in units of reclaims attempted per second, times 1000.
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3e0d98b9