1. 23 Mar, 2011 6 commits
    • Phil Carmody's avatar
      cgroups: if you list_empty() a head then don't list_del() it · 8d258797
      Phil Carmody authored
      list_del() leaves poison in the prev and next pointers.  The next
      list_empty() will compare those poisons, and say the list isn't empty.
      Any list operations that assume the node is on a list because of such a
      check will be fooled into dereferencing poison.  One needs to INIT the
      node after the del, and fortunately there's already a wrapper for that -
      list_del_init().
      
      Some of the dels are followed by deallocations, so can be ignored, and one
      can be merged with an add to make a move.  Apart from that, I erred on the
      side of caution in making nodes list_empty()-queriable.
      Signed-off-by: default avatarPhil Carmody <ext-phil.2.carmody@nokia.com>
      Reviewed-by: default avatarPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Acked-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d258797
    • David Rientjes's avatar
      oom: avoid deferring oom killer if exiting task is being traced · edd45544
      David Rientjes authored
      The oom killer naturally defers killing anything if it finds an eligible
      task that is already exiting and has yet to detach its ->mm.  This avoids
      unnecessarily killing tasks when one is already in the exit path and may
      free enough memory that the oom killer is no longer needed.  This is
      detected by PF_EXITING since threads that have already detached its ->mm
      are no longer considered at all.
      
      The problem with always deferring when a thread is PF_EXITING, however, is
      that it may never actually exit when being traced, specifically if another
      task is tracing it with PTRACE_O_TRACEEXIT.  The oom killer does not want
      to defer in this case since there is no guarantee that thread will ever
      exit without intervention.
      
      This patch will now only defer the oom killer when a thread is PF_EXITING
      and no ptracer has stopped its progress in the exit path.  It also ensures
      that a child is sacrificed for the chosen parent only if it has a
      different ->mm as the comment implies: this ensures that the thread group
      leader is always targeted appropriately.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edd45544
    • Andrey Vagin's avatar
      oom: skip zombies when iterating tasklist · 30e2b41f
      Andrey Vagin authored
      We shouldn't defer oom killing if a thread has already detached its ->mm
      and still has TIF_MEMDIE set.  Memory needs to be freed, so find kill
      other threads that pin the same ->mm or find another task to kill.
      Signed-off-by: default avatarAndrey Vagin <avagin@openvz.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      30e2b41f
    • David Rientjes's avatar
      oom: prevent unnecessary oom kills or kernel panics · 3a5dda7a
      David Rientjes authored
      This patch prevents unnecessary oom kills or kernel panics by reverting
      two commits:
      
      	495789a5 (oom: make oom_score to per-process value)
      	cef1d352 (oom: multi threaded process coredump don't make deadlock)
      
      First, 495789a5 (oom: make oom_score to per-process value) ignores the
      fact that all threads in a thread group do not necessarily exit at the
      same time.
      
      It is imperative that select_bad_process() detect threads that are in the
      exit path, specifically those with PF_EXITING set, to prevent needlessly
      killing additional tasks.  If a process is oom killed and the thread group
      leader exits, select_bad_process() cannot detect the other threads that
      are PF_EXITING by iterating over only processes.  Thus, it currently
      chooses another task unnecessarily for oom kill or panics the machine when
      nothing else is eligible.
      
      By iterating over threads instead, it is possible to detect threads that
      are exiting and nominate them for oom kill so they get access to memory
      reserves.
      
      Second, cef1d352 (oom: multi threaded process coredump don't make
      deadlock) erroneously avoids making the oom killer a no-op when an
      eligible thread other than current isfound to be exiting.  We want to
      detect this situation so that we may allow that exiting thread time to
      exit and free its memory; if it is able to exit on its own, that should
      free memory so current is no loner oom.  If it is not able to exit on its
      own, the oom killer will nominate it for oom kill which, in this case,
      only means it will get access to memory reserves.
      
      Without this change, it is easy for the oom killer to unnecessarily target
      tasks when all threads of a victim don't exit before the thread group
      leader or, in the worst case, panic the machine.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a5dda7a
    • Mel Gorman's avatar
      mm: swap: unlock swapfile inode mutex before closing file on bad swapfiles · 52c50567
      Mel Gorman authored
      If an administrator tries to swapon a file backed by NFS, the inode mutex is
      taken (as it is for any swapfile) but later identified to be a bad swapfile
      due to the lack of bmap and tries to cleanup. During cleanup, an attempt is
      made to close the file but with inode->i_mutex still held. Closing an NFS
      file syncs it which tries to acquire the inode mutex leading to deadlock. If
      lockdep is enabled the following appears on the console;
      
          =============================================
          [ INFO: possible recursive locking detected ]
          2.6.38-rc8-autobuild #1
          ---------------------------------------------
          swapon/2192 is trying to acquire lock:
           (&sb->s_type->i_mutex_key#13){+.+.+.}, at: vfs_fsync_range+0x47/0x7c
      
          but task is already holding lock:
           (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7
      
          other info that might help us debug this:
          1 lock held by swapon/2192:
           #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7
      
          stack backtrace:
          Pid: 2192, comm: swapon Not tainted 2.6.38-rc8-autobuild #1
          Call Trace:
              __lock_acquire+0x2eb/0x1623
              find_get_pages_tag+0x14a/0x174
              pagevec_lookup_tag+0x25/0x2e
              vfs_fsync_range+0x47/0x7c
              lock_acquire+0xd3/0x100
              vfs_fsync_range+0x47/0x7c
              nfs_flush_one+0x0/0xdf [nfs]
              mutex_lock_nested+0x40/0x2b1
              vfs_fsync_range+0x47/0x7c
              vfs_fsync_range+0x47/0x7c
              vfs_fsync+0x1c/0x1e
              nfs_file_flush+0x64/0x69 [nfs]
              filp_close+0x43/0x72
              sys_swapon+0xa39/0xae7
              sysret_check+0x2e/0x69
              system_call_fastpath+0x16/0x1b
      
      This patch releases the mutex if its held before calling filep_close()
      so swapon fails as expected without deadlock when the swapfile is backed
      by NFS.  If accepted for 2.6.39, it should also be considered a -stable
      candidate for 2.6.38 and 2.6.37.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: <stable@kernel.org>		[2.6.37+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52c50567
    • Andrew Morton's avatar
      include/asm-generic/unistd.h: fix syncfs syscall number · c7a1fcd8
      Andrew Morton authored
      syncfs() is duplicating name_to_handle_at() due to a merging mistake.
      
      Cc: Sage Weil <sage@newdream.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7a1fcd8
  2. 22 Mar, 2011 34 commits