1. 07 Apr, 2014 40 commits
    • Stephen Hemminger's avatar
      idr: remove dead code · 90ae3ae5
      Stephen Hemminger authored
      Remove no longer used deprecated code, and make local functions
      static.
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Acked-by: default avatarJean Delvare <jdelvare@suse.de>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: George Spelvin <linux@horizon.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90ae3ae5
    • WANG Chao's avatar
      vmcore: continue vmcore initialization if PT_NOTE is found empty · c4082f36
      WANG Chao authored
      Currently when an empty PT_NOTE is detected, vmcore initialization
      fails.  It sounds too harsh.  Because PT_NOTE could be empty, for
      example, one offlined a cpu but never restarted kdump service, and after
      crash, PT_NOTE program header is there but no data contains.  It's
      better to warn about the empty PT_NOTE and continue to initialise
      vmcore.
      
      And ultimately the multiple PT_NOTE are merged into a single one, all
      empty PT_NOTE are discarded naturally during the merge.  So empty
      PT_NOTE is not visible to user space and vmcore is as good as expected.
      Signed-off-by: default avatarWANG Chao <chaowang@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Greg Pearson <greg.pearson@hp.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4082f36
    • Rashika Kheria's avatar
      include/linux/crash_dump.h: add vmcore_cleanup() prototype · 82e0703b
      Rashika Kheria authored
      Eliminate the following warning in proc/vmcore.c:
      
        fs/proc/vmcore.c:1088:6: warning: no previous prototype for `vmcore_cleanup' [-Wmissing-prototypes]
      
      [akpm@linux-foundation.org: clean up powerpc, remove unneeded EXPORT_SYMBOL]
      Signed-off-by: default avatarRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      82e0703b
    • Oleg Nesterov's avatar
      wait: WSTOPPED|WCONTINUED doesn't work if a zombie leader is traced by another process · 7c733eb3
      Oleg Nesterov authored
      Even if the main thread is dead the process still can stop/continue.
      However, if the leader is ptraced wait_consider_task(ptrace => false)
      always skips wait_task_stopped/wait_task_continued, so WSTOPPED or
      WCONTINUED can never work for the natural parent in this case.
      
      Move the "A zombie ptracee is only visible to its ptracer" check into the
      "if (!delay_group_leader(p))" block.  ->notask_error is cleared by the
      "fall through" code below.
      
      This depends on the previous change, wait_task_stopped/continued must be
      avoided if !delay_group_leader() and the tracer is ->real_parent.
      Otherwise WSTOPPED|WEXITED could wrongly report "stopped" when the child
      is already dead (single-threaded or not).  If it is traced by another task
      then the "stopped" state is fine until the debugger detaches and reveals a
      zombie state.
      
      Stupid test-case:
      
      	void *tfunc(void *arg)
      	{
      		sleep(1);	// wait for zombie leader
      		raise(SIGSTOP);
      		exit(0x13);
      		return NULL;
      	}
      
      	int run_child(void)
      	{
      		pthread_t thread;
      
      		if (!fork()) {
      			int tracee = getppid();
      
      			assert(ptrace(PTRACE_ATTACH, tracee, 0,0) == 0);
      			do
      				ptrace(PTRACE_CONT, tracee, 0,0);
      			while (wait(NULL) > 0);
      
      			return 0;
      		}
      
      		sleep(1);	// wait for PTRACE_ATTACH
      		assert(pthread_create(&thread, NULL, tfunc, NULL) == 0);
      		pthread_exit(NULL);
      	}
      
      	int main(void)
      	{
      		int child, stat;
      
      		child = fork();
      		if (!child)
      			return run_child();
      
      		assert(child == waitpid(-1, &stat, WSTOPPED));
      		assert(stat == 0x137f);
      
      		kill(child, SIGCONT);
      
      		assert(child == waitpid(-1, &stat, WCONTINUED));
      		assert(stat == 0xffff);
      
      		assert(child == waitpid(-1, &stat, 0));
      		assert(stat == 0x1300);
      
      		return 0;
      	}
      
      Without this patch it hangs in waitpid(WSTOPPED), wait_task_stopped() is
      never called.
      
      Note: this doesn't fix all problems with a zombie delay_group_leader(),
      WCONTINUED | WEXITED check is not exactly right.  debugger can't assume it
      will be notified if another thread reaps the whole thread group.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7c733eb3
    • Oleg Nesterov's avatar
      wait: WSTOPPED|WCONTINUED hangs if a zombie child is traced by real_parent · 377d75da
      Oleg Nesterov authored
      "A zombie is only visible to its ptracer" logic in wait_consider_task()
      is very wrong. Trivial test-case:
      
      	#include <unistd.h>
      	#include <sys/ptrace.h>
      	#include <sys/wait.h>
      	#include <assert.h>
      
      	int main(void)
      	{
      		int child = fork();
      
      		if (!child) {
      			assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
      			return 0x23;
      		}
      
      		assert(waitid(P_ALL, child, NULL, WEXITED | WNOWAIT) == 0);
      		assert(waitid(P_ALL, 0, NULL, WSTOPPED) == -1);
      		return 0;
      	}
      
      it hangs in waitpid(WSTOPPED) despite the fact it has a single zombie
      child.  This is because wait_consider_task(ptrace => 0) sees p->ptrace and
      cleares ->notask_error assuming that the debugger should detach and notify
      us.
      
      Change wait_consider_task(ptrace => 0) to pretend that ptrace == T if the
      child is traced by us.  This really simplifies the logic and allows us to
      do more fixes, see the next changes.  This also hides the unwanted group
      stop state automatically, we can remove another ptrace_reparented() check.
      
      Unfortunately, this adds the following behavioural changes:
      
      	1. Before this patch wait(WEXITED | __WNOTHREAD) does not reap
      	   a natural child if it is traced by the caller's sub-thread.
      
      	   Hopefully nobody will ever notice this change, and I think
      	   that nobody should rely on this behaviour anyway.
      
      	2. SIGNAL_STOP_CONTINUED is no longer hidden from debugger if
      	   it is real parent.
      
      	   While this change comes as a side effect, I think it is good
      	   by itself. The group continued state can not be consumed by
      	   another process in this case, it doesn't depend on ptrace,
      	   it doesn't make sense to hide it from real parent.
      
      	   Perhaps we should add the thread_group_leader() check before
      	   wait_task_continued()? May be, but this shouldn't depend on
      	   ptrace_reparented().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      377d75da
    • Oleg Nesterov's avatar
      wait: swap EXIT_ZOMBIE and EXIT_DEAD to hide EXIT_TRACE from user-space · ad86622b
      Oleg Nesterov authored
      get_task_state() uses the most significant bit to report the state to
      user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition
      can be noticed via /proc as Z -> X -> Z change.  Note that this was
      possible even before EXIT_TRACE was introduced.
      
      This is not really bad but imho it make sense to hide EXIT_TRACE from
      user-space completely.  So the patch simply swaps EXIT_ZOMBIE and
      EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad86622b
    • Oleg Nesterov's avatar
      wait: completely ignore the EXIT_DEAD tasks · b3ab0316
      Oleg Nesterov authored
      Now that EXIT_DEAD is the terminal state it doesn't make sense to call
      eligible_child() or security_task_wait() if the task is really dead.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Tested-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3ab0316
    • Oleg Nesterov's avatar
      wait: use EXIT_TRACE only if thread_group_leader(zombie) · b4360690
      Oleg Nesterov authored
      wait_task_zombie() always uses EXIT_TRACE/ptrace_unlink() if
      ptrace_reparented().  This is suboptimal and a bit confusing: we do not
      need do_notify_parent(p) if !thread_group_leader(p) and in this case we
      also do not need ptrace_unlink(), we can rely on ptrace_release_task().
      
      Change wait_task_zombie() to check thread_group_leader() along with
      ptrace_reparented() and simplify the final p->exit_state transition.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Tested-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4360690
    • Oleg Nesterov's avatar
      wait: introduce EXIT_TRACE to avoid the racy EXIT_DEAD->EXIT_ZOMBIE transition · abd50b39
      Oleg Nesterov authored
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.  This was fixed by
      the previous commit, but it was the temporary hack.
      
      1. Add the new exit_state, EXIT_TRACE. It means that the task is the
         traced zombie, debugger is going to detach and notify its natural
         parent.
      
         This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we
         can avoid the changes in proc/kgdb code, get_task_state() still
         reports "X (dead)" in this case.
      
         Note: with or without this change userspace can see Z -> X -> Z
         transition. Not really bad, but probably makes sense to fix.
      
      2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD
         if we need to notify the ->real_parent.
      
      3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD
         is always the final state we can safely ignore such a task.
      
      4. Change wait_consider_task() to check EXIT_TRACE separately and kill
         the racy and no longer needed ptrace_reparented() case.
      
         If ptrace == T an EXIT_TRACE thread should be simply ignored, the
         owner of this state is going to ptrace_unlink() this task. We can
         pretend that it was already removed from ->ptraced list.
      
         Otherwise we should skip this thread too but clear ->notask_error,
         we must be the natural parent and debugger is going to untrace and
         notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace"
         even if the task was already untraced.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Tested-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      abd50b39
    • Oleg Nesterov's avatar
      wait: fix reparent_leader() vs EXIT_DEAD->EXIT_ZOMBIE race · dfccbb5e
      Oleg Nesterov authored
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.
      
      Change reparent_leader() to update ->exit_signal even if EXIT_DEAD.
      Note: this is the simple temporary hack for -stable, it doesn't try to
      solve all problems, it will be reverted by the next changes.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Tested-by: default avatarMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfccbb5e
    • Oleg Nesterov's avatar
      exec: kill bprm->tcomm[], simplify the "basename" logic · 23aebe16
      Oleg Nesterov authored
      Starting from commit c4ad8f98 ("execve: use 'struct filename *' for
      executable name passing") bprm->filename can not go away after
      flush_old_exec(), so we do not need to save the binary name in
      bprm->tcomm[] added by 96e02d15 ("exec: fix use-after-free bug in
      setup_new_exec()").
      
      And there was never need for filename_to_taskname-like code, we can
      simply do set_task_comm(kbasename(filename).
      
      This patch has to change set_task_comm() and trace_task_rename() to
      accept "const char *", but I think this change is also good.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23aebe16
    • Djalal Harouni's avatar
      procfs: make /proc/*/pagemap 0400 · 32ed74a4
      Djalal Harouni authored
      The /proc/*/pagemap contain sensitive information and currently its mode
      is 0444.  Change this to 0400, so the VFS will prevent unprivileged
      processes from getting file descriptors on arbitrary privileged
      /proc/*/pagemap files.
      
      This reduces the scope of address space leaking and bypasses by protecting
      already running processes.
      Signed-off-by: default avatarDjalal Harouni <tixxdz@opendz.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32ed74a4
    • Djalal Harouni's avatar
      procfs: make /proc/*/{stack,syscall,personality} 0400 · 35a35046
      Djalal Harouni authored
      These procfs files contain sensitive information and currently their
      mode is 0444.  Change this to 0400, so the VFS will be able to block
      unprivileged processes from getting file descriptors on arbitrary
      privileged /proc/*/{stack,syscall,personality} files.
      
      This reduces the scope of ASLR leaking and bypasses by protecting already
      running processes.
      Signed-off-by: default avatarDjalal Harouni <tixxdz@opendz.org>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35a35046
    • Monam Agarwal's avatar
      fs/proc/inode.c: use RCU_INIT_POINTER(x, NULL) · 1c44dbc8
      Monam Agarwal authored
      Replace rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)
      
      The rcu_assign_pointer() ensures that the initialization of a structure
      is carried out before storing a pointer to that structure.  And in the
      case of the NULL pointer, there is no structure to initialize.  So,
      rcu_assign_pointer(p, NULL) can be safely converted to
      RCU_INIT_POINTER(p, NULL)
      Signed-off-by: default avatarMonam Agarwal <monamagarwal123@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c44dbc8
    • Andrey Vagin's avatar
      proc: show mnt_id in /proc/pid/fdinfo · 49d063cb
      Andrey Vagin authored
      Currently we don't have a way how to determing from which mount point
      file has been opened.  This information is required for proper dumping
      and restoring file descriptos due to presence of mount namespaces.  It's
      possible, that two file descriptors are opened using the same paths, but
      one fd references mount point from one namespace while the other fd --
      from other namespace.
      
      $ ls -l /proc/1/fd/1
      lrwx------ 1 root root 64 Mar 19 23:54 /proc/1/fd/1 -> /dev/null
      
      $ cat /proc/1/fdinfo/1
      pos:	0
      flags:	0100002
      mnt_id:	16
      
      $ cat /proc/1/mountinfo | grep ^16
      16 32 0:4 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs rw,size=1013356k,nr_inodes=253339,mode=755
      Signed-off-by: default avatarAndrey Vagin <avagin@openvz.org>
      Acked-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Acked-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49d063cb
    • Luiz Capitulino's avatar
      fs/proc/meminfo: meminfo_proc_show(): fix typo in comment · f0b5664b
      Luiz Capitulino authored
      It should read "reclaimable slab" and not "reclaimable swap".
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0b5664b
    • Guillaume Morin's avatar
      kernel/exit.c: call proc_exit_connector() after exit_state is set · ef982393
      Guillaume Morin authored
      The process events connector delivers a notification when a process
      exits.  This is really convenient for a process that spawns and wants to
      monitor its children through an epoll-able() interface.
      
      Unfortunately, there is a small window between when the event is
      delivered and the child become wait()-able.
      
      This is creates a race if the parent wants to make sure that it knows
      about the exit, e.g
      
      pid_t pid = fork();
      if (pid > 0) {
      	register_interest_for_pid(pid);
      	if (waitpid(pid, NULL, WNOHANG) > 0)
      	{
      	  /* We might have raced with exit() */
      	}
      	return;
      }
      
      /* Child */
      execve(...)
      
      register_interest_for_pid() would be telling the the connector socket
      reader to pay attention to events related to pid.
      
      Though this is not a bug, I think it would make the connector a bit more
      usable if this race was closed by simply moving the call to
      proc_exit_connector() from just before exit_notify() to right after.
      
      Oleg said:
      
      : Even with this patch the code above is still "racy" if the child is
      : multi-threaded.  Plus it should obviously filter-out subthreads.  And
      : afaics there is no way to make it reliable, even if you change the code
      : above so that waitpid() is called only after the last thread exits WNOHANG
      : still can fail.
      Signed-off-by: default avatarGuillaume Morin <guillaume@morinfr.org>
      Cc: Matt Helsley <matt.helsley@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef982393
    • Oleg Nesterov's avatar
      exit: move check_stack_usage() to the end of do_exit() · 4bcb8232
      Oleg Nesterov authored
      It is not clear why check_stack_usage() is called so early and thus it
      never checks the stack usage in, say, exit_notify() or
      flush_ptrace_hw_breakpoint() or other functions which are only called by
      do_exit().
      
      Move the callsite down to the last preempt_disable/schedule.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4bcb8232
    • Oleg Nesterov's avatar
      exit: call disassociate_ctty() before exit_task_namespaces() · c39df5fa
      Oleg Nesterov authored
      Commit 8aac6270 ("move exit_task_namespaces() outside of
      exit_notify()") breaks pppd and the exiting service crashes the kernel:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
          IP: ppp_register_channel+0x13/0x20 [ppp_generic]
          Call Trace:
            ppp_asynctty_open+0x12b/0x170 [ppp_async]
            tty_ldisc_open.isra.2+0x27/0x60
            tty_ldisc_hangup+0x1e3/0x220
            __tty_hangup+0x2c4/0x440
            disassociate_ctty+0x61/0x270
            do_exit+0x7f2/0xa50
      
      ppp_register_channel() needs ->net_ns and current->nsproxy == NULL.
      
      Move disassociate_ctty() before exit_task_namespaces(), it doesn't make
      sense to delay it after perf_event_exit_task() or cgroup_exit().
      
      This also allows to use task_work_add() inside the (nontrivial) code
      paths in disassociate_ctty().
      
      Investigated by Peter Hurley.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarSree Harsha Totakura <sreeharsha@totakura.in>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Sree Harsha Totakura <sreeharsha@totakura.in>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[v3.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c39df5fa
    • SeongJae Park's avatar
      mm/zswap.c: remove unnecessary parentheses · 5d2d42de
      SeongJae Park authored
      Fix following trivial checkpatch error:
      
        ERROR: return is not a function, parentheses are not required
      Signed-off-by: default avatarSeongJae Park <sj38.park@gmail.com>
      Acked-by: default avatarSeth Jennings <sjennings@variantweb.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d2d42de
    • Minchan Kim's avatar
      mm/zswap: support multiple swap devices · 60105e12
      Minchan Kim authored
      Cai Liu reporeted that now zbud pool pages counting has a problem when
      multiple swap is used because it just counts only one swap intead of all
      of swap so zswap cannot control writeback properly.  The result is
      unnecessary writeback or no writeback when we should really writeback.
      
      IOW, it made zswap crazy.
      
      Another problem in zswap is:
      
      For example, let's assume we use two swap A and B with different
      priority and A already has charged 19% long time ago and let's assume
      that A swap is full now so VM start to use B so that B has charged 1%
      recently.  It menas zswap charged (19% + 1%) is full by default.  Then,
      if VM want to swap out more pages into B, zbud_reclaim_page would be
      evict one of pages in B's pool and it would be repeated continuously.
      It's totally LRU reverse problem and swap thrashing in B would happen.
      
      This patch makes zswap consider mutliple swap by creating *a* zbud pool
      which will be shared by multiple swap so all of zswap pages in multiple
      swap keep order by LRU so it can prevent above two problems.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarCai Liu <cai.liu@samsung.com>
      Suggested-by: default avatarWeijie Yang <weijie.yang.kh@gmail.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60105e12
    • SeongJae Park's avatar
      mm/zswap.c: update zsmalloc in comment to zbud · 6335b193
      SeongJae Park authored
      zswap used zsmalloc before and now using zbud.  But, some comments saying
      it use zsmalloc yet.  Fix the trivial problems.
      Signed-off-by: default avatarSeongJae Park <sj38.park@gmail.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6335b193
    • SeongJae Park's avatar
      mm/zswap.c: fix trivial typo and arrange indentation · 6b452516
      SeongJae Park authored
      Signed-off-by: default avatarSeongJae Park <sj38.park@gmail.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b452516
    • Joonsoo Kim's avatar
      zram: support REQ_DISCARD · f4659d8e
      Joonsoo Kim authored
      zram is ram based block device and can be used by backend of filesystem.
      When filesystem deletes a file, it normally doesn't do anything on data
      block of that file.  It just marks on metadata of that file.  This
      behavior has no problem on disk based block device, but has problems on
      ram based block device, since we can't free memory used for data block.
      To overcome this disadvantage, there is REQ_DISCARD functionality.  If
      block device support REQ_DISCARD and filesystem is mounted with discard
      option, filesystem sends REQ_DISCARD to block device whenever some data
      blocks are discarded.  All we have to do is to handle this request.
      
      This patch implements to flag up QUEUE_FLAG_DISCARD and handle this
      REQ_DISCARD request.  With it, we can free memory used by zram if it isn't
      used.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4659d8e
    • Sergey Senozhatsky's avatar
      zram: use scnprintf() in attrs show() methods · 56b4e8cb
      Sergey Senozhatsky authored
      sysfs.txt documentation lists the following requirements:
      
       - The buffer will always be PAGE_SIZE bytes in length. On i386, this
         is 4096.
      
       - show() methods should return the number of bytes printed into the
         buffer. This is the return value of scnprintf().
      
       - show() should always use scnprintf().
      
      Use scnprintf() in show() functions.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56b4e8cb
    • Minchan Kim's avatar
      zram: propagate error to user · 60a726e3
      Minchan Kim authored
      When we initialized zcomp with single, we couldn't change
      max_comp_streams without zram reset but current interface doesn't show
      any error to user and even it changes max_comp_streams's value without
      any effect so it would make user very confusing.
      
      This patch prevents max_comp_streams's change when zcomp was initialized
      as single zcomp and emit the error to user(ex, echo).
      
      [akpm@linux-foundation.org: don't return with the lock held, per Sergey]
      [fengguang.wu@intel.com: fix coccinelle warnings]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60a726e3
    • Sergey Senozhatsky's avatar
      zram: return error-valued pointer from zcomp_create() · fcfa8d95
      Sergey Senozhatsky authored
      Instead of returning just NULL, return ERR_PTR from zcomp_create() if
      compressing backend creation has failed.  ERR_PTR(-EINVAL) for unsupported
      compression algorithm request, ERR_PTR(-ENOMEM) for allocation (zcomp or
      compression stream) error.
      
      Perform IS_ERR() check of returned from zcomp_create() value in
      disksize_store() and set return code to PTR_ERR().
      
      Change suggested by Jerome Marchand.
      
      [akpm@linux-foundation.org: clean up error recovery flow]
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcfa8d95
    • Sergey Senozhatsky's avatar
      zram: move comp allocation out of init_lock · d61f98c7
      Sergey Senozhatsky authored
      While fixing lockdep spew of ->init_lock reported by Sasha Levin [1],
      Minchan Kim noted [2] that it's better to move compression backend
      allocation (using GPF_KERNEL) out of the ->init_lock lock, same way as
      with zram_meta_alloc(), in order to prevent the same lockdep spew.
      
      [1] https://lkml.org/lkml/2014/2/27/337
      [2] https://lkml.org/lkml/2014/3/3/32Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d61f98c7
    • Sergey Senozhatsky's avatar
      zram: add lz4 algorithm backend · 6e76668e
      Sergey Senozhatsky authored
      Introduce LZ4 compression backend and make it available for selection.
      LZ4 support is optional and requires user to set ZRAM_LZ4_COMPRESS config
      option.  The default compression backend is LZO.
      
      TEST
      
      (x86_64, core i5, 2 cores + 2 hyperthreading, zram disk size 1G,
      ext4 file system, 3 compression streams)
      
      iozone -t 3 -R -r 16K -s 60M -I +Z
      
             Test           LZO           LZ4
      ----------------------------------------------
        Initial write   1642744.62    1317005.09
              Rewrite   2498980.88    1800645.16
                 Read   3957026.38    5877043.75
              Re-read   3950997.38    5861847.00
         Reverse Read   2937114.56    5047384.00
          Stride read   2948163.19    4929587.38
          Random read   3292692.69    4880793.62
       Mixed workload   1545602.62    3502940.38
         Random write   2448039.75    1758786.25
               Pwrite   1670051.03    1338329.69
                Pread   2530682.00    5097177.62
               Fwrite   3232085.62    3275942.56
                Fread   6306880.25    6645271.12
      
      So on my system LZ4 is slower in write-only tests, while it performs
      better in read-only and mixed (reads + writes) tests.
      
      Official LZ4 benchmarks available here http://code.google.com/p/lz4/
      (linux kernel uses revision r90).
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e76668e
    • Sergey Senozhatsky's avatar
      zram: make compression algorithm selection possible · e46b8a03
      Sergey Senozhatsky authored
      Add and document `comp_algorithm' device attribute.  This attribute allows
      to show supported compression and currently selected compression
      algorithms:
      
      	cat /sys/block/zram0/comp_algorithm
      	[lzo] lz4
      
      and change selected compression algorithm:
      	echo lzo > /sys/block/zram0/comp_algorithm
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e46b8a03
    • Sergey Senozhatsky's avatar
      zram: add set_max_streams knob · fe8eb122
      Sergey Senozhatsky authored
      This patch allows to change max_comp_streams on initialised zcomp.
      
      Introduce zcomp set_max_streams() knob, zcomp_strm_multi_set_max_streams()
      and zcomp_strm_single_set_max_streams() callbacks to change streams limit
      for zcomp_strm_multi and zcomp_strm_single, accordingly.  set_max_streams
      for single steam zcomp does nothing.
      
      If user has lowered the limit, then zcomp_strm_multi_set_max_streams()
      attempts to immediately free extra streams (as much as it can, depending
      on idle streams availability).
      
      Note, this patch does not allow to change stream 'policy' from single to
      multi stream (or vice versa) on already initialised compression backend.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe8eb122
    • Sergey Senozhatsky's avatar
      zram: add multi stream functionality · beca3ec7
      Sergey Senozhatsky authored
      Existing zram (zcomp) implementation has only one compression stream
      (buffer and algorithm private part), so in order to prevent data
      corruption only one write (compress operation) can use this compression
      stream, forcing all concurrent write operations to wait for stream lock
      to be released.  This patch changes zcomp to keep a compression streams
      list of user-defined size (via sysfs device attr).  Each write operation
      still exclusively holds compression stream, the difference is that we
      can have N write operations (depending on size of streams list)
      executing in parallel.  See TEST section later in commit message for
      performance data.
      
      Introduce struct zcomp_strm_multi and a set of functions to manage
      zcomp_strm stream access.  zcomp_strm_multi has a list of idle
      zcomp_strm structs, spinlock to protect idle list and wait queue, making
      it possible to perform parallel compressions.
      
      The following set of functions added:
      - zcomp_strm_multi_find()/zcomp_strm_multi_release()
        find and release a compression stream, implement required locking
      - zcomp_strm_multi_create()/zcomp_strm_multi_destroy()
        create and destroy zcomp_strm_multi
      
      zcomp ->strm_find() and ->strm_release() callbacks are set during
      initialisation to zcomp_strm_multi_find()/zcomp_strm_multi_release()
      correspondingly.
      
      Each time zcomp issues a zcomp_strm_multi_find() call, the following set
      of operations performed:
      
      - spin lock strm_lock
      - if idle list is not empty, remove zcomp_strm from idle list, spin
        unlock and return zcomp stream pointer to caller
      - if idle list is empty, current adds itself to wait queue. it will be
        awaken by zcomp_strm_multi_release() caller.
      
      zcomp_strm_multi_release():
      - spin lock strm_lock
      - add zcomp stream to idle list
      - spin unlock, wake up sleeper
      
      Minchan Kim reported that spinlock-based locking scheme has demonstrated
      a severe perfomance regression for single compression stream case,
      comparing to mutex-based (see https://lkml.org/lkml/2014/2/18/16)
      
      base                      spinlock                    mutex
      
      ==Initial write           ==Initial write             ==Initial  write
      records:  5               records:  5                 records:   5
      avg:      1642424.35      avg:      699610.40         avg:       1655583.71
      std:      39890.95(2.43%) std:      232014.19(33.16%) std:       52293.96
      max:      1690170.94      max:      1163473.45        max:       1697164.75
      min:      1568669.52      min:      573429.88         min:       1553410.23
      ==Rewrite                 ==Rewrite                   ==Rewrite
      records:  5               records:  5                 records:   5
      avg:      1611775.39      avg:      501406.64         avg:       1684419.11
      std:      17144.58(1.06%) std:      15354.41(3.06%)   std:       18367.42
      max:      1641800.95      max:      531356.78         max:       1706445.84
      min:      1593515.27      min:      488817.78         min:       1655335.73
      
      When only one compression stream available, mutex with spin on owner
      tends to perform much better than frequent wait_event()/wake_up().  This
      is why single stream implemented as a special case with mutex locking.
      
      Introduce and document zram device attribute max_comp_streams.  This
      attr shows and stores current zcomp's max number of zcomp streams
      (max_strm).  Extend zcomp's zcomp_create() with `max_strm' parameter.
      `max_strm' limits the number of zcomp_strm structs in compression
      backend's idle list (max_comp_streams).
      
      max_comp_streams used during initialisation as follows:
      -- passing to zcomp_create() max_strm equals to 1 will initialise zcomp
      using single compression stream zcomp_strm_single (mutex-based locking).
      -- passing to zcomp_create() max_strm greater than 1 will initialise zcomp
      using multi compression stream zcomp_strm_multi (spinlock-based locking).
      
      default max_comp_streams value is 1, meaning that zram with single stream
      will be initialised.
      
      Later patch will introduce configuration knob to change max_comp_streams
      on already initialised and used zcomp.
      
      TEST
      iozone -t 3 -R -r 16K -s 60M -I +Z
      
             test           base       1 strm (mutex)     3 strm (spinlock)
      -----------------------------------------------------------------------
       Initial write      589286.78       583518.39          718011.05
             Rewrite      604837.97       596776.38         1515125.72
        Random write      584120.11       595714.58         1388850.25
              Pwrite      535731.17       541117.38          739295.27
              Fwrite     1418083.88      1478612.72         1484927.06
      
      Usage example:
      set max_comp_streams to 4
              echo 4 > /sys/block/zram0/max_comp_streams
      
      show current max_comp_streams (default value is 1).
              cat /sys/block/zram0/max_comp_streams
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      beca3ec7
    • Sergey Senozhatsky's avatar
      zram: factor out single stream compression · 9cc97529
      Sergey Senozhatsky authored
      This is preparation patch to add multi stream support to zcomp.
      
      Introduce struct zcomp_strm_single and a set of functions to manage
      zcomp_strm stream access.  zcomp_strm_single implements single compession
      stream, same way as current zcomp implementation.  This moves zcomp_strm
      stream control and locking from zcomp, so compressing backend zcomp is not
      aware of required locking.
      
      Single and multi streams require different locking schemes.  Minchan Kim
      reported that spinlock-based locking scheme (which is used in multi stream
      implementation) has demonstrated a severe perfomance regression for single
      compression stream case, comparing to mutex-based.  see
      https://lkml.org/lkml/2014/2/18/16
      
      The following set of functions added:
      - zcomp_strm_single_find()/zcomp_strm_single_release()
        find and release a compression stream, implement required locking
      - zcomp_strm_single_create()/zcomp_strm_single_destroy()
        create and destroy zcomp_strm_single
      
      New ->strm_find() and ->strm_release() callbacks added to zcomp, which are
      set to zcomp_strm_single_find() and zcomp_strm_single_release() during
      initialisation.  Instead of direct locking and zcomp_strm access from
      zcomp_strm_find() and zcomp_strm_release(), zcomp now calls ->strm_find()
      and ->strm_release() correspondingly.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cc97529
    • Sergey Senozhatsky's avatar
      zram: use zcomp compressing backends · b7ca232e
      Sergey Senozhatsky authored
      Do not perform direct LZO compress/decompress calls, initialise
      and use zcomp LZO backend (single compression stream) instead.
      
      [akpm@linux-foundation.org: resolve conflicts with zram-delete-zram_init_device-fix.patch]
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7ca232e
    • Sergey Senozhatsky's avatar
      zram: introduce compressing backend abstraction · e7e1ef43
      Sergey Senozhatsky authored
      ZRAM performs direct LZO compression algorithm calls, making it the one
      and only option.  While LZO is generally performs well, LZ4 algorithm
      tends to have a faster decompression (see http://code.google.com/p/lz4/
      for full report)
      
      	Name            Ratio  C.speed D.speed
      	                        MB/s    MB/s
      	LZ4 (r101)      2.084    422    1820
      	LZO 2.06        2.106    414     600
      
      Thus, users who have mostly read (decompress) usage scenarious or mixed
      workflow (writes with relatively high read ops number) will benefit from
      using LZ4 compression backend.
      
      Introduce compressing backend abstraction zcomp in order to support
      multiple compression algorithms with the following set of operations:
      
              .create
              .destroy
              .compress
              .decompress
      
      Schematically zram write() usually contains the following steps:
      0) preparation (decompression of partioal IO, etc.)
      1) lock buffer_lock mutex (protects meta compress buffers)
      2) compress (using meta compress buffers)
      3) alloc and map zs_pool object
      4) copy compressed data (from meta compress buffers) to object allocated by 3)
      5) free previous pool page, assign a new one
      6) unlock buffer_lock mutex
      
      As we can see, compressing buffers must remain untouched from 1) to 4),
      because, otherwise, concurrent write() can overwrite data.  At the same
      time, zram_meta must be aware of a) specific compression algorithm memory
      requirements and b) necessary locking to protect compression buffers.  To
      remove requirement a) new struct zcomp_strm introduced, which contains a
      compress/decompress `buffer' and compression algorithm `private' part.
      While struct zcomp implements zcomp_strm stream handling and locking and
      removes requirement b) from zram meta.  zcomp ->create() and ->destroy(),
      respectively, allocate and deallocate algorithm specific zcomp_strm
      `private' part.
      
      Every zcomp has zcomp stream and mutex to protect its compression stream.
      Stream usage semantics remains the same -- only one write can hold stream
      lock and use its buffers.  zcomp_strm_find() turns caller into exclusive
      user of a stream (holding stream mutex until zram release stream), and
      zcomp_strm_release() makes zcomp stream available (unlock the stream
      mutex).  Hence no concurrent write (compression) operations possible at
      the moment.
      
      iozone -t 3 -R -r 16K -s 60M -I +Z
      
             test            base           patched
      --------------------------------------------------
        Initial write      597992.91       591660.58
              Rewrite      609674.34       616054.97
                 Read     2404771.75      2452909.12
              Re-read     2459216.81      2470074.44
         Reverse Read     1652769.66      1589128.66
          Stride read     2202441.81      2202173.31
          Random read     2236311.47      2276565.31
       Mixed workload     1423760.41      1709760.06
         Random write      579584.08       615933.86
               Pwrite      597550.02       594933.70
                Pread     1703672.53      1718126.72
               Fwrite     1330497.06      1461054.00
                Fread     3922851.00      3957242.62
      
      Usage examples:
      
      	comp = zcomp_create(NAME) /* NAME e.g. "lzo" */
      
      which initialises compressing backend if requested algorithm is supported.
      
      Compress:
      	zstrm = zcomp_strm_find(comp)
      	zcomp_compress(comp, zstrm, src, &dst_len)
      	[..] /* copy compressed data */
      	zcomp_strm_release(comp, zstrm)
      
      Decompress:
      	zcomp_decompress(comp, src, src_len, dst);
      
      Free compessing backend and its zcomp stream:
      	zcomp_destroy(comp)
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7e1ef43
    • Sergey Senozhatsky's avatar
      zram: delete zram_init_device() · b67d1ec1
      Sergey Senozhatsky authored
      allocate new `zram_meta' in disksize_store() only for uninitialised zram
      device, saving a number of allocations and deallocations in case if
      disksize_store() was called on currently used device.  at the same time
      zram_meta stack variable is not necessary, because we can set ->meta
      directly.  there is also no need in setting QUEUE_FLAG_NONROT queue on
      every disksize_store(), set it once during device creation.
      
      [minchan@kernel.org: handle zram->meta alloc fail case]
      [minchan@kernel.org: prevent lockdep spew of init_lock]
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b67d1ec1
    • Sergey Senozhatsky's avatar
      zram: document failed_reads, failed_writes stats · 8dd1d324
      Sergey Senozhatsky authored
      Document `failed_reads' and `failed_writes' device attributes.
      Remove info about `discard' - there is no such zram attr.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8dd1d324
    • Sergey Senozhatsky's avatar
      zram: move zram size warning to documentation · e64cd51d
      Sergey Senozhatsky authored
      Move zram warning about disksize and size of memory correlation to zram
      documentation.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e64cd51d
    • Sergey Senozhatsky's avatar
      zram: drop not used table `count' member · 59fc86a4
      Sergey Senozhatsky authored
      struct table `count' member is not used.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59fc86a4
    • Sergey Senozhatsky's avatar
      zram: report failed read and write stats · 64447249
      Sergey Senozhatsky authored
      zram accounted but did not report numbers of failed read and write
      queries.  make these stats available as failed_reads and failed_writes
      attrs.
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64447249