1. 30 Dec, 2015 22 commits
  2. 27 Nov, 2015 18 commits
    • Ben Hutchings's avatar
      Linux 3.2.74 · 95afdc9f
      Ben Hutchings authored
      95afdc9f
    • Christophe Leroy's avatar
      splice: sendfile() at once fails for big files · fcb27817
      Christophe Leroy authored
      commit 0ff28d9f upstream.
      
      Using sendfile with below small program to get MD5 sums of some files,
      it appear that big files (over 64kbytes with 4k pages system) get a
      wrong MD5 sum while small files get the correct sum.
      This program uses sendfile() to send a file to an AF_ALG socket
      for hashing.
      
      /* md5sum2.c */
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <string.h>
      #include <fcntl.h>
      #include <sys/socket.h>
      #include <sys/stat.h>
      #include <sys/types.h>
      #include <linux/if_alg.h>
      
      int main(int argc, char **argv)
      {
      	int sk = socket(AF_ALG, SOCK_SEQPACKET, 0);
      	struct stat st;
      	struct sockaddr_alg sa = {
      		.salg_family = AF_ALG,
      		.salg_type = "hash",
      		.salg_name = "md5",
      	};
      	int n;
      
      	bind(sk, (struct sockaddr*)&sa, sizeof(sa));
      
      	for (n = 1; n < argc; n++) {
      		int size;
      		int offset = 0;
      		char buf[4096];
      		int fd;
      		int sko;
      		int i;
      
      		fd = open(argv[n], O_RDONLY);
      		sko = accept(sk, NULL, 0);
      		fstat(fd, &st);
      		size = st.st_size;
      		sendfile(sko, fd, &offset, size);
      		size = read(sko, buf, sizeof(buf));
      		for (i = 0; i < size; i++)
      			printf("%2.2x", buf[i]);
      		printf("  %s\n", argv[n]);
      		close(fd);
      		close(sko);
      	}
      	exit(0);
      }
      
      Test below is done using official linux patch files. First result is
      with a software based md5sum. Second result is with the program above.
      
      root@vgoip:~# ls -l patch-3.6.*
      -rw-r--r--    1 root     root         64011 Aug 24 12:01 patch-3.6.2.gz
      -rw-r--r--    1 root     root         94131 Aug 24 12:01 patch-3.6.3.gz
      
      root@vgoip:~# md5sum patch-3.6.*
      b3ffb9848196846f31b2ff133d2d6443  patch-3.6.2.gz
      c5e8f687878457db77cb7158c38a7e43  patch-3.6.3.gz
      
      root@vgoip:~# ./md5sum2 patch-3.6.*
      b3ffb9848196846f31b2ff133d2d6443  patch-3.6.2.gz
      5fd77b24e68bb24dcc72d6e57c64790e  patch-3.6.3.gz
      
      After investivation, it appears that sendfile() sends the files by blocks
      of 64kbytes (16 times PAGE_SIZE). The problem is that at the end of each
      block, the SPLICE_F_MORE flag is missing, therefore the hashing operation
      is reset as if it was the end of the file.
      
      This patch adds SPLICE_F_MORE to the flags when more data is pending.
      
      With the patch applied, we get the correct sums:
      
      root@vgoip:~# md5sum patch-3.6.*
      b3ffb9848196846f31b2ff133d2d6443  patch-3.6.2.gz
      c5e8f687878457db77cb7158c38a7e43  patch-3.6.3.gz
      
      root@vgoip:~# ./md5sum2 patch-3.6.*
      b3ffb9848196846f31b2ff133d2d6443  patch-3.6.2.gz
      c5e8f687878457db77cb7158c38a7e43  patch-3.6.3.gz
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      fcb27817
    • Eric Dumazet's avatar
      net: avoid NULL deref in inet_ctl_sock_destroy() · f79c83d6
      Eric Dumazet authored
      [ Upstream commit 8fa677d2 ]
      
      Under low memory conditions, tcp_sk_init() and icmp_sk_init()
      can both iterate on all possible cpus and call inet_ctl_sock_destroy(),
      with eventual NULL pointer.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      f79c83d6
    • Ani Sinha's avatar
      ipmr: fix possible race resulting from improper usage of IP_INC_STATS_BH() in preemptible context. · 33cf84ba
      Ani Sinha authored
      [ Upstream commit 44f49dd8 ]
      
      Fixes the following kernel BUG :
      
      BUG: using __this_cpu_add() in preemptible [00000000] code: bash/2758
      caller is __this_cpu_preempt_check+0x13/0x15
      CPU: 0 PID: 2758 Comm: bash Tainted: P           O   3.18.19 #2
       ffffffff8170eaca ffff880110d1b788 ffffffff81482b2a 0000000000000000
       0000000000000000 ffff880110d1b7b8 ffffffff812010ae ffff880007cab800
       ffff88001a060800 ffff88013a899108 ffff880108b84240 ffff880110d1b7c8
      Call Trace:
      [<ffffffff81482b2a>] dump_stack+0x52/0x80
      [<ffffffff812010ae>] check_preemption_disabled+0xce/0xe1
      [<ffffffff812010d4>] __this_cpu_preempt_check+0x13/0x15
      [<ffffffff81419d60>] ipmr_queue_xmit+0x647/0x70c
      [<ffffffff8141a154>] ip_mr_forward+0x32f/0x34e
      [<ffffffff8141af76>] ip_mroute_setsockopt+0xe03/0x108c
      [<ffffffff810553fc>] ? get_parent_ip+0x11/0x42
      [<ffffffff810e6974>] ? pollwake+0x4d/0x51
      [<ffffffff81058ac0>] ? default_wake_function+0x0/0xf
      [<ffffffff810553fc>] ? get_parent_ip+0x11/0x42
      [<ffffffff810613d9>] ? __wake_up_common+0x45/0x77
      [<ffffffff81486ea9>] ? _raw_spin_unlock_irqrestore+0x1d/0x32
      [<ffffffff810618bc>] ? __wake_up_sync_key+0x4a/0x53
      [<ffffffff8139a519>] ? sock_def_readable+0x71/0x75
      [<ffffffff813dd226>] do_ip_setsockopt+0x9d/0xb55
      [<ffffffff81429818>] ? unix_seqpacket_sendmsg+0x3f/0x41
      [<ffffffff813963fe>] ? sock_sendmsg+0x6d/0x86
      [<ffffffff813959d4>] ? sockfd_lookup_light+0x12/0x5d
      [<ffffffff8139650a>] ? SyS_sendto+0xf3/0x11b
      [<ffffffff810d5738>] ? new_sync_read+0x82/0xaa
      [<ffffffff813ddd19>] compat_ip_setsockopt+0x3b/0x99
      [<ffffffff813fb24a>] compat_raw_setsockopt+0x11/0x32
      [<ffffffff81399052>] compat_sock_common_setsockopt+0x18/0x1f
      [<ffffffff813c4d05>] compat_SyS_setsockopt+0x1a9/0x1cf
      [<ffffffff813c4149>] compat_SyS_socketcall+0x180/0x1e3
      [<ffffffff81488ea1>] cstar_dispatch+0x7/0x1e
      Signed-off-by: default avatarAni Sinha <ani@arista.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.2: ipmr doesn't implement IPSTATS_MIB_OUTOCTETS]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      33cf84ba
    • Sowmini Varadhan's avatar
      RDS-TCP: Recover correctly from pskb_pull()/pksb_trim() failure in rds_tcp_data_recv · f114d937
      Sowmini Varadhan authored
      [ Upstream commit 8ce675ff ]
      
      Either of pskb_pull() or pskb_trim() may fail under low memory conditions.
      If rds_tcp_data_recv() ignores such failures, the application will
      receive corrupted data because the skb has not been correctly
      carved to the RDS datagram size.
      
      Avoid this by handling pskb_pull/pskb_trim failure in the same
      manner as the skb_clone failure: bail out of rds_tcp_data_recv(), and
      retry via the deferred call to rds_send_worker() that gets set up on
      ENOMEM from rds_tcp_read_sock()
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      f114d937
    • Dan Carpenter's avatar
      irda: precedence bug in irlmp_seq_hb_idx() · a8ab3e63
      Dan Carpenter authored
      [ Upstream commit 50010c20 ]
      
      This is decrementing the pointer, instead of the value stored in the
      pointer.  KASan detects it as an out of bounds reference.
      Reported-by: default avatar"Berry Cheng 程君(成淼)" <chengmiao.cj@alibaba-inc.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      a8ab3e63
    • Jann Horn's avatar
      fs: if a coredump already exists, unlink and recreate with O_EXCL · b208d27d
      Jann Horn authored
      commit fbb18169 upstream.
      
      It was possible for an attacking user to trick root (or another user) into
      writing his coredumps into an attacker-readable, pre-existing file using
      rename() or link(), causing the disclosure of secret data from the victim
      process' virtual memory.  Depending on the configuration, it was also
      possible to trick root into overwriting system files with coredumps.  Fix
      that issue by never writing coredumps into existing files.
      
      Requirements for the attack:
       - The attack only applies if the victim's process has a nonzero
         RLIMIT_CORE and is dumpable.
       - The attacker can trick the victim into coredumping into an
         attacker-writable directory D, either because the core_pattern is
         relative and the victim's cwd is attacker-writable or because an
         absolute core_pattern pointing to a world-writable directory is used.
       - The attacker has one of these:
        A: on a system with protected_hardlinks=0:
           execute access to a folder containing a victim-owned,
           attacker-readable file on the same partition as D, and the
           victim-owned file will be deleted before the main part of the attack
           takes place. (In practice, there are lots of files that fulfill
           this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
           This does not apply to most Linux systems because most distros set
           protected_hardlinks=1.
        B: on a system with protected_hardlinks=1:
           execute access to a folder containing a victim-owned,
           attacker-readable and attacker-writable file on the same partition
           as D, and the victim-owned file will be deleted before the main part
           of the attack takes place.
           (This seems to be uncommon.)
        C: on any system, independent of protected_hardlinks:
           write access to a non-sticky folder containing a victim-owned,
           attacker-readable file on the same partition as D
           (This seems to be uncommon.)
      
      The basic idea is that the attacker moves the victim-owned file to where
      he expects the victim process to dump its core.  The victim process dumps
      its core into the existing file, and the attacker reads the coredump from
      it.
      
      If the attacker can't move the file because he does not have write access
      to the containing directory, he can instead link the file to a directory
      he controls, then wait for the original link to the file to be deleted
      (because the kernel checks that the link count of the corefile is 1).
      
      A less reliable variant that requires D to be non-sticky works with link()
      and does not require deletion of the original link: link() the file into
      D, but then unlink() it directly before the kernel performs the link count
      check.
      
      On systems with protected_hardlinks=0, this variant allows an attacker to
      not only gain information from coredumps, but also clobber existing,
      victim-writable files with coredumps.  (This could theoretically lead to a
      privilege escalation.)
      Signed-off-by: default avatarJann Horn <jann@thejh.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2: adjust filename, context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      b208d27d
    • Kees Cook's avatar
      fs: make dumpable=2 require fully qualified path · 0677d4e0
      Kees Cook authored
      commit 9520628e upstream.
      
      When the suid_dumpable sysctl is set to "2", and there is no core dump
      pipe defined in the core_pattern sysctl, a local user can cause core files
      to be written to root-writable directories, potentially with
      user-controlled content.
      
      This means an admin can unknowningly reintroduce a variation of
      CVE-2006-2451, allowing local users to gain root privileges.
      
        $ cat /proc/sys/fs/suid_dumpable
        2
        $ cat /proc/sys/kernel/core_pattern
        core
        $ ulimit -c unlimited
        $ cd /
        $ ls -l core
        ls: cannot access core: No such file or directory
        $ touch core
        touch: cannot touch `core': Permission denied
        $ OHAI="evil-string-here" ping localhost >/dev/null 2>&1 &
        $ pid=$!
        $ sleep 1
        $ kill -SEGV $pid
        $ ls -l core
        -rw------- 1 root kees 458752 Jun 21 11:35 core
        $ sudo strings core | grep evil
        OHAI=evil-string-here
      
      While cron has been fixed to abort reading a file when there is any
      parse error, there are still other sensitive directories that will read
      any file present and skip unparsable lines.
      
      Instead of introducing a suid_dumpable=3 mode and breaking all users of
      mode 2, this only disables the unsafe portion of mode 2 (writing to disk
      via relative path).  Most users of mode 2 (e.g.  Chrome OS) already use
      a core dump pipe handler, so this change will not break them.  For the
      situations where a pipe handler is not defined but mode 2 is still
      active, crash dumps will only be written to fully qualified paths.  If a
      relative path is defined (e.g.  the default "core" pattern), dump
      attempts will trigger a printk yelling about the lack of a fully
      qualified path.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Reviewed-by: default avatarJames Morris <james.l.morris@oracle.com>
      0677d4e0
    • Maciej W. Rozycki's avatar
      binfmt_elf: Don't clobber passed executable's file header · beebd9fa
      Maciej W. Rozycki authored
      commit b582ef5c upstream.
      
      Do not clobber the buffer space passed from `search_binary_handler' and
      originally preloaded by `prepare_binprm' with the executable's file
      header by overwriting it with its interpreter's file header.  Instead
      keep the buffer space intact and directly use the data structure locally
      allocated for the interpreter's file header, fixing a bug introduced in
      2.1.14 with loadable module support (linux-mips.org commit beb11695
      [Import of Linux/MIPS 2.1.14], predating kernel.org repo's history).
      Adjust the amount of data read from the interpreter's file accordingly.
      
      This was not an issue before loadable module support, because back then
      `load_elf_binary' was executed only once for a given ELF executable,
      whether the function succeeded or failed.
      
      With loadable module support supported and enabled, upon a failure of
      `load_elf_binary' -- which may for example be caused by architecture
      code rejecting an executable due to a missing hardware feature requested
      in the file header -- a module load is attempted and then the function
      reexecuted by `search_binary_handler'.  With the executable's file
      header replaced with its interpreter's file header the executable can
      then be erroneously accepted in this subsequent attempt.
      Signed-off-by: default avatarMaciej W. Rozycki <macro@imgtec.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      beebd9fa
    • David Howells's avatar
      FS-Cache: Handle a write to the page immediately beyond the EOF marker · cf185723
      David Howells authored
      commit 102f4d90 upstream.
      
      Handle a write being requested to the page immediately beyond the EOF
      marker on a cache object.  Currently this gets an assertion failure in
      CacheFiles because the EOF marker is used there to encode information about
      a partial page at the EOF - which could lead to an unknown blank spot in
      the file if we extend the file over it.
      
      The problem is actually in fscache where we check the index of the page
      being written against store_limit.  store_limit is set to the number of
      pages that we're allowed to store by fscache_set_store_limit() - which
      means it's one more than the index of the last page we're allowed to store.
      The problem is that we permit writing to a page with an index _equal_ to
      the store limit - when we should reject that case.
      
      Whilst we're at it, change the triggered assertion in CacheFiles to just
      return -ENOBUFS instead.
      
      The assertion failure looks something like this:
      
      CacheFiles: Assertion failed
      1000 < 7b1 is false
      ------------[ cut here ]------------
      kernel BUG at fs/cachefiles/rdwr.c:962!
      ...
      RIP: 0010:[<ffffffffa02c9e83>]  [<ffffffffa02c9e83>] cachefiles_write_page+0x273/0x2d0 [cachefiles]
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      [bwh: Backported to 3.2: we don't have __kernel_write() so keep using the
       open-coded equivalent]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      cf185723
    • Kinglong Mee's avatar
      FS-Cache: Don't override netfs's primary_index if registering failed · 8f2746bb
      Kinglong Mee authored
      commit b130ed59 upstream.
      
      Only override netfs->primary_index when registering success.
      Signed-off-by: default avatarKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      [bwh: Backported to 3.2: no n_active or flags fields in fscache_cookie]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      8f2746bb
    • Kinglong Mee's avatar
      FS-Cache: Increase reference of parent after registering, netfs success · bdb28a40
      Kinglong Mee authored
      commit 86108c2e upstream.
      
      If netfs exist, fscache should not increase the reference of parent's
      usage and n_children, otherwise, never be decreased.
      
      v2: thanks David's suggest,
       move increasing reference of parent if success
       use kmem_cache_free() freeing primary_index directly
      
      v3: don't move "netfs->primary_index->parent = &fscache_fsdef_index;"
      Signed-off-by: default avatarKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      bdb28a40
    • Paolo Bonzini's avatar
      KVM: svm: unconditionally intercept #DB · b42506c6
      Paolo Bonzini authored
      commit cbdb967a upstream.
      
      This is needed to avoid the possibility that the guest triggers
      an infinite stream of #DB exceptions (CVE-2015-8104).
      
      VMX is not affected: because it does not save DR6 in the VMCS,
      it already intercepts #DB unconditionally.
      Reported-by: default avatarJan Beulich <jbeulich@suse.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [bwh: Backported to 3.2, with thanks to Paolo:
       - update_db_bp_intercept() was called update_db_intercept()
       - The remaining call is in svm_guest_debug() rather than through svm_x86_ops]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      b42506c6
    • Eric Dumazet's avatar
      net: fix a race in dst_release() · 1a513170
      Eric Dumazet authored
      commit d69bbf88 upstream.
      
      Only cpu seeing dst refcount going to 0 can safely
      dereference dst->flags.
      
      Otherwise an other cpu might already have freed the dst.
      
      Fixes: 27b75c95 ("net: avoid RCU for NOCACHE dst")
      Reported-by: default avatarGreg Thelen <gthelen@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 3.2: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      1a513170
    • Filipe Manana's avatar
      Btrfs: fix race when listing an inode's xattrs · 7abbc81b
      Filipe Manana authored
      commit f1cd1f0b upstream.
      
      When listing a inode's xattrs we have a time window where we race against
      a concurrent operation for adding a new hard link for our inode that makes
      us not return any xattr to user space. In order for this to happen, the
      first xattr of our inode needs to be at slot 0 of a leaf and the previous
      leaf must still have room for an inode ref (or extref) item, and this can
      happen because an inode's listxattrs callback does not lock the inode's
      i_mutex (nor does the VFS does it for us), but adding a hard link to an
      inode makes the VFS lock the inode's i_mutex before calling the inode's
      link callback.
      
      If we have the following leafs:
      
                     Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 XATTR_ITEM 12345), ... ]
                 slot N - 2         slot N - 1              slot 0
      
      The race illustrated by the following sequence diagram is possible:
      
             CPU 1                                               CPU 2
      
        btrfs_listxattr()
      
          searches for key (257 XATTR_ITEM 0)
      
          gets path with path->nodes[0] == leaf X
          and path->slots[0] == N
      
          because path->slots[0] is >=
          btrfs_header_nritems(leaf X), it calls
          btrfs_next_leaf()
      
          btrfs_next_leaf()
            releases the path
      
                                                         adds key (257 INODE_REF 666)
                                                         to the end of leaf X (slot N),
                                                         and leaf X now has N + 1 items
      
            searches for the key (257 INODE_REF 256),
            with path->keep_locks == 1, because that
            is the last key it saw in leaf X before
            releasing the path
      
            ends up at leaf X again and it verifies
            that the key (257 INODE_REF 256) is no
            longer the last key in leaf X, so it
            returns with path->nodes[0] == leaf X
            and path->slots[0] == N, pointing to
            the new item with key (257 INODE_REF 666)
      
          btrfs_listxattr's loop iteration sees that
          the type of the key pointed by the path is
          different from the type BTRFS_XATTR_ITEM_KEY
          and so it breaks the loop and stops looking
          for more xattr items
            --> the application doesn't get any xattr
                listed for our inode
      
      So fix this by breaking the loop only if the key's type is greater than
      BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      [bwh: Backported to 3.2: old code used the trivial accessor btrfs_key_type()]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      7abbc81b
    • Peter Oberparleiter's avatar
      scsi_sysfs: Fix queue_ramp_up_period return code · 74d26be3
      Peter Oberparleiter authored
      commit 863e02d0 upstream.
      
      Writing a number to /sys/bus/scsi/devices/<sdev>/queue_ramp_up_period
      returns the value of that number instead of the number of bytes written.
      This behavior can confuse programs expecting POSIX write() semantics.
      Fix this by returning the number of bytes written instead.
      Signed-off-by: default avatarPeter Oberparleiter <oberpar@linux.vnet.ibm.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarMatthew R. Ochs <mrochs@linux.vnet.ibm.com>
      Reviewed-by: default avatarEwan D. Milne <emilne@redhat.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      74d26be3
    • Peter Zijlstra's avatar
      perf: Fix inherited events vs. tracepoint filters · 4edb9551
      Peter Zijlstra authored
      commit b71b437e upstream.
      
      Arnaldo reported that tracepoint filters seem to misbehave (ie. not
      apply) on inherited events.
      
      The fix is obvious; filters are only set on the actual (parent)
      event, use the normal pattern of using this parent event for filters.
      This is safe because each child event has a reference to it.
      Reported-by: default avatarArnaldo Carvalho de Melo <acme@kernel.org>
      Tested-by: default avatarArnaldo Carvalho de Melo <acme@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: http://lkml.kernel.org/r/20151102095051.GN17308@twins.programming.kicks-ass.netSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      4edb9551
    • Filipe Manana's avatar
      Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow · f5daa8cd
      Filipe Manana authored
      commit 1d512cb7 upstream.
      
      If we are using the NO_HOLES feature, we have a tiny time window when
      running delalloc for a nodatacow inode where we can race with a concurrent
      link or xattr add operation leading to a BUG_ON.
      
      This happens because at run_delalloc_nocow() we end up casting a leaf item
      of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
      file extent item (struct btrfs_file_extent_item) and then analyse its
      extent type field, which won't match any of the expected extent types
      (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
      explicit BUG_ON(1).
      
      The following sequence diagram shows how the race happens when running a
      no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
      neighbour leafs:
      
                   Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                    slot N - 2         slot N - 1              slot 0
      
       (Note the implicit hole for inode 257 regarding the [0, 8K[ range)
      
             CPU 1                                         CPU 2
      
       run_dealloc_nocow()
         btrfs_lookup_file_extent()
           --> searches for a key with value
               (257 EXTENT_DATA 4096) in the
               fs/subvol tree
           --> returns us a path with
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
         because path->slots[0] is >=
         btrfs_header_nritems(leaf X), it
         calls btrfs_next_leaf()
      
         btrfs_next_leaf()
           --> releases the path
      
                                                    hard link added to our inode,
                                                    with key (257 INODE_REF 500)
                                                    added to the end of leaf X,
                                                    so leaf X now has N + 1 keys
      
           --> searches for the key
               (257 INODE_REF 256), because
               it was the last key in leaf X
               before it released the path,
               with path->keep_locks set to 1
      
           --> ends up at leaf X again and
               it verifies that the key
               (257 INODE_REF 256) is no longer
               the last key in the leaf, so it
               returns with path->nodes[0] ==
               leaf X and path->slots[0] == N,
               pointing to the new item with
               key (257 INODE_REF 500)
      
         the loop iteration of run_dealloc_nocow()
         does not break out the loop and continues
         because the key referenced in the path
         at path->nodes[0] and path->slots[0] is
         for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
         and its offset (500) is less then our delalloc
         range's end (8192)
      
         the item pointed by the path, an inode reference item,
         is (incorrectly) interpreted as a file extent item and
         we get an invalid extent type, leading to the BUG_ON(1):
      
         if (extent_type == BTRFS_FILE_EXTENT_REG ||
            extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
             (...)
         } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
             (...)
         } else {
             BUG_ON(1)
         }
      
      The same can happen if a xattr is added concurrently and ends up having
      a key with an offset smaller then the delalloc's range end.
      
      So fix this by skipping keys with a type smaller than
      BTRFS_EXTENT_DATA_KEY.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      f5daa8cd