1. 23 Jan, 2016 20 commits
    • Lukas Czerner's avatar
      ext4: fix potential use after free in __ext4_journal_stop · 3c1c70cf
      Lukas Czerner authored
      commit 6934da92 upstream.
      
      There is a use-after-free possibility in __ext4_journal_stop() in the
      case that we free the handle in the first jbd2_journal_stop() because
      we're referencing handle->h_err afterwards. This was introduced in
      9705acd6 and it is wrong. Fix it by
      storing the handle->h_err value beforehand and avoid referencing
      potentially freed handle.
      
      Fixes: 9705acd6Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c1c70cf
    • Filipe Manana's avatar
      Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow · 389e0b9b
      Filipe Manana authored
      commit 1d512cb7 upstream.
      
      If we are using the NO_HOLES feature, we have a tiny time window when
      running delalloc for a nodatacow inode where we can race with a concurrent
      link or xattr add operation leading to a BUG_ON.
      
      This happens because at run_delalloc_nocow() we end up casting a leaf item
      of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
      file extent item (struct btrfs_file_extent_item) and then analyse its
      extent type field, which won't match any of the expected extent types
      (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
      explicit BUG_ON(1).
      
      The following sequence diagram shows how the race happens when running a
      no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
      neighbour leafs:
      
                   Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                    slot N - 2         slot N - 1              slot 0
      
       (Note the implicit hole for inode 257 regarding the [0, 8K[ range)
      
             CPU 1                                         CPU 2
      
       run_dealloc_nocow()
         btrfs_lookup_file_extent()
           --> searches for a key with value
               (257 EXTENT_DATA 4096) in the
               fs/subvol tree
           --> returns us a path with
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
         because path->slots[0] is >=
         btrfs_header_nritems(leaf X), it
         calls btrfs_next_leaf()
      
         btrfs_next_leaf()
           --> releases the path
      
                                                    hard link added to our inode,
                                                    with key (257 INODE_REF 500)
                                                    added to the end of leaf X,
                                                    so leaf X now has N + 1 keys
      
           --> searches for the key
               (257 INODE_REF 256), because
               it was the last key in leaf X
               before it released the path,
               with path->keep_locks set to 1
      
           --> ends up at leaf X again and
               it verifies that the key
               (257 INODE_REF 256) is no longer
               the last key in the leaf, so it
               returns with path->nodes[0] ==
               leaf X and path->slots[0] == N,
               pointing to the new item with
               key (257 INODE_REF 500)
      
         the loop iteration of run_dealloc_nocow()
         does not break out the loop and continues
         because the key referenced in the path
         at path->nodes[0] and path->slots[0] is
         for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
         and its offset (500) is less then our delalloc
         range's end (8192)
      
         the item pointed by the path, an inode reference item,
         is (incorrectly) interpreted as a file extent item and
         we get an invalid extent type, leading to the BUG_ON(1):
      
         if (extent_type == BTRFS_FILE_EXTENT_REG ||
            extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
             (...)
         } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
             (...)
         } else {
             BUG_ON(1)
         }
      
      The same can happen if a xattr is added concurrently and ends up having
      a key with an offset smaller then the delalloc's range end.
      
      So fix this by skipping keys with a type smaller than
      BTRFS_EXTENT_DATA_KEY.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      389e0b9b
    • Filipe Manana's avatar
      Btrfs: fix race leading to incorrect item deletion when dropping extents · de4306a7
      Filipe Manana authored
      commit aeafbf84 upstream.
      
      While running a stress test I got the following warning triggered:
      
        [191627.672810] ------------[ cut here ]------------
        [191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]()
        (...)
        [191627.701485] Call Trace:
        [191627.702037]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
        [191627.702992]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
        [191627.704091]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
        [191627.705380]  [<ffffffffa0664499>] ? __btrfs_drop_extents+0x391/0xa50 [btrfs]
        [191627.706637]  [<ffffffff8104b46d>] warn_slowpath_null+0x1a/0x1c
        [191627.707789]  [<ffffffffa0664499>] __btrfs_drop_extents+0x391/0xa50 [btrfs]
        [191627.709155]  [<ffffffff8115663c>] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0
        [191627.712444]  [<ffffffff81155007>] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18
        [191627.714162]  [<ffffffffa06570c9>] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs]
        [191627.715887]  [<ffffffffa065422b>] ? start_transaction+0x3bb/0x610 [btrfs]
        [191627.717287]  [<ffffffffa065b604>] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs]
        [191627.728865]  [<ffffffffa065b888>] finish_ordered_fn+0x15/0x17 [btrfs]
        [191627.730045]  [<ffffffffa067d688>] normal_work_helper+0x14c/0x32c [btrfs]
        [191627.731256]  [<ffffffffa067d96a>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
        [191627.732661]  [<ffffffff81061119>] process_one_work+0x24c/0x4ae
        [191627.733822]  [<ffffffff810615b0>] worker_thread+0x206/0x2c2
        [191627.734857]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
        [191627.736052]  [<ffffffff810613aa>] ? process_scheduled_works+0x2f/0x2f
        [191627.737349]  [<ffffffff810669a6>] kthread+0xef/0xf7
        [191627.738267]  [<ffffffff810f3b3a>] ? time_hardirqs_on+0x15/0x28
        [191627.739330]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
        [191627.741976]  [<ffffffff81465592>] ret_from_fork+0x42/0x70
        [191627.743080]  [<ffffffff810668b7>] ? __kthread_parkme+0xad/0xad
        [191627.744206] ---[ end trace bbfddacb7aaada8d ]---
      
        $ cat -n fs/btrfs/file.c
        691  int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
        (...)
        758                  btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
        759                  if (key.objectid > ino ||
        760                      key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
        761                          break;
        762
        763                  fi = btrfs_item_ptr(leaf, path->slots[0],
        764                                      struct btrfs_file_extent_item);
        765                  extent_type = btrfs_file_extent_type(leaf, fi);
        766
        767                  if (extent_type == BTRFS_FILE_EXTENT_REG ||
        768                      extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
        (...)
        774                  } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
        (...)
        778                  } else {
        779                          WARN_ON(1);
        780                          extent_end = search_start;
        781                  }
        (...)
      
      This happened because the item we were processing did not match a file
      extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
      case we cast the item to a struct btrfs_file_extent_item pointer and
      then find a type field value that does not match any of the expected
      values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
      due to a tiny time window where a race can happen as exemplified below.
      For example, consider the following scenario where we're using the
      NO_HOLES feature and we have the following two neighbour leafs:
      
                     Leaf X (has N items)                    Leaf Y
      
      [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                slot N - 2         slot N - 1              slot 0
      
      Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
      than explicit because NO_HOLES is enabled). Now if our inode has an
      ordered extent for the range [4K, 8K[ that is finishing, the following
      can happen:
      
                CPU 1                                       CPU 2
      
        btrfs_finish_ordered_io()
          insert_reserved_file_extent()
            __btrfs_drop_extents()
               Searches for the key
                (257 EXTENT_DATA 4096) through
                btrfs_lookup_file_extent()
      
               Key not found and we get a path where
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
               Because path->slots[0] is >=
               btrfs_header_nritems(leaf X), we call
               btrfs_next_leaf()
      
               btrfs_next_leaf() releases the path
      
                                                        inserts key
                                                        (257 INODE_REF 4096)
                                                        at the end of leaf X,
                                                        leaf X now has N + 1 keys,
                                                        and the new key is at
                                                        slot N
      
               btrfs_next_leaf() searches for
               key (257 INODE_REF 256), with
               path->keep_locks set to 1,
               because it was the last key it
               saw in leaf X
      
                 finds it in leaf X again and
                 notices it's no longer the last
                 key of the leaf, so it returns 0
                 with path->nodes[0] == leaf X and
                 path->slots[0] == N (which is now
                 < btrfs_header_nritems(leaf X)),
                 pointing to the new key
                 (257 INODE_REF 4096)
      
               __btrfs_drop_extents() casts the
               item at path->nodes[0], slot
               path->slots[0], to a struct
               btrfs_file_extent_item - it does
               not skip keys for the target
               inode with a type less than
               BTRFS_EXTENT_DATA_KEY
               (BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY)
      
               sees a bogus value for the type
               field triggering the WARN_ON in
               the trace shown above, and sets
               extent_end = search_start (4096)
      
               does the if-then-else logic to
               fixup 0 length extent items created
               by a past bug from hole punching:
      
                 if (extent_end == key.offset &&
                     extent_end >= search_start)
                     goto delete_extent_item;
      
               that evaluates to true and it ends
               up deleting the key pointed to by
               path->slots[0], (257 INODE_REF 4096),
               from leaf X
      
      The same could happen for example for a xattr that ends up having a key
      with an offset value that matches search_start (very unlikely but not
      impossible).
      
      So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
      skipped, never casted to struct btrfs_file_extent_item and never deleted
      by accident. Also protect against the unexpected case of getting a key
      for a lower inode number by skipping that key and issuing a warning.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de4306a7
    • Eric Dumazet's avatar
      ipv6: sctp: implement sctp_v6_destroy_sock() · a8fa15f0
      Eric Dumazet authored
      [ Upstream commit 602dd62d ]
      
      Dmitry Vyukov reported a memory leak using IPV6 SCTP sockets.
      
      We need to call inet6_destroy_sock() to properly release
      inet6 specific fields.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8fa15f0
    • Michal Kubeček's avatar
      ipv6: distinguish frag queues by device for multicast and link-local packets · f4ce575c
      Michal Kubeček authored
      [ Upstream commit 264640fc ]
      
      If a fragmented multicast packet is received on an ethernet device which
      has an active macvlan on top of it, each fragment is duplicated and
      received both on the underlying device and the macvlan. If some
      fragments for macvlan are processed before the whole packet for the
      underlying device is reassembled, the "overlapping fragments" test in
      ip6_frag_queue() discards the whole fragment queue.
      
      To resolve this, add device ifindex to the search key and require it to
      match reassembling multicast packets and packets to link-local
      addresses.
      
      Note: similar patch has been already submitted by Yoshifuji Hideaki in
      
        http://patchwork.ozlabs.org/patch/220979/
      
      but got lost and forgotten for some reason.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f4ce575c
    • Aaro Koskinen's avatar
      broadcom: fix PHY_ID_BCM5481 entry in the id table · ad7a4964
      Aaro Koskinen authored
      [ Upstream commit 3c25a860 ]
      
      Commit fcb26ec5 ("broadcom: move all PHY_ID's to header")
      updated broadcom_tbl to use PHY_IDs, but incorrectly replaced 0x0143bca0
      with PHY_ID_BCM5482 (making a duplicate entry, and completely omitting
      the original). Fix that.
      
      Fixes: fcb26ec5 ("broadcom: move all PHY_ID's to header")
      Signed-off-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ad7a4964
    • Nikolay Aleksandrov's avatar
      net: ip6mr: fix static mfc/dev leaks on table destruction · 10a72b76
      Nikolay Aleksandrov authored
      [ Upstream commit 4c698046 ]
      
      Similar to ipv4, when destroying an mrt table the static mfc entries and
      the static devices are kept, which leads to devices that can never be
      destroyed (because of refcnt taken) and leaked memory. Make sure that
      everything is cleaned up on netns destruction.
      
      Fixes: 8229efda ("netns: ip6mr: enable namespace support in ipv6 multicast forwarding code")
      CC: Benjamin Thery <benjamin.thery@bull.net>
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      10a72b76
    • Nikolay Aleksandrov's avatar
      net: ipmr: fix static mfc/dev leaks on table destruction · 207c03d6
      Nikolay Aleksandrov authored
      [ Upstream commit 0e615e96 ]
      
      When destroying an mrt table the static mfc entries and the static
      devices are kept, which leads to devices that can never be destroyed
      (because of refcnt taken) and leaked memory, for example:
      unreferenced object 0xffff880034c144c0 (size 192):
        comm "mfc-broken", pid 4777, jiffies 4320349055 (age 46001.964s)
        hex dump (first 32 bytes):
          98 53 f0 34 00 88 ff ff 98 53 f0 34 00 88 ff ff  .S.4.....S.4....
          ef 0a 0a 14 01 02 03 04 00 00 00 00 01 00 00 00  ................
        backtrace:
          [<ffffffff815c1b9e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff811ea6e0>] kmem_cache_alloc+0x190/0x300
          [<ffffffff815931cb>] ip_mroute_setsockopt+0x5cb/0x910
          [<ffffffff8153d575>] do_ip_setsockopt.isra.11+0x105/0xff0
          [<ffffffff8153e490>] ip_setsockopt+0x30/0xa0
          [<ffffffff81564e13>] raw_setsockopt+0x33/0x90
          [<ffffffff814d1e14>] sock_common_setsockopt+0x14/0x20
          [<ffffffff814d0b51>] SyS_setsockopt+0x71/0xc0
          [<ffffffff815cdbf6>] entry_SYSCALL_64_fastpath+0x16/0x7a
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Make sure that everything is cleaned on netns destruction.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      207c03d6
    • Daniel Borkmann's avatar
      net, scm: fix PaX detected msg_controllen overflow in scm_detach_fds · c35d6700
      Daniel Borkmann authored
      [ Upstream commit 6900317f ]
      
      David and HacKurx reported a following/similar size overflow triggered
      in a grsecurity kernel, thanks to PaX's gcc size overflow plugin:
      
      (Already fixed in later grsecurity versions by Brad and PaX Team.)
      
      [ 1002.296137] PAX: size overflow detected in function scm_detach_fds net/core/scm.c:314
                     cicus.202_127 min, count: 4, decl: msg_controllen; num: 0; context: msghdr;
      [ 1002.296145] CPU: 0 PID: 3685 Comm: scm_rights_recv Not tainted 4.2.3-grsec+ #7
      [ 1002.296149] Hardware name: Apple Inc. MacBookAir5,1/Mac-66F35F19FE2A0D05, [...]
      [ 1002.296153]  ffffffff81c27366 0000000000000000 ffffffff81c27375 ffffc90007843aa8
      [ 1002.296162]  ffffffff818129ba 0000000000000000 ffffffff81c27366 ffffc90007843ad8
      [ 1002.296169]  ffffffff8121f838 fffffffffffffffc fffffffffffffffc ffffc90007843e60
      [ 1002.296176] Call Trace:
      [ 1002.296190]  [<ffffffff818129ba>] dump_stack+0x45/0x57
      [ 1002.296200]  [<ffffffff8121f838>] report_size_overflow+0x38/0x60
      [ 1002.296209]  [<ffffffff816a979e>] scm_detach_fds+0x2ce/0x300
      [ 1002.296220]  [<ffffffff81791899>] unix_stream_read_generic+0x609/0x930
      [ 1002.296228]  [<ffffffff81791c9f>] unix_stream_recvmsg+0x4f/0x60
      [ 1002.296236]  [<ffffffff8178dc00>] ? unix_set_peek_off+0x50/0x50
      [ 1002.296243]  [<ffffffff8168fac7>] sock_recvmsg+0x47/0x60
      [ 1002.296248]  [<ffffffff81691522>] ___sys_recvmsg+0xe2/0x1e0
      [ 1002.296257]  [<ffffffff81693496>] __sys_recvmsg+0x46/0x80
      [ 1002.296263]  [<ffffffff816934fc>] SyS_recvmsg+0x2c/0x40
      [ 1002.296271]  [<ffffffff8181a3ab>] entry_SYSCALL_64_fastpath+0x12/0x85
      
      Further investigation showed that this can happen when an *odd* number of
      fds are being passed over AF_UNIX sockets.
      
      In these cases CMSG_LEN(i * sizeof(int)) and CMSG_SPACE(i * sizeof(int)),
      where i is the number of successfully passed fds, differ by 4 bytes due
      to the extra CMSG_ALIGN() padding in CMSG_SPACE() to an 8 byte boundary
      on 64 bit. The padding is used to align subsequent cmsg headers in the
      control buffer.
      
      When the control buffer passed in from the receiver side *lacks* these 4
      bytes (e.g. due to buggy/wrong API usage), then msg->msg_controllen will
      overflow in scm_detach_fds():
      
        int cmlen = CMSG_LEN(i * sizeof(int));  <--- cmlen w/o tail-padding
        err = put_user(SOL_SOCKET, &cm->cmsg_level);
        if (!err)
          err = put_user(SCM_RIGHTS, &cm->cmsg_type);
        if (!err)
          err = put_user(cmlen, &cm->cmsg_len);
        if (!err) {
          cmlen = CMSG_SPACE(i * sizeof(int));  <--- cmlen w/ 4 byte extra tail-padding
          msg->msg_control += cmlen;
          msg->msg_controllen -= cmlen;         <--- iff no tail-padding space here ...
        }                                            ... wrap-around
      
      F.e. it will wrap to a length of 18446744073709551612 bytes in case the
      receiver passed in msg->msg_controllen of 20 bytes, and the sender
      properly transferred 1 fd to the receiver, so that its CMSG_LEN results
      in 20 bytes and CMSG_SPACE in 24 bytes.
      
      In case of MSG_CMSG_COMPAT (scm_detach_fds_compat()), I haven't seen an
      issue in my tests as alignment seems always on 4 byte boundary. Same
      should be in case of native 32 bit, where we end up with 4 byte boundaries
      as well.
      
      In practice, passing msg->msg_controllen of 20 to recvmsg() while receiving
      a single fd would mean that on successful return, msg->msg_controllen is
      being set by the kernel to 24 bytes instead, thus more than the input
      buffer advertised. It could f.e. become an issue if such application later
      on zeroes or copies the control buffer based on the returned msg->msg_controllen
      elsewhere.
      
      Maximum number of fds we can send is a hard upper limit SCM_MAX_FD (253).
      
      Going over the code, it seems like msg->msg_controllen is not being read
      after scm_detach_fds() in scm_recv() anymore by the kernel, good!
      
      Relevant recvmsg() handler are unix_dgram_recvmsg() (unix_seqpacket_recvmsg())
      and unix_stream_recvmsg(). Both return back to their recvmsg() caller,
      and ___sys_recvmsg() places the updated length, that is, new msg_control -
      old msg_control pointer into msg->msg_controllen (hence the 24 bytes seen
      in the example).
      
      Long time ago, Wei Yongjun fixed something related in commit 1ac70e7a
      ("[NET]: Fix function put_cmsg() which may cause usr application memory
      overflow").
      
      RFC3542, section 20.2. says:
      
        The fields shown as "XX" are possible padding, between the cmsghdr
        structure and the data, and between the data and the next cmsghdr
        structure, if required by the implementation. While sending an
        application may or may not include padding at the end of last
        ancillary data in msg_controllen and implementations must accept both
        as valid. On receiving a portable application must provide space for
        padding at the end of the last ancillary data as implementations may
        copy out the padding at the end of the control message buffer and
        include it in the received msg_controllen. When recvmsg() is called
        if msg_controllen is too small for all the ancillary data items
        including any trailing padding after the last item an implementation
        may set MSG_CTRUNC.
      
      Since we didn't place MSG_CTRUNC for already quite a long time, just do
      the same as in 1ac70e7a to avoid an overflow.
      
      Btw, even man-page author got this wrong :/ See db939c9b26e9 ("cmsg.3: Fix
      error in SCM_RIGHTS code sample"). Some people must have copied this (?),
      thus it got triggered in the wild (reported several times during boot by
      David and HacKurx).
      
      No Fixes tag this time as pre 2002 (that is, pre history tree).
      Reported-by: default avatarDavid Sterba <dave@jikos.cz>
      Reported-by: default avatarHacKurx <hackurx@gmail.com>
      Cc: PaX Team <pageexec@freemail.hu>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: Eric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c35d6700
    • Eric Dumazet's avatar
      tcp: initialize tp->copied_seq in case of cross SYN connection · d24edcc3
      Eric Dumazet authored
      [ Upstream commit 142a2e7e ]
      
      Dmitry provided a syzkaller (http://github.com/google/syzkaller)
      generated program that triggers the WARNING at
      net/ipv4/tcp.c:1729 in tcp_recvmsg() :
      
      WARN_ON(tp->copied_seq != tp->rcv_nxt &&
              !(flags & (MSG_PEEK | MSG_TRUNC)));
      
      His program is specifically attempting a Cross SYN TCP exchange,
      that we support (for the pleasure of hackers ?), but it looks we
      lack proper tcp->copied_seq initialization.
      
      Thanks again Dmitry for your report and testings.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Tested-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d24edcc3
    • Eric Dumazet's avatar
      tcp: md5: fix lockdep annotation · ab3689c8
      Eric Dumazet authored
      [ Upstream commit 1b8e6a01 ]
      
      When a passive TCP is created, we eventually call tcp_md5_do_add()
      with sk pointing to the child. It is not owner by the user yet (we
      will add this socket into listener accept queue a bit later anyway)
      
      But we do own the spinlock, so amend the lockdep annotation to avoid
      following splat :
      
      [ 8451.090932] net/ipv4/tcp_ipv4.c:923 suspicious rcu_dereference_protected() usage!
      [ 8451.090932]
      [ 8451.090932] other info that might help us debug this:
      [ 8451.090932]
      [ 8451.090934]
      [ 8451.090934] rcu_scheduler_active = 1, debug_locks = 1
      [ 8451.090936] 3 locks held by socket_sockopt_/214795:
      [ 8451.090936]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff855c6ac1>] __netif_receive_skb_core+0x151/0xe90
      [ 8451.090947]  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff85618143>] ip_local_deliver_finish+0x43/0x2b0
      [ 8451.090952]  #2:  (slock-AF_INET){+.-...}, at: [<ffffffff855acda5>] sk_clone_lock+0x1c5/0x500
      [ 8451.090958]
      [ 8451.090958] stack backtrace:
      [ 8451.090960] CPU: 7 PID: 214795 Comm: socket_sockopt_
      
      [ 8451.091215] Call Trace:
      [ 8451.091216]  <IRQ>  [<ffffffff856fb29c>] dump_stack+0x55/0x76
      [ 8451.091229]  [<ffffffff85123b5b>] lockdep_rcu_suspicious+0xeb/0x110
      [ 8451.091235]  [<ffffffff8564544f>] tcp_md5_do_add+0x1bf/0x1e0
      [ 8451.091239]  [<ffffffff85645751>] tcp_v4_syn_recv_sock+0x1f1/0x4c0
      [ 8451.091242]  [<ffffffff85642b27>] ? tcp_v4_md5_hash_skb+0x167/0x190
      [ 8451.091246]  [<ffffffff85647c78>] tcp_check_req+0x3c8/0x500
      [ 8451.091249]  [<ffffffff856451ae>] ? tcp_v4_inbound_md5_hash+0x11e/0x190
      [ 8451.091253]  [<ffffffff85647170>] tcp_v4_rcv+0x3c0/0x9f0
      [ 8451.091256]  [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0
      [ 8451.091260]  [<ffffffff856181b6>] ip_local_deliver_finish+0xb6/0x2b0
      [ 8451.091263]  [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0
      [ 8451.091267]  [<ffffffff85618d38>] ip_local_deliver+0x48/0x80
      [ 8451.091270]  [<ffffffff85618510>] ip_rcv_finish+0x160/0x700
      [ 8451.091273]  [<ffffffff8561900e>] ip_rcv+0x29e/0x3d0
      [ 8451.091277]  [<ffffffff855c74b7>] __netif_receive_skb_core+0xb47/0xe90
      
      Fixes: a8afca03 ("tcp: md5: protects md5sig_info with RCU")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab3689c8
    • Bjørn Mork's avatar
      net: qmi_wwan: add XS Stick W100-2 from 4G Systems · e9ec6742
      Bjørn Mork authored
      [ Upstream commit 68242a5a ]
      
      Thomas reports
      "
      4gsystems sells two total different LTE-surfsticks under the same name.
      ..
      The newer version of XS Stick W100 is from "omega"
      ..
      Under windows the driver switches to the same ID, and uses MI03\6 for
      network and MI01\6 for modem.
      ..
      echo "1c9e 9b01" > /sys/bus/usb/drivers/qmi_wwan/new_id
      echo "1c9e 9b01" > /sys/bus/usb-serial/drivers/option1/new_id
      
      T:  Bus=01 Lev=01 Prnt=01 Port=03 Cnt=01 Dev#=  4 Spd=480 MxCh= 0
      D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
      P:  Vendor=1c9e ProdID=9b01 Rev=02.32
      S:  Manufacturer=USB Modem
      S:  Product=USB Modem
      S:  SerialNumber=
      C:  #Ifs= 5 Cfg#= 1 Atr=80 MxPwr=500mA
      I:  If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
      I:  If#= 1 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
      I:  If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=option
      I:  If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=qmi_wwan
      I:  If#= 4 Alt= 0 #EPs= 2 Cls=08(stor.) Sub=06 Prot=50 Driver=usb-storage
      
      Now all important things are there:
      
      wwp0s29f7u2i3 (net), ttyUSB2 (at), cdc-wdm0 (qmi), ttyUSB1 (at)
      
      There is also ttyUSB0, but it is not usable, at least not for at.
      
      The device works well with qmi and ModemManager-NetworkManager.
      "
      Reported-by: default avatarThomas Schäfer <tschaefer@t-online.de>
      Signed-off-by: default avatarBjørn Mork <bjorn@mork.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e9ec6742
    • Neil Horman's avatar
      snmp: Remove duplicate OUTMCAST stat increment · c32a5063
      Neil Horman authored
      [ Upstream commit 41033f02 ]
      
      the OUTMCAST stat is double incremented, getting bumped once in the mcast code
      itself, and again in the common ip output path.  Remove the mcast bump, as its
      not needed
      
      Validated by the reporter, with good results
      Signed-off-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Reported-by: default avatarClaus Jensen <claus.jensen@microsemi.com>
      CC: Claus Jensen <claus.jensen@microsemi.com>
      CC: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c32a5063
    • Jason A. Donenfeld's avatar
      ip_tunnel: disable preemption when updating per-cpu tstats · f4f4b092
      Jason A. Donenfeld authored
      [ Upstream commit b4fe85f9 ]
      
      Drivers like vxlan use the recently introduced
      udp_tunnel_xmit_skb/udp_tunnel6_xmit_skb APIs. udp_tunnel6_xmit_skb
      makes use of ip6tunnel_xmit, and ip6tunnel_xmit, after sending the
      packet, updates the struct stats using the usual
      u64_stats_update_begin/end calls on this_cpu_ptr(dev->tstats).
      udp_tunnel_xmit_skb makes use of iptunnel_xmit, which doesn't touch
      tstats, so drivers like vxlan, immediately after, call
      iptunnel_xmit_stats, which does the same thing - calls
      u64_stats_update_begin/end on this_cpu_ptr(dev->tstats).
      
      While vxlan is probably fine (I don't know?), calling a similar function
      from, say, an unbound workqueue, on a fully preemptable kernel causes
      real issues:
      
      [  188.434537] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u8:0/6
      [  188.435579] caller is debug_smp_processor_id+0x17/0x20
      [  188.435583] CPU: 0 PID: 6 Comm: kworker/u8:0 Not tainted 4.2.6 #2
      [  188.435607] Call Trace:
      [  188.435611]  [<ffffffff8234e936>] dump_stack+0x4f/0x7b
      [  188.435615]  [<ffffffff81915f3d>] check_preemption_disabled+0x19d/0x1c0
      [  188.435619]  [<ffffffff81915f77>] debug_smp_processor_id+0x17/0x20
      
      The solution would be to protect the whole
      this_cpu_ptr(dev->tstats)/u64_stats_update_begin/end blocks with
      disabling preemption and then reenabling it.
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f4f4b092
    • lucien's avatar
      sctp: translate host order to network order when setting a hmacid · a286e9c0
      lucien authored
      [ Upstream commit ed5a377d ]
      
      now sctp auth cannot work well when setting a hmacid manually, which
      is caused by that we didn't use the network order for hmacid, so fix
      it by adding the transformation in sctp_auth_ep_set_hmacs.
      
      even we set hmacid with the network order in userspace, it still
      can't work, because of this condition in sctp_auth_ep_set_hmacs():
      
      		if (id > SCTP_AUTH_HMAC_ID_MAX)
      			return -EOPNOTSUPP;
      
      so this wasn't working before and thus it won't break compatibility.
      
      Fixes: 65b07e5d ("[SCTP]: API updates to suport SCTP-AUTH extensions.")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a286e9c0
    • Daniel Borkmann's avatar
      packet: infer protocol from ethernet header if unset · de23f6a7
      Daniel Borkmann authored
      [ Upstream commit c72219b7 ]
      
      In case no struct sockaddr_ll has been passed to packet
      socket's sendmsg() when doing a TX_RING flush run, then
      skb->protocol is set to po->num instead, which is the protocol
      passed via socket(2)/bind(2).
      
      Applications only xmitting can go the path of allocating the
      socket as socket(PF_PACKET, <mode>, 0) and do a bind(2) on the
      TX_RING with sll_protocol of 0. That way, register_prot_hook()
      is neither called on creation nor on bind time, which saves
      cycles when there's no interest in capturing anyway.
      
      That leaves us however with po->num 0 instead and therefore
      the TX_RING flush run sets skb->protocol to 0 as well. Eric
      reported that this leads to problems when using tools like
      trafgen over bonding device. I.e. the bonding's hash function
      could invoke the kernel's flow dissector, which depends on
      skb->protocol being properly set. In the current situation, all
      the traffic is then directed to a single slave.
      
      Fix it up by inferring skb->protocol from the Ethernet header
      when not set and we have ARPHRD_ETHER device type. This is only
      done in case of SOCK_RAW and where we have a dev->hard_header_len
      length. In case of ARPHRD_ETHER devices, this is guaranteed to
      cover ETH_HLEN, and therefore being accessed on the skb after
      the skb_store_bits().
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de23f6a7
    • Daniel Borkmann's avatar
      packet: always probe for transport header · c8f0bbc0
      Daniel Borkmann authored
      [ Upstream commit 8fd6c80d ]
      
      We concluded that the skb_probe_transport_header() should better be
      called unconditionally. Avoiding the call into the flow dissector has
      also not really much to do with the direct xmit mode.
      
      While it seems that only virtio_net code makes use of GSO from non
      RX/TX ring packet socket paths, we should probe for a transport header
      nevertheless before they hit devices.
      
      Reference: http://thread.gmane.org/gmane.linux.network/386173/Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c8f0bbc0
    • Daniel Borkmann's avatar
      packet: do skb_probe_transport_header when we actually have data · 0fd58003
      Daniel Borkmann authored
      [ Upstream commit efdfa2f7 ]
      
      In tpacket_fill_skb() commit c1aad275 ("packet: set transport
      header before doing xmit") and later on 40893fd0 ("net: switch
      to use skb_probe_transport_header()") was probing for a transport
      header on the skb from a ring buffer slot, but at a time, where
      the skb has _not even_ been filled with data yet. So that call into
      the flow dissector is pretty useless. Lets do it after we've set
      up the skb frags.
      
      Fixes: c1aad275 ("packet: set transport header before doing xmit")
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0fd58003
    • Kamal Mostafa's avatar
      tools/net: Use include/uapi with __EXPORTED_HEADERS__ · 4db32ea3
      Kamal Mostafa authored
      [ Upstream commit d7475de5 ]
      
      Use the local uapi headers to keep in sync with "recently" added #define's
      (e.g. SKF_AD_VLAN_TPID).  Refactored CFLAGS, and bpf_asm doesn't need -I.
      
      Fixes: 3f356385 ("filter: bpf_asm: add minimal bpf asm tool")
      Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4db32ea3
    • Rainer Weikusat's avatar
      unix: avoid use-after-free in ep_remove_wait_queue · 9d054f57
      Rainer Weikusat authored
      [ Upstream commit 7d267278 ]
      
      Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
      An AF_UNIX datagram socket being the client in an n:1 association with
      some server socket is only allowed to send messages to the server if the
      receive queue of this socket contains at most sk_max_ack_backlog
      datagrams. This implies that prospective writers might be forced to go
      to sleep despite none of the message presently enqueued on the server
      receive queue were sent by them. In order to ensure that these will be
      woken up once space becomes again available, the present unix_dgram_poll
      routine does a second sock_poll_wait call with the peer_wait wait queue
      of the server socket as queue argument (unix_dgram_recvmsg does a wake
      up on this queue after a datagram was received). This is inherently
      problematic because the server socket is only guaranteed to remain alive
      for as long as the client still holds a reference to it. In case the
      connection is dissolved via connect or by the dead peer detection logic
      in unix_dgram_sendmsg, the server socket may be freed despite "the
      polling mechanism" (in particular, epoll) still has a pointer to the
      corresponding peer_wait queue. There's no way to forcibly deregister a
      wait queue with epoll.
      
      Based on an idea by Jason Baron, the patch below changes the code such
      that a wait_queue_t belonging to the client socket is enqueued on the
      peer_wait queue of the server whenever the peer receive queue full
      condition is detected by either a sendmsg or a poll. A wake up on the
      peer queue is then relayed to the ordinary wait queue of the client
      socket via wake function. The connection to the peer wait queue is again
      dissolved if either a wake up is about to be relayed or the client
      socket reconnects or a dead peer is detected or the client socket is
      itself closed. This enables removing the second sock_poll_wait from
      unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
      that no blocked writer sleeps forever.
      Signed-off-by: default avatarRainer Weikusat <rweikusat@mobileactivedefense.com>
      Fixes: ec0d215f ("af_unix: fix 'poll for write'/connected DGRAM sockets")
      Reviewed-by: default avatarJason Baron <jbaron@akamai.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9d054f57
  2. 09 Dec, 2015 20 commits