1. 01 Oct, 2015 40 commits
    • Nikolay Aleksandrov's avatar
      bridge: mdb: fix double add notification · cf4a8f24
      Nikolay Aleksandrov authored
      [ Upstream commit 5ebc7846 ]
      
      Since the mdb add/del code was introduced there have been 2 br_mdb_notify
      calls when doing br_mdb_add() resulting in 2 notifications on each add.
      
      Example:
       Command: bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
       Before patch:
       root@debian:~# bridge monitor all
       [MDB]dev br0 port eth1 grp 239.0.0.1 permanent
       [MDB]dev br0 port eth1 grp 239.0.0.1 permanent
      
       After patch:
       root@debian:~# bridge monitor all
       [MDB]dev br0 port eth1 grp 239.0.0.1 permanent
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Fixes: cfd56754 ("bridge: add support of adding and deleting mdb entries")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cf4a8f24
    • Herbert Xu's avatar
      net: Fix skb_set_peeked use-after-free bug · 414913c2
      Herbert Xu authored
      [ Upstream commit a0a2a660 ]
      
      The commit 738ac1eb ("net: Clone
      skb before setting peeked flag") introduced a use-after-free bug
      in skb_recv_datagram.  This is because skb_set_peeked may create
      a new skb and free the existing one.  As it stands the caller will
      continue to use the old freed skb.
      
      This patch fixes it by making skb_set_peeked return the new skb
      (or the old one if unchanged).
      
      Fixes: 738ac1eb ("net: Clone skb before setting peeked flag")
      Reported-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Tested-by: default avatarBrenden Blanco <bblanco@plumgrid.com>
      Reviewed-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      414913c2
    • Herbert Xu's avatar
      net: Fix skb csum races when peeking · d8f7f61b
      Herbert Xu authored
      [ Upstream commit 89c22d8c ]
      
      When we calculate the checksum on the recv path, we store the
      result in the skb as an optimisation in case we need the checksum
      again down the line.
      
      This is in fact bogus for the MSG_PEEK case as this is done without
      any locking.  So multiple threads can peek and then store the result
      to the same skb, potentially resulting in bogus skb states.
      
      This patch fixes this by only storing the result if the skb is not
      shared.  This preserves the optimisations for the few cases where
      it can be done safely due to locking or other reasons, e.g., SIOCINQ.
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d8f7f61b
    • Herbert Xu's avatar
      net: Clone skb before setting peeked flag · 2482787e
      Herbert Xu authored
      [ Upstream commit 738ac1eb ]
      
      Shared skbs must not be modified and this is crucial for broadcast
      and/or multicast paths where we use it as an optimisation to avoid
      unnecessary cloning.
      
      The function skb_recv_datagram breaks this rule by setting peeked
      without cloning the skb first.  This causes funky races which leads
      to double-free.
      
      This patch fixes this by cloning the skb and replacing the skb
      in the list when setting skb->peeked.
      
      Fixes: a59322be ("[UDP]: Only increment counter on first peek/recv")
      Reported-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2482787e
    • Julian Anastasov's avatar
      net: call rcu_read_lock early in process_backlog · 69f02461
      Julian Anastasov authored
      [ Upstream commit 2c17d27c ]
      
      Incoming packet should be either in backlog queue or
      in RCU read-side section. Otherwise, the final sequence of
      flush_backlog() and synchronize_net() may miss packets
      that can run without device reference:
      
      CPU 1                  CPU 2
                             skb->dev: no reference
                             process_backlog:__skb_dequeue
                             process_backlog:local_irq_enable
      
      on_each_cpu for
      flush_backlog =>       IPI(hardirq): flush_backlog
                             - packet not found in backlog
      
                             CPU delayed ...
      synchronize_net
      - no ongoing RCU
      read-side sections
      
      netdev_run_todo,
      rcu_barrier: no
      ongoing callbacks
                             __netif_receive_skb_core:rcu_read_lock
                             - too late
      free dev
                             process packet for freed dev
      
      Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69f02461
    • Julian Anastasov's avatar
      net: do not process device backlog during unregistration · 2095ab24
      Julian Anastasov authored
      [ Upstream commit e9e4dd32 ]
      
      commit 381c759d ("ipv4: Avoid crashing in ip_error")
      fixes a problem where processed packet comes from device
      with destroyed inetdev (dev->ip_ptr). This is not expected
      because inetdev_destroy is called in NETDEV_UNREGISTER
      phase and packets should not be processed after
      dev_close_many() and synchronize_net(). Above fix is still
      required because inetdev_destroy can be called for other
      reasons. But it shows the real problem: backlog can keep
      packets for long time and they do not hold reference to
      device. Such packets are then delivered to upper levels
      at the same time when device is unregistered.
      Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
      accounts all packets from backlog but before that some packets
      continue to be delivered to upper levels long after the
      synchronize_net call which is supposed to wait the last
      ones. Also, as Eric pointed out, processed packets, mostly
      from other devices, can continue to add new packets to backlog.
      
      Fix the problem by moving flush_backlog early, after the
      device driver is stopped and before the synchronize_net() call.
      Then use netif_running check to make sure we do not add more
      packets to backlog. We have to do it in enqueue_to_backlog
      context when the local IRQ is disabled. As result, after the
      flush_backlog and synchronize_net sequence all packets
      should be accounted.
      
      Thanks to Eric W. Biederman for the test script and his
      valuable feedback!
      Reported-by: default avatarVittorio Gambaletta <linuxbugs@vittgam.net>
      Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2095ab24
    • Oleg Nesterov's avatar
      net: pktgen: fix race between pktgen_thread_worker() and kthread_stop() · 45b5e68e
      Oleg Nesterov authored
      [ Upstream commit fecdf8be ]
      
      pktgen_thread_worker() is obviously racy, kthread_stop() can come
      between the kthread_should_stop() check and set_current_state().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Reported-by: default avatarMarcelo Leitner <mleitner@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      45b5e68e
    • Nikolay Aleksandrov's avatar
      bridge: mdb: zero out the local br_ip variable before use · d46c3f82
      Nikolay Aleksandrov authored
      [ Upstream commit f1158b74 ]
      
      Since commit b0e9a30d ("bridge: Add vlan id to multicast groups")
      there's a check in br_ip_equal() for a matching vlan id, but the mdb
      functions were not modified to use (or at least zero it) so when an
      entry was added it would have a garbage vlan id (from the local br_ip
      variable in __br_mdb_add/del) and this would prevent it from being
      matched and also deleted. So zero out the whole local ip var to protect
      ourselves from future changes and also to fix the current bug, since
      there's no vlan id support in the mdb uapi - use always vlan id 0.
      Example before patch:
      root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
      root@debian:~# bridge mdb
      dev br0 port eth1 grp 239.0.0.1 permanent
      root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent
      RTNETLINK answers: Invalid argument
      
      After patch:
      root@debian:~# bridge mdb add dev br0 port eth1 grp 239.0.0.1 permanent
      root@debian:~# bridge mdb
      dev br0 port eth1 grp 239.0.0.1 permanent
      root@debian:~# bridge mdb del dev br0 port eth1 grp 239.0.0.1 permanent
      root@debian:~# bridge mdb
      Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Fixes: b0e9a30d ("bridge: Add vlan id to multicast groups")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d46c3f82
    • Stephen Smalley's avatar
      net/tipc: initialize security state for new connection socket · 4db9a972
      Stephen Smalley authored
      [ Upstream commit fdd75ea8 ]
      
      Calling connect() with an AF_TIPC socket would trigger a series
      of error messages from SELinux along the lines of:
      SELinux: Invalid class 0
      type=AVC msg=audit(1434126658.487:34500): avc:  denied  { <unprintable> }
        for pid=292 comm="kworker/u16:5" scontext=system_u:system_r:kernel_t:s0
        tcontext=system_u:object_r:unlabeled_t:s0 tclass=<unprintable>
        permissive=0
      
      This was due to a failure to initialize the security state of the new
      connection sock by the tipc code, leaving it with junk in the security
      class field and an unlabeled secid.  Add a call to security_sk_clone()
      to inherit the security state from the parent socket.
      Reported-by: default avatarTim Shearer <tim.shearer@overturenetworks.com>
      Signed-off-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4db9a972
    • Timo Teräs's avatar
      ip_tunnel: fix ipv4 pmtu check to honor inner ip header df · 5a2c3656
      Timo Teräs authored
      [ Upstream commit fc24f2b2 ]
      
      Frag needed should be sent only if the inner header asked
      to not fragment. Currently fragmentation is broken if the
      tunnel has df set, but df was not asked in the original
      packet. The tunnel's df needs to be still checked to update
      internally the pmtu cache.
      
      Commit 23a3647b broke it, and this commit fixes
      the ipv4 df check back to the way it was.
      
      Fixes: 23a3647b ("ip_tunnels: Use skb-len to PMTU check.")
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarTimo Teräs <timo.teras@iki.fi>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5a2c3656
    • Daniel Borkmann's avatar
      rtnetlink: verify IFLA_VF_INFO attributes before passing them to driver · c8fa57bb
      Daniel Borkmann authored
      [ Upstream commit 4f7d2cdf ]
      
      Jason Gunthorpe reported that since commit c02db8c6 ("rtnetlink: make
      SR-IOV VF interface symmetric"), we don't verify IFLA_VF_INFO attributes
      anymore with respect to their policy, that is, ifla_vfinfo_policy[].
      
      Before, they were part of ifla_policy[], but they have been nested since
      placed under IFLA_VFINFO_LIST, that contains the attribute IFLA_VF_INFO,
      which is another nested attribute for the actual VF attributes such as
      IFLA_VF_MAC, IFLA_VF_VLAN, etc.
      
      Despite the policy being split out from ifla_policy[] in this commit,
      it's never applied anywhere. nla_for_each_nested() only does basic nla_ok()
      testing for struct nlattr, but it doesn't know about the data context and
      their requirements.
      
      Fix, on top of Jason's initial work, does 1) parsing of the attributes
      with the right policy, and 2) using the resulting parsed attribute table
      from 1) instead of the nla_for_each_nested() loop (just like we used to
      do when still part of ifla_policy[]).
      
      Reference: http://thread.gmane.org/gmane.linux.network/368913
      Fixes: c02db8c6 ("rtnetlink: make SR-IOV VF interface symmetric")
      Reported-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
      Cc: Greg Rose <gregory.v.rose@intel.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Rony Efraim <ronye@mellanox.com>
      Cc: Vlad Zolotarov <vladz@cloudius-systems.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarVlad Zolotarov <vladz@cloudius-systems.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c8fa57bb
    • Eric Dumazet's avatar
      net: graceful exit from netif_alloc_netdev_queues() · 6133a5c1
      Eric Dumazet authored
      [ Upstream commit d339727c ]
      
      User space can crash kernel with
      
      ip link add ifb10 numtxqueues 100000 type ifb
      
      We must replace a BUG_ON() by proper test and return -EINVAL for
      crazy values.
      
      Fixes: 60877a32 ("net: allow large number of tx queues")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6133a5c1
    • Angga's avatar
      ipv6: Make MLD packets to only be processed locally · 1742bfe8
      Angga authored
      [ Upstream commit 4c938d22 ]
      
      Before commit daad1512 ("ipv6: Make ipv6_is_mld() inline and use it
      from ip6_mc_input().") MLD packets were only processed locally. After the
      change, a copy of MLD packet goes through ip6_mr_input, causing
      MRT6MSG_NOCACHE message to be generated to user space.
      
      Make MLD packet only processed locally.
      
      Fixes: daad1512 ("ipv6: Make ipv6_is_mld() inline and use it from ip6_mc_input().")
      Signed-off-by: default avatarHermin Anggawijaya <hermin.anggawijaya@alliedtelesis.co.nz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1742bfe8
    • Hin-Tak Leung's avatar
      hfs,hfsplus: cache pages correctly between bnode_create and bnode_free · c824dd9a
      Hin-Tak Leung authored
      commit 7cb74be6 upstream.
      
      Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
      hfs_bnode_find() for finding or creating pages corresponding to an inode)
      are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
      and should not be page_cache_release()'ed until hfs_bnode_free().
      
      This patch fixes a problem I first saw in July 2012: merely running "du"
      on a large hfsplus-mounted directory a few times on a reasonably loaded
      system would get the hfsplus driver all confused and complaining about
      B-tree inconsistencies, and generates a "BUG: Bad page state".  Most
      recently, I can generate this problem on up-to-date Fedora 22 with shipped
      kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
      mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
      lightly-used QEMU VM image of the full Mac OS X 10.9:
      
      $ df -i / /home /mnt
      Filesystem                  Inodes   IUsed      IFree IUse% Mounted on
      /dev/mapper/fedora-root    3276800  551665    2725135   17% /
      /dev/mapper/fedora-home   52879360  716221   52163139    2% /home
      /dev/nbd0p2             4294967295 1387818 4293579477    1% /mnt
      
      After applying the patch, I was able to run "du /" (60+ times) and "du
      /mnt" (150+ times) continuously and simultaneously for 6+ hours.
      
      There are many reports of the hfsplus driver getting confused under load
      and generating "BUG: Bad page state" or other similar issues over the
      years.  [1]
      
      The unpatched code [2] has always been wrong since it entered the kernel
      tree.  The only reason why it gets away with it is that the
      kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
      the kernel has not had a chance to reuse the memory for something else,
      most of the time.
      
      The current RW driver appears to have followed the design and development
      of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
      2001) had a B-tree node-centric approach to
      read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
      migrating towards version 0.2 (June 2002) of caching and releasing pages
      per inode extents.  When the current RW code first entered the kernel [2]
      in 2005, there was an REF_PAGES conditional (and "//" commented out code)
      to switch between B-node centric paging to inode-centric paging.  There
      was a mistake with the direction of one of the REF_PAGES conditionals in
      __hfs_bnode_create().  In a subsequent "remove debug code" commit [4], the
      read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
      removed, but a page_cache_release() was mistakenly left in (propagating
      the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out
      page_cache_release() in bnode_release() (which should be spanned by
      !REF_PAGES) was never enabled.
      
      References:
      [1]:
      Michael Fox, Apr 2013
      http://www.spinics.net/lists/linux-fsdevel/msg63807.html
      ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")
      
      Sasha Levin, Feb 2015
      http://lkml.org/lkml/2015/2/20/85 ("use after free")
      
      https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
      https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
      https://bugzilla.kernel.org/show_bug.cgi?id=42342
      https://bugzilla.kernel.org/show_bug.cgi?id=63841
      https://bugzilla.kernel.org/show_bug.cgi?id=78761
      
      [2]:
      http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
      fs/hfs/bnode.c?id=d1081202
      commit d1081202
      Author: Andrew Morton <akpm@osdl.org>
      Date:   Wed Feb 25 16:17:36 2004 -0800
      
          [PATCH] HFS rewrite
      
      http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
      fs/hfsplus/bnode.c?id=91556682
      
      commit 91556682
      Author: Andrew Morton <akpm@osdl.org>
      Date:   Wed Feb 25 16:17:48 2004 -0800
      
          [PATCH] HFS+ support
      
      [3]:
      http://sourceforge.net/projects/linux-hfsplus/
      
      http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
      http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/
      
      http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
      fs/hfsplus/bnode.c?r1=1.4&r2=1.5
      
      Date:   Thu Jun 6 09:45:14 2002 +0000
      Use buffer cache instead of page cache in bnode.c. Cache inode extents.
      
      [4]:
      http://git.kernel.org/cgit/linux/kernel/git/\
      stable/linux-stable.git/commit/?id=a5e3985f
      
      commit a5e3985f
      Author: Roman Zippel <zippel@linux-m68k.org>
      Date:   Tue Sep 6 15:18:47 2005 -0700
      
      [PATCH] hfs: remove debug code
      Signed-off-by: default avatarHin-Tak Leung <htl10@users.sourceforge.net>
      Signed-off-by: default avatarSergei Antonov <saproj@gmail.com>
      Reviewed-by: default avatarAnton Altaparmakov <anton@tuxera.com>
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Sougata Santra <sougata@tuxera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c824dd9a
    • Alexey Brodkin's avatar
      stmmac: troubleshoot unexpected bits in des0 & des1 · bf6391e2
      Alexey Brodkin authored
      commit f1590670 upstream.
      
      Current implementation of descriptor init procedure only takes
      care about setting/clearing ownership flag in "des0"/"des1"
      fields while it is perfectly possible to get unexpected bits
      set because of the following factors:
      
       [1] On driver probe underlying memory allocated with
           dma_alloc_coherent() might not be zeroed and so
           it will be filled with garbage.
      
       [2] During driver operation some bits could be set by SD/MMC
           controller (for example error flags etc).
      
      And unexpected and/or randomly set flags in "des0"/"des1"
      fields may lead to unpredictable behavior of GMAC DMA block.
      
      This change addresses both items above with:
      
       [1] Use of dma_zalloc_coherent() instead of simple
           dma_alloc_coherent() to make sure allocated memory is
           zeroed. That shouldn't affect performance because
           this allocation only happens once on driver probe.
      
       [2] Do explicit zeroing of both "des0" and "des1" fields
           of all buffer descriptors during initialization of
           DMA transfer.
      
      And while at it fixed identation of dma_free_coherent()
      counterpart as well.
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
      Cc: arc-linux-dev@synopsys.com
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf6391e2
    • Noa Osherovich's avatar
      IB/mlx4: Use correct SL on AH query under RoCE · 86f1e069
      Noa Osherovich authored
      commit 5e99b139 upstream.
      
      The mlx4 IB driver implementation for ib_query_ah used a wrong offset
      (28 instead of 29) when link type is Ethernet. Fixed to use the correct one.
      
      Fixes: fa417f7b ('IB/mlx4: Add support for IBoE')
      Signed-off-by: default avatarShani Michaeli <shanim@mellanox.com>
      Signed-off-by: default avatarNoa Osherovich <noaos@mellanox.com>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      86f1e069
    • Jack Morgenstein's avatar
      IB/mlx4: Forbid using sysfs to change RoCE pkeys · 3fb968bf
      Jack Morgenstein authored
      commit 2b135db3 upstream.
      
      The pkey mapping for RoCE must remain the default mapping:
      VFs:
        virtual index 0 = mapped to real index 0 (0xFFFF)
        All others indices: mapped to a real pkey index containing an
                            invalid pkey.
      PF:
        virtual index i = real index i.
      
      Don't allow users to change these mappings using files found in
      sysfs.
      
      Fixes: c1e7e466 ('IB/mlx4: Add iov directory in sysfs under the ib device')
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3fb968bf
    • Yishai Hadas's avatar
      IB/uverbs: Fix race between ib_uverbs_open and remove_one · 20dbae92
      Yishai Hadas authored
      commit 35d4a0b6 upstream.
      
      Fixes: 2a72f212 ("IB/uverbs: Remove dev_table")
      
      Before this commit there was a device look-up table that was protected
      by a spin_lock used by ib_uverbs_open and by ib_uverbs_remove_one. When
      it was dropped and container_of was used instead, it enabled the race
      with remove_one as dev might be freed just after:
      dev = container_of(inode->i_cdev, struct ib_uverbs_device, cdev) but
      before the kref_get.
      
      In addition, this buggy patch added some dead code as
      container_of(x,y,z) can never be NULL and so dev can never be NULL.
      As a result the comment above ib_uverbs_open saying "the open method
      will either immediately run -ENXIO" is wrong as it can never happen.
      
      The solution follows Jason Gunthorpe suggestion from below URL:
      https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg25692.html
      
      cdev will hold a kref on the parent (the containing structure,
      ib_uverbs_device) and only when that kref is released it is
      guaranteed that open will never be called again.
      
      In addition, fixes the active count scheme to use an atomic
      not a kref to prevent WARN_ON as pointed by above comment
      from Jason.
      Signed-off-by: default avatarYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: default avatarShachar Raindel <raindel@mellanox.com>
      Reviewed-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      20dbae92
    • Christoph Hellwig's avatar
      IB/uverbs: reject invalid or unknown opcodes · 58922dda
      Christoph Hellwig authored
      commit b632ffa7 upstream.
      
      We have many WR opcodes that are only supported in kernel space
      and/or require optional information to be copied into the WR
      structure.  Reject all those not explicitly handled so that we
      can't pass invalid information to drivers.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Reviewed-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58922dda
    • Mike Marciniszyn's avatar
      IB/qib: Change lkey table allocation to support more MRs · 25a58b07
      Mike Marciniszyn authored
      commit d6f1c17e upstream.
      
      The lkey table is allocated with with a get_user_pages() with an
      order based on a number of index bits from a module parameter.
      
      The underlying kernel code cannot allocate that many contiguous pages.
      
      There is no reason the underlying memory needs to be physically
      contiguous.
      
      This patch:
      - switches the allocation/deallocation to vmalloc/vfree
      - caps the number of bits to 23 to insure at least 1 generation bit
        o this matches the module parameter description
      Reviewed-by: default avatarVinit Agnihotri <vinit.abhay.agnihotri@intel.com>
      Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      25a58b07
    • Hin-Tak Leung's avatar
      hfs: fix B-tree corruption after insertion at position 0 · 18fd1e93
      Hin-Tak Leung authored
      commit b4cc0efe upstream.
      
      Fix B-tree corruption when a new record is inserted at position 0 in the
      node in hfs_brec_insert().
      
      This is an identical change to the corresponding hfs b-tree code to Sergei
      Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
      to keep similar code paths in the hfs and hfsplus drivers in sync, where
      appropriate.
      Signed-off-by: default avatarHin-Tak Leung <htl10@users.sourceforge.net>
      Cc: Sergei Antonov <saproj@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Reviewed-by: default avatarVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18fd1e93
    • David Vrabel's avatar
      xen/gntdev: convert priv->lock to a mutex · e254330e
      David Vrabel authored
      commit 1401c00e upstream.
      
      Unmapping may require sleeping and we unmap while holding priv->lock, so
      convert it to a mutex.
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      Reviewed-by: default avatarStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <ian.campbell@citrix.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e254330e
    • NeilBrown's avatar
      md/raid10: always set reshape_safe when initializing reshape_position. · a78b8633
      NeilBrown authored
      commit 299b0685 upstream.
      
      'reshape_position' tracks where in the reshape we have reached.
      'reshape_safe' tracks where in the reshape we have safely recorded
      in the metadata.
      
      These are compared to determine when to update the metadata.
      So it is important that reshape_safe is initialised properly.
      Currently it isn't.  When starting a reshape from the beginning
      it usually has the correct value by luck.  But when reducing the
      number of devices in a RAID10, it has the wrong value and this leads
      to the metadata not being updated correctly.
      This can lead to corruption if the reshape is not allowed to complete.
      
      This patch is suitable for any -stable kernel which supports RAID10
      reshape, which is 3.5 and later.
      
      Fixes: 3ea7daa5 ("md/raid10: add reshape support")
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a78b8633
    • Jialing Fu's avatar
      mmc: core: fix race condition in mmc_wait_data_done · a30c50d2
      Jialing Fu authored
      commit 71f8a4b8 upstream.
      
      The following panic is captured in ker3.14, but the issue still exists
      in latest kernel.
      ---------------------------------------------------------------------
      [   20.738217] c0 3136 (Compiler) Unable to handle kernel NULL pointer dereference
      at virtual address 00000578
      ......
      [   20.738499] c0 3136 (Compiler) PC is at _raw_spin_lock_irqsave+0x24/0x60
      [   20.738527] c0 3136 (Compiler) LR is at _raw_spin_lock_irqsave+0x20/0x60
      [   20.740134] c0 3136 (Compiler) Call trace:
      [   20.740165] c0 3136 (Compiler) [<ffffffc0008ee900>] _raw_spin_lock_irqsave+0x24/0x60
      [   20.740200] c0 3136 (Compiler) [<ffffffc0000dd024>] __wake_up+0x1c/0x54
      [   20.740230] c0 3136 (Compiler) [<ffffffc000639414>] mmc_wait_data_done+0x28/0x34
      [   20.740262] c0 3136 (Compiler) [<ffffffc0006391a0>] mmc_request_done+0xa4/0x220
      [   20.740314] c0 3136 (Compiler) [<ffffffc000656894>] sdhci_tasklet_finish+0xac/0x264
      [   20.740352] c0 3136 (Compiler) [<ffffffc0000a2b58>] tasklet_action+0xa0/0x158
      [   20.740382] c0 3136 (Compiler) [<ffffffc0000a2078>] __do_softirq+0x10c/0x2e4
      [   20.740411] c0 3136 (Compiler) [<ffffffc0000a24bc>] irq_exit+0x8c/0xc0
      [   20.740439] c0 3136 (Compiler) [<ffffffc00008489c>] handle_IRQ+0x48/0xac
      [   20.740469] c0 3136 (Compiler) [<ffffffc000081428>] gic_handle_irq+0x38/0x7c
      ----------------------------------------------------------------------
      Because in SMP, "mrq" has race condition between below two paths:
      path1: CPU0: <tasklet context>
        static void mmc_wait_data_done(struct mmc_request *mrq)
        {
           mrq->host->context_info.is_done_rcv = true;
           //
           // If CPU0 has just finished "is_done_rcv = true" in path1, and at
           // this moment, IRQ or ICache line missing happens in CPU0.
           // What happens in CPU1 (path2)?
           //
           // If the mmcqd thread in CPU1(path2) hasn't entered to sleep mode:
           // path2 would have chance to break from wait_event_interruptible
           // in mmc_wait_for_data_req_done and continue to run for next
           // mmc_request (mmc_blk_rw_rq_prep).
           //
           // Within mmc_blk_rq_prep, mrq is cleared to 0.
           // If below line still gets host from "mrq" as the result of
           // compiler, the panic happens as we traced.
           wake_up_interruptible(&mrq->host->context_info.wait);
        }
      
      path2: CPU1: <The mmcqd thread runs mmc_queue_thread>
        static int mmc_wait_for_data_req_done(...
        {
           ...
           while (1) {
                 wait_event_interruptible(context_info->wait,
                         (context_info->is_done_rcv ||
                          context_info->is_new_req));
           	   static void mmc_blk_rw_rq_prep(...
                 {
                 ...
                 memset(brq, 0, sizeof(struct mmc_blk_request));
      
      This issue happens very coincidentally; however adding mdelay(1) in
      mmc_wait_data_done as below could duplicate it easily.
      
         static void mmc_wait_data_done(struct mmc_request *mrq)
         {
           mrq->host->context_info.is_done_rcv = true;
      +    mdelay(1);
           wake_up_interruptible(&mrq->host->context_info.wait);
          }
      
      At runtime, IRQ or ICache line missing may just happen at the same place
      of the mdelay(1).
      
      This patch gets the mmc_context_info at the beginning of function, it can
      avoid this race condition.
      Signed-off-by: default avatarJialing Fu <jlfu@marvell.com>
      Tested-by: default avatarShawn Lin <shawn.lin@rock-chips.com>
      Fixes: 2220eedf ("mmc: fix async request mechanism ....")
      Signed-off-by: default avatarShawn Lin <shawn.lin@rock-chips.com>
      Signed-off-by: default avatarUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a30c50d2
    • Jann Horn's avatar
      fs: if a coredump already exists, unlink and recreate with O_EXCL · 9dbc5c78
      Jann Horn authored
      commit fbb18169 upstream.
      
      It was possible for an attacking user to trick root (or another user) into
      writing his coredumps into an attacker-readable, pre-existing file using
      rename() or link(), causing the disclosure of secret data from the victim
      process' virtual memory.  Depending on the configuration, it was also
      possible to trick root into overwriting system files with coredumps.  Fix
      that issue by never writing coredumps into existing files.
      
      Requirements for the attack:
       - The attack only applies if the victim's process has a nonzero
         RLIMIT_CORE and is dumpable.
       - The attacker can trick the victim into coredumping into an
         attacker-writable directory D, either because the core_pattern is
         relative and the victim's cwd is attacker-writable or because an
         absolute core_pattern pointing to a world-writable directory is used.
       - The attacker has one of these:
        A: on a system with protected_hardlinks=0:
           execute access to a folder containing a victim-owned,
           attacker-readable file on the same partition as D, and the
           victim-owned file will be deleted before the main part of the attack
           takes place. (In practice, there are lots of files that fulfill
           this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
           This does not apply to most Linux systems because most distros set
           protected_hardlinks=1.
        B: on a system with protected_hardlinks=1:
           execute access to a folder containing a victim-owned,
           attacker-readable and attacker-writable file on the same partition
           as D, and the victim-owned file will be deleted before the main part
           of the attack takes place.
           (This seems to be uncommon.)
        C: on any system, independent of protected_hardlinks:
           write access to a non-sticky folder containing a victim-owned,
           attacker-readable file on the same partition as D
           (This seems to be uncommon.)
      
      The basic idea is that the attacker moves the victim-owned file to where
      he expects the victim process to dump its core.  The victim process dumps
      its core into the existing file, and the attacker reads the coredump from
      it.
      
      If the attacker can't move the file because he does not have write access
      to the containing directory, he can instead link the file to a directory
      he controls, then wait for the original link to the file to be deleted
      (because the kernel checks that the link count of the corefile is 1).
      
      A less reliable variant that requires D to be non-sticky works with link()
      and does not require deletion of the original link: link() the file into
      D, but then unlink() it directly before the kernel performs the link count
      check.
      
      On systems with protected_hardlinks=0, this variant allows an attacker to
      not only gain information from coredumps, but also clobber existing,
      victim-writable files with coredumps.  (This could theoretically lead to a
      privilege escalation.)
      Signed-off-by: default avatarJann Horn <jann@thejh.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9dbc5c78
    • Jaewon Kim's avatar
      vmscan: fix increasing nr_isolated incurred by putback unevictable pages · 8803492d
      Jaewon Kim authored
      commit c54839a7 upstream.
      
      reclaim_clean_pages_from_list() assumes that shrink_page_list() returns
      number of pages removed from the candidate list.  But shrink_page_list()
      puts back mlocked pages without passing it to caller and without
      counting as nr_reclaimed.  This increases nr_isolated.
      
      To fix this, this patch changes shrink_page_list() to pass unevictable
      pages back to caller.  Caller will take care those pages.
      
      Minchan said:
      
      It fixes two issues.
      
      1. With unevictable page, cma_alloc will be successful.
      
      Exactly speaking, cma_alloc of current kernel will fail due to
      unevictable pages.
      
      2. fix leaking of NR_ISOLATED counter of vmstat
      
      With it, too_many_isolated works.  Otherwise, it could make hang until
      the process get SIGKILL.
      Signed-off-by: default avatarJaewon Kim <jaewon31.kim@samsung.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8803492d
    • Helge Deller's avatar
      parisc: Filter out spurious interrupts in PA-RISC irq handler · 136af9b8
      Helge Deller authored
      commit b1b4e435 upstream.
      
      When detecting a serial port on newer PA-RISC machines (with iosapic) we have a
      long way to go to find the right IRQ line, registering it, then registering the
      serial port and the irq handler for the serial port. During this phase spurious
      interrupts for the serial port may happen which then crashes the kernel because
      the action handler might not have been set up yet.
      
      So, basically it's a race condition between the serial port hardware and the
      CPU which sets up the necessary fields in the irq sructs. The main reason for
      this race is, that we unmask the serial port irqs too early without having set
      up everything properly before (which isn't easily possible because we need the
      IRQ number to register the serial ports).
      
      This patch is a work-around for this problem. It adds checks to the CPU irq
      handler to verify if the IRQ action field has been initialized already. If not,
      we just skip this interrupt (which isn't critical for a serial port at bootup).
      The real fix would probably involve rewriting all PA-RISC specific IRQ code
      (for CPU, IOSAPIC, GSC and EISA) to use IRQ domains with proper parenting of
      the irq chips and proper irq enabling along this line.
      
      This bug has been in the PA-RISC port since the beginning, but the crashes
      happened very rarely with currently used hardware.  But on the latest machine
      which I bought (a C8000 workstation), which uses the fastest CPUs (4 x PA8900,
      1GHz) and which has the largest possible L1 cache size (64MB each), the kernel
      crashed at every boot because of this race. So, without this patch the machine
      would currently be unuseable.
      
      For the record, here is the flow logic:
      1. serial_init_chip() in 8250_gsc.c calls iosapic_serial_irq().
      2. iosapic_serial_irq() calls txn_alloc_irq() to find the irq.
      3. iosapic_serial_irq() calls cpu_claim_irq() to register the CPU irq
      4. cpu_claim_irq() unmasks the CPU irq (which it shouldn't!)
      5. serial_init_chip() then registers the 8250 port.
      Problems:
      - In step 4 the CPU irq shouldn't have been registered yet, but after step 5
      - If serial irq happens between 4 and 5 have finished, the kernel will crash
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      136af9b8
    • John David Anglin's avatar
      parisc: Use double word condition in 64bit CAS operation · 2d608a34
      John David Anglin authored
      commit 1b59ddfc upstream.
      
      The attached change fixes the condition used in the "sub" instruction.
      A double word comparison is needed.  This fixes the 64-bit LWS CAS
      operation on 64-bit kernels.
      
      I can now enable 64-bit atomic support in GCC.
      
      Signed-off-by: John David Anglin <dave.anglin>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2d608a34
    • Trond Myklebust's avatar
      NFS: nfs_set_pgio_error sometimes misses errors · e6f5a8e4
      Trond Myklebust authored
      commit e9ae58ae upstream.
      
      We should ensure that we always set the pgio_header's error field
      if a READ or WRITE RPC call returns an error. The current code depends
      on 'hdr->good_bytes' always being initialised to a large value, which
      is not always done correctly by callers.
      When this happens, applications may end up missing important errors.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e6f5a8e4
    • Kinglong Mee's avatar
      NFS: Fix a NULL pointer dereference of migration recovery ops for v4.2 client · 05c5d5c7
      Kinglong Mee authored
      commit 18e3b739 upstream.
      
      ---Steps to Reproduce--
      <nfs-server>
      # cat /etc/exports
      /nfs/referal  *(rw,insecure,no_subtree_check,no_root_squash,crossmnt)
      /nfs/old      *(ro,insecure,subtree_check,root_squash,crossmnt)
      
      <nfs-client>
      # mount -t nfs nfs-server:/nfs/ /mnt/
      # ll /mnt/*/
      
      <nfs-server>
      # cat /etc/exports
      /nfs/referal   *(rw,insecure,no_subtree_check,no_root_squash,crossmnt,refer=/nfs/old/@nfs-server)
      /nfs/old       *(ro,insecure,subtree_check,root_squash,crossmnt)
      # service nfs restart
      
      <nfs-client>
      # ll /mnt/*/    --->>>>> oops here
      
      [ 5123.102925] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [ 5123.103363] IP: [<ffffffffa03ed38b>] nfs4_proc_get_locations+0x9b/0x120 [nfsv4]
      [ 5123.103752] PGD 587b9067 PUD 3cbf5067 PMD 0
      [ 5123.104131] Oops: 0000 [#1]
      [ 5123.104529] Modules linked in: nfsv4(OE) nfs(OE) fscache(E) nfsd(OE) xfs libcrc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev vmw_balloon parport_pc parport i2c_piix4 shpchp auth_rpcgss nfs_acl vmw_vmci lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi serio_raw scsi_transport_spi e1000 mptscsih mptbase ata_generic pata_acpi [last unloaded: nfsd]
      [ 5123.105887] CPU: 0 PID: 15853 Comm: ::1-manager Tainted: G           OE   4.2.0-rc6+ #214
      [ 5123.106358] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
      [ 5123.106860] task: ffff88007620f300 ti: ffff88005877c000 task.ti: ffff88005877c000
      [ 5123.107363] RIP: 0010:[<ffffffffa03ed38b>]  [<ffffffffa03ed38b>] nfs4_proc_get_locations+0x9b/0x120 [nfsv4]
      [ 5123.107909] RSP: 0018:ffff88005877fdb8  EFLAGS: 00010246
      [ 5123.108435] RAX: ffff880053f3bc00 RBX: ffff88006ce6c908 RCX: ffff880053a0d240
      [ 5123.108968] RDX: ffffea0000e6d940 RSI: ffff8800399a0000 RDI: ffff88006ce6c908
      [ 5123.109503] RBP: ffff88005877fe28 R08: ffffffff81c708a0 R09: 0000000000000000
      [ 5123.110045] R10: 00000000000001a2 R11: ffff88003ba7f5c8 R12: ffff880054c55800
      [ 5123.110618] R13: 0000000000000000 R14: ffff880053a0d240 R15: ffff880053a0d240
      [ 5123.111169] FS:  0000000000000000(0000) GS:ffffffff81c27000(0000) knlGS:0000000000000000
      [ 5123.111726] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5123.112286] CR2: 0000000000000000 CR3: 0000000054cac000 CR4: 00000000001406f0
      [ 5123.112888] Stack:
      [ 5123.113458]  ffffea0000e6d940 ffff8800399a0000 00000000000167d0 0000000000000000
      [ 5123.114049]  0000000000000000 0000000000000000 0000000000000000 00000000a7ec82c6
      [ 5123.114662]  ffff88005877fe18 ffffea0000e6d940 ffff8800399a0000 ffff880054c55800
      [ 5123.115264] Call Trace:
      [ 5123.115868]  [<ffffffffa03fb44b>] nfs4_try_migration+0xbb/0x220 [nfsv4]
      [ 5123.116487]  [<ffffffffa03fcb3b>] nfs4_run_state_manager+0x4ab/0x7b0 [nfsv4]
      [ 5123.117104]  [<ffffffffa03fc690>] ? nfs4_do_reclaim+0x510/0x510 [nfsv4]
      [ 5123.117813]  [<ffffffff810a4527>] kthread+0xd7/0xf0
      [ 5123.118456]  [<ffffffff810a4450>] ? kthread_worker_fn+0x160/0x160
      [ 5123.119108]  [<ffffffff816d9cdf>] ret_from_fork+0x3f/0x70
      [ 5123.119723]  [<ffffffff810a4450>] ? kthread_worker_fn+0x160/0x160
      [ 5123.120329] Code: 4c 8b 6a 58 74 17 eb 52 48 8d 55 a8 89 c6 4c 89 e7 e8 4a b5 ff ff 8b 45 b0 85 c0 74 1c 4c 89 f9 48 8b 55 90 48 8b 75 98 48 89 df <41> ff 55 00 3d e8 d8 ff ff 41 89 c6 74 cf 48 8b 4d c8 65 48 33
      [ 5123.121643] RIP  [<ffffffffa03ed38b>] nfs4_proc_get_locations+0x9b/0x120 [nfsv4]
      [ 5123.122308]  RSP <ffff88005877fdb8>
      [ 5123.122942] CR2: 0000000000000000
      
      Fixes: ec011fe8 ("NFS: Introduce a vector of migration recovery ops")
      Signed-off-by: default avatarKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      05c5d5c7
    • NeilBrown's avatar
      NFSv4: don't set SETATTR for O_RDONLY|O_EXCL · 74e1449c
      NeilBrown authored
      commit efcbc04e upstream.
      
      It is unusual to combine the open flags O_RDONLY and O_EXCL, but
      it appears that libre-office does just that.
      
      [pid  3250] stat("/home/USER/.config", {st_mode=S_IFDIR|0700, st_size=8192, ...}) = 0
      [pid  3250] open("/home/USER/.config/libreoffice/4-suse/user/extensions/buildid", O_RDONLY|O_EXCL <unfinished ...>
      
      NFSv4 takes O_EXCL as a sign that a setattr command should be sent,
      probably to reset the timestamps.
      
      When it was an O_RDONLY open, the SETATTR command does not
      identify any actual attributes to change.
      If no delegation was provided to the open, the SETATTR uses the
      all-zeros stateid and the request is accepted (at least by the
      Linux NFS server - no harm, no foul).
      
      If a read-delegation was provided, this is used in the SETATTR
      request, and a Netapp filer will justifiably claim
      NFS4ERR_BAD_STATEID, which the Linux client takes as a sign
      to retry - indefinitely.
      
      So only treat O_EXCL specially if O_CREAT was also given.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      74e1449c
    • Filipe Manana's avatar
      Btrfs: check if previous transaction aborted to avoid fs corruption · 20ae73f4
      Filipe Manana authored
      commit 1f9b8c8f upstream.
      
      While we are committing a transaction, it's possible the previous one is
      still finishing its commit and therefore we wait for it to finish first.
      However we were not checking if that previous transaction ended up getting
      aborted after we waited for it to commit, so we ended up committing the
      current transaction which can lead to fs corruption because the new
      superblock can point to trees that have had one or more nodes/leafs that
      were never durably persisted.
      The following sequence diagram exemplifies how this is possible:
      
                CPU 0                                                        CPU 1
      
        transaction N starts
      
        (...)
      
        btrfs_commit_transaction(N)
      
          cur_trans->state = TRANS_STATE_COMMIT_START;
          (...)
          cur_trans->state = TRANS_STATE_COMMIT_DOING;
          (...)
      
          cur_trans->state = TRANS_STATE_UNBLOCKED;
          root->fs_info->running_transaction = NULL;
      
                                                                    btrfs_start_transaction()
                                                                       --> starts transaction N + 1
      
          btrfs_write_and_wait_transaction(trans, root);
            --> starts writing all new or COWed ebs created
                at transaction N
      
                                                                    creates some new ebs, COWs some
                                                                    existing ebs but doesn't COW or
                                                                    deletes eb X
      
                                                                    btrfs_commit_transaction(N + 1)
                                                                      (...)
                                                                      cur_trans->state = TRANS_STATE_COMMIT_START;
                                                                      (...)
                                                                      wait_for_commit(root, prev_trans);
                                                                        --> prev_trans == transaction N
      
          btrfs_write_and_wait_transaction() continues
          writing ebs
             --> fails writing eb X, we abort transaction N
                 and set bit BTRFS_FS_STATE_ERROR on
                 fs_info->fs_state, so no new transactions
                 can start after setting that bit
      
             cleanup_transaction()
               btrfs_cleanup_one_transaction()
                 wakes up task at CPU 1
      
                                                                      continues, doesn't abort because
                                                                      cur_trans->aborted (transaction N + 1)
                                                                      is zero, and no checks for bit
                                                                      BTRFS_FS_STATE_ERROR in fs_info->fs_state
                                                                      are made
      
                                                                      btrfs_write_and_wait_transaction(trans, root);
                                                                        --> succeeds, no errors during writeback
      
                                                                      write_ctree_super(trans, root, 0);
                                                                        --> succeeds
                                                                        --> we have now a superblock that points us
                                                                            to some root that uses eb X, which was
                                                                            never written to disk
      
      In this scenario future attempts to read eb X from disk results in an
      error message like "parent transid verify failed on X wanted Y found Z".
      
      So fix this by aborting the current transaction if after waiting for the
      previous transaction we verify that it was aborted.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      20ae73f4
    • Sakari Ailus's avatar
      v4l: omap3isp: Fix sub-device power management code · a601c5b4
      Sakari Ailus authored
      commit 9d39f054 upstream.
      
      Commit 813f5c0a ("media: Change media device link_notify behaviour")
      modified the media controller link setup notification API and updated the
      OMAP3 ISP driver accordingly. As a side effect it introduced a bug by
      turning power on after setting the link instead of before. This results in
      sub-devices not being powered down in some cases when they should be. Fix
      it.
      
      Fixes: 813f5c0a [media] media: Change media device link_notify behaviour
      Signed-off-by: default avatarSakari Ailus <sakari.ailus@iki.fi>
      Signed-off-by: default avatarLaurent Pinchart <laurent.pinchart@ideasonboard.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@osg.samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a601c5b4
    • David Härdeman's avatar
      rc-core: fix remove uevent generation · 565e7a31
      David Härdeman authored
      commit a66b0c41 upstream.
      
      The input_dev is already gone when the rc device is being unregistered
      so checking for its presence only means that no remove uevent will be
      generated.
      Signed-off-by: default avatarDavid Härdeman <david@hardeman.nu>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@osg.samsung.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      565e7a31
    • Minfei Huang's avatar
      x86/mm: Initialize pmd_idx in page_table_range_init_count() · 0fce8532
      Minfei Huang authored
      commit 9962eea9 upstream.
      
      The variable pmd_idx is not initialized for the first iteration of the
      for loop.
      
      Assign the proper value which indexes the start address.
      
      Fixes: 719272c4 'x86, mm: only call early_ioremap_page_table_range_init() once'
      Signed-off-by: default avatarMinfei Huang <mnfhuang@gmail.com>
      Cc: tony.luck@intel.com
      Cc: wangnan0@huawei.com
      Cc: david.vrabel@citrix.com
      Reviewed-by: yinghai@kernel.org
      Link: http://lkml.kernel.org/r/1436703522-29552-1-git-send-email-mhuang@redhat.comSigned-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0fce8532
    • Yinghai Lu's avatar
      mm: check if section present during memory block registering · 191276cd
      Yinghai Lu authored
      commit 04697858 upstream.
      
      Tony Luck found on his setup, if memory block size 512M will cause crash
      during booting.
      
        BUG: unable to handle kernel paging request at ffffea0074000020
        IP: get_nid_for_pfn+0x17/0x40
        PGD 128ffcb067 PUD 128ffc9067 PMD 0
        Oops: 0000 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.2.0-rc8 #1
        ...
        Call Trace:
           ? register_mem_sect_under_node+0x66/0xe0
           register_one_node+0x17b/0x240
           ? pci_iommu_alloc+0x6e/0x6e
           topology_init+0x3c/0x95
           do_one_initcall+0xcd/0x1f0
      
      The system has non continuous RAM address:
       BIOS-e820: [mem 0x0000001300000000-0x0000001cffffffff] usable
       BIOS-e820: [mem 0x0000001d70000000-0x0000001ec7ffefff] usable
       BIOS-e820: [mem 0x0000001f00000000-0x0000002bffffffff] usable
       BIOS-e820: [mem 0x0000002c18000000-0x0000002d6fffefff] usable
       BIOS-e820: [mem 0x0000002e00000000-0x00000039ffffffff] usable
      
      So there are start sections in memory block not present.  For example:
      
          memory block : [0x2c18000000, 0x2c20000000) 512M
      
      first three sections are not present.
      
      The current register_mem_sect_under_node() assume first section is
      present, but memory block section number range [start_section_nr,
      end_section_nr] would include not present section.
      
      For arch that support vmemmap, we don't setup memmap for struct page
      area within not present sections area.
      
      So skip the pfn range that belong to absent section.
      
      [akpm@linux-foundation.org: simplification]
      [rientjes@google.com: more simplification]
      Fixes: bdee237c ("x86: mm: Use 2GB memory block size on large memory x86-64 systems")
      Fixes: 982792c7 ("x86, mm: probe memory block size for generic x86 64bit")
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarTony Luck <tony.luck@intel.com>
      Tested-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Tested-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      191276cd
    • Jeffery Miller's avatar
      Add radeon suspend/resume quirk for HP Compaq dc5750. · b8144881
      Jeffery Miller authored
      commit 09bfda10 upstream.
      
      With the radeon driver loaded the HP Compaq dc5750
      Small Form Factor machine fails to resume from suspend.
      Adding a quirk similar to other devices avoids
      the problem and the system resumes properly.
      Signed-off-by: default avatarJeffery Miller <jmiller@neverware.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b8144881
    • Jann Horn's avatar
      CIFS: fix type confusion in copy offload ioctl · cc45b90c
      Jann Horn authored
      commit 4c17a6d5 upstream.
      
      This might lead to local privilege escalation (code execution as
      kernel) for systems where the following conditions are met:
      
       - CONFIG_CIFS_SMB2 and CONFIG_CIFS_POSIX are enabled
       - a cifs filesystem is mounted where:
        - the mount option "vers" was used and set to a value >=2.0
        - the attacker has write access to at least one file on the filesystem
      
      To attack this, an attacker would have to guess the target_tcon
      pointer (but guessing wrong doesn't cause a crash, it just returns an
      error code) and win a narrow race.
      Signed-off-by: default avatarJann Horn <jann@thejh.net>
      Signed-off-by: default avatarSteve French <smfrench@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc45b90c
    • Aneesh Kumar K.V's avatar
      powerpc/mm: Recompute hash value after a failed update · 5d395c5d
      Aneesh Kumar K.V authored
      commit 36b35d5d upstream.
      
      If we had secondary hash flag set, we ended up modifying hash value in
      the updatepp code path. Hence with a failed updatepp we will be using
      a wrong hash value for the following hash insert. Fix this by
      recomputing hash before insert.
      
      Without this patch we can end up with using wrong slot number in linux
      pte. That can result in us missing an hash pte update or invalidate
      which can cause memory corruption or even machine check.
      
      Fixes: 6d492ecc ("powerpc/THP: Add code to handle HPTE faults for hugepages")
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5d395c5d
    • Thomas Huth's avatar
      powerpc/rtas: Introduce rtas_get_sensor_fast() for IRQ handlers · 4d14a9e6
      Thomas Huth authored
      commit 1c2cb594 upstream.
      
      The EPOW interrupt handler uses rtas_get_sensor(), which in turn
      uses rtas_busy_delay() to wait for RTAS becoming ready in case it
      is necessary. But rtas_busy_delay() is annotated with might_sleep()
      and thus may not be used by interrupts handlers like the EPOW handler!
      This leads to the following BUG when CONFIG_DEBUG_ATOMIC_SLEEP is
      enabled:
      
       BUG: sleeping function called from invalid context at arch/powerpc/kernel/rtas.c:496
       in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
       CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc2-thuth #6
       Call Trace:
       [c00000007ffe7b90] [c000000000807670] dump_stack+0xa0/0xdc (unreliable)
       [c00000007ffe7bc0] [c0000000000e1f14] ___might_sleep+0x134/0x180
       [c00000007ffe7c20] [c00000000002aec0] rtas_busy_delay+0x30/0xd0
       [c00000007ffe7c50] [c00000000002bde4] rtas_get_sensor+0x74/0xe0
       [c00000007ffe7ce0] [c000000000083264] ras_epow_interrupt+0x44/0x450
       [c00000007ffe7d90] [c000000000120260] handle_irq_event_percpu+0xa0/0x300
       [c00000007ffe7e70] [c000000000120524] handle_irq_event+0x64/0xc0
       [c00000007ffe7eb0] [c000000000124dbc] handle_fasteoi_irq+0xec/0x260
       [c00000007ffe7ef0] [c00000000011f4f0] generic_handle_irq+0x50/0x80
       [c00000007ffe7f20] [c000000000010f3c] __do_irq+0x8c/0x200
       [c00000007ffe7f90] [c0000000000236cc] call_do_irq+0x14/0x24
       [c00000007e6f39e0] [c000000000011144] do_IRQ+0x94/0x110
       [c00000007e6f3a30] [c000000000002594] hardware_interrupt_common+0x114/0x180
      
      Fix this issue by introducing a new rtas_get_sensor_fast() function
      that does not use rtas_busy_delay() - and thus can only be used for
      sensors that do not cause a BUSY condition - known as "fast" sensors.
      
      The EPOW sensor is defined to be "fast" in sPAPR - mpe.
      
      Fixes: 587f83e8 ("powerpc/pseries: Use rtas_get_sensor in RAS code")
      Signed-off-by: default avatarThomas Huth <thuth@redhat.com>
      Reviewed-by: default avatarNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d14a9e6