1. 28 Oct, 2015 7 commits
    • Jason Wang's avatar
      kvm: fix double free for fast mmio eventfd · 0b5ee818
      Jason Wang authored
      [ Upstream commit eefd6b06 ]
      
      We register wildcard mmio eventfd on two buses, once for KVM_MMIO_BUS
      and once on KVM_FAST_MMIO_BUS but with a single iodev
      instance. This will lead to an issue: kvm_io_bus_destroy() knows
      nothing about the devices on two buses pointing to a single dev. Which
      will lead to double free[1] during exit. Fix this by allocating two
      instances of iodevs then registering one on KVM_MMIO_BUS and another
      on KVM_FAST_MMIO_BUS.
      
      CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic #28-Ubuntu
      Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013
      task: ffff88009ae0c4b0 ti: ffff88020e7f0000 task.ti: ffff88020e7f0000
      RIP: 0010:[<ffffffffc07e25d8>]  [<ffffffffc07e25d8>] ioeventfd_release+0x28/0x60 [kvm]
      RSP: 0018:ffff88020e7f3bc8  EFLAGS: 00010292
      RAX: dead000000200200 RBX: ffff8801ec19c900 RCX: 000000018200016d
      RDX: ffff8801ec19cf80 RSI: ffffea0008bf1d40 RDI: ffff8801ec19c900
      RBP: ffff88020e7f3bd8 R08: 000000002fc75a01 R09: 000000018200016d
      R10: ffffffffc07df6ae R11: ffff88022fc75a98 R12: ffff88021e7cc000
      R13: ffff88021e7cca48 R14: ffff88021e7cca50 R15: ffff8801ec19c880
      FS:  00007fc1ee3e6700(0000) GS:ffff88023e240000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f8f389d8000 CR3: 000000023dc13000 CR4: 00000000001427e0
      Stack:
      ffff88021e7cc000 0000000000000000 ffff88020e7f3be8 ffffffffc07e2622
      ffff88020e7f3c38 ffffffffc07df69a ffff880232524160 ffff88020e792d80
       0000000000000000 ffff880219b78c00 0000000000000008 ffff8802321686a8
      Call Trace:
      [<ffffffffc07e2622>] ioeventfd_destructor+0x12/0x20 [kvm]
      [<ffffffffc07df69a>] kvm_put_kvm+0xca/0x210 [kvm]
      [<ffffffffc07df818>] kvm_vcpu_release+0x18/0x20 [kvm]
      [<ffffffff811f69f7>] __fput+0xe7/0x250
      [<ffffffff811f6bae>] ____fput+0xe/0x10
      [<ffffffff81093f04>] task_work_run+0xd4/0xf0
      [<ffffffff81079358>] do_exit+0x368/0xa50
      [<ffffffff81082c8f>] ? recalc_sigpending+0x1f/0x60
      [<ffffffff81079ad5>] do_group_exit+0x45/0xb0
      [<ffffffff81085c71>] get_signal+0x291/0x750
      [<ffffffff810144d8>] do_signal+0x28/0xab0
      [<ffffffff810f3a3b>] ? do_futex+0xdb/0x5d0
      [<ffffffff810b7028>] ? __wake_up_locked_key+0x18/0x20
      [<ffffffff810f3fa6>] ? SyS_futex+0x76/0x170
      [<ffffffff81014fc9>] do_notify_resume+0x69/0xb0
      [<ffffffff817cb9af>] int_signal+0x12/0x17
      Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 20 e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 <48> 89 10 48 b8 00 01 10 00 00
       RIP  [<ffffffffc07e25d8>] ioeventfd_release+0x28/0x60 [kvm]
       RSP <ffff88020e7f3bc8>
      
      Cc: stable@vger.kernel.org
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Reviewed-by: default avatarCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      0b5ee818
    • Jason Wang's avatar
      kvm: factor out core eventfd assign/deassign logic · 7642b3f1
      Jason Wang authored
      [ Upstream commit 85da11ca ]
      
      This patch factors out core eventfd assign/deassign logic and leaves
      the argument checking and bus index selection to callers.
      
      Cc: stable@vger.kernel.org
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Reviewed-by: default avatarCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      7642b3f1
    • Jason Wang's avatar
      kvm: fix zero length mmio searching · 7d765ce0
      Jason Wang authored
      [ Upstream commit 8f4216c7 ]
      
      Currently, if we had a zero length mmio eventfd assigned on
      KVM_MMIO_BUS. It will never be found by kvm_io_bus_cmp() since it
      always compares the kvm_io_range() with the length that guest
      wrote. This will cause e.g for vhost, kick will be trapped by qemu
      userspace instead of vhost. Fixing this by using zero length if an
      iodevice is zero length.
      
      Cc: stable@vger.kernel.org
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Reviewed-by: default avatarCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      7d765ce0
    • Jason Wang's avatar
      kvm: don't try to register to KVM_FAST_MMIO_BUS for non mmio eventfd · d758df24
      Jason Wang authored
      [ Upstream commit 8453fecb ]
      
      We only want zero length mmio eventfd to be registered on
      KVM_FAST_MMIO_BUS. So check this explicitly when arg->len is zero to
      make sure this.
      
      Cc: stable@vger.kernel.org
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Reviewed-by: default avatarCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      d758df24
    • Marek Majtyka's avatar
      arm: KVM: Fix incorrect device to IPA mapping · 45258bdd
      Marek Majtyka authored
      [ Upstream commit ca09f02f ]
      
      A critical bug has been found in device memory stage1 translation for
      VMs with more then 4GB of address space. Once vm_pgoff size is smaller
      then pa (which is true for LPAE case, u32 and u64 respectively) some
      more significant bits of pa may be lost as a shift operation is performed
      on u32 and later cast onto u64.
      
      Example: vm_pgoff(u32)=0x00210030, PAGE_SHIFT=12
              expected pa(u64):   0x0000002010030000
              produced pa(u64):   0x0000000010030000
      
      The fix is to change the order of operations (casting first onto phys_addr_t
      and then shifting).
      Reviewed-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      [maz: fixed changelog and patch formatting]
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMarek Majtyka <marek.majtyka@tieto.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      45258bdd
    • Kyle Evans's avatar
      hp-wmi: limit hotkey enable · 3cd079e5
      Kyle Evans authored
      [ Upstream commit 8a1513b4 ]
      
      Do not write initialize magic on systems that do not have
      feature query 0xb. Fixes Bug #82451.
      
      Redefine FEATURE_QUERY to align with 0xb and FEATURE2 with 0xd
      for code clearity.
      
      Add a new test function, hp_wmi_bios_2008_later() & simplify
      hp_wmi_bios_2009_later(), which fixes a bug in cases where
      an improper value is returned. Probably also fixes Bug #69131.
      
      Add missing __init tag.
      Signed-off-by: default avatarKyle Evans <kvans32@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3cd079e5
    • Luis Henriques's avatar
      zram: fix possible use after free in zcomp_create() · 2889a072
      Luis Henriques authored
      [ Upstream commit 3aaf14da ]
      
      zcomp_create() verifies the success of zcomp_strm_{multi,single}_create()
      through comp->stream, which can potentially be pointing to memory that
      was freed if these functions returned an error.
      
      While at it, replace a 'ERR_PTR(-ENOMEM)' by a more generic
      'ERR_PTR(error)' as in the future zcomp_strm_{multi,siggle}_create()
      could return other error codes.  Function documentation updated
      accordingly.
      
      Fixes: beca3ec7 ("zram: add multi stream functionality")
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      Acked-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      2889a072
  2. 27 Oct, 2015 25 commits
    • Stas Sergeev's avatar
      of_mdio: add new DT property 'managed' to specify the PHY management type · 71a386c7
      Stas Sergeev authored
      [ Upstream commit 4cba5c21 ]
      
      Currently the PHY management type is selected by the MAC driver arbitrary.
      The decision is based on the presence of the "fixed-link" node and on a
      will of the driver's authors.
      This caused a regression recently, when mvneta driver suddenly started
      to use the in-band status for auto-negotiation on fixed links.
      It appears the auto-negotiation may not work when expected by the MAC driver.
      Sebastien Rannou explains:
      << Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's
      a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with
      inband status) is connected to the switch through QSGMII, and in this context
      we are on the media side of the PHY. >>
      https://lkml.org/lkml/2015/7/10/206
      
      This patch introduces the new string property 'managed' that allows
      the user to set the management type explicitly.
      The supported values are:
      "auto" - default. Uses either MDIO or nothing, depending on the presence
      of the fixed-link node
      "in-band-status" - use in-band status
      Signed-off-by: default avatarStas Sergeev <stsp@users.sourceforge.net>
      
      CC: Rob Herring <robh+dt@kernel.org>
      CC: Pawel Moll <pawel.moll@arm.com>
      CC: Mark Rutland <mark.rutland@arm.com>
      CC: Ian Campbell <ijc+devicetree@hellion.org.uk>
      CC: Kumar Gala <galak@codeaurora.org>
      CC: Florian Fainelli <f.fainelli@gmail.com>
      CC: Grant Likely <grant.likely@linaro.org>
      CC: devicetree@vger.kernel.org
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      71a386c7
    • Florian Fainelli's avatar
      net: dsa: bcm_sf2: Do not override speed settings · c0fb0993
      Florian Fainelli authored
      [ Upstream commit d2eac98f ]
      
      The SF2 driver currently overrides speed settings for its port
      configured using a fixed PHY, this is both unnecessary and incorrect,
      because we keep feedback to the hardware parameters that we read from
      the PHY device, which in the case of a fixed PHY cannot possibly change
      speed.
      
      This is a required change to allow the fixed PHY code to allow
      registering a PHY with a link configured as DOWN by default and avoid
      some sort of circular dependency where we require the link_update
      callback to run to program the hardware, and we then utilize the fixed
      PHY parameters to program the hardware with the same settings.
      
      Fixes: 246d7f77 ("net: dsa: add Broadcom SF2 switch driver")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      c0fb0993
    • Eric Dumazet's avatar
      tcp: add proper TS val into RST packets · 9a2c1f52
      Eric Dumazet authored
      [ Upstream commit 675ee231 ]
      
      RST packets sent on behalf of TCP connections with TS option (RFC 7323
      TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.
      
      A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
      ecr 0], length 0
      B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
      1460,nop,nop,TS val 7264344 ecr 100], length 0
      A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
      7264344], length 0
      
      B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
      ecr 110], length 0
      
      We need to call skb_mstamp_get() to get proper TS val,
      derived from skb->skb_mstamp
      
      Note that RFC 1323 was advocating to not send TS option in RST segment,
      but RFC 7323 recommends the opposite :
      
        Once TSopt has been successfully negotiated, that is both <SYN> and
        <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
        segment for the duration of the connection, and SHOULD be sent in an
        <RST> segment (see Section 5.2 for details)
      
      Note this RFC recommends to send TS val = 0, but we believe it is
      premature : We do not know if all TCP stacks are properly
      handling the receive side :
      
         When an <RST> segment is
         received, it MUST NOT be subjected to the PAWS check by verifying an
         acceptable value in SEG.TSval, and information from the Timestamps
         option MUST NOT be used to update connection state information.
         SEG.TSecr MAY be used to provide stricter <RST> acceptance checks.
      
      In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
      to decide to send TS val = 0, if it buys something.
      
      Fixes: 7faee5c0 ("tcp: remove TCP_SKB_CB(skb)->when")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      9a2c1f52
    • Florian Fainelli's avatar
      net: dsa: bcm_sf2: Fix 64-bits register writes · 646cd5ed
      Florian Fainelli authored
      [ Upstream commit 03679a14 ]
      
      The macro to write 64-bits quantities to the 32-bits register swapped
      the value and offsets arguments, we want to preserve the ordering of the
      arguments with respect to how writel() is implemented for instance:
      value first, offset/base second.
      
      Fixes: 246d7f77 ("net: dsa: add Broadcom SF2 switch driver")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      646cd5ed
    • Atsushi Nemoto's avatar
      net: eth: altera: fix napi poll_list corruption · ca41797a
      Atsushi Nemoto authored
      [ Upstream commit 4548a697 ]
      
      tse_poll() calls __napi_complete() with irq enabled.  This leads napi
      poll_list corruption and may stop all napi drivers working.
      Use napi_complete() instead of __napi_complete().
      Signed-off-by: default avatarAtsushi Nemoto <nemoto@toshiba-tops.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      ca41797a
    • Eric Sandeen's avatar
      ext4: don't manipulate recovery flag when freezing no-journal fs · 826d518a
      Eric Sandeen authored
      [ Upstream commit c642dc9e ]
      
      At some point along this sequence of changes:
      
      f6e63f90 ext4: fold ext4_nojournal_sops into ext4_sops
      bb044576 ext4: support freezing ext2 (nojournal) file systems
      9ca92389 ext4: Use separate super_operations structure for no_journal filesystems
      
      ext4 started setting needs_recovery on filesystems without journals
      when they are unfrozen.  This makes no sense, and in fact confuses
      blkid to the point where it doesn't recognize the filesystem at all.
      
      (freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs,
      see needs_recovery set on fs w/ no journal).
      
      To fix this, don't manipulate the INCOMPAT_RECOVER feature on
      filesystems without journals.
      Reported-by: default avatarStu Mark <smark@datto.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      826d518a
    • Daniel Axtens's avatar
      cxl: Fix unbalanced pci_dev_get in cxl_probe · 5324b253
      Daniel Axtens authored
      [ Upstream commit 2925c2fd ]
      
      Currently the first thing we do in cxl_probe is to grab a reference
      on the pci device. Later on, we call device_register on our adapter.
      In our remove path, we call device_unregister, but we never call
      pci_dev_put. We therefore leak the device every time we do a
      reflash.
      
      device_register/unregister is sufficient to hold the reference.
      Therefore, drop the call to pci_dev_get.
      
      Here's why this is safe.
      The proposed cxl_probe(pdev) calls cxl_adapter_init:
          a) init calls cxl_adapter_alloc, which creates a struct cxl,
             conventionally called adapter. This struct contains a
             device entry, adapter->dev.
      
          b) init calls cxl_configure_adapter, where we set
             adapter->dev.parent = &dev->dev (here dev is the pci dev)
      
      So at this point, the cxl adapter's device's parent is the PCI
      device that I want to be refcounted properly.
      
          c) init calls cxl_register_adapter
             *) cxl_register_adapter calls device_register(&adapter->dev)
      
      So now we're in device_register, where dev is the adapter device, and
      we want to know if the PCI device is safe after we return.
      
      device_register(&adapter->dev) calls device_initialize() and then
      device_add().
      
      device_add() does a get_device(). device_add() also explicitly grabs
      the device's parent, and calls get_device() on it:
      
               parent = get_device(dev->parent);
      
      So therefore, device_register() takes a lock on the parent PCI dev,
      which is what pci_dev_get() was guarding. pci_dev_get() can therefore
      be safely removed.
      
      Fixes: f204e0b8 ("cxl: Driver code for powernv PCIe based cards for userspace access")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Acked-by: default avatarIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5324b253
    • Shota Suzuki's avatar
      igb: Fix oops caused by missing queue pairing · 5be042b1
      Shota Suzuki authored
      [ Upstream commit 72ddef05 ]
      
      When initializing igb driver (e.g. 82576, I350), IGB_FLAG_QUEUE_PAIRS is
      set if adapter->rss_queues exceeds half of max_rss_queues in
      igb_init_queue_configuration().
      On the other hand, IGB_FLAG_QUEUE_PAIRS is not set even if the number of
      queues exceeds half of max_combined in igb_set_channels() when changing
      the number of queues by "ethtool -L".
      In this case, if numvecs is larger than MAX_MSIX_ENTRIES (10), the size
      of adapter->msix_entries[], an overflow can occur in
      igb_set_interrupt_capability(), which in turn leads to an oops.
      
      Fix this problem as follows:
       - When changing the number of queues by "ethtool -L", set
         IGB_FLAG_QUEUE_PAIRS in the same way as initializing igb driver.
       - When increasing the size of q_vector, reallocate it appropriately.
         (With IGB_FLAG_QUEUE_PAIRS set, the size of q_vector gets larger.)
      
      Another possible way to fix this problem is to cap the queues at its
      initial number, which is the number of the initial online cpus. But this
      is not the optimal way because we cannot increase queues when another
      cpu becomes online.
      
      Note that before commit cd14ef54 ("igb: Change to use statically
      allocated array for MSIx entries"), this problem did not cause oops
      but just made the number of queues become 1 because of entering msi_only
      mode in igb_set_interrupt_capability().
      
      Fixes: 907b7835 ("igb: Add ethtool support to configure number of channels")
      CC: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarShota Suzuki <suzuki_shota_t3@lab.ntt.co.jp>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5be042b1
    • Larry Finger's avatar
      rtlwifi: rtl8821ae: Fix an expression that is always false · e936a4c6
      Larry Finger authored
      [ Upstream commit 251086f5 ]
      
      In routine _rtl8821ae_set_media_status(), an incorrect mask results in a test
      for AP status to always be false. Similar bugs were fixed in rtl8192cu and
      rtl8192de, but this instance was missed at that time.
      Reported-by: default avatarDavid Binderman <dcb314@hotmail.com>
      Signed-off-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Cc: Stable <stable@vger.kernel.org> [3.18+]
      Cc: David Binderman <dcb314@hotmail.com>
      Signed-off-by: default avatarKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e936a4c6
    • Andy Lutomirski's avatar
      x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection · 4bc532d8
      Andy Lutomirski authored
      [ Upstream commit 810bc075 ]
      
      We have a tricky bug in the nested NMI code: if we see RSP
      pointing to the NMI stack on NMI entry from kernel mode, we
      assume that we are executing a nested NMI.
      
      This isn't quite true.  A malicious userspace program can point
      RSP at the NMI stack, issue SYSCALL, and arrange for an NMI to
      happen while RSP is still pointing at the NMI stack.
      
      Fix it with a sneaky trick.  Set DF in the region of code that
      the RSP check is intended to detect.  IRET will clear DF
      atomically.
      
      ( Note: other than paravirt, there's little need for all this
        complexity. We could check RIP instead of RSP. )
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      4bc532d8
    • Andy Lutomirski's avatar
      x86/nmi/64: Reorder nested NMI checks · eb0bad52
      Andy Lutomirski authored
      [ Upstream commit a27507ca ]
      
      Check the repeat_nmi .. end_repeat_nmi special case first.  The
      next patch will rework the RSP check and, as a side effect, the
      RSP check will no longer detect repeat_nmi .. end_repeat_nmi, so
      we'll need this ordering of the checks.
      
      Note: this is more subtle than it appears.  The check for
      repeat_nmi .. end_repeat_nmi jumps straight out of the NMI code
      instead of adjusting the "iret" frame to force a repeat.  This
      is necessary, because the code between repeat_nmi and
      end_repeat_nmi sets "NMI executing" and then writes to the
      "iret" frame itself.  If a nested NMI comes in and modifies the
      "iret" frame while repeat_nmi is also modifying it, we'll end up
      with garbage.  The old code got this right, as does the new
      code, but the new code is a bit more explicit.
      
      If we were to move the check right after the "NMI executing"
      check, then we'd get it wrong and have random crashes.
      
      ( Because the "NMI executing" check would jump to the code that would
        modify the "iret" frame without checking if the interrupted NMI was
        currently modifying it. )
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      eb0bad52
    • Andy Lutomirski's avatar
      x86/nmi/64: Improve nested NMI comments · 092f7a2a
      Andy Lutomirski authored
      [ Upstream commit 0b22930e ]
      
      I found the nested NMI documentation to be difficult to follow.
      Improve the comments.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      092f7a2a
    • Ivan Vecera's avatar
      bna: fix interrupts storm caused by erroneous packets · 5e4c0ae9
      Ivan Vecera authored
      [ Upstream commit ade4dc3e ]
      
      The commit "e29aa339 bna: Enable Multi Buffer RX" moved packets counter
      increment from the beginning of the NAPI processing loop after the check
      for erroneous packets so they are never accounted. This counter is used
      to inform firmware about number of processed completions (packets).
      As these packets are never acked the firmware fires IRQs for them again
      and again.
      
      Fixes: e29aa339 ("bna: Enable Multi Buffer RX")
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Acked-by: default avatarRasesh Mody <rasesh.mody@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5e4c0ae9
    • Eric Dumazet's avatar
      udp: fix dst races with multicast early demux · 2ab4f113
      Eric Dumazet authored
      [ Upstream commit 10e2eb87 ]
      
      Multicast dst are not cached. They carry DST_NOCACHE.
      
      As mentioned in commit f8864972 ("ipv4: fix dst race in
      sk_dst_get()"), these dst need special care before caching them
      into a socket.
      
      Caching them is allowed only if their refcnt was not 0, ie we
      must use atomic_inc_not_zero()
      
      Also, we must use READ_ONCE() to fetch sk->sk_rx_dst, as mentioned
      in commit d0c294c5 ("tcp: prevent fetching dst twice in early demux
      code")
      
      Fixes: 421b3885 ("udp: ipv4: Add udp early demux")
      Tested-by: default avatarGregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarGregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
      Reported-by: default avatarAlex Gartrell <agartrell@fb.com>
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      2ab4f113
    • Lars Westerhoff's avatar
      packet: missing dev_put() in packet_do_bind() · f0efe010
      Lars Westerhoff authored
      [ Upstream commit 158cd4af ]
      
      When binding a PF_PACKET socket, the use count of the bound interface is
      always increased with dev_hold in dev_get_by_{index,name}.  However,
      when rebound with the same protocol and device as in the previous bind
      the use count of the interface was not decreased.  Ultimately, this
      caused the deletion of the interface to fail with the following message:
      
      unregister_netdevice: waiting for dummy0 to become free. Usage count = 1
      
      This patch moves the dev_put out of the conditional part that was only
      executed when either the protocol or device changed on a bind.
      
      Fixes: 902fefb8 ('packet: improve socket create/bind latency in some cases')
      Signed-off-by: default avatarLars Westerhoff <lars.westerhoff@newtec.eu>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f0efe010
    • Wilson Kok's avatar
      fib_rules: fix fib rule dumps across multiple skbs · 71960d66
      Wilson Kok authored
      [ Upstream commit 41fc0143 ]
      
      dump_rules returns skb length and not error.
      But when family == AF_UNSPEC, the caller of dump_rules
      assumes that it returns an error. Hence, when family == AF_UNSPEC,
      we continue trying to dump on -EMSGSIZE errors resulting in
      incorrect dump idx carried between skbs belonging to the same dump.
      This results in fib rule dump always only dumping rules that fit
      into the first skb.
      
      This patch fixes dump_rules to return error so that we exit correctly
      and idx is correctly maintained between skbs that are part of the
      same dump.
      Signed-off-by: default avatarWilson Kok <wkok@cumulusnetworks.com>
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      71960d66
    • Jesse Gross's avatar
      openvswitch: Zero flows on allocation. · ae688bc6
      Jesse Gross authored
      [ Upstream commit ae5f2fb1 ]
      
      When support for megaflows was introduced, OVS needed to start
      installing flows with a mask applied to them. Since masking is an
      expensive operation, OVS also had an optimization that would only
      take the parts of the flow keys that were covered by a non-zero
      mask. The values stored in the remaining pieces should not matter
      because they are masked out.
      
      While this works fine for the purposes of matching (which must always
      look at the mask), serialization to netlink can be problematic. Since
      the flow and the mask are serialized separately, the uninitialized
      portions of the flow can be encoded with whatever values happen to be
      present.
      
      In terms of functionality, this has little effect since these fields
      will be masked out by definition. However, it leaks kernel memory to
      userspace, which is a potential security vulnerability. It is also
      possible that other code paths could look at the masked key and get
      uninitialized data, although this does not currently appear to be an
      issue in practice.
      
      This removes the mask optimization for flows that are being installed.
      This was always intended to be the case as the mask optimizations were
      really targetting per-packet flow operations.
      
      Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
      Signed-off-by: default avatarJesse Gross <jesse@nicira.com>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      ae688bc6
    • Marcelo Ricardo Leitner's avatar
      sctp: fix race on protocol/netns initialization · 779c19e0
      Marcelo Ricardo Leitner authored
      [ Upstream commit 8e2d61e0 ]
      
      Consider sctp module is unloaded and is being requested because an user
      is creating a sctp socket.
      
      During initialization, sctp will add the new protocol type and then
      initialize pernet subsys:
      
              status = sctp_v4_protosw_init();
              if (status)
                      goto err_protosw_init;
      
              status = sctp_v6_protosw_init();
              if (status)
                      goto err_v6_protosw_init;
      
              status = register_pernet_subsys(&sctp_net_ops);
      
      The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
      is possible for userspace to create SCTP sockets like if the module is
      already fully loaded. If that happens, one of the possible effects is
      that we will have readers for net->sctp.local_addr_list list earlier
      than expected and sctp_net_init() does not take precautions while
      dealing with that list, leading to a potential panic but not limited to
      that, as sctp_sock_init() will copy a bunch of blank/partially
      initialized values from net->sctp.
      
      The race happens like this:
      
           CPU 0                           |  CPU 1
        socket()                           |
         __sock_create                     | socket()
          inet_create                      |  __sock_create
           list_for_each_entry_rcu(        |
              answer, &inetsw[sock->type], |
              list) {                      |   inet_create
            /* no hits */                  |
           if (unlikely(err)) {            |
            ...                            |
            request_module()               |
            /* socket creation is blocked  |
             * the module is fully loaded  |
             */                            |
             sctp_init                     |
              sctp_v4_protosw_init         |
               inet_register_protosw       |
                list_add_rcu(&p->list,     |
                             last_perm);   |
                                           |  list_for_each_entry_rcu(
                                           |     answer, &inetsw[sock->type],
              sctp_v6_protosw_init         |     list) {
                                           |     /* hit, so assumes protocol
                                           |      * is already loaded
                                           |      */
                                           |  /* socket creation continues
                                           |   * before netns is initialized
                                           |   */
              register_pernet_subsys       |
      
      Simply inverting the initialization order between
      register_pernet_subsys() and sctp_v4_protosw_init() is not possible
      because register_pernet_subsys() will create a control sctp socket, so
      the protocol must be already visible by then. Deferring the socket
      creation to a work-queue is not good specially because we loose the
      ability to handle its errors.
      
      So, as suggested by Vlad, the fix is to split netns initialization in
      two moments: defaults and control socket, so that the defaults are
      already loaded by when we register the protocol, while control socket
      initialization is kept at the same moment it is today.
      
      Fixes: 4db67e80 ("sctp: Make the address lists per network namespace")
      Signed-off-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      779c19e0
    • Daniel Borkmann's avatar
      netlink, mmap: transform mmap skb into full skb on taps · d3820009
      Daniel Borkmann authored
      [ Upstream commit 1853c949 ]
      
      Ken-ichirou reported that running netlink in mmap mode for receive in
      combination with nlmon will throw a NULL pointer dereference in
      __kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
      to handle kernel paging request". The problem is the skb_clone() in
      __netlink_deliver_tap_skb() for skbs that are mmaped.
      
      I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
      skb has it pointed to netlink_skb_destructor(), set in the handler
      netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
      that in such cases, __kfree_skb() doesn't perform a skb_release_data()
      via skb_release_all(), where skb->head is possibly being freed through
      kfree(head) into slab allocator, although netlink mmap skb->head points
      to the mmap buffer. Similarly, the same has to be done also for large
      netlink skbs where the data area is vmalloced. Therefore, as discussed,
      make a copy for these rather rare cases for now. This fixes the issue
      on my and Ken-ichirou's test-cases.
      
      Reference: http://thread.gmane.org/gmane.linux.network/371129
      Fixes: bcbde0d4 ("net: netlink: virtual tap device management")
      Reported-by: default avatarKen-ichirou MATSUZAWA <chamaken@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarKen-ichirou MATSUZAWA <chamaken@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      d3820009
    • Richard Laing's avatar
      net/ipv6: Correct PIM6 mrt_lock handling · 3ad45f92
      Richard Laing authored
      [ Upstream commit 25b4a44c ]
      
      In the IPv6 multicast routing code the mrt_lock was not being released
      correctly in the MFC iterator, as a result adding or deleting a MIF would
      cause a hang because the mrt_lock could not be acquired.
      
      This fix is a copy of the code for the IPv4 case and ensures that the lock
      is released correctly.
      Signed-off-by: default avatarRichard Laing <richard.laing@alliedtelesis.co.nz>
      Acked-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      3ad45f92
    • Daniel Borkmann's avatar
      ipv6: fix exthdrs offload registration in out_rt path · 833db3b8
      Daniel Borkmann authored
      [ Upstream commit e41b0bed ]
      
      We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
      but in error path, we try to unregister it with inet_del_offload(). This
      doesn't seem correct, it should actually be inet6_del_offload(), also
      ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
      also uses rthdr_offload twice), but it got removed entirely later on.
      
      Fixes: 3336288a ("ipv6: Switch to using new offload infrastructure.")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      833db3b8
    • Eugene Shatokhin's avatar
      usbnet: Get EVENT_NO_RUNTIME_PM bit before it is cleared · 965360de
      Eugene Shatokhin authored
      [ Upstream commit f50791ac ]
      
      It is needed to check EVENT_NO_RUNTIME_PM bit of dev->flags in
      usbnet_stop(), but its value should be read before it is cleared
      when dev->flags is set to 0.
      
      The problem was spotted and the fix was provided by
      Oliver Neukum <oneukum@suse.de>.
      Signed-off-by: default avatarEugene Shatokhin <eugene.shatokhin@rosalab.ru>
      Acked-by: default avatarOliver Neukum <oneukum@suse.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      965360de
    • huaibin Wang's avatar
      ip6_gre: release cached dst on tunnel removal · adda5e35
      huaibin Wang authored
      [ Upstream commit d4257295 ]
      
      When a tunnel is deleted, the cached dst entry should be released.
      
      This problem may prevent the removal of a netns (seen with a x-netns IPv6
      gre tunnel):
        unregister_netdevice: waiting for lo to become free. Usage count = 3
      
      CC: Dmitry Kozlov <xeb@mail.ru>
      Fixes: c12b395a ("gre: Support GRE over IPv6")
      Signed-off-by: default avatarhuaibin Wang <huaibin.wang@6wind.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      adda5e35
    • Daniel Borkmann's avatar
      rtnetlink: verify IFLA_VF_INFO attributes before passing them to driver · 2fb9a494
      Daniel Borkmann authored
      [ Upstream commit 4f7d2cdf ]
      
      Jason Gunthorpe reported that since commit c02db8c6 ("rtnetlink: make
      SR-IOV VF interface symmetric"), we don't verify IFLA_VF_INFO attributes
      anymore with respect to their policy, that is, ifla_vfinfo_policy[].
      
      Before, they were part of ifla_policy[], but they have been nested since
      placed under IFLA_VFINFO_LIST, that contains the attribute IFLA_VF_INFO,
      which is another nested attribute for the actual VF attributes such as
      IFLA_VF_MAC, IFLA_VF_VLAN, etc.
      
      Despite the policy being split out from ifla_policy[] in this commit,
      it's never applied anywhere. nla_for_each_nested() only does basic nla_ok()
      testing for struct nlattr, but it doesn't know about the data context and
      their requirements.
      
      Fix, on top of Jason's initial work, does 1) parsing of the attributes
      with the right policy, and 2) using the resulting parsed attribute table
      from 1) instead of the nla_for_each_nested() loop (just like we used to
      do when still part of ifla_policy[]).
      
      Reference: http://thread.gmane.org/gmane.linux.network/368913
      Fixes: c02db8c6 ("rtnetlink: make SR-IOV VF interface symmetric")
      Reported-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
      Cc: Greg Rose <gregory.v.rose@intel.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Rony Efraim <ronye@mellanox.com>
      Cc: Vlad Zolotarov <vladz@cloudius-systems.com>
      Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarVlad Zolotarov <vladz@cloudius-systems.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      2fb9a494
    • Vlad Zolotarov's avatar
      if_link: Add an additional parameter to ifla_vf_info for RSS querying · e8d18053
      Vlad Zolotarov authored
      [ Upstream commit 01a3d796 ]
      
      Add configuration setting for drivers to allow/block an RSS Redirection
      Table and a Hash Key querying for discrete VFs.
      
      On some devices VF share the mentioned above information with PF and
      querying it may adduce a theoretical security risk. We want to let a
      system administrator to decide if he/she wants to take this risk or not.
      Signed-off-by: default avatarVlad Zolotarov <vladz@cloudius-systems.com>
      Tested-by: default avatarPhil Schmitt <phillip.j.schmitt@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e8d18053
  3. 07 Oct, 2015 8 commits