1. 28 Jul, 2018 27 commits
    • Anssi Hannula's avatar
      can: xilinx_can: fix recovery from error states not being propagated · c98f5772
      Anssi Hannula authored
      commit 877e0b75 upstream.
      
      The xilinx_can driver contains no mechanism for propagating recovery
      from CAN_STATE_ERROR_WARNING and CAN_STATE_ERROR_PASSIVE.
      
      Add such a mechanism by factoring the handling of
      XCAN_STATE_ERROR_PASSIVE and XCAN_STATE_ERROR_WARNING out of
      xcan_err_interrupt and checking for recovery after RX and TX if the
      interface is in one of those states.
      
      Tested with the integrated CAN on Zynq-7000 SoC.
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: default avatarAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c98f5772
    • Anssi Hannula's avatar
      can: xilinx_can: fix power management handling · 1fadfbd9
      Anssi Hannula authored
      commit 8ebd83bd upstream.
      
      There are several issues with the suspend/resume handling code of the
      driver:
      
      - The device is attached and detached in the runtime_suspend() and
        runtime_resume() callbacks if the interface is running. However,
        during xcan_chip_start() the interface is considered running,
        causing the resume handler to incorrectly call netif_start_queue()
        at the beginning of xcan_chip_start(), and on xcan_chip_start() error
        return the suspend handler detaches the device leaving the user
        unable to bring-up the device anymore.
      
      - The device is not brought properly up on system resume. A reset is
        done and the code tries to determine the bus state after that.
        However, after reset the device is always in Configuration mode
        (down), so the state checking code does not make sense and
        communication will also not work.
      
      - The suspend callback tries to set the device to sleep mode (low-power
        mode which monitors the bus and brings the device back to normal mode
        on activity), but then immediately disables the clocks (possibly
        before the device reaches the sleep mode), which does not make sense
        to me. If a clean shutdown is wanted before disabling clocks, we can
        just bring it down completely instead of only sleep mode.
      
      Reorganize the PM code so that only the clock logic remains in the
      runtime PM callbacks and the system PM callbacks contain the device
      bring-up/down logic. This makes calling the runtime PM callbacks during
      e.g. xcan_chip_start() safe.
      
      The system PM callbacks now simply call common code to start/stop the
      HW if the interface was running, replacing the broken code from before.
      
      xcan_chip_stop() is updated to use the common reset code so that it will
      wait for the reset to complete. Reset also disables all interrupts so do
      not do that separately.
      
      Also, the device_may_wakeup() checks are removed as the driver does not
      have wakeup support.
      
      Tested on Zynq-7000 integrated CAN.
      Signed-off-by: default avatarAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: Michal Simek <michal.simek@xilinx.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1fadfbd9
    • Anssi Hannula's avatar
      can: xilinx_can: fix RX loop if RXNEMP is asserted without RXOK · de2219a8
      Anssi Hannula authored
      commit 32852c56 upstream.
      
      If the device gets into a state where RXNEMP (RX FIFO not empty)
      interrupt is asserted without RXOK (new frame received successfully)
      interrupt being asserted, xcan_rx_poll() will continue to try to clear
      RXNEMP without actually reading frames from RX FIFO. If the RX FIFO is
      not empty, the interrupt will not be cleared and napi_schedule() will
      just be called again.
      
      This situation can occur when:
      
      (a) xcan_rx() returns without reading RX FIFO due to an error condition.
      The code tries to clear both RXOK and RXNEMP but RXNEMP will not clear
      due to a frame still being in the FIFO. The frame will never be read
      from the FIFO as RXOK is no longer set.
      
      (b) A frame is received between xcan_rx_poll() reading interrupt status
      and clearing RXOK. RXOK will be cleared, but RXNEMP will again remain
      set as the new message is still in the FIFO.
      
      I'm able to trigger case (b) by flooding the bus with frames under load.
      
      There does not seem to be any benefit in using both RXNEMP and RXOK in
      the way the driver does, and the polling example in the reference manual
      (UG585 v1.10 18.3.7 Read Messages from RxFIFO) also says that either
      RXOK or RXNEMP can be used for detecting incoming messages.
      
      Fix the issue and simplify the RX processing by only using RXNEMP
      without RXOK.
      
      Tested with the integrated CAN on Zynq-7000 SoC.
      
      Fixes: b1201e44 ("can: xilinx CAN controller support")
      Signed-off-by: default avatarAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de2219a8
    • Rafael J. Wysocki's avatar
      driver core: Partially revert "driver core: correct device's shutdown order" · bf0070e2
      Rafael J. Wysocki authored
      commit 722e5f2b upstream.
      
      Commit 52cdbdd4 (driver core: correct device's shutdown order)
      introduced a regression by breaking device shutdown on some systems.
      
      Namely, the devices_kset_move_last() call in really_probe() added by
      that commit is a mistake as it may cause parents to follow children
      in the devices_kset list which then causes shutdown to fail.  For
      example, if a device has children before really_probe() is called
      for it (which is not uncommon), that call will cause it to be
      reordered after the children in the devices_kset list and the
      ordering of that list will not reflect the correct device shutdown
      order any more.
      
      Also it causes the devices_kset list to be constantly reordered
      until all drivers have been probed which is totally pointless
      overhead in the majority of cases and it only covered an issue
      with system shutdown, while system-wide suspend/resume potentially
      had the same issue on the affected platforms (which was not covered).
      
      Moreover, the shutdown issue originally addressed by the change in
      really_probe() made by commit 52cdbdd4 is not present in 4.18-rc
      any more, since dra7 started to use the sdhci-omap driver which
      doesn't disable any regulators during shutdown, so the really_probe()
      part of commit 52cdbdd4 can be safely reverted.  [The original
      issue was related to the omap_hsmmc driver used by dra7 previously.]
      
      For the above reasons, revert the really_probe() modifications made
      by commit 52cdbdd4.
      
      The other code changes made by commit 52cdbdd4 are useful and
      they need not be reverted.
      
      Fixes: 52cdbdd4 (driver core: correct device's shutdown order)
      Link: https://lore.kernel.org/lkml/CAFgQCTt7VfqM=UyCnvNFxrSw8Z6cUtAi3HUwR4_xPAc03SgHjQ@mail.gmail.com/Reported-by: default avatarPingfan Liu <kernelfans@gmail.com>
      Tested-by: default avatarPingfan Liu <kernelfans@gmail.com>
      Reviewed-by: default avatarKishon Vijay Abraham I <kishon@ti.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf0070e2
    • Jerry Zhang's avatar
      usb: gadget: f_fs: Only return delayed status when len is 0 · 9e10043b
      Jerry Zhang authored
      commit 4d644abf upstream.
      
      Commit 1b9ba000 ("Allow function drivers to pause control
      transfers") states that USB_GADGET_DELAYED_STATUS is only
      supported if data phase is 0 bytes.
      
      It seems that when the length is not 0 bytes, there is no
      need to explicitly delay the data stage since the transfer
      is not completed until the user responds. However, when the
      length is 0, there is no data stage and the transfer is
      finished once setup() returns, hence there is a need to
      explicitly delay completion.
      
      This manifests as the following bugs:
      
      Prior to 946ef68a ('Let setup() return
      USB_GADGET_DELAYED_STATUS'), when setup is 0 bytes, ffs
      would require user to queue a 0 byte request in order to
      clear setup state. However, that 0 byte request was actually
      not needed and would hang and cause errors in other setup
      requests.
      
      After the above commit, 0 byte setups work since the gadget
      now accepts empty queues to ep0 to clear the delay, but all
      other setups hang.
      
      Fixes: 946ef68a ("Let setup() return USB_GADGET_DELAYED_STATUS")
      Signed-off-by: default avatarJerry Zhang <zhangjerry@google.com>
      Cc: stable <stable@vger.kernel.org>
      Acked-by: default avatarFelipe Balbi <felipe.balbi@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9e10043b
    • Bin Liu's avatar
      usb: core: handle hub C_PORT_OVER_CURRENT condition · e2996cf5
      Bin Liu authored
      commit 249a32b7 upstream.
      
      Based on USB2.0 Spec Section 11.12.5,
      
        "If a hub has per-port power switching and per-port current limiting,
        an over-current on one port may still cause the power on another port
        to fall below specific minimums. In this case, the affected port is
        placed in the Power-Off state and C_PORT_OVER_CURRENT is set for the
        port, but PORT_OVER_CURRENT is not set."
      
      so let's check C_PORT_OVER_CURRENT too for over current condition.
      
      Fixes: 08d1dec6 ("usb:hub set hub->change_bits when over-current happens")
      Cc: <stable@vger.kernel.org>
      Tested-by: default avatarAlessandro Antenucci <antenucci@korg.it>
      Signed-off-by: default avatarBin Liu <b-liu@ti.com>
      Acked-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e2996cf5
    • Lubomir Rintel's avatar
      usb: cdc_acm: Add quirk for Castles VEGA3000 · b0bd06a4
      Lubomir Rintel authored
      commit 1445cbe4 upstream.
      
      The device (a POS terminal) implements CDC ACM, but has not union
      descriptor.
      Signed-off-by: default avatarLubomir Rintel <lkundrak@v3.sk>
      Acked-by: default avatarOliver Neukum <oneukum@suse.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b0bd06a4
    • Eric Dumazet's avatar
      tcp: call tcp_drop() from tcp_data_queue_ofo() · 94623c74
      Eric Dumazet authored
      [ Upstream commit 8541b21e ]
      
      In order to be able to give better diagnostics and detect
      malicious traffic, we need to have better sk->sk_drops tracking.
      
      Fixes: 9f5afeae ("tcp: use an RB tree for ooo receive queue")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      94623c74
    • Eric Dumazet's avatar
      tcp: detect malicious patterns in tcp_collapse_ofo_queue() · a8786814
      Eric Dumazet authored
      [ Upstream commit 3d4bf93a ]
      
      In case an attacker feeds tiny packets completely out of order,
      tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
      expensive copies, but not changing socket memory usage at all.
      
      1) Do not attempt to collapse tiny skbs.
      2) Add logic to exit early when too many tiny skbs are detected.
      
      We prefer not doing aggressive collapsing (which copies packets)
      for pathological flows, and revert to tcp_prune_ofo_queue() which
      will be less expensive.
      
      In the future, we might add the possibility of terminating flows
      that are proven to be malicious.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8786814
    • Eric Dumazet's avatar
      tcp: avoid collapses in tcp_prune_queue() if possible · fdf258ed
      Eric Dumazet authored
      [ Upstream commit f4a3313d ]
      
      Right after a TCP flow is created, receiving tiny out of order
      packets allways hit the condition :
      
      if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
      	tcp_clamp_window(sk);
      
      tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
      (guarded by tcp_rmem[2])
      
      Calling tcp_collapse_ofo_queue() in this case is not useful,
      and offers a O(N^2) surface attack to malicious peers.
      
      Better not attempt anything before full queue capacity is reached,
      forcing attacker to spend lots of resource and allow us to more
      easily detect the abuse.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fdf258ed
    • Eric Dumazet's avatar
      tcp: free batches of packets in tcp_prune_ofo_queue() · 2d08921c
      Eric Dumazet authored
      [ Upstream commit 72cd43ba ]
      
      Juha-Matti Tilli reported that malicious peers could inject tiny
      packets in out_of_order_queue, forcing very expensive calls
      to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
      every incoming packet. out_of_order_queue rb-tree can contain
      thousands of nodes, iterating over all of them is not nice.
      
      Before linux-4.9, we would have pruned all packets in ofo_queue
      in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
      truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.
      
      Since we plan to increase tcp_rmem[2] in the future to cope with
      modern BDP, can not revert to the old behavior, without great pain.
      
      Strategy taken in this patch is to purge ~12.5 % of the queue capacity.
      
      Fixes: 36a6503f ("tcp: refine tcp_prune_ofo_queue() to not drop all packets")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarJuha-Matti Tilli <juha-matti.tilli@iki.fi>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2d08921c
    • Yuchung Cheng's avatar
      tcp: do not delay ACK in DCTCP upon CE status change · 8736711f
      Yuchung Cheng authored
      [ Upstream commit a0496ef2 ]
      
      Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
      has to be sent immediately so the sender can respond quickly:
      
      """ When receiving packets, the CE codepoint MUST be processed as follows:
      
         1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
             true and send an immediate ACK.
      
         2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
             to false and send an immediate ACK.
      """
      
      Previously DCTCP implementation may continue to delay the ACK. This
      patch fixes that to implement the RFC by forcing an immediate ACK.
      
      Tested with this packetdrill script provided by Larry Brakmo
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
         +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      +0.005 < [ce] . 2001:3001(1000) ack 2 win 257
      
      +0.000 > [ect01] . 2:2(0) ack 2001
      // Previously the ACK below would be delayed by 40ms
      +0.000 > [ect01] E. 2:2(0) ack 3001
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8736711f
    • Yuchung Cheng's avatar
      tcp: do not cancel delay-AcK on DCTCP special ACK · 57ec8824
      Yuchung Cheng authored
      [ Upstream commit 27cde44a ]
      
      Currently when a DCTCP receiver delays an ACK and receive a
      data packet with a different CE mark from the previous one's, it
      sends two immediate ACKs acking previous and latest sequences
      respectly (for ECN accounting).
      
      Previously sending the first ACK may mark off the delayed ACK timer
      (tcp_event_ack_sent). This may subsequently prevent sending the
      second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
      The culprit is that tcp_send_ack() assumes it always acknowleges
      the latest sequence, which is not true for the first special ACK.
      
      The fix is to not make the assumption in tcp_send_ack and check the
      actual ack sequence before cancelling the delayed ACK. Further it's
      safer to pass the ack sequence number as a local variable into
      tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
      future bugs like this.
      Reported-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      57ec8824
    • Yuchung Cheng's avatar
      tcp: helpers to send special DCTCP ack · 1fcccc57
      Yuchung Cheng authored
      [ Upstream commit 2987babb ]
      
      Refactor and create helpers to send the special ACK in DCTCP.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1fcccc57
    • Yuchung Cheng's avatar
      tcp: fix dctcp delayed ACK schedule · 84177801
      Yuchung Cheng authored
      [ Upstream commit b0c05d0e ]
      
      Previously, when a data segment was sent an ACK was piggybacked
      on the data segment without generating a CA_EVENT_NON_DELAYED_ACK
      event to notify congestion control modules. So the DCTCP
      ca->delayed_ack_reserved flag could incorrectly stay set when
      in fact there were no delayed ACKs being reserved. This could result
      in sending a special ECN notification ACK that carries an older
      ACK sequence, when in fact there was no need for such an ACK.
      DCTCP keeps track of the delayed ACK status with its own separate
      state ca->delayed_ack_reserved. Previously it may accidentally cancel
      the delayed ACK without updating this field upon sending a special
      ACK that carries a older ACK sequence. This inconsistency would
      lead to DCTCP receiver never acknowledging the latest data until the
      sender times out and retry in some cases.
      
      Packetdrill script (provided by Larry Brakmo)
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 2:3(1) ack 2001
      
      0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
      0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
      0.200 > [ect01] . 3:3(0) ack 4001
      
      0.210 < [ce] P. 4001:4501(500) ack 3 win 257
      
      +0.001 read(4, ..., 4500) = 4500
      +0 write(4, ..., 1) = 1
      +0 > [ect01] PE. 3:4(1) ack 4501
      
      +0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
      // Previously the ACK sequence below would be 4501, causing a long RTO
      +0.040~+0.045 > [ect01] . 4:4(0) ack 5501   // delayed ack
      
      +0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // More data
      +0 > [ect01] . 4:4(0) ack 6501     // now acks everything
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Reported-by: default avatarLarry Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      84177801
    • Roopa Prabhu's avatar
      rtnetlink: add rtnl_link_state check in rtnl_configure_link · 19b74799
      Roopa Prabhu authored
      [ Upstream commit 5025f7f7 ]
      
      rtnl_configure_link sets dev->rtnl_link_state to
      RTNL_LINK_INITIALIZED and unconditionally calls
      __dev_notify_flags to notify user-space of dev flags.
      
      current call sequence for rtnl_configure_link
      rtnetlink_newlink
          rtnl_link_ops->newlink
          rtnl_configure_link (unconditionally notifies userspace of
                               default and new dev flags)
      
      If a newlink handler wants to call rtnl_configure_link
      early, we will end up with duplicate notifications to
      user-space.
      
      This patch fixes rtnl_configure_link to check rtnl_link_state
      and call __dev_notify_flags with gchanges = 0 if already
      RTNL_LINK_INITIALIZED.
      
      Later in the series, this patch will help the following sequence
      where a driver implementing newlink can call rtnl_configure_link
      to initialize the link early.
      
      makes the following call sequence work:
      rtnetlink_newlink
          rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes
                                                      link and notifies
                                                      user-space of default
                                                      dev flags)
          rtnl_configure_link (updates dev flags if requested by user ifm
                               and notifies user-space of new dev flags)
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      19b74799
    • Heiner Kallweit's avatar
      net: phy: consider PHY_IGNORE_INTERRUPT in phy_start_aneg_priv · c6ac36be
      Heiner Kallweit authored
      [ Upstream commit 215d08a8 ]
      
      The situation described in the comment can occur also with
      PHY_IGNORE_INTERRUPT, therefore change the condition to include it.
      
      Fixes: f555f34f ("net: phy: fix auto-negotiation stall due to unavailable interrupt")
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c6ac36be
    • Hangbin Liu's avatar
      multicast: do not restore deleted record source filter mode to new one · cc403d5d
      Hangbin Liu authored
      There are two scenarios that we will restore deleted records. The first is
      when device down and up(or unmap/remap). In this scenario the new filter
      mode is same with previous one. Because we get it from in_dev->mc_list and
      we do not touch it during device down and up.
      
      The other scenario is when a new socket join a group which was just delete
      and not finish sending status reports. In this scenario, we should use the
      current filter mode instead of restore old one. Here are 4 cases in total.
      
      old_socket        new_socket       before_fix       after_fix
        IN(A)             IN(A)           ALLOW(A)         ALLOW(A)
        IN(A)             EX( )           TO_IN( )         TO_EX( )
        EX( )             IN(A)           TO_EX( )         ALLOW(A)
        EX( )             EX( )           TO_EX( )         TO_EX( )
      
      Fixes: 24803f38 (igmp: do not remove igmp souce list info when set link down)
      Fixes: 1666d49e (mld: do not remove mld souce list info when set link down)
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc403d5d
    • Eran Ben Elisha's avatar
      net/mlx5e: Fix quota counting in aRFS expire flow · b7e37add
      Eran Ben Elisha authored
      [ Upstream commit 2630bae8 ]
      
      Quota should follow the amount of rules which do expire, and not the
      number of rules that were examined, fixed that.
      
      Fixes: 18c908e4 ("net/mlx5e: Add accelerated RFS support")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: default avatarMaor Gottlieb <maorg@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b7e37add
    • Eran Ben Elisha's avatar
      net/mlx5e: Don't allow aRFS for encapsulated packets · d9d58012
      Eran Ben Elisha authored
      [ Upstream commit d2e1c57b ]
      
      Driver is yet to support aRFS for encapsulated packets, return early
      error in such case.
      
      Fixes: 18c908e4 ("net/mlx5e: Add accelerated RFS support")
      Signed-off-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d9d58012
    • Ariel Levkovich's avatar
      net/mlx5: Adjust clock overflow work period · adcecd4a
      Ariel Levkovich authored
      [ Upstream commit 33180bee ]
      
      When driver converts HW timestamp to wall clock time it subtracts
      the last saved cycle counter from the HW timestamp and converts the
      difference to nanoseconds.
      The conversion is done by multiplying the cycles difference with the
      clock multiplier value as a first step and therefore the cycles
      difference should be small enough so that the multiplication product
      doesn't exceed 64bit.
      
      The overflow handling routine is in charge of updating the last saved
      cycle counter in driver and it is called periodically using kernel
      delayed workqueue.
      
      The delay period for this work is calculated using the max HW cycle
      counter value (a 41 bit mask) as a base which doesn't take the 64bit
      limit into account so the delay period may be incorrect and too
      long to prevent a large difference between the HW counter and the last
      saved counter in SW.
      
      This change adjusts the work period for the HW clock overflow work by
      taking the minimum between the previous value and the quotient of max
      u64 value and the clock multiplier value.
      
      Fixes: ef9814de ("net/mlx5e: Add HW timestamping (TS) support")
      Signed-off-by: default avatarAriel Levkovich <lariel@mellanox.com>
      Reviewed-by: default avatarEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      adcecd4a
    • Eric Dumazet's avatar
      net: skb_segment() should not return NULL · e2ffdd64
      Eric Dumazet authored
      [ Upstream commit ff907a11 ]
      
      syzbot caught a NULL deref [1], caused by skb_segment()
      
      skb_segment() has many "goto err;" that assume the @err variable
      contains -ENOMEM.
      
      A successful call to __skb_linearize() should not clear @err,
      otherwise a subsequent memory allocation error could return NULL.
      
      While we are at it, we might use -EINVAL instead of -ENOMEM when
      MAX_SKB_FRAGS limit is reached.
      
      [1]
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      CPU: 0 PID: 13285 Comm: syz-executor3 Not tainted 4.18.0-rc4+ #146
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:tcp_gso_segment+0x3dc/0x1780 net/ipv4/tcp_offload.c:106
      Code: f0 ff ff 0f 87 1c fd ff ff e8 00 88 0b fb 48 8b 75 d0 48 b9 00 00 00 00 00 fc ff df 48 8d be 90 00 00 00 48 89 f8 48 c1 e8 03 <0f> b6 14 08 48 8d 86 94 00 00 00 48 89 c6 83 e0 07 48 c1 ee 03 0f
      RSP: 0018:ffff88019b7fd060 EFLAGS: 00010206
      RAX: 0000000000000012 RBX: 0000000000000020 RCX: dffffc0000000000
      RDX: 0000000000040000 RSI: 0000000000000000 RDI: 0000000000000090
      RBP: ffff88019b7fd0f0 R08: ffff88019510e0c0 R09: ffffed003b5c46d6
      R10: ffffed003b5c46d6 R11: ffff8801dae236b3 R12: 0000000000000001
      R13: ffff8801d6c581f4 R14: 0000000000000000 R15: ffff8801d6c58128
      FS:  00007fcae64d6700(0000) GS:ffff8801dae00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000004e8664 CR3: 00000001b669b000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       tcp4_gso_segment+0x1c3/0x440 net/ipv4/tcp_offload.c:54
       inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342
       inet_gso_segment+0x64e/0x12d0 net/ipv4/af_inet.c:1342
       skb_mac_gso_segment+0x3b5/0x740 net/core/dev.c:2792
       __skb_gso_segment+0x3c3/0x880 net/core/dev.c:2865
       skb_gso_segment include/linux/netdevice.h:4099 [inline]
       validate_xmit_skb+0x640/0xf30 net/core/dev.c:3104
       __dev_queue_xmit+0xc14/0x3910 net/core/dev.c:3561
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3602
       neigh_hh_output include/net/neighbour.h:473 [inline]
       neigh_output include/net/neighbour.h:481 [inline]
       ip_finish_output2+0x1063/0x1860 net/ipv4/ip_output.c:229
       ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:276 [inline]
       ip_output+0x223/0x880 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:444 [inline]
       ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
       iptunnel_xmit+0x567/0x850 net/ipv4/ip_tunnel_core.c:91
       ip_tunnel_xmit+0x1598/0x3af1 net/ipv4/ip_tunnel.c:778
       ipip_tunnel_xmit+0x264/0x2c0 net/ipv4/ipip.c:308
       __netdev_start_xmit include/linux/netdevice.h:4148 [inline]
       netdev_start_xmit include/linux/netdevice.h:4157 [inline]
       xmit_one net/core/dev.c:3034 [inline]
       dev_hard_start_xmit+0x26c/0xc30 net/core/dev.c:3050
       __dev_queue_xmit+0x29ef/0x3910 net/core/dev.c:3569
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3602
       neigh_direct_output+0x15/0x20 net/core/neighbour.c:1403
       neigh_output include/net/neighbour.h:483 [inline]
       ip_finish_output2+0xa67/0x1860 net/ipv4/ip_output.c:229
       ip_finish_output+0x841/0xfa0 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:276 [inline]
       ip_output+0x223/0x880 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:444 [inline]
       ip_local_out+0xc5/0x1b0 net/ipv4/ip_output.c:124
       ip_queue_xmit+0x9df/0x1f80 net/ipv4/ip_output.c:504
       tcp_transmit_skb+0x1bf9/0x3f10 net/ipv4/tcp_output.c:1168
       tcp_write_xmit+0x1641/0x5c20 net/ipv4/tcp_output.c:2363
       __tcp_push_pending_frames+0xb2/0x290 net/ipv4/tcp_output.c:2536
       tcp_push+0x638/0x8c0 net/ipv4/tcp.c:735
       tcp_sendmsg_locked+0x2ec5/0x3f00 net/ipv4/tcp.c:1410
       tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1447
       inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:641 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:651
       __sys_sendto+0x3d7/0x670 net/socket.c:1797
       __do_sys_sendto net/socket.c:1809 [inline]
       __se_sys_sendto net/socket.c:1805 [inline]
       __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1805
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x455ab9
      Code: 1d ba fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb b9 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fcae64d5c68 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00007fcae64d66d4 RCX: 0000000000455ab9
      RDX: 0000000000000001 RSI: 0000000020000200 RDI: 0000000000000013
      RBP: 000000000072bea0 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000014
      R13: 00000000004c1145 R14: 00000000004d1818 R15: 0000000000000006
      Modules linked in:
      Dumping ftrace buffer:
         (ftrace buffer empty)
      
      Fixes: ddff00d4 ("net: Move skb_has_shared_frag check out of GRE code and into segmentation")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Acked-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e2ffdd64
    • Jack Morgenstein's avatar
      net/mlx4_core: Save the qpn from the input modifier in RST2INIT wrapper · 444987d5
      Jack Morgenstein authored
      [ Upstream commit 958c696f ]
      
      Function mlx4_RST2INIT_QP_wrapper saved the qp number passed in the qp
      context, rather than the one passed in the input modifier.
      
      However, the qp number in the qp context is not defined as a
      required parameter by the FW. Therefore, drivers may choose to not
      specify the qp number in the qp context for the reset-to-init transition.
      
      Thus, we must save the qp number passed in the command input modifier --
      which is always present. (This saved qp number is used as the input
      modifier for command 2RST_QP when a slave's qp's are destroyed).
      
      Fixes: c82e9aa0 ("mlx4_core: resource tracking for HCA resources used by guests")
      Signed-off-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      444987d5
    • Willem de Bruijn's avatar
      ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull · 03fbf2b8
      Willem de Bruijn authored
      [ Upstream commit 2efd4fca ]
      
      Syzbot reported a read beyond the end of the skb head when returning
      IPV6_ORIGDSTADDR:
      
        BUG: KMSAN: kernel-infoleak in put_cmsg+0x5ef/0x860 net/core/scm.c:242
        CPU: 0 PID: 4501 Comm: syz-executor128 Not tainted 4.17.0+ #9
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
        Google 01/01/2011
        Call Trace:
          __dump_stack lib/dump_stack.c:77 [inline]
          dump_stack+0x185/0x1d0 lib/dump_stack.c:113
          kmsan_report+0x188/0x2a0 mm/kmsan/kmsan.c:1125
          kmsan_internal_check_memory+0x138/0x1f0 mm/kmsan/kmsan.c:1219
          kmsan_copy_to_user+0x7a/0x160 mm/kmsan/kmsan.c:1261
          copy_to_user include/linux/uaccess.h:184 [inline]
          put_cmsg+0x5ef/0x860 net/core/scm.c:242
          ip6_datagram_recv_specific_ctl+0x1cf3/0x1eb0 net/ipv6/datagram.c:719
          ip6_datagram_recv_ctl+0x41c/0x450 net/ipv6/datagram.c:733
          rawv6_recvmsg+0x10fb/0x1460 net/ipv6/raw.c:521
          [..]
      
      This logic and its ipv4 counterpart read the destination port from
      the packet at skb_transport_offset(skb) + 4.
      
      With MSG_MORE and a local SOCK_RAW sender, syzbot was able to cook a
      packet that stores headers exactly up to skb_transport_offset(skb) in
      the head and the remainder in a frag.
      
      Call pskb_may_pull before accessing the pointer to ensure that it lies
      in skb head.
      
      Link: http://lkml.kernel.org/r/CAF=yD-LEJwZj5a1-bAAj2Oy_hKmGygV6rsJ_WOrAYnv-fnayiQ@mail.gmail.com
      Reported-by: syzbot+9adb4b567003cac781f0@syzkaller.appspotmail.com
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      03fbf2b8
    • Paolo Abeni's avatar
      ip: hash fragments consistently · 93d94fec
      Paolo Abeni authored
      [ Upstream commit 3dd1c9a1 ]
      
      The skb hash for locally generated ip[v6] fragments belonging
      to the same datagram can vary in several circumstances:
      * for connected UDP[v6] sockets, the first fragment get its hash
        via set_owner_w()/skb_set_hash_from_sk()
      * for unconnected IPv6 UDPv6 sockets, the first fragment can get
        its hash via ip6_make_flowlabel()/skb_get_hash_flowi6(), if
        auto_flowlabel is enabled
      
      For the following frags the hash is usually computed via
      skb_get_hash().
      The above can cause OoO for unconnected IPv6 UDPv6 socket: in that
      scenario the egress tx queue can be selected on a per packet basis
      via the skb hash.
      It may also fool flow-oriented schedulers to place fragments belonging
      to the same datagram in different flows.
      
      Fix the issue by copying the skb hash from the head frag into
      the others at fragmentation time.
      
      Before this commit:
      perf probe -a "dev_queue_xmit skb skb->hash skb->l4_hash:b1@0/8 skb->sw_hash:b1@1/8"
      netperf -H $IPV4 -t UDP_STREAM -l 5 -- -m 2000 -n &
      perf record -e probe:dev_queue_xmit -e probe:skb_set_owner_w -a sleep 0.1
      perf script
      probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=3713014309 l4_hash=1 sw_hash=0
      probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=0 l4_hash=0 sw_hash=0
      
      After this commit:
      probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
      probe:dev_queue_xmit: (ffffffff8c6b1b20) hash=2171763177 l4_hash=1 sw_hash=0
      
      Fixes: b73c3d0e ("net: Save TX flow hash in sock and set in skbuf on xmit")
      Fixes: 67800f9b ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      93d94fec
    • Paul Burton's avatar
      MIPS: Fix off-by-one in pci_resource_to_user() · 650321fe
      Paul Burton authored
      commit 38c0a74f upstream.
      
      The MIPS implementation of pci_resource_to_user() introduced in v3.12 by
      commit 4c2924b7 ("MIPS: PCI: Use pci_resource_to_user to map pci
      memory space properly") incorrectly sets *end to the address of the
      byte after the resource, rather than the last byte of the resource.
      
      This results in userland seeing resources as a byte larger than they
      actually are, for example a 32 byte BAR will be reported by a tool such
      as lspci as being 33 bytes in size:
      
          Region 2: I/O ports at 1000 [disabled] [size=33]
      
      Correct this by subtracting one from the calculated end address,
      reporting the correct address to userland.
      Signed-off-by: default avatarPaul Burton <paul.burton@mips.com>
      Reported-by: default avatarRui Wang <rui.wang@windriver.com>
      Fixes: 4c2924b7 ("MIPS: PCI: Use pci_resource_to_user to map pci memory space properly")
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Wolfgang Grandegger <wg@grandegger.com>
      Cc: linux-mips@linux-mips.org
      Cc: stable@vger.kernel.org # v3.12+
      Patchwork: https://patchwork.linux-mips.org/patch/19829/Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      650321fe
    • Felix Fietkau's avatar
      MIPS: ath79: fix register address in ath79_ddr_wb_flush() · 92f72413
      Felix Fietkau authored
      commit bc88ad2e upstream.
      
      ath79_ddr_wb_flush_base has the type void __iomem *, so register offsets
      need to be a multiple of 4 in order to access the intended register.
      Signed-off-by: default avatarFelix Fietkau <nbd@nbd.name>
      Signed-off-by: default avatarJohn Crispin <john@phrozen.org>
      Signed-off-by: default avatarPaul Burton <paul.burton@mips.com>
      Fixes: 24b0e3e8 ("MIPS: ath79: Improve the DDR controller interface")
      Patchwork: https://patchwork.linux-mips.org/patch/19912/
      Cc: Alban Bedel <albeu@free.fr>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Cc: stable@vger.kernel.org # 4.2+
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      92f72413
  2. 25 Jul, 2018 13 commits