1. 04 May, 2019 5 commits
  2. 02 May, 2019 35 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.19.38 · a03957ab
      Greg Kroah-Hartman authored
      a03957ab
    • Diana Craciun's avatar
    • Jakub Kicinski's avatar
      net/tls: don't leak IV and record seq when offload fails · 53db6523
      Jakub Kicinski authored
      [ Upstream commit 12c76861 ]
      
      When device refuses the offload in tls_set_device_offload_rx()
      it calls tls_sw_free_resources_rx() to clean up software context
      state.
      
      Unfortunately, tls_sw_free_resources_rx() does not free all
      the state tls_set_sw_offload() allocated - it leaks IV and
      sequence number buffers.  All other code paths which lead to
      tls_sw_release_resources_rx() (which tls_sw_free_resources_rx()
      calls) free those right before the call.
      
      Avoid the leak by moving freeing of iv and rec_seq into
      tls_sw_release_resources_rx().
      
      Fixes: 4799ac81 ("tls: Add rx inline crypto offload")
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53db6523
    • Jakub Kicinski's avatar
      net/tls: avoid potential deadlock in tls_set_device_offload_rx() · d3bdd359
      Jakub Kicinski authored
      [ Upstream commit 62ef81d5 ]
      
      If device supports offload, but offload fails tls_set_device_offload_rx()
      will call tls_sw_free_resources_rx() which (unhelpfully) releases
      and reacquires the socket lock.
      
      For a small fix release and reacquire the device_offload_lock.
      
      Fixes: 4799ac81 ("tls: Add rx inline crypto offload")
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d3bdd359
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Fix use-after-free after xdp_return_frame · 041b3224
      Maxim Mikityanskiy authored
      [ Upstream commit 12fc512f ]
      
      xdp_return_frame releases the frame. It leads to releasing the page, so
      it's not allowed to access xdpi.xdpf->len after that, because xdpi.xdpf
      is at xdp->data_hard_start after convert_to_xdp_frame. This patch moves
      the memory access to precede the return of the frame.
      
      Fixes: 58b99ee3 ("net/mlx5e: Add support for XDP_REDIRECT in device-out side")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      041b3224
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Fix the max MTU check in case of XDP · ae6b0710
      Maxim Mikityanskiy authored
      [ Upstream commit d460c271 ]
      
      MLX5E_XDP_MAX_MTU was calculated incorrectly. It didn't account for
      NET_IP_ALIGN and MLX5E_HW2SW_MTU, and it also misused MLX5_SKB_FRAG_SZ.
      This commit fixes the calculations and adds a brief explanation for the
      formula used.
      
      Fixes: a26a5bdf ("net/mlx5e: Restrict the combination of large MTU and XDP")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ae6b0710
    • Petr Machata's avatar
      mlxsw: spectrum: Put MC TCs into DWRR mode · b08774d3
      Petr Machata authored
      [ Upstream commit f476b3f8 ]
      
      Both Spectrum-1 and Spectrum-2 chips are currently configured such that
      pairs of TC n (which is used for UC traffic) and TC n+8 (which is used
      for MC traffic) are feeding into the same subgroup. Strict
      prioritization is configured between the two TCs, and by enabling
      MC-aware mode on the switch, the lower-numbered (UC) TCs are favored
      over the higher-numbered (MC) TCs.
      
      On Spectrum-2 however, there is an issue in configuration of the
      MC-aware mode. As a result, MC traffic is prioritized over UC traffic.
      To work around the issue, configure the MC TCs with DWRR mode (while
      keeping the UC TCs in strict mode).
      
      With this patch, the multicast-unicast arbitration results in the same
      behavior on both Spectrum-1 and Spectrum-2 chips.
      
      Fixes: 7b819530 ("mlxsw: spectrum: Configure MC-aware mode on mlxsw ports")
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b08774d3
    • Ido Schimmel's avatar
      mlxsw: pci: Reincrease PCI reset timeout · 21e47998
      Ido Schimmel authored
      [ Upstream commit 1ab30301 ]
      
      During driver initialization the driver sends a reset to the device and
      waits for the firmware to signal that it is ready to continue.
      
      Commit d2f372ba ("mlxsw: pci: Increase PCI SW reset timeout")
      increased the timeout to 13 seconds due to longer PHY calibration in
      Spectrum-2 compared to Spectrum-1.
      
      Recently it became apparent that this timeout is too short and therefore
      this patch increases it again to a safer limit that will be reduced in
      the future.
      
      Fixes: c3ab4354 ("mlxsw: spectrum: Extend to support Spectrum-2 ASIC")
      Fixes: d2f372ba ("mlxsw: pci: Increase PCI SW reset timeout")
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21e47998
    • Jun Xiao's avatar
      net: hns: Fix WARNING when hns modules installed · e875a409
      Jun Xiao authored
      Commit dfdf26babc98 upstream
      
      this patch need merge to 4.19.y stable kernel
      
      Fix Conflict:already fixed the confilct dfdf26babc98 with Yonglong Liu
      
      stable candidate:user cannot connect to the internet via hns dev
      by default setting without this patch
      
      we have already verified this patch on kunpeng916 platform,
      and it works well.
      Signed-off-by: default avatarYonglong Liu <liuyonglong@huawei.com>
      Signed-off-by: default avatarJun Xiao <xiaojun2@hisilicon.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e875a409
    • Hangbin Liu's avatar
      team: fix possible recursive locking when add slaves · 7ce836e8
      Hangbin Liu authored
      [ Upstream commit 925b0c84 ]
      
      If we add a bond device which is already the master of the team interface,
      we will hold the team->lock in team_add_slave() first and then request the
      lock in team_set_mac_address() again. The functions are called like:
      
      - team_add_slave()
       - team_port_add()
         - team_port_enter()
           - team_modeop_port_enter()
             - __set_port_dev_addr()
               - dev_set_mac_address()
                 - bond_set_mac_address()
                   - dev_set_mac_address()
        	       - team_set_mac_address
      
      Although team_upper_dev_link() would check the upper devices but it is
      called too late. Fix it by adding a checking before processing the slave.
      
      v2: Do not split the string in netdev_err()
      
      Fixes: 3d249d4c ("net: introduce ethernet teaming device")
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7ce836e8
    • Su Bao Cheng's avatar
      stmmac: pci: Adjust IOT2000 matching · 1f78e75e
      Su Bao Cheng authored
      [ Upstream commit e0c1d14a ]
      
      Since there are more IOT2040 variants with identical hardware but
      different asset tags, the asset tag matching should be adjusted to
      support them.
      
      For the board name "SIMATIC IOT2000", currently there are 2 types of
      hardware, IOT2020 and IOT2040. The IOT2020 is identified by its unique
      asset tag. Match on it first. If we then match on the board name only,
      we will catch all IOT2040 variants. In the future there will be no other
      devices with the "SIMATIC IOT2000" DMI board name but different
      hardware.
      Signed-off-by: default avatarSu Bao Cheng <baocheng.su@siemens.com>
      Reviewed-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1f78e75e
    • Jakub Kicinski's avatar
      net/tls: fix refcount adjustment in fallback · e97f0bc7
      Jakub Kicinski authored
      [ Upstream commit 9188d5ca ]
      
      Unlike atomic_add(), refcount_add() does not deal well
      with a negative argument.  TLS fallback code reallocates
      the skb and is very likely to shrink the truesize, leading to:
      
      [  189.513254] WARNING: CPU: 5 PID: 0 at lib/refcount.c:81 refcount_add_not_zero_checked+0x15c/0x180
       Call Trace:
        refcount_add_checked+0x6/0x40
        tls_enc_skb+0xb93/0x13e0 [tls]
      
      Once wmem_allocated count saturates the application can no longer
      send data on the socket.  This is similar to Eric's fixes for GSO,
      TCP:
      commit 7ec318fe ("tcp: gso: avoid refcount_t warning from tcp_gso_segment()")
      and UDP:
      commit 575b65bc ("udp: avoid refcount_t saturation in __udp_gso_segment()").
      
      Unlike the GSO case, for TLS fallback it's likely that the skb has
      shrunk, so the "likely" annotation is the other way around (likely
      branch being "sub").
      
      Fixes: e8f69799 ("net/tls: Add generic NIC offload infrastructure")
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarJohn Hurley <john.hurley@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e97f0bc7
    • Vinod Koul's avatar
      net: stmmac: move stmmac_check_ether_addr() to driver probe · b02f8aa8
      Vinod Koul authored
      [ Upstream commit b561af36 ]
      
      stmmac_check_ether_addr() checks the MAC address and assigns one in
      driver open(). In many cases when we create slave netdevice, the dev
      addr is inherited from master but the master dev addr maybe NULL at
      that time, so move this call to driver probe so that address is
      always valid.
      Signed-off-by: default avatarXiaofei Shen <xiaofeis@codeaurora.org>
      Tested-by: default avatarXiaofei Shen <xiaofeis@codeaurora.org>
      Signed-off-by: default avatarSneh Shah <snehshah@codeaurora.org>
      Signed-off-by: default avatarVinod Koul <vkoul@kernel.org>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b02f8aa8
    • Eric Dumazet's avatar
      net/rose: fix unbound loop in rose_loopback_timer() · d7b10dfe
      Eric Dumazet authored
      [ Upstream commit 0453c682 ]
      
      This patch adds a limit on the number of skbs that fuzzers can queue
      into loopback_queue. 1000 packets for rose loopback seems more than enough.
      
      Then, since we now have multiple cpus in most linux hosts,
      we also need to limit the number of skbs rose_loopback_timer()
      can dequeue at each round.
      
      rose_loopback_queue() can be drop-monitor friendly, calling
      consume_skb() or kfree_skb() appropriately.
      
      Finally, use mod_timer() instead of del_timer() + add_timer()
      
      syzbot report was :
      
      rcu: INFO: rcu_preempt self-detected stall on CPU
      rcu:    0-...!: (10499 ticks this GP) idle=536/1/0x4000000000000002 softirq=103291/103291 fqs=34
      rcu:     (t=10500 jiffies g=140321 q=323)
      rcu: rcu_preempt kthread starved for 10426 jiffies! g140321 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
      rcu: RCU grace-period kthread stack dump:
      rcu_preempt     I29168    10      2 0x80000000
      Call Trace:
       context_switch kernel/sched/core.c:2877 [inline]
       __schedule+0x813/0x1cc0 kernel/sched/core.c:3518
       schedule+0x92/0x180 kernel/sched/core.c:3562
       schedule_timeout+0x4db/0xfd0 kernel/time/timer.c:1803
       rcu_gp_fqs_loop kernel/rcu/tree.c:1971 [inline]
       rcu_gp_kthread+0x962/0x17b0 kernel/rcu/tree.c:2128
       kthread+0x357/0x430 kernel/kthread.c:253
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
      NMI backtrace for cpu 0
      CPU: 0 PID: 7632 Comm: kworker/0:4 Not tainted 5.1.0-rc5+ #172
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events iterate_cleanup_work
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       nmi_cpu_backtrace.cold+0x63/0xa4 lib/nmi_backtrace.c:101
       nmi_trigger_cpumask_backtrace+0x1be/0x236 lib/nmi_backtrace.c:62
       arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
       trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
       rcu_dump_cpu_stacks+0x183/0x1cf kernel/rcu/tree.c:1223
       print_cpu_stall kernel/rcu/tree.c:1360 [inline]
       check_cpu_stall kernel/rcu/tree.c:1434 [inline]
       rcu_pending kernel/rcu/tree.c:3103 [inline]
       rcu_sched_clock_irq.cold+0x500/0xa4a kernel/rcu/tree.c:2544
       update_process_times+0x32/0x80 kernel/time/timer.c:1635
       tick_sched_handle+0xa2/0x190 kernel/time/tick-sched.c:161
       tick_sched_timer+0x47/0x130 kernel/time/tick-sched.c:1271
       __run_hrtimer kernel/time/hrtimer.c:1389 [inline]
       __hrtimer_run_queues+0x33e/0xde0 kernel/time/hrtimer.c:1451
       hrtimer_interrupt+0x314/0x770 kernel/time/hrtimer.c:1509
       local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1035 [inline]
       smp_apic_timer_interrupt+0x120/0x570 arch/x86/kernel/apic/apic.c:1060
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807
      RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x50 kernel/kcov.c:95
      Code: 89 25 b4 6e ec 08 41 bc f4 ff ff ff e8 cd 5d ea ff 48 c7 05 9e 6e ec 08 00 00 00 00 e9 a4 e9 ff ff 90 90 90 90 90 90 90 90 90 <55> 48 89 e5 48 8b 75 08 65 48 8b 04 25 00 ee 01 00 65 8b 15 c8 60
      RSP: 0018:ffff8880ae807ce0 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
      RAX: ffff88806fd40640 RBX: dffffc0000000000 RCX: ffffffff863fbc56
      RDX: 0000000000000100 RSI: ffffffff863fbc1d RDI: ffff88808cf94228
      RBP: ffff8880ae807d10 R08: ffff88806fd40640 R09: ffffed1015d00f8b
      R10: ffffed1015d00f8a R11: 0000000000000003 R12: ffff88808cf941c0
      R13: 00000000fffff034 R14: ffff8882166cd840 R15: 0000000000000000
       rose_loopback_timer+0x30d/0x3f0 net/rose/rose_loopback.c:91
       call_timer_fn+0x190/0x720 kernel/time/timer.c:1325
       expire_timers kernel/time/timer.c:1362 [inline]
       __run_timers kernel/time/timer.c:1681 [inline]
       __run_timers kernel/time/timer.c:1649 [inline]
       run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694
       __do_softirq+0x266/0x95a kernel/softirq.c:293
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1027
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d7b10dfe
    • Zhu Yanjun's avatar
      net: rds: exchange of 8K and 1M pool · ed1866aa
      Zhu Yanjun authored
      [ Upstream commit 4b9fc714 ]
      
      Before the commit 490ea596 ("RDS: IB: move FMR code to its own file"),
      when the dirty_count is greater than 9/10 of max_items of 8K pool,
      1M pool is used, Vice versa. After the commit 490ea596 ("RDS: IB: move
      FMR code to its own file"), the above is removed. When we make the
      following tests.
      
      Server:
        rds-stress -r 1.1.1.16 -D 1M
      
      Client:
        rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M
      
      The following will appear.
      "
      connecting to 1.1.1.16:4000
      negotiated options, tasks will start in 2 seconds
      Starting up..header from 1.1.1.166:4001 to id 4001 bogus
      ..
      tsks  tx/s  rx/s tx+rx K/s  mbi K/s  mbo K/s tx us/c  rtt us
      cpu %
         1    0    0     0.00     0.00     0.00    0.00 0.00 -1.00
         1    0    0     0.00     0.00     0.00    0.00 0.00 -1.00
         1    0    0     0.00     0.00     0.00    0.00 0.00 -1.00
         1    0    0     0.00     0.00     0.00    0.00 0.00 -1.00
         1    0    0     0.00     0.00     0.00    0.00 0.00 -1.00
      ...
      "
      So this exchange between 8K and 1M pool is added back.
      
      Fixes: commit 490ea596 ("RDS: IB: move FMR code to its own file")
      Signed-off-by: default avatarZhu Yanjun <yanjun.zhu@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed1866aa
    • Erez Alfasi's avatar
      net/mlx5e: ethtool, Remove unsupported SFP EEPROM high pages query · 7da11d6a
      Erez Alfasi authored
      [ Upstream commit ace329f4 ]
      
      Querying EEPROM high pages data for SFP module is currently
      not supported by our driver and yet queried, resulting in
      invalid FW queries.
      
      Set the EEPROM ethtool data length to 256 for SFP module will
      limit the reading for page 0 only and prevent invalid FW queries.
      
      Fixes: bb64143e ("net/mlx5e: Add ethtool support for dump module EEPROM")
      Signed-off-by: default avatarErez Alfasi <ereza@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7da11d6a
    • Amit Cohen's avatar
      mlxsw: spectrum: Fix autoneg status in ethtool · 829fd984
      Amit Cohen authored
      [ Upstream commit 151f0ddd ]
      
      If link is down and autoneg is set to on/off, the status in ethtool does
      not change.
      
      The reason is when the link is down the function returns with zero
      before changing autoneg value.
      
      Move the checking of link state (up/down) to be performed after setting
      autoneg value, in order to be sure that autoneg will change in any case.
      
      Fixes: 56ade8fe ("mlxsw: spectrum: Add initial support for Spectrum ASIC")
      Signed-off-by: default avatarAmit Cohen <amitc@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      829fd984
    • ZhangXiaoxu's avatar
      ipv4: set the tcp_min_rtt_wlen range from 0 to one day · 250e51f8
      ZhangXiaoxu authored
      [ Upstream commit 19fad20d ]
      
      There is a UBSAN report as below:
      UBSAN: Undefined behaviour in net/ipv4/tcp_input.c:2877:56
      signed integer overflow:
      2147483647 * 1000 cannot be represented in type 'int'
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.1.0-rc4-00058-g582549e3 #1
      Call Trace:
       <IRQ>
       dump_stack+0x8c/0xba
       ubsan_epilogue+0x11/0x60
       handle_overflow+0x12d/0x170
       ? ttwu_do_wakeup+0x21/0x320
       __ubsan_handle_mul_overflow+0x12/0x20
       tcp_ack_update_rtt+0x76c/0x780
       tcp_clean_rtx_queue+0x499/0x14d0
       tcp_ack+0x69e/0x1240
       ? __wake_up_sync_key+0x2c/0x50
       ? update_group_capacity+0x50/0x680
       tcp_rcv_established+0x4e2/0xe10
       tcp_v4_do_rcv+0x22b/0x420
       tcp_v4_rcv+0xfe8/0x1190
       ip_protocol_deliver_rcu+0x36/0x180
       ip_local_deliver+0x15b/0x1a0
       ip_rcv+0xac/0xd0
       __netif_receive_skb_one_core+0x7f/0xb0
       __netif_receive_skb+0x33/0xc0
       netif_receive_skb_internal+0x84/0x1c0
       napi_gro_receive+0x2a0/0x300
       receive_buf+0x3d4/0x2350
       ? detach_buf_split+0x159/0x390
       virtnet_poll+0x198/0x840
       ? reweight_entity+0x243/0x4b0
       net_rx_action+0x25c/0x770
       __do_softirq+0x19b/0x66d
       irq_exit+0x1eb/0x230
       do_IRQ+0x7a/0x150
       common_interrupt+0xf/0xf
       </IRQ>
      
      It can be reproduced by:
        echo 2147483647 > /proc/sys/net/ipv4/tcp_min_rtt_wlen
      
      Fixes: f6722583 ("tcp: track min RTT using windowed min-filter")
      Signed-off-by: default avatarZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      250e51f8
    • Eric Dumazet's avatar
      ipv4: add sanity checks in ipv4_link_failure() · 07445fea
      Eric Dumazet authored
      [ Upstream commit 20ff83f1 ]
      
      Before calling __ip_options_compile(), we need to ensure the network
      header is a an IPv4 one, and that it is already pulled in skb->head.
      
      RAW sockets going through a tunnel can end up calling ipv4_link_failure()
      with total garbage in the skb, or arbitrary lengthes.
      
      syzbot report :
      
      BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:355 [inline]
      BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123
      Write of size 69 at addr ffff888096abf068 by task syz-executor.4/9204
      
      CPU: 0 PID: 9204 Comm: syz-executor.4 Not tainted 5.1.0-rc5+ #77
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
       kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       check_memory_region_inline mm/kasan/generic.c:185 [inline]
       check_memory_region+0x123/0x190 mm/kasan/generic.c:191
       memcpy+0x38/0x50 mm/kasan/common.c:133
       memcpy include/linux/string.h:355 [inline]
       __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123
       __icmp_send+0x725/0x1400 net/ipv4/icmp.c:695
       ipv4_link_failure+0x29f/0x550 net/ipv4/route.c:1204
       dst_link_failure include/net/dst.h:427 [inline]
       vti6_xmit net/ipv6/ip6_vti.c:514 [inline]
       vti6_tnl_xmit+0x10d4/0x1c0c net/ipv6/ip6_vti.c:553
       __netdev_start_xmit include/linux/netdevice.h:4414 [inline]
       netdev_start_xmit include/linux/netdevice.h:4423 [inline]
       xmit_one net/core/dev.c:3292 [inline]
       dev_hard_start_xmit+0x1b2/0x980 net/core/dev.c:3308
       __dev_queue_xmit+0x271d/0x3060 net/core/dev.c:3878
       dev_queue_xmit+0x18/0x20 net/core/dev.c:3911
       neigh_direct_output+0x16/0x20 net/core/neighbour.c:1527
       neigh_output include/net/neighbour.h:508 [inline]
       ip_finish_output2+0x949/0x1740 net/ipv4/ip_output.c:229
       ip_finish_output+0x73c/0xd50 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:278 [inline]
       ip_output+0x21f/0x670 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:444 [inline]
       NF_HOOK include/linux/netfilter.h:289 [inline]
       raw_send_hdrinc net/ipv4/raw.c:432 [inline]
       raw_sendmsg+0x1d2b/0x2f20 net/ipv4/raw.c:663
       inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:651 [inline]
       sock_sendmsg+0xdd/0x130 net/socket.c:661
       sock_write_iter+0x27c/0x3e0 net/socket.c:988
       call_write_iter include/linux/fs.h:1866 [inline]
       new_sync_write+0x4c7/0x760 fs/read_write.c:474
       __vfs_write+0xe4/0x110 fs/read_write.c:487
       vfs_write+0x20c/0x580 fs/read_write.c:549
       ksys_write+0x14f/0x2d0 fs/read_write.c:599
       __do_sys_write fs/read_write.c:611 [inline]
       __se_sys_write fs/read_write.c:608 [inline]
       __x64_sys_write+0x73/0xb0 fs/read_write.c:608
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x458c29
      Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f293b44bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458c29
      RDX: 0000000000000014 RSI: 00000000200002c0 RDI: 0000000000000003
      RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f293b44c6d4
      R13: 00000000004c8623 R14: 00000000004ded68 R15: 00000000ffffffff
      
      The buggy address belongs to the page:
      page:ffffea00025aafc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
      flags: 0x1fffc0000000000()
      raw: 01fffc0000000000 0000000000000000 ffffffff025a0101 0000000000000000
      raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff888096abef80: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2
       ffff888096abf000: f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00
      >ffff888096abf080: 00 00 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
                               ^
       ffff888096abf100: 00 00 00 00 f1 f1 f1 f1 00 00 f3 f3 00 00 00 00
       ffff888096abf180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      Fixes: ed0de45a ("ipv4: recompile ip options in ipv4_link_failure")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07445fea
    • Sebastian Andrzej Siewior's avatar
      x86/fpu: Don't export __kernel_fpu_{begin,end}() · d4ff57d0
      Sebastian Andrzej Siewior authored
      commit 12209993 upstream.
      
      There is one user of __kernel_fpu_begin() and before invoking it,
      it invokes preempt_disable(). So it could invoke kernel_fpu_begin()
      right away. The 32bit version of arch_efi_call_virt_setup() and
      arch_efi_call_virt_teardown() does this already.
      
      The comment above *kernel_fpu*() claims that before invoking
      __kernel_fpu_begin() preemption should be disabled and that KVM is a
      good example of doing it. Well, KVM doesn't do that since commit
      
        f775b13e ("x86,kvm: move qemu/guest FPU switching out to vcpu_run")
      
      so it is not an example anymore.
      
      With EFI gone as the last user of __kernel_fpu_{begin|end}(), both can
      be made static and not exported anymore.
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@surriel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm ML <kvm@vger.kernel.org>
      Cc: linux-efi <linux-efi@vger.kernel.org>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20181129150210.2k4mawt37ow6c2vq@linutronix.deSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4ff57d0
    • Jan Kara's avatar
      mm: Fix warning in insert_pfn() · 423497a9
      Jan Kara authored
      commit f2c57d91 upstream.
      
      In DAX mode a write pagefault can race with write(2) in the following
      way:
      
      CPU0                            CPU1
                                      write fault for mapped zero page (hole)
      dax_iomap_rw()
        iomap_apply()
          xfs_file_iomap_begin()
            - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - invalidates radix tree entries in given range
                                      dax_iomap_pte_fault()
                                        grab_mapping_entry()
                                          - no entry found, creates empty
                                        ...
                                        xfs_file_iomap_begin()
                                          - finds already allocated block
                                        ...
                                        vmf_insert_mixed_mkwrite()
                                          - WARNs and does nothing because there
                                            is still zero page mapped in PTE
              unmap_mapping_pages()
      
      This race results in WARN_ON from insert_pfn() and is occasionally
      triggered by fstest generic/344. Note that the race is otherwise
      harmless as before write(2) on CPU0 is finished, we will invalidate page
      tables properly and thus user of mmap will see modified data from
      write(2) from that point on. So just restrict the warning only to the
      case when the PFN in PTE is not zero page.
      
      Link: http://lkml.kernel.org/r/20180824154542.26872-1-jack@suse.czSigned-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      423497a9
    • Daniel Borkmann's avatar
      x86/retpolines: Disable switch jump tables when retpolines are enabled · e923c6b7
      Daniel Borkmann authored
      commit a9d57ef1 upstream.
      
      Commit ce02ef06 ("x86, retpolines: Raise limit for generating indirect
      calls from switch-case") raised the limit under retpolines to 20 switch
      cases where gcc would only then start to emit jump tables, and therefore
      effectively disabling the emission of slow indirect calls in this area.
      
      After this has been brought to attention to gcc folks [0], Martin Liska
      has then fixed gcc to align with clang by avoiding to generate switch jump
      tables entirely under retpolines. This is taking effect in gcc starting
      from stable version 8.4.0. Given kernel supports compilation with older
      versions of gcc where the fix is not being available or backported anymore,
      we need to keep the extra KBUILD_CFLAGS around for some time and generally
      set the -fno-jump-tables to align with what more recent gcc is doing
      automatically today.
      
      More than 20 switch cases are not expected to be fast-path critical, but
      it would still be good to align with gcc behavior for versions < 8.4.0 in
      order to have consistency across supported gcc versions. vmlinux size is
      slightly growing by 0.27% for older gcc. This flag is only set to work
      around affected gcc, no change for clang.
      
        [0] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86952Suggested-by: default avatarMartin Liska <mliska@suse.cz>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Björn Töpel<bjorn.topel@intel.com>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Link: https://lkml.kernel.org/r/20190325135620.14882-1-daniel@iogearbox.netSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e923c6b7
    • Daniel Borkmann's avatar
      x86, retpolines: Raise limit for generating indirect calls from switch-case · 6cfcff3c
      Daniel Borkmann authored
      commit ce02ef06 upstream.
      
      From networking side, there are numerous attempts to get rid of indirect
      calls in fast-path wherever feasible in order to avoid the cost of
      retpolines, for example, just to name a few:
      
        * 283c16a2 ("indirect call wrappers: helpers to speed-up indirect calls of builtin")
        * aaa5d90b ("net: use indirect call wrappers at GRO network layer")
        * 028e0a47 ("net: use indirect call wrappers at GRO transport layer")
        * 356da6d0 ("dma-mapping: bypass indirect calls for dma-direct")
        * 09772d92 ("bpf: avoid retpoline for lookup/update/delete calls on maps")
        * 10870dd8 ("netfilter: nf_tables: add direct calls for all builtin expressions")
        [...]
      
      Recent work on XDP from Björn and Magnus additionally found that manually
      transforming the XDP return code switch statement with more than 5 cases
      into if-else combination would result in a considerable speedup in XDP
      layer due to avoidance of indirect calls in CONFIG_RETPOLINE enabled
      builds. On i40e driver with XDP prog attached, a 20-26% speedup has been
      observed [0]. Aside from XDP, there are many other places later in the
      networking stack's critical path with similar switch-case
      processing. Rather than fixing every XDP-enabled driver and locations in
      stack by hand, it would be good to instead raise the limit where gcc would
      emit expensive indirect calls from the switch under retpolines and stick
      with the default as-is in case of !retpoline configured kernels. This would
      also have the advantage that for archs where this is not necessary, we let
      compiler select the underlying target optimization for these constructs and
      avoid potential slow-downs by if-else hand-rewrite.
      
      In case of gcc, this setting is controlled by case-values-threshold which
      has an architecture global default that selects 4 or 5 (latter if target
      does not have a case insn that compares the bounds) where some arch back
      ends like arm64 or s390 override it with their own target hooks, for
      example, in gcc commit db7a90aa0de5 ("S/390: Disable prediction of indirect
      branches") the threshold pretty much disables jump tables by limit of 20
      under retpoline builds.  Comparing gcc's and clang's default code
      generation on x86-64 under O2 level with retpoline build results in the
      following outcome for 5 switch cases:
      
      * gcc with -mindirect-branch=thunk-inline -mindirect-branch-register:
      
        # gdb -batch -ex 'disassemble dispatch' ./c-switch
        Dump of assembler code for function dispatch:
         0x0000000000400be0 <+0>:     cmp    $0x4,%edi
         0x0000000000400be3 <+3>:     ja     0x400c35 <dispatch+85>
         0x0000000000400be5 <+5>:     lea    0x915f8(%rip),%rdx        # 0x4921e4
         0x0000000000400bec <+12>:    mov    %edi,%edi
         0x0000000000400bee <+14>:    movslq (%rdx,%rdi,4),%rax
         0x0000000000400bf2 <+18>:    add    %rdx,%rax
         0x0000000000400bf5 <+21>:    callq  0x400c01 <dispatch+33>
         0x0000000000400bfa <+26>:    pause
         0x0000000000400bfc <+28>:    lfence
         0x0000000000400bff <+31>:    jmp    0x400bfa <dispatch+26>
         0x0000000000400c01 <+33>:    mov    %rax,(%rsp)
         0x0000000000400c05 <+37>:    retq
         0x0000000000400c06 <+38>:    nopw   %cs:0x0(%rax,%rax,1)
         0x0000000000400c10 <+48>:    jmpq   0x400c90 <fn_3>
         0x0000000000400c15 <+53>:    nopl   (%rax)
         0x0000000000400c18 <+56>:    jmpq   0x400c70 <fn_2>
         0x0000000000400c1d <+61>:    nopl   (%rax)
         0x0000000000400c20 <+64>:    jmpq   0x400c50 <fn_1>
         0x0000000000400c25 <+69>:    nopl   (%rax)
         0x0000000000400c28 <+72>:    jmpq   0x400c40 <fn_0>
         0x0000000000400c2d <+77>:    nopl   (%rax)
         0x0000000000400c30 <+80>:    jmpq   0x400cb0 <fn_4>
         0x0000000000400c35 <+85>:    push   %rax
         0x0000000000400c36 <+86>:    callq  0x40dd80 <abort>
        End of assembler dump.
      
      * clang with -mretpoline emitting search tree:
      
        # gdb -batch -ex 'disassemble dispatch' ./c-switch
        Dump of assembler code for function dispatch:
         0x0000000000400b30 <+0>:     cmp    $0x1,%edi
         0x0000000000400b33 <+3>:     jle    0x400b44 <dispatch+20>
         0x0000000000400b35 <+5>:     cmp    $0x2,%edi
         0x0000000000400b38 <+8>:     je     0x400b4d <dispatch+29>
         0x0000000000400b3a <+10>:    cmp    $0x3,%edi
         0x0000000000400b3d <+13>:    jne    0x400b52 <dispatch+34>
         0x0000000000400b3f <+15>:    jmpq   0x400c50 <fn_3>
         0x0000000000400b44 <+20>:    test   %edi,%edi
         0x0000000000400b46 <+22>:    jne    0x400b5c <dispatch+44>
         0x0000000000400b48 <+24>:    jmpq   0x400c20 <fn_0>
         0x0000000000400b4d <+29>:    jmpq   0x400c40 <fn_2>
         0x0000000000400b52 <+34>:    cmp    $0x4,%edi
         0x0000000000400b55 <+37>:    jne    0x400b66 <dispatch+54>
         0x0000000000400b57 <+39>:    jmpq   0x400c60 <fn_4>
         0x0000000000400b5c <+44>:    cmp    $0x1,%edi
         0x0000000000400b5f <+47>:    jne    0x400b66 <dispatch+54>
         0x0000000000400b61 <+49>:    jmpq   0x400c30 <fn_1>
         0x0000000000400b66 <+54>:    push   %rax
         0x0000000000400b67 <+55>:    callq  0x40dd20 <abort>
        End of assembler dump.
      
        For sake of comparison, clang without -mretpoline:
      
        # gdb -batch -ex 'disassemble dispatch' ./c-switch
        Dump of assembler code for function dispatch:
         0x0000000000400b30 <+0>:	cmp    $0x4,%edi
         0x0000000000400b33 <+3>:	ja     0x400b57 <dispatch+39>
         0x0000000000400b35 <+5>:	mov    %edi,%eax
         0x0000000000400b37 <+7>:	jmpq   *0x492148(,%rax,8)
         0x0000000000400b3e <+14>:	jmpq   0x400bf0 <fn_0>
         0x0000000000400b43 <+19>:	jmpq   0x400c30 <fn_4>
         0x0000000000400b48 <+24>:	jmpq   0x400c10 <fn_2>
         0x0000000000400b4d <+29>:	jmpq   0x400c20 <fn_3>
         0x0000000000400b52 <+34>:	jmpq   0x400c00 <fn_1>
         0x0000000000400b57 <+39>:	push   %rax
         0x0000000000400b58 <+40>:	callq  0x40dcf0 <abort>
        End of assembler dump.
      
      Raising the cases to a high number (e.g. 100) will still result in similar
      code generation pattern with clang and gcc as above, in other words clang
      generally turns off jump table emission by having an extra expansion pass
      under retpoline build to turn indirectbr instructions from their IR into
      switch instructions as a built-in -mno-jump-table lowering of a switch (in
      this case, even if IR input already contained an indirect branch).
      
      For gcc, adding --param=case-values-threshold=20 as in similar fashion as
      s390 in order to raise the limit for x86 retpoline enabled builds results
      in a small vmlinux size increase of only 0.13% (before=18,027,528
      after=18,051,192). For clang this option is ignored due to i) not being
      needed as mentioned and ii) not having above cmdline
      parameter. Non-retpoline-enabled builds with gcc continue to use the
      default case-values-threshold setting, so nothing changes here.
      
      [0] https://lore.kernel.org/netdev/20190129095754.9390-1-bjorn.topel@gmail.com/
          and "The Path to DPDK Speeds for AF_XDP", LPC 2018, networking track:
        - http://vger.kernel.org/lpc_net2018_talks/lpc18_pres_af_xdp_perf-v3.pdf
        - http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdfSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: netdev@vger.kernel.org
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Link: https://lkml.kernel.org/r/20190221221941.29358-1-daniel@iogearbox.netSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6cfcff3c
    • Al Viro's avatar
      Fix aio_poll() races · e9e47779
      Al Viro authored
      commit af5c72b1 upstream.
      
      aio_poll() has to cope with several unpleasant problems:
      	* requests that might stay around indefinitely need to
      be made visible for io_cancel(2); that must not be done to
      a request already completed, though.
      	* in cases when ->poll() has placed us on a waitqueue,
      wakeup might have happened (and request completed) before ->poll()
      returns.
      	* worse, in some early wakeup cases request might end
      up re-added into the queue later - we can't treat "woken up and
      currently not in the queue" as "it's not going to stick around
      indefinitely"
      	* ... moreover, ->poll() might have decided not to
      put it on any queues to start with, and that needs to be distinguished
      from the previous case
      	* ->poll() might have tried to put us on more than one queue.
      Only the first will succeed for aio poll, so we might end up missing
      wakeups.  OTOH, we might very well notice that only after the
      wakeup hits and request gets completed (all before ->poll() gets
      around to the second poll_wait()).  In that case it's too late to
      decide that we have an error.
      
      req->woken was an attempt to deal with that.  Unfortunately, it was
      broken.  What we need to keep track of is not that wakeup has happened -
      the thing might come back after that.  It's that async reference is
      already gone and won't come back, so we can't (and needn't) put the
      request on the list of cancellables.
      
      The easiest case is "request hadn't been put on any waitqueues"; we
      can tell by seeing NULL apt.head, and in that case there won't be
      anything async.  We should either complete the request ourselves
      (if vfs_poll() reports anything of interest) or return an error.
      
      In all other cases we get exclusion with wakeups by grabbing the
      queue lock.
      
      If request is currently on queue and we have something interesting
      from vfs_poll(), we can steal it and complete the request ourselves.
      
      If it's on queue and vfs_poll() has not reported anything interesting,
      we either put it on the cancellable list, or, if we know that it
      hadn't been put on all queues ->poll() wanted it on, we steal it and
      return an error.
      
      If it's _not_ on queue, it's either been already dealt with (in which
      case we do nothing), or there's aio_poll_complete_work() about to be
      executed.  In that case we either put it on the cancellable list,
      or, if we know it hadn't been put on all queues ->poll() wanted it on,
      simulate what cancel would've done.
      
      It's a lot more convoluted than I'd like it to be.  Single-consumer APIs
      suck, and unfortunately aio is not an exception...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e9e47779
    • Al Viro's avatar
      aio: store event at final iocb_put() · aab66dfb
      Al Viro authored
      commit 2bb874c0 upstream.
      
      Instead of having aio_complete() set ->ki_res.{res,res2}, do that
      explicitly in its callers, drop the reference (as aio_complete()
      used to do) and delay the rest until the final iocb_put().
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aab66dfb
    • Al Viro's avatar
      aio: keep io_event in aio_kiocb · c20202c5
      Al Viro authored
      commit a9339b78 upstream.
      
      We want to separate forming the resulting io_event from putting it
      into the ring buffer.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c20202c5
    • Al Viro's avatar
      aio: fold lookup_kiocb() into its sole caller · 592ea630
      Al Viro authored
      commit 833f4154 upstream.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      592ea630
    • Linus Torvalds's avatar
      pin iocb through aio. · c7f2525a
      Linus Torvalds authored
      commit b53119f1 upstream.
      
      aio_poll() is not the only case that needs file pinned; worse, while
      aio_read()/aio_write() can live without pinning iocb itself, the
      proof is rather brittle and can easily break on later changes.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c7f2525a
    • Linus Torvalds's avatar
      aio: simplify - and fix - fget/fput for io_submit() · d6b2615f
      Linus Torvalds authored
      commit 84c4e1f8 upstream.
      
      Al Viro root-caused a race where the IOCB_CMD_POLL handling of
      fget/fput() could cause us to access the file pointer after it had
      already been freed:
      
       "In more details - normally IOCB_CMD_POLL handling looks so:
      
         1) io_submit(2) allocates aio_kiocb instance and passes it to
            aio_poll()
      
         2) aio_poll() resolves the descriptor to struct file by req->file =
            fget(iocb->aio_fildes)
      
         3) aio_poll() sets ->woken to false and raises ->ki_refcnt of that
            aio_kiocb to 2 (bumps by 1, that is).
      
         4) aio_poll() calls vfs_poll(). After sanity checks (basically,
            "poll_wait() had been called and only once") it locks the queue.
            That's what the extra reference to iocb had been for - we know we
            can safely access it.
      
         5) With queue locked, we check if ->woken has already been set to
            true (by aio_poll_wake()) and, if it had been, we unlock the
            queue, drop a reference to aio_kiocb and bugger off - at that
            point it's a responsibility to aio_poll_wake() and the stuff
            called/scheduled by it. That code will drop the reference to file
            in req->file, along with the other reference to our aio_kiocb.
      
         6) otherwise, we see whether we need to wait. If we do, we unlock the
            queue, drop one reference to aio_kiocb and go away - eventual
            wakeup (or cancel) will deal with the reference to file and with
            the other reference to aio_kiocb
      
         7) otherwise we remove ourselves from waitqueue (still under the
            queue lock), so that wakeup won't get us. No async activity will
            be happening, so we can safely drop req->file and iocb ourselves.
      
        If wakeup happens while we are in vfs_poll(), we are fine - aio_kiocb
        won't get freed under us, so we can do all the checks and locking
        safely. And we don't touch ->file if we detect that case.
      
        However, vfs_poll() most certainly *does* touch the file it had been
        given. So wakeup coming while we are still in ->poll() might end up
        doing fput() on that file. That case is not too rare, and usually we
        are saved by the still present reference from descriptor table - that
        fput() is not the final one.
      
        But if another thread closes that descriptor right after our fget()
        and wakeup does happen before ->poll() returns, we are in trouble -
        final fput() done while we are in the middle of a method:
      
      Al also wrote a patch to take an extra reference to the file descriptor
      to fix this, but I instead suggested we just streamline the whole file
      pointer handling by submit_io() so that the generic aio submission code
      simply keeps the file pointer around until the aio has completed.
      
      Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL")
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Reported-by: syzbot+503d4cc169fcec1cb18c@syzkaller.appspotmail.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d6b2615f
    • Mike Marshall's avatar
      aio: initialize kiocb private in case any filesystems expect it. · 2afa01cd
      Mike Marshall authored
      commit ec51f8ee upstream.
      
      A recent optimization had left private uninitialized.
      
      Fixes: 2bc4ca9b ("aio: don't zero entire aio_kiocb aio_get_req()")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2afa01cd
    • Jens Axboe's avatar
      aio: abstract out io_event filler helper · a812f7b6
      Jens Axboe authored
      commit 875736bb upstream.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a812f7b6
    • Jens Axboe's avatar
      aio: split out iocb copy from io_submit_one() · d384f8b8
      Jens Axboe authored
      commit 88a6f18b upstream.
      
      In preparation of handing in iocbs in a different fashion as well. Also
      make it clear that the iocb being passed in isn't modified, by marking
      it const throughout.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d384f8b8
    • Jens Axboe's avatar
      aio: use iocb_put() instead of open coding it · 4d677689
      Jens Axboe authored
      commit 71ebc6fe upstream.
      
      Replace the percpu_ref_put() + kmem_cache_free() with a call to
      iocb_put() instead.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d677689
    • Jens Axboe's avatar
      aio: don't zero entire aio_kiocb aio_get_req() · ef529eea
      Jens Axboe authored
      commit 2bc4ca9b upstream.
      
      It's 192 bytes, fairly substantial. Most items don't need to be cleared,
      especially not upfront. Clear the ones we do need to clear, and leave
      the other ones for setup when the iocb is prepared and submitted.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ef529eea
    • Christoph Hellwig's avatar
      aio: separate out ring reservation from req allocation · 730198c8
      Christoph Hellwig authored
      commit 432c7997 upstream.
      
      This is in preparation for certain types of IO not needing a ring
      reserveration.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      730198c8