1. 22 Jul, 2018 36 commits
    • David Ahern's avatar
      net/tcp: Fix socket lookups with SO_BINDTODEVICE · 35e324eb
      David Ahern authored
      [ Upstream commit 8c43bd17 ]
      
      Similar to 69678bcd ("udp: fix SO_BINDTODEVICE"), TCP socket lookups
      need to fail if dev_match is not true. Currently, a packet to a given port
      can match a socket bound to device when it should not. In the VRF case,
      this causes the lookup to hit a VRF socket and not a global socket
      resulting in a response trying to go through the VRF when it should not.
      
      Fixes: 3fa6f616 ("net: ipv4: add second dif to inet socket lookups")
      Fixes: 4297a0ef ("net: ipv6: add second dif to inet6 socket lookups")
      Reported-by: default avatarLou Berger <lberger@labn.net>
      Diagnosed-by: default avatarRenato Westphal <renato@opensourcerouting.org>
      Tested-by: default avatarRenato Westphal <renato@opensourcerouting.org>
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      35e324eb
    • Eric Dumazet's avatar
      net: sungem: fix rx checksum support · b3c66b54
      Eric Dumazet authored
      [ Upstream commit 12b03558 ]
      
      After commit 88078d98 ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE
      are friends"), sungem owners reported the infamous "eth0: hw csum failure"
      message.
      
      CHECKSUM_COMPLETE has in fact never worked for this driver, but this
      was masked by the fact that upper stacks had to strip the FCS, and
      therefore skb->ip_summed was set back to CHECKSUM_NONE before
      my recent change.
      
      Driver configures a number of bytes to skip when the chip computes
      the checksum, and for some reason only half of the Ethernet header
      was skipped.
      
      Then a second problem is that we should strip the FCS by default,
      unless the driver is updated to eventually support NETIF_F_RXFCS in
      the future.
      
      Finally, a driver should check if NETIF_F_RXCSUM feature is enabled
      or not, so that the admin can turn off rx checksum if wanted.
      
      Many thanks to Andreas Schwab and Mathieu Malaterre for their
      help in debugging this issue.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarMeelis Roos <mroos@linux.ee>
      Reported-by: default avatarMathieu Malaterre <malat@debian.org>
      Reported-by: default avatarAndreas Schwab <schwab@linux-m68k.org>
      Tested-by: default avatarAndreas Schwab <schwab@linux-m68k.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b3c66b54
    • Konstantin Khlebnikov's avatar
      net_sched: blackhole: tell upper qdisc about dropped packets · b36f997a
      Konstantin Khlebnikov authored
      [ Upstream commit 7e85dc8c ]
      
      When blackhole is used on top of classful qdisc like hfsc it breaks
      qlen and backlog counters because packets are disappear without notice.
      
      In HFSC non-zero qlen while all classes are inactive triggers warning:
      WARNING: ... at net/sched/sch_hfsc.c:1393 hfsc_dequeue+0xba4/0xe90 [sch_hfsc]
      and schedules watchdog work endlessly.
      
      This patch return __NET_XMIT_BYPASS in addition to NET_XMIT_SUCCESS,
      this flag tells upper layer: this packet is gone and isn't queued.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b36f997a
    • Eric Dumazet's avatar
      net/packet: fix use-after-free · 5e6b4b9b
      Eric Dumazet authored
      [ Upstream commit 945d015e ]
      
      We should put copy_skb in receive_queue only after
      a successful call to virtio_net_hdr_from_skb().
      
      syzbot report :
      
      BUG: KASAN: use-after-free in __skb_unlink include/linux/skbuff.h:1843 [inline]
      BUG: KASAN: use-after-free in __skb_dequeue include/linux/skbuff.h:1863 [inline]
      BUG: KASAN: use-after-free in skb_dequeue+0x16a/0x180 net/core/skbuff.c:2815
      Read of size 8 at addr ffff8801b044ecc0 by task syz-executor217/4553
      
      CPU: 0 PID: 4553 Comm: syz-executor217 Not tainted 4.18.0-rc1+ #111
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
       print_address_description+0x6c/0x20b mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
       __skb_unlink include/linux/skbuff.h:1843 [inline]
       __skb_dequeue include/linux/skbuff.h:1863 [inline]
       skb_dequeue+0x16a/0x180 net/core/skbuff.c:2815
       skb_queue_purge+0x26/0x40 net/core/skbuff.c:2852
       packet_set_ring+0x675/0x1da0 net/packet/af_packet.c:4331
       packet_release+0x630/0xd90 net/packet/af_packet.c:2991
       __sock_release+0xd7/0x260 net/socket.c:603
       sock_close+0x19/0x20 net/socket.c:1186
       __fput+0x35b/0x8b0 fs/file_table.c:209
       ____fput+0x15/0x20 fs/file_table.c:243
       task_work_run+0x1ec/0x2a0 kernel/task_work.c:113
       exit_task_work include/linux/task_work.h:22 [inline]
       do_exit+0x1b08/0x2750 kernel/exit.c:865
       do_group_exit+0x177/0x440 kernel/exit.c:968
       __do_sys_exit_group kernel/exit.c:979 [inline]
       __se_sys_exit_group kernel/exit.c:977 [inline]
       __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:977
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4448e9
      Code: Bad RIP value.
      RSP: 002b:00007ffd5f777ca8 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004448e9
      RDX: 00000000004448e9 RSI: 000000000000fcfb RDI: 0000000000000001
      RBP: 00000000006cf018 R08: 00007ffd0000a45b R09: 0000000000000000
      R10: 00007ffd5f777e48 R11: 0000000000000202 R12: 00000000004021f0
      R13: 0000000000402280 R14: 0000000000000000 R15: 0000000000000000
      
      Allocated by task 4553:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
       kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
       kmem_cache_alloc+0x12e/0x760 mm/slab.c:3554
       skb_clone+0x1f5/0x500 net/core/skbuff.c:1282
       tpacket_rcv+0x28f7/0x3200 net/packet/af_packet.c:2221
       deliver_skb net/core/dev.c:1925 [inline]
       deliver_ptype_list_skb net/core/dev.c:1940 [inline]
       __netif_receive_skb_core+0x1bfb/0x3680 net/core/dev.c:4611
       __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693
       netif_receive_skb_internal+0x12e/0x7d0 net/core/dev.c:4767
       netif_receive_skb+0xbf/0x420 net/core/dev.c:4791
       tun_rx_batched.isra.55+0x4ba/0x8c0 drivers/net/tun.c:1571
       tun_get_user+0x2af1/0x42f0 drivers/net/tun.c:1981
       tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:2009
       call_write_iter include/linux/fs.h:1795 [inline]
       new_sync_write fs/read_write.c:474 [inline]
       __vfs_write+0x6c6/0x9f0 fs/read_write.c:487
       vfs_write+0x1f8/0x560 fs/read_write.c:549
       ksys_write+0x101/0x260 fs/read_write.c:598
       __do_sys_write fs/read_write.c:610 [inline]
       __se_sys_write fs/read_write.c:607 [inline]
       __x64_sys_write+0x73/0xb0 fs/read_write.c:607
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 4553:
       save_stack+0x43/0xd0 mm/kasan/kasan.c:448
       set_track mm/kasan/kasan.c:460 [inline]
       __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
       kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
       __cache_free mm/slab.c:3498 [inline]
       kmem_cache_free+0x86/0x2d0 mm/slab.c:3756
       kfree_skbmem+0x154/0x230 net/core/skbuff.c:582
       __kfree_skb net/core/skbuff.c:642 [inline]
       kfree_skb+0x1a5/0x580 net/core/skbuff.c:659
       tpacket_rcv+0x189e/0x3200 net/packet/af_packet.c:2385
       deliver_skb net/core/dev.c:1925 [inline]
       deliver_ptype_list_skb net/core/dev.c:1940 [inline]
       __netif_receive_skb_core+0x1bfb/0x3680 net/core/dev.c:4611
       __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693
       netif_receive_skb_internal+0x12e/0x7d0 net/core/dev.c:4767
       netif_receive_skb+0xbf/0x420 net/core/dev.c:4791
       tun_rx_batched.isra.55+0x4ba/0x8c0 drivers/net/tun.c:1571
       tun_get_user+0x2af1/0x42f0 drivers/net/tun.c:1981
       tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:2009
       call_write_iter include/linux/fs.h:1795 [inline]
       new_sync_write fs/read_write.c:474 [inline]
       __vfs_write+0x6c6/0x9f0 fs/read_write.c:487
       vfs_write+0x1f8/0x560 fs/read_write.c:549
       ksys_write+0x101/0x260 fs/read_write.c:598
       __do_sys_write fs/read_write.c:610 [inline]
       __se_sys_write fs/read_write.c:607 [inline]
       __x64_sys_write+0x73/0xb0 fs/read_write.c:607
       do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff8801b044ecc0
       which belongs to the cache skbuff_head_cache of size 232
      The buggy address is located 0 bytes inside of
       232-byte region [ffff8801b044ecc0, ffff8801b044eda8)
      The buggy address belongs to the page:
      page:ffffea0006c11380 count:1 mapcount:0 mapping:ffff8801d9be96c0 index:0x0
      flags: 0x2fffc0000000100(slab)
      raw: 02fffc0000000100 ffffea0006c17988 ffff8801d9bec248 ffff8801d9be96c0
      raw: 0000000000000000 ffff8801b044e040 000000010000000c 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8801b044eb80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       ffff8801b044ec00: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc
      >ffff8801b044ec80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                 ^
       ffff8801b044ed00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8801b044ed80: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc
      
      Fixes: 58d19b19 ("packet: vnet_hdr support for tpacket_rcv")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5e6b4b9b
    • Antoine Tenart's avatar
      net: mvneta: fix the Rx desc DMA address in the Rx path · ddbbd3e0
      Antoine Tenart authored
      [ Upstream commit 271f7ff5 ]
      
      When using s/w buffer management, buffers are allocated and DMA mapped.
      When doing so on an arm64 platform, an offset correction is applied on
      the DMA address, before storing it in an Rx descriptor. The issue is
      this DMA address is then used later in the Rx path without removing the
      offset correction. Thus the DMA address is wrong, which can led to
      various issues.
      
      This patch fixes this by removing the offset correction from the DMA
      address retrieved from the Rx descriptor before using it in the Rx path.
      
      Fixes: 8d5047cf ("net: mvneta: Convert to be 64 bits compatible")
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ddbbd3e0
    • Shay Agroskin's avatar
      net/mlx5: Fix wrong size allocation for QoS ETC TC regitster · 7ae129dd
      Shay Agroskin authored
      [ Upstream commit d14fcb8d ]
      
      The driver allocates wrong size (due to wrong struct name) when issuing
      a query/set request to NIC's register.
      
      Fixes: d8880795 ("net/mlx5e: Implement DCBNL IEEE max rate")
      Signed-off-by: default avatarShay Agroskin <shayag@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7ae129dd
    • Eli Cohen's avatar
      net/mlx5: Fix required capability for manipulating MPFS · 46ff2bc7
      Eli Cohen authored
      [ Upstream commit f8119804 ]
      
      Manipulating of the MPFS requires eswitch manager capabilities.
      
      Fixes: eeb66cdb ('net/mlx5: Separate between E-Switch and MPFS')
      Signed-off-by: default avatarEli Cohen <eli@mellanox.com>
      Reviewed-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      46ff2bc7
    • Alex Vesker's avatar
      net/mlx5: Fix incorrect raw command length parsing · 8b7b5f76
      Alex Vesker authored
      [ Upstream commit 603b7bcf ]
      
      The NULL character was not set correctly for the string containing
      the command length, this caused failures reading the output of the
      command due to a random length. The fix is to initialize the output
      length string.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarAlex Vesker <valex@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b7b5f76
    • Alex Vesker's avatar
      net/mlx5: Fix command interface race in polling mode · 075b5038
      Alex Vesker authored
      [ Upstream commit d412c31d ]
      
      The command interface can work in two modes: Events and Polling.
      In the general case, each time we invoke a command, a work is
      queued to handle it.
      
      When working in events, the interrupt handler completes the
      command execution. On the other hand, when working in polling
      mode, the work itself completes it.
      
      Due to a bug in the work handler, a command could have been
      completed by the interrupt handler, while the work handler
      hasn't finished yet, causing the it to complete once again
      if the command interface mode was changed from Events to
      polling after the interrupt handler was called.
      
      mlx5_unload_one()
              mlx5_stop_eqs()
                      // Destroy the EQ before cmd EQ
                      ...cmd_work_handler()
                              write_doorbell()
                              --> EVENT_TYPE_CMD
                                      mlx5_cmd_comp_handler() // First free
                                              free_ent(cmd, ent->idx)
                                              complete(&ent->done)
      
              <-- mlx5_stop_eqs //cmd was complete
                      // move to polling before destroying the last cmd EQ
                      mlx5_cmd_use_polling()
                              cmd->mode = POLL;
      
                      --> cmd_work_handler (continues)
                              if (cmd->mode == POLL)
                                      mlx5_cmd_comp_handler() // Double free
      
      The solution is to store the cmd->mode before writing the doorbell.
      
      Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
      Signed-off-by: default avatarAlex Vesker <valex@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      075b5038
    • Or Gerlitz's avatar
      net/mlx5: E-Switch, Avoid setup attempt if not being e-switch manager · c3994f4f
      Or Gerlitz authored
      [ Upstream commit 0efc8562 ]
      
      In smartnic env, the host (PF) driver might not be an e-switch
      manager, hence the FW will err on driver attempts to deal with
      setting/unsetting the eswitch and as a result the overall setup
      of sriov will fail.
      
      Fix that by avoiding the operation if e-switch management is not
      allowed for this driver instance. While here, move to use the
      correct name for the esw manager capability name.
      
      Fixes: 81848731 ('net/mlx5: E-Switch, Add SR-IOV (FDB) support')
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Reported-by: default avatarGuy Kushnir <guyk@mellanox.com>
      Reviewed-by: default avatarEli Cohen <eli@melloanox.com>
      Tested-by: default avatarEli Cohen <eli@melloanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c3994f4f
    • Or Gerlitz's avatar
      net/mlx5e: Don't attempt to dereference the ppriv struct if not being eswitch manager · b216867c
      Or Gerlitz authored
      [ Upstream commit 8ffd569a ]
      
      The check for cpu hit statistics was not returning immediate false for
      any non vport rep netdev and hence we crashed (say on mlx5 probed VFs) if
      user-space tool was calling into any possible netdev in the system.
      
      Fix that by doing a proper check before dereferencing.
      
      Fixes: 1d447a39 ('net/mlx5e: Extendable vport representor netdev private data')
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Reported-by: default avatarEli Cohen <eli@melloanox.com>
      Reviewed-by: default avatarEli Cohen <eli@melloanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b216867c
    • Or Gerlitz's avatar
      net/mlx5e: Avoid dealing with vport representors if not being e-switch manager · 1d8dda44
      Or Gerlitz authored
      [ Upstream commit 733d3e54 ]
      
      In smartnic env, the host (PF) driver might not be an e-switch
      manager, hence the switchdev mode representors are running on
      the embedded cpu (EC) and not at the host.
      
      As such, we should avoid dealing with vport representors if
      not being esw manager.
      
      While here, make sure to disallow eswitch switchdev related
      setups through devlink if we are not esw managers.
      
      Fixes: cb67b832 ('net/mlx5e: Introduce SRIOV VF representors')
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: default avatarEli Cohen <eli@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1d8dda44
    • Harini Katakam's avatar
      net: macb: Fix ptp time adjustment for large negative delta · f389c17b
      Harini Katakam authored
      [ Upstream commit 64d7839a ]
      
      When delta passed to gem_ptp_adjtime is negative, the sign is
      maintained in the ns_to_timespec64 conversion. Hence timespec_add
      should be used directly. timespec_sub will just subtract the negative
      value thus increasing the time difference.
      Signed-off-by: default avatarHarini Katakam <harini.katakam@xilinx.com>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f389c17b
    • Sabrina Dubroca's avatar
      net: fix use-after-free in GRO with ESP · b364a914
      Sabrina Dubroca authored
      [ Upstream commit 603d4cf8 ]
      
      Since the addition of GRO for ESP, gro_receive can consume the skb and
      return -EINPROGRESS. In that case, the lower layer GRO handler cannot
      touch the skb anymore.
      
      Commit 5f114163 ("net: Add a skb_gro_flush_final helper.") converted
      some of the gro_receive handlers that can lead to ESP's gro_receive so
      that they wouldn't access the skb when -EINPROGRESS is returned, but
      missed other spots, mainly in tunneling protocols.
      
      This patch finishes the conversion to using skb_gro_flush_final(), and
      adds a new helper, skb_gro_flush_final_remcsum(), used in VXLAN and
      GUE.
      
      Fixes: 5f114163 ("net: Add a skb_gro_flush_final helper.")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b364a914
    • Eric Dumazet's avatar
      net: dccp: switch rx_tstamp_last_feedback to monotonic clock · fb6b1466
      Eric Dumazet authored
      [ Upstream commit 0ce4e70f ]
      
      To compute delays, better not use time of the day which can
      be changed by admins or malicious programs.
      
      Also change ccid3_first_li() to use s64 type for delta variable
      to avoid potential overflows.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
      Cc: dccp@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb6b1466
    • Eric Dumazet's avatar
      net: dccp: avoid crash in ccid3_hc_rx_send_feedback() · a3225a83
      Eric Dumazet authored
      [ Upstream commit 74174fe5 ]
      
      On fast hosts or malicious bots, we trigger a DCCP_BUG() which
      seems excessive.
      
      syzbot reported :
      
      BUG: delta (-6195) <= 0 at net/dccp/ccids/ccid3.c:628/ccid3_hc_rx_send_feedback()
      CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.18.0-rc1+ #112
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
       ccid3_hc_rx_send_feedback net/dccp/ccids/ccid3.c:628 [inline]
       ccid3_hc_rx_packet_recv.cold.16+0x38/0x71 net/dccp/ccids/ccid3.c:793
       ccid_hc_rx_packet_recv net/dccp/ccid.h:185 [inline]
       dccp_deliver_input_to_ccids+0xf0/0x280 net/dccp/input.c:180
       dccp_rcv_established+0x87/0xb0 net/dccp/input.c:378
       dccp_v4_do_rcv+0x153/0x180 net/dccp/ipv4.c:654
       sk_backlog_rcv include/net/sock.h:914 [inline]
       __sk_receive_skb+0x3ba/0xd80 net/core/sock.c:517
       dccp_v4_rcv+0x10f9/0x1f58 net/dccp/ipv4.c:875
       ip_local_deliver_finish+0x2eb/0xda0 net/ipv4/ip_input.c:215
       NF_HOOK include/linux/netfilter.h:287 [inline]
       ip_local_deliver+0x1e9/0x750 net/ipv4/ip_input.c:256
       dst_input include/net/dst.h:450 [inline]
       ip_rcv_finish+0x823/0x2220 net/ipv4/ip_input.c:396
       NF_HOOK include/linux/netfilter.h:287 [inline]
       ip_rcv+0xa18/0x1284 net/ipv4/ip_input.c:492
       __netif_receive_skb_core+0x2488/0x3680 net/core/dev.c:4628
       __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4693
       process_backlog+0x219/0x760 net/core/dev.c:5373
       napi_poll net/core/dev.c:5771 [inline]
       net_rx_action+0x7da/0x1980 net/core/dev.c:5837
       __do_softirq+0x2e8/0xb17 kernel/softirq.c:284
       run_ksoftirqd+0x86/0x100 kernel/softirq.c:645
       smpboot_thread_fn+0x417/0x870 kernel/smpboot.c:164
       kthread+0x345/0x410 kernel/kthread.c:240
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:412
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
      Cc: dccp@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a3225a83
    • Jesper Dangaard Brouer's avatar
      ixgbe: split XDP_TX tail and XDP_REDIRECT map flushing · a2e53d69
      Jesper Dangaard Brouer authored
      [ Upstream commit ad088ec4 ]
      
      The driver was combining the XDP_TX tail flush and XDP_REDIRECT
      map flushing (xdp_do_flush_map).  This is suboptimal, these two
      flush operations should be kept separate.
      
      Fixes: 11393cc9 ("xdp: Add batching support to redirect map")
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2e53d69
    • Xin Long's avatar
      ipvlan: fix IFLA_MTU ignored on NEWLINK · f5a42d63
      Xin Long authored
      [ Upstream commit 30877961 ]
      
      Commit 296d4856 ("ipvlan: inherit MTU from master device") adjusted
      the mtu from the master device when creating a ipvlan device, but it
      would also override the mtu value set in rtnl_create_link. It causes
      IFLA_MTU param not to take effect.
      
      So this patch is to not adjust the mtu if IFLA_MTU param is set when
      creating a ipvlan device.
      
      Fixes: 296d4856 ("ipvlan: inherit MTU from master device")
      Reported-by: default avatarJianlin Shi <jishi@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5a42d63
    • Eric Biggers's avatar
      ipv6: sr: fix passing wrong flags to crypto_alloc_shash() · d10c0baa
      Eric Biggers authored
      [ Upstream commit fc9c2029 ]
      
      The 'mask' argument to crypto_alloc_shash() uses the CRYPTO_ALG_* flags,
      not 'gfp_t'.  So don't pass GFP_KERNEL to it.
      
      Fixes: bf355b8d ("ipv6: sr: add core files for SR HMAC support")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d10c0baa
    • Stephen Hemminger's avatar
      hv_netvsc: split sub-channel setup into async and sync · e34e92d8
      Stephen Hemminger authored
      [ Upstream commit 3ffe64f1 ]
      
      When doing device hotplug the sub channel must be async to avoid
      deadlock issues because device is discovered in softirq context.
      
      When doing changes to MTU and number of channels, the setup
      must be synchronous to avoid races such as when MTU and device
      settings are done in a single ip command.
      Reported-by: default avatarThomas Walker <Thomas.Walker@twosigma.com>
      Fixes: 8195b139 ("hv_netvsc: fix deadlock on hotplug")
      Fixes: 732e4985 ("netvsc: fix race on sub channel creation")
      Signed-off-by: default avatarStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e34e92d8
    • Gustavo A. R. Silva's avatar
      atm: zatm: Fix potential Spectre v1 · 43c9207d
      Gustavo A. R. Silva authored
      [ Upstream commit ced9e191 ]
      
      pool can be indirectly controlled by user-space, hence leading to
      a potential exploitation of the Spectre variant 1 vulnerability.
      
      This issue was detected with the help of Smatch:
      
      drivers/atm/zatm.c:1491 zatm_ioctl() warn: potential spectre issue
      'zatm_dev->pool_info' (local cap)
      
      Fix this by sanitizing pool before using it to index
      zatm_dev->pool_info
      
      Notice that given that speculation windows are large, the policy is
      to kill the speculation on the first load and not worry if it can be
      completed with a dependent load/store [1].
      
      [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43c9207d
    • David Woodhouse's avatar
      atm: Preserve value of skb->truesize when accounting to vcc · f93d6593
      David Woodhouse authored
      [ Upstream commit 9bbe60a6 ]
      
      ATM accounts for in-flight TX packets in sk_wmem_alloc of the VCC on
      which they are to be sent. But it doesn't take ownership of those
      packets from the sock (if any) which originally owned them. They should
      remain owned by their actual sender until they've left the box.
      
      There's a hack in pskb_expand_head() to avoid adjusting skb->truesize
      for certain skbs, precisely to avoid messing up sk_wmem_alloc
      accounting. Ideally that hack would cover the ATM use case too, but it
      doesn't — skbs which aren't owned by any sock, for example PPP control
      frames, still get their truesize adjusted when the low-level ATM driver
      adds headroom.
      
      This has always been an issue, it seems. The truesize of a packet
      increases, and sk_wmem_alloc on the VCC goes negative. But this wasn't
      for normal traffic, only for control frames. So I think we just got away
      with it, and we probably needed to send 2GiB of LCP echo frames before
      the misaccounting would ever have caused a problem and caused
      atm_may_send() to start refusing packets.
      
      Commit 14afee4b ("net: convert sock.sk_wmem_alloc from atomic_t to
      refcount_t") did exactly what it was intended to do, and turned this
      mostly-theoretical problem into a real one, causing PPPoATM to fail
      immediately as sk_wmem_alloc underflows and atm_may_send() *immediately*
      starts refusing to allow new packets.
      
      The least intrusive solution to this problem is to stash the value of
      skb->truesize that was accounted to the VCC, in a new member of the
      ATM_SKB(skb) structure. Then in atm_pop_raw() subtract precisely that
      value instead of the then-current value of skb->truesize.
      
      Fixes: 158f323b ("net: adjust skb->truesize in pskb_expand_head()")
      Signed-off-by: default avatarDavid Woodhouse <dwmw2@infradead.org>
      Tested-by: default avatarKevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f93d6593
    • Sabrina Dubroca's avatar
      alx: take rtnl before calling __alx_open from resume · c62e2f08
      Sabrina Dubroca authored
      [ Upstream commit bc800e8b ]
      
      The __alx_open function can be called from ndo_open, which is called
      under RTNL, or from alx_resume, which isn't. Since commit d768319c,
      we're calling the netif_set_real_num_{tx,rx}_queues functions, which
      need to be called under RTNL.
      
      This is similar to commit 0c2cc02e ("igb: Move the calls to set the
      Tx and Rx queues into igb_open").
      
      Fixes: d768319c ("alx: enable multiple tx queues")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c62e2f08
    • Christian Lamparter's avatar
      crypto: crypto4xx - fix crypto4xx_build_pdr, crypto4xx_build_sdr leak · 03bb9187
      Christian Lamparter authored
      commit 5d59ad6e upstream.
      
      If one of the later memory allocations in rypto4xx_build_pdr()
      fails: dev->pdr (and/or) dev->pdr_uinfo wouldn't be freed.
      
      crypto4xx_build_sdr() has the same issue with dev->sdr.
      Signed-off-by: default avatarChristian Lamparter <chunkeey@googlemail.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      03bb9187
    • Christian Lamparter's avatar
      crypto: crypto4xx - remove bad list_del · 996a6a39
      Christian Lamparter authored
      commit a728a196 upstream.
      
      alg entries are only added to the list, after the registration
      was successful. If the registration failed, it was never added
      to the list in the first place.
      Signed-off-by: default avatarChristian Lamparter <chunkeey@googlemail.com>
      Signed-off-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      996a6a39
    • Jaehoon Chung's avatar
      PCI: exynos: Fix a potential init_clk_resources NULL pointer dereference · dc3782a3
      Jaehoon Chung authored
      commit b5d6bc90 upstream.
      
      In order to avoid triggering a NULL pointer dereference in
      exynos_pcie_probe() a check must be put in place to detect if
      the init_clk_resources hook is initialized before calling it.
      
      Add the respective function pointer check in exynos_pcie_probe().
      Signed-off-by: default avatarJaehoon Chung <jh80.chung@samsung.com>
      [lorenzo.pieralisi@arm.com: rewrote the commit log]
      Signed-off-by: default avatarLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc3782a3
    • Jonas Gorski's avatar
      bcm63xx_enet: do not write to random DMA channel on BCM6345 · b1c3ce0c
      Jonas Gorski authored
      commit d6213c1f upstream.
      
      The DMA controller regs actually point to DMA channel 0, so the write to
      ENETDMA_CFG_REG will actually modify a random DMA channel.
      
      Since DMA controller registers do not exist on BCM6345, guard the write
      with the usual check for dma_has_sram.
      Signed-off-by: default avatarJonas Gorski <jonas.gorski@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b1c3ce0c
    • Jonas Gorski's avatar
      bcm63xx_enet: correct clock usage · b913a05a
      Jonas Gorski authored
      commit 9c86b846 upstream.
      
      Check the return code of prepare_enable and change one last instance of
      enable only to prepare_enable. Also properly disable and release the
      clock in error paths and on remove for enetsw.
      Signed-off-by: default avatarJonas Gorski <jonas.gorski@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarAmit Pundir <amit.pundir@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b913a05a
    • alex chen's avatar
      ocfs2: ip_alloc_sem should be taken in ocfs2_get_block() · 1ccab2bf
      alex chen authored
      commit 3e4c56d4 upstream.
      
      ip_alloc_sem should be taken in ocfs2_get_block() when reading file in
      DIRECT mode to prevent concurrent access to extent tree with
      ocfs2_dio_end_io_write(), which may cause BUGON in the following
      situation:
      
      read file 'A'                                  end_io of writing file 'A'
      vfs_read
       __vfs_read
        ocfs2_file_read_iter
         generic_file_read_iter
          ocfs2_direct_IO
           __blockdev_direct_IO
            do_blockdev_direct_IO
             do_direct_IO
              get_more_blocks
               ocfs2_get_block
                ocfs2_extent_map_get_blocks
                 ocfs2_get_clusters
                  ocfs2_get_clusters_nocache()
                   ocfs2_search_extent_list
                    return the index of record which
                    contains the v_cluster, that is
                    v_cluster > rec[i]->e_cpos.
                                                      ocfs2_dio_end_io
                                                       ocfs2_dio_end_io_write
                                                        down_write(&oi->ip_alloc_sem);
                                                        ocfs2_mark_extent_written
                                                         ocfs2_change_extent_flag
                                                          ocfs2_split_extent
                                                           ...
                                                       --> modify the rec[i]->e_cpos, resulting
                                                           in v_cluster < rec[i]->e_cpos.
                   BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos))
      
      [alex.chen@huawei.com: v3]
        Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com
      Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com
      Fixes: c15471f7 ("ocfs2: fix sparse file & data ordering issue in direct io")
      Signed-off-by: default avatarAlex Chen <alex.chen@huawei.com>
      Reviewed-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Reviewed-by: default avatarGang He <ghe@suse.com>
      Acked-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Salvatore Bonaccorso <carnil@debian.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ccab2bf
    • alex chen's avatar
      ocfs2: subsystem.su_mutex is required while accessing the item->ci_parent · c59a8f13
      alex chen authored
      commit 853bc26a upstream.
      
      The subsystem.su_mutex is required while accessing the item->ci_parent,
      otherwise, NULL pointer dereference to the item->ci_parent will be
      triggered in the following situation:
      
      add node                     delete node
      sys_write
       vfs_write
        configfs_write_file
         o2nm_node_store
          o2nm_node_local_write
                                   do_rmdir
                                    vfs_rmdir
                                     configfs_rmdir
                                      mutex_lock(&subsys->su_mutex);
                                      unlink_obj
                                       item->ci_group = NULL;
                                       item->ci_parent = NULL;
      	 to_o2nm_cluster_from_node
      	  node->nd_item.ci_parent->ci_parent
      	  BUG since of NULL pointer dereference to nd_item.ci_parent
      
      Moreover, the o2nm_cluster also should be protected by the
      subsystem.su_mutex.
      
      [alex.chen@huawei.com: v2]
        Link: http://lkml.kernel.org/r/59EEAA69.9080703@huawei.com
      Link: http://lkml.kernel.org/r/59E9B36A.10700@huawei.comSigned-off-by: default avatarAlex Chen <alex.chen@huawei.com>
      Reviewed-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Salvatore Bonaccorso <carnil@debian.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c59a8f13
    • Chuck Lever's avatar
      xprtrdma: Fix corner cases when handling device removal · f5778c2d
      Chuck Lever authored
      commit 25524288 upstream.
      
      Michal Kalderon has found some corner cases around device unload
      with active NFS mounts that I didn't have the imagination to test
      when xprtrdma device removal was added last year.
      
      - The ULP device removal handler is responsible for deallocating
        the PD. That wasn't clear to me initially, and my own testing
        suggested it was not necessary, but that is incorrect.
      
      - The transport destruction path can no longer assume that there
        is a valid ID.
      
      - When destroying a transport, ensure that ib_free_cq() is not
        invoked on a CQ that was already released.
      Reported-by: default avatarMichal Kalderon <Michal.Kalderon@cavium.com>
      Fixes: bebd0318 ("xprtrdma: Support unplugging an HCA from ...")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: default avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5778c2d
    • Prashanth Prakash's avatar
      cpufreq / CPPC: Set platform specific transition_delay_us · 1083a7e8
      Prashanth Prakash authored
      commit d4f3388a upstream.
      
      Add support to specify platform specific transition_delay_us instead
      of using the transition delay derived from PCC.
      
      With commit 3d41386d (cpufreq: CPPC: Use transition_delay_us
      depending transition_latency) we are setting transition_delay_us
      directly and not applying the LATENCY_MULTIPLIER. Because of that,
      on Qualcomm Centriq we can end up with a very high rate of frequency
      change requests when using the schedutil governor (default
      rate_limit_us=10 compared to an earlier value of 10000).
      
      The PCC subspace describes the rate at which the platform can accept
      commands on the CPPC's PCC channel. This includes read and write
      command on the PCC channel that can be used for reasons other than
      frequency transitions. Moreover the same PCC subspace can be used by
      multiple freq domains and deriving transition_delay_us from it as we
      do now can be sub-optimal.
      
      Moreover if a platform does not use PCC for desired_perf register then
      there is no way to compute the transition latency or the delay_us.
      
      CPPC does not have a standard defined mechanism to get the transition
      rate or the latency at the moment.
      
      Given the above limitations, it is simpler to have a platform specific
      transition_delay_us and rely on PCC derived value only if a platform
      specific value is not available.
      Signed-off-by: default avatarPrashanth Prakash <pprakash@codeaurora.org>
      Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
      Fixes: 3d41386d (cpufreq: CPPC: Use transition_delay_us depending transition_latency)
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1083a7e8
    • Filipe Manana's avatar
      Btrfs: fix duplicate extents after fsync of file with prealloc extents · 61a9f6b7
      Filipe Manana authored
      commit 31d11b83 upstream.
      
      In commit 471d557a ("Btrfs: fix loss of prealloc extents past i_size
      after fsync log replay"), on fsync,  we started to always log all prealloc
      extents beyond an inode's i_size in order to avoid losing them after a
      power failure. However under some cases this can lead to the log replay
      code to create duplicate extent items, with different lengths, in the
      extent tree. That happens because, as of that commit, we can now log
      extent items based on extent maps that are not on the "modified" list
      of extent maps of the inode's extent map tree. Logging extent items based
      on extent maps is used during the fast fsync path to save time and for
      this to work reliably it requires that the extent maps are not merged
      with other adjacent extent maps - having the extent maps in the list
      of modified extents gives such guarantee.
      
      Consider the following example, captured during a long run of fsstress,
      which illustrates this problem.
      
      We have inode 271, in the filesystem tree (root 5), for which all of the
      following operations and discussion apply to.
      
      A buffered write starts at offset 312391 with a length of 933471 bytes
      (end offset at 1245862). At this point we have, for this inode, the
      following extent maps with the their field values:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 376832, block_start 1106399232,
            block_len 376832, orig_block_len 376832
      em C, start 417792, orig_start 417792, len 782336, block_start
            18446744073709551613, block_len 0, orig_block_len 0
      em D, start 1200128, orig_start 1200128, len 835584, block_start
            1106776064, block_len 835584, orig_block_len 835584
      em E, start 2035712, orig_start 2035712, len 245760, block_start
            1107611648, block_len 245760, orig_block_len 245760
      
      Extent map A corresponds to a hole and extent maps D and E correspond to
      preallocated extents.
      
      Extent map D ends where extent map E begins (1106776064 + 835584 =
      1107611648), but these extent maps were not merged because they are in
      the inode's list of modified extent maps.
      
      An fsync against this inode is made, which triggers the fast path
      (BTRFS_INODE_NEEDS_FULL_SYNC is not set). This fsync triggers writeback
      of the data previously written using buffered IO, and when the respective
      ordered extent finishes, btrfs_drop_extents() is called against the
      (aligned) range 311296..1249279. This causes a split of extent map D at
      btrfs_drop_extent_cache(), replacing extent map D with a new extent map
      D', also added to the list of modified extents,  with the following
      values:
      
      em D', start 1249280, orig_start of 1200128,
             block_start 1106825216 (= 1106776064 + 1249280 - 1200128),
             orig_block_len 835584,
             block_len 786432 (835584 - (1249280 - 1200128))
      
      Then, during the fast fsync, btrfs_log_changed_extents() is called and
      extent maps D' and E are removed from the list of modified extents. The
      flag EXTENT_FLAG_LOGGING is also set on them. After the extents are logged
      clear_em_logging() is called on each of them, and that makes extent map E
      to be merged with extent map D' (try_merge_map()), resulting in D' being
      deleted and E adjusted to:
      
      em E, start 1249280, orig_start 1200128, len 1032192,
            block_start 1106825216, block_len 1032192,
            orig_block_len 245760
      
      A direct IO write at offset 1847296 and length of 360448 bytes (end offset
      at 2207744) starts, and at that moment the following extent maps exist for
      our inode:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
            block_len 270336, orig_block_len 376832
      em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
            block_len 937984, orig_block_len 937984
      em E (prealloc), start 1249280, orig_start 1200128, len 1032192,
            block_start 1106825216, block_len 1032192, orig_block_len 245760
      
      The dio write results in drop_extent_cache() being called twice. The first
      time for a range that starts at offset 1847296 and ends at offset 2035711
      (length of 188416), which results in a double split of extent map E,
      replacing it with two new extent maps:
      
      em F, start 1249280, orig_start 1200128, block_start 1106825216,
            block_len 598016, orig_block_len 598016
      em G, start 2035712, orig_start 1200128, block_start 1107611648,
            block_len 245760, orig_block_len 1032192
      
      It also creates a new extent map that represents a part of the requested
      IO (through create_io_em()):
      
      em H, start 1847296, len 188416, block_start 1107423232, block_len 188416
      
      The second call to drop_extent_cache() has a range with a start offset of
      2035712 and end offset of 2207743 (length of 172032). This leads to
      replacing extent map G with a new extent map I with the following values:
      
      em I, start 2207744, orig_start 1200128, block_start 1107783680,
            block_len 73728, orig_block_len 1032192
      
      It also creates a new extent map that represents the second part of the
      requested IO (through create_io_em()):
      
      em J, start 2035712, len 172032, block_start 1107611648, block_len 172032
      
      The dio write set the inode's i_size to 2207744 bytes.
      
      After the dio write the inode has the following extent maps:
      
      em A, start 0, orig_start 0, len 40960, block_start 18446744073709551613,
            block_len 0, orig_block_len 0
      em B, start 40960, orig_start 40960, len 270336, block_start 1106399232,
            block_len 270336, orig_block_len 376832
      em C, start 311296, orig_start 311296, len 937984, block_start 1112842240,
            block_len 937984, orig_block_len 937984
      em F, start 1249280, orig_start 1200128, len 598016,
            block_start 1106825216, block_len 598016, orig_block_len 598016
      em H, start 1847296, orig_start 1200128, len 188416,
            block_start 1107423232, block_len 188416, orig_block_len 835584
      em J, start 2035712, orig_start 2035712, len 172032,
            block_start 1107611648, block_len 172032, orig_block_len 245760
      em I, start 2207744, orig_start 1200128, len 73728,
            block_start 1107783680, block_len 73728, orig_block_len 1032192
      
      Now do some change to the file, like adding a xattr for example and then
      fsync it again. This triggers a fast fsync path, and as of commit
      471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync
      log replay"), we use the extent map I to log a file extent item because
      it's a prealloc extent and it starts at an offset matching the inode's
      i_size. However when we log it, we create a file extent item with a value
      for the disk byte location that is wrong, as can be seen from the
      following output of "btrfs inspect-internal dump-tree":
      
       item 1 key (271 EXTENT_DATA 2207744) itemoff 3782 itemsize 53
           generation 22 type 2 (prealloc)
           prealloc data disk byte 1106776064 nr 1032192
           prealloc data offset 1007616 nr 73728
      
      Here the disk byte value corresponds to calculation based on some fields
      from the extent map I:
      
        1106776064 = block_start (1107783680) - 1007616 (extent_offset)
        extent_offset = 2207744 (start) - 1200128 (orig_start) = 1007616
      
      The disk byte value of 1106776064 clashes with disk byte values of the
      file extent items at offsets 1249280 and 1847296 in the fs tree:
      
              item 6 key (271 EXTENT_DATA 1249280) itemoff 3568 itemsize 53
                      generation 20 type 2 (prealloc)
                      prealloc data disk byte 1106776064 nr 835584
                      prealloc data offset 49152 nr 598016
              item 7 key (271 EXTENT_DATA 1847296) itemoff 3515 itemsize 53
                      generation 20 type 1 (regular)
                      extent data disk byte 1106776064 nr 835584
                      extent data offset 647168 nr 188416 ram 835584
                      extent compression 0 (none)
              item 8 key (271 EXTENT_DATA 2035712) itemoff 3462 itemsize 53
                      generation 20 type 1 (regular)
                      extent data disk byte 1107611648 nr 245760
                      extent data offset 0 nr 172032 ram 245760
                      extent compression 0 (none)
              item 9 key (271 EXTENT_DATA 2207744) itemoff 3409 itemsize 53
                      generation 20 type 2 (prealloc)
                      prealloc data disk byte 1107611648 nr 245760
                      prealloc data offset 172032 nr 73728
      
      Instead of the disk byte value of 1106776064, the value of 1107611648
      should have been logged. Also the data offset value should have been
      172032 and not 1007616.
      After a log replay we end up getting two extent items in the extent tree
      with different lengths, one of 835584, which is correct and existed
      before the log replay, and another one of 1032192 which is wrong and is
      based on the logged file extent item:
      
       item 12 key (1106776064 EXTENT_ITEM 835584) itemoff 3406 itemsize 53
          refs 2 gen 15 flags DATA
          extent data backref root 5 objectid 271 offset 1200128 count 2
       item 13 key (1106776064 EXTENT_ITEM 1032192) itemoff 3353 itemsize 53
          refs 1 gen 22 flags DATA
          extent data backref root 5 objectid 271 offset 1200128 count 1
      
      Obviously this leads to many problems and a filesystem check reports many
      errors:
      
       (...)
       checking extents
       Extent back ref already exists for 1106776064 parent 0 root 5 owner 271 offset 1200128 num_refs 1
       extent item 1106776064 has multiple extent items
       ref mismatch on [1106776064 835584] extent item 2, found 3
       Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 2 wanted 1 back 0x55b1d0ad7680
       Backref 1106776064 root 5 owner 271 offset 1200128 num_refs 0 not found in extent tree
       Incorrect local backref count on 1106776064 root 5 owner 271 offset 1200128 found 1 wanted 0 back 0x55b1d0ad4e70
       Backref bytes do not match extent backref, bytenr=1106776064, ref bytes=835584, backref bytes=1032192
       backpointer mismatch on [1106776064 835584]
       checking free space cache
       block group 1103101952 has wrong amount of free space
       failed to load free space cache for block group 1103101952
       checking fs roots
       (...)
      
      So fix this by logging the prealloc extents beyond the inode's i_size
      based on searches in the subvolume tree instead of the extent maps.
      
      Fixes: 471d557a ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61a9f6b7
    • Nick Desaulniers's avatar
      x86/paravirt: Make native_save_fl() extern inline · edefb935
      Nick Desaulniers authored
      commit d0a8d937 upstream.
      
      native_save_fl() is marked static inline, but by using it as
      a function pointer in arch/x86/kernel/paravirt.c, it MUST be outlined.
      
      paravirt's use of native_save_fl() also requires that no GPRs other than
      %rax are clobbered.
      
      Compilers have different heuristics which they use to emit stack guard
      code, the emittance of which can break paravirt's callee saved assumption
      by clobbering %rcx.
      
      Marking a function definition extern inline means that if this version
      cannot be inlined, then the out-of-line version will be preferred. By
      having the out-of-line version be implemented in assembly, it cannot be
      instrumented with a stack protector, which might violate custom calling
      conventions that code like paravirt rely on.
      
      The semantics of extern inline has changed since gnu89. This means that
      folks using GCC versions >= 5.1 may see symbol redefinition errors at
      link time for subdirs that override KBUILD_CFLAGS (making the C standard
      used implicit) regardless of this patch. This has been cleaned up
      earlier in the patch set, but is left as a note in the commit message
      for future travelers.
      
      Reports:
       https://lkml.org/lkml/2018/5/7/534
       https://github.com/ClangBuiltLinux/linux/issues/16
      
      Discussion:
       https://bugs.llvm.org/show_bug.cgi?id=37512
       https://lkml.org/lkml/2018/5/24/1371
      
      Thanks to the many folks that participated in the discussion.
      Debugged-by: default avatarAlistair Strachan <astrachan@google.com>
      Debugged-by: default avatarMatthias Kaehlcke <mka@chromium.org>
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Suggested-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Suggested-by: default avatarTom Stellar <tstellar@redhat.com>
      Reported-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Acked-by: default avatarJuergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@redhat.com
      Cc: akataria@vmware.com
      Cc: akpm@linux-foundation.org
      Cc: andrea.parri@amarulasolutions.com
      Cc: ard.biesheuvel@linaro.org
      Cc: aryabinin@virtuozzo.com
      Cc: astrachan@google.com
      Cc: boris.ostrovsky@oracle.com
      Cc: brijesh.singh@amd.com
      Cc: caoj.fnst@cn.fujitsu.com
      Cc: geert@linux-m68k.org
      Cc: ghackmann@google.com
      Cc: gregkh@linuxfoundation.org
      Cc: jan.kiszka@siemens.com
      Cc: jarkko.sakkinen@linux.intel.com
      Cc: joe@perches.com
      Cc: jpoimboe@redhat.com
      Cc: keescook@google.com
      Cc: kirill.shutemov@linux.intel.com
      Cc: kstewart@linuxfoundation.org
      Cc: linux-efi@vger.kernel.org
      Cc: linux-kbuild@vger.kernel.org
      Cc: manojgupta@google.com
      Cc: mawilcox@microsoft.com
      Cc: michal.lkml@markovi.net
      Cc: mjg59@google.com
      Cc: mka@chromium.org
      Cc: pombredanne@nexb.com
      Cc: rientjes@google.com
      Cc: rostedt@goodmis.org
      Cc: thomas.lendacky@amd.com
      Cc: tweek@google.com
      Cc: virtualization@lists.linux-foundation.org
      Cc: will.deacon@arm.com
      Cc: yamada.masahiro@socionext.com
      Link: http://lkml.kernel.org/r/20180621162324.36656-4-ndesaulniers@google.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      edefb935
    • H. Peter Anvin's avatar
      x86/asm: Add _ASM_ARG* constants for argument registers to <asm/asm.h> · 92e50158
      H. Peter Anvin authored
      commit 0e2e1600 upstream.
      
      i386 and x86-64 uses different registers for arguments; make them
      available so we don't have to #ifdef in the actual code.
      
      Native size and specified size (q, l, w, b) versions are provided.
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Reviewed-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Acked-by: default avatarJuergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@redhat.com
      Cc: akataria@vmware.com
      Cc: akpm@linux-foundation.org
      Cc: andrea.parri@amarulasolutions.com
      Cc: ard.biesheuvel@linaro.org
      Cc: arnd@arndb.de
      Cc: aryabinin@virtuozzo.com
      Cc: astrachan@google.com
      Cc: boris.ostrovsky@oracle.com
      Cc: brijesh.singh@amd.com
      Cc: caoj.fnst@cn.fujitsu.com
      Cc: geert@linux-m68k.org
      Cc: ghackmann@google.com
      Cc: gregkh@linuxfoundation.org
      Cc: jan.kiszka@siemens.com
      Cc: jarkko.sakkinen@linux.intel.com
      Cc: joe@perches.com
      Cc: jpoimboe@redhat.com
      Cc: keescook@google.com
      Cc: kirill.shutemov@linux.intel.com
      Cc: kstewart@linuxfoundation.org
      Cc: linux-efi@vger.kernel.org
      Cc: linux-kbuild@vger.kernel.org
      Cc: manojgupta@google.com
      Cc: mawilcox@microsoft.com
      Cc: michal.lkml@markovi.net
      Cc: mjg59@google.com
      Cc: mka@chromium.org
      Cc: pombredanne@nexb.com
      Cc: rientjes@google.com
      Cc: rostedt@goodmis.org
      Cc: thomas.lendacky@amd.com
      Cc: tstellar@redhat.com
      Cc: tweek@google.com
      Cc: virtualization@lists.linux-foundation.org
      Cc: will.deacon@arm.com
      Cc: yamada.masahiro@socionext.com
      Link: http://lkml.kernel.org/r/20180621162324.36656-3-ndesaulniers@google.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      92e50158
    • Nick Desaulniers's avatar
      compiler-gcc.h: Add __attribute__((gnu_inline)) to all inline declarations · 779145a6
      Nick Desaulniers authored
      commit d03db2bc upstream.
      
      Functions marked extern inline do not emit an externally visible
      function when the gnu89 C standard is used. Some KBUILD Makefiles
      overwrite KBUILD_CFLAGS. This is an issue for GCC 5.1+ users as without
      an explicit C standard specified, the default is gnu11. Since c99, the
      semantics of extern inline have changed such that an externally visible
      function is always emitted. This can lead to multiple definition errors
      of extern inline functions at link time of compilation units whose build
      files have removed an explicit C standard compiler flag for users of GCC
      5.1+ or Clang.
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Suggested-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Suggested-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Acked-by: default avatarJuergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@redhat.com
      Cc: akataria@vmware.com
      Cc: akpm@linux-foundation.org
      Cc: andrea.parri@amarulasolutions.com
      Cc: ard.biesheuvel@linaro.org
      Cc: aryabinin@virtuozzo.com
      Cc: astrachan@google.com
      Cc: boris.ostrovsky@oracle.com
      Cc: brijesh.singh@amd.com
      Cc: caoj.fnst@cn.fujitsu.com
      Cc: geert@linux-m68k.org
      Cc: ghackmann@google.com
      Cc: gregkh@linuxfoundation.org
      Cc: jan.kiszka@siemens.com
      Cc: jarkko.sakkinen@linux.intel.com
      Cc: jpoimboe@redhat.com
      Cc: keescook@google.com
      Cc: kirill.shutemov@linux.intel.com
      Cc: kstewart@linuxfoundation.org
      Cc: linux-efi@vger.kernel.org
      Cc: linux-kbuild@vger.kernel.org
      Cc: manojgupta@google.com
      Cc: mawilcox@microsoft.com
      Cc: michal.lkml@markovi.net
      Cc: mjg59@google.com
      Cc: mka@chromium.org
      Cc: pombredanne@nexb.com
      Cc: rientjes@google.com
      Cc: rostedt@goodmis.org
      Cc: sedat.dilek@gmail.com
      Cc: thomas.lendacky@amd.com
      Cc: tstellar@redhat.com
      Cc: tweek@google.com
      Cc: virtualization@lists.linux-foundation.org
      Cc: will.deacon@arm.com
      Cc: yamada.masahiro@socionext.com
      Link: http://lkml.kernel.org/r/20180621162324.36656-2-ndesaulniers@google.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      779145a6
  2. 17 Jul, 2018 4 commits