1. 16 Jul, 2021 1 commit
    • Daniel Borkmann's avatar
      bpf: Remove superfluous aux sanitation on subprog rejection · 59089a18
      Daniel Borkmann authored
      Follow-up to fe9a5ca7 ("bpf: Do not mark insn as seen under speculative
      path verification"). The sanitize_insn_aux_data() helper does not serve a
      particular purpose in today's code. The original intention for the helper
      was that if function-by-function verification fails, a given program would
      be cleared from temporary insn_aux_data[], and then its verification would
      be re-attempted in the context of the main program a second time.
      
      However, a failure in do_check_subprogs() will skip do_check_main() and
      propagate the error to the user instead, thus such situation can never occur.
      Given its interaction is not compatible to the Spectre v1 mitigation (due to
      comparing aux->seen with env->pass_cnt), just remove sanitize_insn_aux_data()
      to avoid future bugs in this area.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      59089a18
  2. 15 Jul, 2021 14 commits
  3. 14 Jul, 2021 6 commits
    • David S. Miller's avatar
      Merge branch 'r8152-pm-fixxes' · 3ffd3dad
      David S. Miller authored
      Takashi Iwai says:
      
      ====================
      r8152: Fix a couple of PM problems
      
      it seems that r8152 driver suffers from the deadlock at both runtime
      and system PM.  Formerly, it was seen more often at hibernation
      resume, but now it's triggered more frequently, as reported in SUSE
      Bugzilla:
        https://bugzilla.suse.com/show_bug.cgi?id=1186194
      
      While debugging the problem, I stumbled on a few obvious bugs and here
      is the results with two patches for addressing the resume problem.
      
      ***
      
      However, the story doesn't end here, unfortunately, and those patches
      don't seem sufficing.  The rest major problem is that the driver calls
      napi_disable() and napi_enable() in the PM suspend callbacks.  This
      makes the system stalling at (runtime-)suspend.  If we drop
      napi_disable() and napi_enable() calls in the PM suspend callbacks, it
      starts working (that was the result in Bugzilla comment 13):
        https://bugzilla.suse.com/show_bug.cgi?id=1186194#c13
      
      So, my patches aren't enough and we still need to investigate
      further.  It'd be appreciated if anyone can give a fix or a hint for
      more debugging.  The usage of napi_disable() at PM callbacks is unique
      in this driver and looks rather suspicious to me; but I'm no expert in
      this area so I might be wrong...
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ffd3dad
    • Takashi Iwai's avatar
      r8152: Fix a deadlock by doubly PM resume · 776ac63a
      Takashi Iwai authored
      r8152 driver sets up the MAC address at reset-resume, while
      rtl8152_set_mac_address() has the temporary autopm get/put.  This may
      lead to a deadlock as the PM lock has been already taken for the
      execution of the runtime PM callback.
      
      This patch adds the workaround to avoid the superfluous autpm when
      called from rtl8152_reset_resume().
      
      Link: https://bugzilla.suse.com/show_bug.cgi?id=1186194Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      776ac63a
    • Takashi Iwai's avatar
      r8152: Fix potential PM refcount imbalance · 9c23aa51
      Takashi Iwai authored
      rtl8152_close() takes the refcount via usb_autopm_get_interface() but
      it doesn't release when RTL8152_UNPLUG test hits.  This may lead to
      the imbalance of PM refcount.  This patch addresses it.
      
      Link: https://bugzilla.suse.com/show_bug.cgi?id=1186194Signed-off-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c23aa51
    • Linus Torvalds's avatar
      Merge tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 8096acd7
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski.
       "Including fixes from bpf and netfilter.
      
        Current release - regressions:
      
         - sock: fix parameter order in sock_setsockopt()
      
        Current release - new code bugs:
      
         - netfilter: nft_last:
             - fix incorrect arithmetic when restoring last used
             - honor NFTA_LAST_SET on restoration
      
        Previous releases - regressions:
      
         - udp: properly flush normal packet at GRO time
      
         - sfc: ensure correct number of XDP queues; don't allow enabling the
           feature if there isn't sufficient resources to Tx from any CPU
      
         - dsa: sja1105: fix address learning getting disabled on the CPU port
      
         - mptcp: addresses a rmem accounting issue that could keep packets in
           subflow receive buffers longer than necessary, delaying MPTCP-level
           ACKs
      
         - ip_tunnel: fix mtu calculation for ETHER tunnel devices
      
         - do not reuse skbs allocated from skbuff_fclone_cache in the napi
           skb cache, we'd try to return them to the wrong slab cache
      
         - tcp: consistently disable header prediction for mptcp
      
        Previous releases - always broken:
      
         - bpf: fix subprog poke descriptor tracking use-after-free
      
         - ipv6:
             - allocate enough headroom in ip6_finish_output2() in case
               iptables TEE is used
             - tcp: drop silly ICMPv6 packet too big messages to avoid
               expensive and pointless lookups (which may serve as a DDOS
               vector)
             - make sure fwmark is copied in SYNACK packets
             - fix 'disable_policy' for forwarded packets (align with IPv4)
      
         - netfilter: conntrack:
             - do not renew entry stuck in tcp SYN_SENT state
             - do not mark RST in the reply direction coming after SYN packet
               for an out-of-sync entry
      
         - mptcp: cleanly handle error conditions with MP_JOIN and syncookies
      
         - mptcp: fix double free when rejecting a join due to port mismatch
      
         - validate lwtstate->data before returning from skb_tunnel_info()
      
         - tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path
      
         - mt76: mt7921: continue to probe driver when fw already downloaded
      
         - bonding: fix multiple issues with offloading IPsec to (thru?) bond
      
         - stmmac: ptp: fix issues around Qbv support and setting time back
      
         - bcmgenet: always clear wake-up based on energy detection
      
        Misc:
      
         - sctp: move 198 addresses from unusable to private scope
      
         - ptp: support virtual clocks and timestamping
      
         - openvswitch: optimize operation for key comparison"
      
      * tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (158 commits)
        net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave()
        sfc: add logs explaining XDP_TX/REDIRECT is not available
        sfc: ensure correct number of XDP queues
        sfc: fix lack of XDP TX queues - error XDP TX failed (-22)
        net: fddi: fix UAF in fza_probe
        net: dsa: sja1105: fix address learning getting disabled on the CPU port
        net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload
        net: Use nlmsg_unicast() instead of netlink_unicast()
        octeontx2-pf: Fix uninitialized boolean variable pps
        ipv6: allocate enough headroom in ip6_finish_output2()
        net: hdlc: rename 'mod_init' & 'mod_exit' functions to be module-specific
        net: bridge: multicast: fix MRD advertisement router port marking race
        net: bridge: multicast: fix PIM hello router port marking race
        net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340
        dsa: fix for_each_child.cocci warnings
        virtio_net: check virtqueue_add_sgs() return value
        mptcp: properly account bulk freed memory
        selftests: mptcp: fix case multiple subflows limited by server
        mptcp: avoid processing packet if a subflow reset
        mptcp: fix syncookie process if mptcp can not_accept new subflow
        ...
      8096acd7
    • Christian Brauner's avatar
      fs: add vfs_parse_fs_param_source() helper · d1d488d8
      Christian Brauner authored
      Add a simple helper that filesystems can use in their parameter parser
      to parse the "source" parameter. A few places open-coded this function
      and that already caused a bug in the cgroup v1 parser that we fixed.
      Let's make it harder to get this wrong by introducing a helper which
      performs all necessary checks.
      
      Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1d488d8
    • Christian Brauner's avatar
      cgroup: verify that source is a string · 3b046272
      Christian Brauner authored
      The following sequence can be used to trigger a UAF:
      
          int fscontext_fd = fsopen("cgroup");
          int fd_null = open("/dev/null, O_RDONLY);
          int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null);
          close_range(3, ~0U, 0);
      
      The cgroup v1 specific fs parser expects a string for the "source"
      parameter.  However, it is perfectly legitimate to e.g.  specify a file
      descriptor for the "source" parameter.  The fs parser doesn't know what
      a filesystem allows there.  So it's a bug to assume that "source" is
      always of type fs_value_is_string when it can reasonably also be
      fs_value_is_file.
      
      This assumption in the cgroup code causes a UAF because struct
      fs_parameter uses a union for the actual value.  Access to that union is
      guarded by the param->type member.  Since the cgroup paramter parser
      didn't check param->type but unconditionally moved param->string into
      fc->source a close on the fscontext_fd would trigger a UAF during
      put_fs_context() which frees fc->source thereby freeing the file stashed
      in param->file causing a UAF during a close of the fd_null.
      
      Fix this by verifying that param->type is actually a string and report
      an error if not.
      
      In follow up patches I'll add a new generic helper that can be used here
      and by other filesystems instead of this error-prone copy-pasta fix.
      But fixing it in here first makes backporting a it to stable a lot
      easier.
      
      Fixes: 8d2451f4 ("cgroup1: switch to option-by-option parsing")
      Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: <stable@kernel.org>
      Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b046272
  4. 13 Jul, 2021 13 commits
    • Vladimir Oltean's avatar
      net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave() · bcb9928a
      Vladimir Oltean authored
      This was not caught because there is no switch driver which implements
      the .port_bridge_join but not .port_bridge_leave method, but it should
      nonetheless be fixed, as in certain conditions (driver development) it
      might lead to NULL pointer dereference.
      
      Fixes: f66a6a69 ("net: dsa: permit cross-chip bridging between all trees in the system")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bcb9928a
    • Linus Torvalds's avatar
      Merge tag 'vboxsf-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux · 40226a3d
      Linus Torvalds authored
      Pull vboxsf fixes from Hans de Goede:
       "This adds support for the atomic_open directory-inode op to vboxsf.
      
        Note this is not just an enhancement this also fixes an actual issue
        which users are hitting, see the commit message of the "boxsf: Add
        support for the atomic_open directory-inode" patch"
      
      * tag 'vboxsf-v5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hansg/linux:
        vboxsf: Add support for the atomic_open directory-inode op
        vboxsf: Add vboxsf_[create|release]_sf_handle() helpers
        vboxsf: Make vboxsf_dir_create() return the handle for the created file
        vboxsf: Honor excl flag to the dir-inode create op
      40226a3d
    • Linus Torvalds's avatar
      Merge tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · f02bf857
      Linus Torvalds authored
      Pull btrfs zoned mode fixes from David Sterba:
      
       - fix deadlock when allocating system chunk
      
       - fix wrong mutex unlock on an error path
      
       - fix extent map splitting for append operation
      
       - update and fix message reporting unusable chunk space
      
       - don't block when background zone reclaim runs with balance in
         parallel
      
      * tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: zoned: fix wrong mutex unlock on failure to allocate log root tree
        btrfs: don't block if we can't acquire the reclaim lock
        btrfs: properly split extent_map for REQ_OP_ZONE_APPEND
        btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
        btrfs: fix deadlock with concurrent chunk allocations involving system chunks
        btrfs: zoned: print unusable percentage when reclaiming block groups
        btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work
      f02bf857
    • David S. Miller's avatar
      Merge branch 'sfc-tx-queues' · 28efd208
      David S. Miller authored
      Íñigo Huguet says:
      
      ====================
      sfc: Fix lack of XDP TX queues
      
      A change introduced in commit e26ca4b5 ("sfc: reduce the number of
      requested xdp ev queues") created a bug in XDP_TX and XDP_REDIRECT
      because it unintentionally reduced the number of XDP TX queues, letting
      not enough queues to have one per CPU, which leaded to errors if XDP
      TX/REDIRECT was done from a high numbered CPU.
      
      This patchs make the following changes:
      - Fix the bug mentioned above
      - Revert commit 99ba0ea6 ("sfc: adjust efx->xdp_tx_queue_count with
        the real number of initialized queues") which intended to fix a related
        problem, created by mentioned bug, but it's no longer necessary
      - Add a new error log message if there are not enough resources to make
        XDP_TX/REDIRECT work
      
      V1 -> V2: keep the calculation of how many tx queues can handle a single
      event queue, but apply the "max. tx queues per channel" upper limit.
      V2 -> V3: WARN_ON if the number of initialized XDP TXQs differs from the
      expected.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28efd208
    • Íñigo Huguet's avatar
      sfc: add logs explaining XDP_TX/REDIRECT is not available · d2a16bde
      Íñigo Huguet authored
      If it's not possible to allocate enough channels for XDP, XDP_TX and
      XDP_REDIRECT don't work. However, only a message saying that not enough
      channels were available was shown, but not saying what are the
      consequences in that case. The user didn't know if he/she can use XDP
      or not, if the performance is reduced, or what.
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2a16bde
    • Íñigo Huguet's avatar
      sfc: ensure correct number of XDP queues · 788bc000
      Íñigo Huguet authored
      Commit 99ba0ea6 ("sfc: adjust efx->xdp_tx_queue_count with the real
      number of initialized queues") intended to fix a problem caused by a
      round up when calculating the number of XDP channels and queues.
      However, this was not the real problem. The real problem was that the
      number of XDP TX queues had been reduced to half in
      commit e26ca4b5 ("sfc: reduce the number of requested xdp ev queues"),
      but the variable xdp_tx_queue_count had remained the same.
      
      Once the correct number of XDP TX queues is created again in the
      previous patch of this series, this also can be reverted since the error
      doesn't actually exist.
      
      Only in the case that there is a bug in the code we can have different
      values in xdp_queue_number and efx->xdp_tx_queue_count. Because of this,
      and per Edward Cree's suggestion, I add instead a WARN_ON to catch if it
      happens again in the future.
      
      Note that the number of allocated queues can be higher than the number
      of used ones due to the round up, as explained in the existing comment
      in the code. That's why we also have to stop increasing xdp_queue_number
      beyond efx->xdp_tx_queue_count.
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      788bc000
    • Íñigo Huguet's avatar
      sfc: fix lack of XDP TX queues - error XDP TX failed (-22) · f28100cb
      Íñigo Huguet authored
      Fixes: e26ca4b5 sfc: reduce the number of requested xdp ev queues
      
      The buggy commit intended to allocate less channels for XDP in order to
      be more unlikely to reach the limit of 32 channels of the driver.
      
      The idea was to use each IRQ/eventqeue for more XDP TX queues than
      before, calculating which is the maximum number of TX queues that one
      event queue can handle. For example, in EF10 each event queue could
      handle up to 8 queues, better than the 4 they were handling before the
      change. This way, it would have to allocate half of channels than before
      for XDP TX.
      
      The problem is that the TX queues are also contained inside the channel
      structs, and there are only 4 queues per channel. Reducing the number of
      channels means also reducing the number of queues, resulting in not
      having the desired number of 1 queue per CPU.
      
      This leads to getting errors on XDP_TX and XDP_REDIRECT if they're
      executed from a high numbered CPU, because there only exist queues for
      the low half of CPUs, actually. If XDP_TX/REDIRECT is executed in a low
      numbered CPU, the error doesn't happen. This is the error in the logs
      (repeated many times, even rate limited):
      sfc 0000:5e:00.0 ens3f0np0: XDP TX failed (-22)
      
      This errors happens in function efx_xdp_tx_buffers, where it expects to
      have a dedicated XDP TX queue per CPU.
      
      Reverting the change makes again more likely to reach the limit of 32
      channels in machines with many CPUs. If this happen, no XDP_TX/REDIRECT
      will be possible at all, and we will have this log error messages:
      
      At interface probe:
      sfc 0000:5e:00.0: Insufficient resources for 12 XDP event queues (24 other channels, max 32)
      
      At every subsequent XDP_TX/REDIRECT failure, rate limited:
      sfc 0000:5e:00.0 ens3f0np0: XDP TX failed (-22)
      
      However, without reverting the change, it makes the user to think that
      everything is OK at probe time, but later it fails in an unpredictable
      way, depending on the CPU that handles the packet.
      
      It is better to restore the predictable behaviour. If the user sees the
      error message at probe time, he/she can try to configure the best way it
      fits his/her needs. At least, he/she will have 2 options:
      - Accept that XDP_TX/REDIRECT is not available (he/she may not need it)
      - Load sfc module with modparam 'rss_cpus' with a lower number, thus
        creating less normal RX queues/channels, letting more free resources
        for XDP, with some performance penalty.
      
      Anyway, let the calculation of maximum TX queues that can be handled by
      a single event queue, and use it only if it's less than the number of TX
      queues per channel. This doesn't happen in practice, but could happen if
      some constant values are tweaked in the future, such us
      EFX_MAX_TXQ_PER_CHANNEL, EFX_MAX_EVQ_SIZE or EFX_MAX_DMAQ_SIZE.
      
      Related mailing list thread:
      https://lore.kernel.org/bpf/20201215104327.2be76156@carbon/Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f28100cb
    • Pavel Skripkin's avatar
      net: fddi: fix UAF in fza_probe · deb7178e
      Pavel Skripkin authored
      fp is netdev private data and it cannot be
      used after free_netdev() call. Using fp after free_netdev()
      can cause UAF bug. Fix it by moving free_netdev() after error message.
      
      Fixes: 61414f5e ("FDDI: defza: Add support for DEC FDDIcontroller 700
      TURBOchannel adapter")
      Signed-off-by: default avatarPavel Skripkin <paskripkin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      deb7178e
    • Vladimir Oltean's avatar
      net: dsa: sja1105: fix address learning getting disabled on the CPU port · b0b33b04
      Vladimir Oltean authored
      In May 2019 when commit 640f763f ("net: dsa: sja1105: Add support
      for Spanning Tree Protocol") was introduced, the comment that "STP does
      not get called for the CPU port" was true. This changed after commit
      0394a63a ("net: dsa: enable and disable all ports") in August 2019
      and went largely unnoticed, because the sja1105_bridge_stp_state_set()
      method did nothing different compared to the static setup done by
      sja1105_init_mac_settings().
      
      With the ability to turn address learning off introduced by the blamed
      commit, there is a new priv->learn_ena port mask in the driver. When
      sja1105_bridge_stp_state_set() gets called and we are in
      BR_STATE_LEARNING or later, address learning is enabled or not depending
      on priv->learn_ena & BIT(port).
      
      So what happens is that priv->learn_ena is not being set from anywhere
      for the CPU port, and the static configuration done by
      sja1105_init_mac_settings() is being overwritten.
      
      To solve this, acknowledge that the static configuration of STP state is
      no longer necessary because the STP state is being set by the DSA core
      now, but what is necessary is to set priv->learn_ena for the CPU port.
      
      Fixes: 4d942354 ("net: dsa: sja1105: offload bridge port flags to device")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0b33b04
    • Vladimir Oltean's avatar
      net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload · e56c6bbd
      Vladimir Oltean authored
      The point with a *dev and a *brport_dev is that when we have a LAG net
      device that is a bridge port, *dev is an ocelot net device and
      *brport_dev is the bonding/team net device. The ocelot net device
      beneath the LAG does not exist from the bridge's perspective, so we need
      to sync the switchdev objects belonging to the brport_dev and not to the
      dev.
      
      Fixes: e4bd44e8 ("net: ocelot: replay switchdev events when joining bridge")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e56c6bbd
    • Yajun Deng's avatar
      net: Use nlmsg_unicast() instead of netlink_unicast() · 01757f53
      Yajun Deng authored
      It has 'if (err >0 )' statement in nlmsg_unicast(), so use nlmsg_unicast()
      instead of netlink_unicast(), this looks more concise.
      
      v2: remove the change in netfilter.
      Signed-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      01757f53
    • Xuan Zhuo's avatar
      xdp, net: Fix use-after-free in bpf_xdp_link_release · 5acc7d3e
      Xuan Zhuo authored
      The problem occurs between dev_get_by_index() and dev_xdp_attach_link().
      At this point, dev_xdp_uninstall() is called. Then xdp link will not be
      detached automatically when dev is released. But link->dev already
      points to dev, when xdp link is released, dev will still be accessed,
      but dev has been released.
      
      dev_get_by_index()        |
      link->dev = dev           |
                                |      rtnl_lock()
                                |      unregister_netdevice_many()
                                |          dev_xdp_uninstall()
                                |      rtnl_unlock()
      rtnl_lock();              |
      dev_xdp_attach_link()     |
      rtnl_unlock();            |
                                |      netdev_run_todo() // dev released
      bpf_xdp_link_release()    |
          /* access dev.        |
             use-after-free */  |
      
      [   45.966867] BUG: KASAN: use-after-free in bpf_xdp_link_release+0x3b8/0x3d0
      [   45.967619] Read of size 8 at addr ffff00000f9980c8 by task a.out/732
      [   45.968297]
      [   45.968502] CPU: 1 PID: 732 Comm: a.out Not tainted 5.13.0+ #22
      [   45.969222] Hardware name: linux,dummy-virt (DT)
      [   45.969795] Call trace:
      [   45.970106]  dump_backtrace+0x0/0x4c8
      [   45.970564]  show_stack+0x30/0x40
      [   45.970981]  dump_stack_lvl+0x120/0x18c
      [   45.971470]  print_address_description.constprop.0+0x74/0x30c
      [   45.972182]  kasan_report+0x1e8/0x200
      [   45.972659]  __asan_report_load8_noabort+0x2c/0x50
      [   45.973273]  bpf_xdp_link_release+0x3b8/0x3d0
      [   45.973834]  bpf_link_free+0xd0/0x188
      [   45.974315]  bpf_link_put+0x1d0/0x218
      [   45.974790]  bpf_link_release+0x3c/0x58
      [   45.975291]  __fput+0x20c/0x7e8
      [   45.975706]  ____fput+0x24/0x30
      [   45.976117]  task_work_run+0x104/0x258
      [   45.976609]  do_notify_resume+0x894/0xaf8
      [   45.977121]  work_pending+0xc/0x328
      [   45.977575]
      [   45.977775] The buggy address belongs to the page:
      [   45.978369] page:fffffc00003e6600 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4f998
      [   45.979522] flags: 0x7fffe0000000000(node=0|zone=0|lastcpupid=0x3ffff)
      [   45.980349] raw: 07fffe0000000000 fffffc00003e6708 ffff0000dac3c010 0000000000000000
      [   45.981309] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      [   45.982259] page dumped because: kasan: bad access detected
      [   45.982948]
      [   45.983153] Memory state around the buggy address:
      [   45.983753]  ffff00000f997f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [   45.984645]  ffff00000f998000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.985533] >ffff00000f998080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.986419]                                               ^
      [   45.987112]  ffff00000f998100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.988006]  ffff00000f998180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      [   45.988895] ==================================================================
      [   45.989773] Disabling lock debugging due to kernel taint
      [   45.990552] Kernel panic - not syncing: panic_on_warn set ...
      [   45.991166] CPU: 1 PID: 732 Comm: a.out Tainted: G    B             5.13.0+ #22
      [   45.991929] Hardware name: linux,dummy-virt (DT)
      [   45.992448] Call trace:
      [   45.992753]  dump_backtrace+0x0/0x4c8
      [   45.993208]  show_stack+0x30/0x40
      [   45.993627]  dump_stack_lvl+0x120/0x18c
      [   45.994113]  dump_stack+0x1c/0x34
      [   45.994530]  panic+0x3a4/0x7d8
      [   45.994930]  end_report+0x194/0x198
      [   45.995380]  kasan_report+0x134/0x200
      [   45.995850]  __asan_report_load8_noabort+0x2c/0x50
      [   45.996453]  bpf_xdp_link_release+0x3b8/0x3d0
      [   45.997007]  bpf_link_free+0xd0/0x188
      [   45.997474]  bpf_link_put+0x1d0/0x218
      [   45.997942]  bpf_link_release+0x3c/0x58
      [   45.998429]  __fput+0x20c/0x7e8
      [   45.998833]  ____fput+0x24/0x30
      [   45.999247]  task_work_run+0x104/0x258
      [   45.999731]  do_notify_resume+0x894/0xaf8
      [   46.000236]  work_pending+0xc/0x328
      [   46.000697] SMP: stopping secondary CPUs
      [   46.001226] Dumping ftrace buffer:
      [   46.001663]    (ftrace buffer empty)
      [   46.002110] Kernel Offset: disabled
      [   46.002545] CPU features: 0x00000001,23202c00
      [   46.003080] Memory Limit: none
      
      Fixes: aa8d3a71 ("bpf, xdp: Add bpf_link-based XDP attachment API")
      Reported-by: default avatarAbaci <abaci@linux.alibaba.com>
      Signed-off-by: default avatarXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210710031635.41649-1-xuanzhuo@linux.alibaba.com
      5acc7d3e
    • Daniel Borkmann's avatar
      bpf: Fix tail_call_reachable rejection for interpreter when jit failed · 5dd0a6b8
      Daniel Borkmann authored
      During testing of f263a814 ("bpf: Track subprog poke descriptors correctly
      and fix use-after-free") under various failure conditions, for example, when
      jit_subprogs() fails and tries to clean up the program to be run under the
      interpreter, we ran into the following freeze:
      
        [...]
        #127/8 tailcall_bpf2bpf_3:FAIL
        [...]
        [   92.041251] BUG: KASAN: slab-out-of-bounds in ___bpf_prog_run+0x1b9d/0x2e20
        [   92.042408] Read of size 8 at addr ffff88800da67f68 by task test_progs/682
        [   92.043707]
        [   92.044030] CPU: 1 PID: 682 Comm: test_progs Tainted: G   O   5.13.0-53301-ge6c08cb33a30-dirty #87
        [   92.045542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
        [   92.046785] Call Trace:
        [   92.047171]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.047773]  ? __bpf_prog_run_args32+0x8b/0xb0
        [   92.048389]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.049019]  ? ktime_get+0x117/0x130
        [...] // few hundred [similar] lines more
        [   92.659025]  ? ktime_get+0x117/0x130
        [   92.659845]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.660738]  ? __bpf_prog_run_args32+0x8b/0xb0
        [   92.661528]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.662378]  ? print_usage_bug+0x50/0x50
        [   92.663221]  ? print_usage_bug+0x50/0x50
        [   92.664077]  ? bpf_ksym_find+0x9c/0xe0
        [   92.664887]  ? ktime_get+0x117/0x130
        [   92.665624]  ? kernel_text_address+0xf5/0x100
        [   92.666529]  ? __kernel_text_address+0xe/0x30
        [   92.667725]  ? unwind_get_return_address+0x2f/0x50
        [   92.668854]  ? ___bpf_prog_run+0x15d4/0x2e20
        [   92.670185]  ? ktime_get+0x117/0x130
        [   92.671130]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.672020]  ? __bpf_prog_run_args32+0x8b/0xb0
        [   92.672860]  ? __bpf_prog_run_args64+0xc0/0xc0
        [   92.675159]  ? ktime_get+0x117/0x130
        [   92.677074]  ? lock_is_held_type+0xd5/0x130
        [   92.678662]  ? ___bpf_prog_run+0x15d4/0x2e20
        [   92.680046]  ? ktime_get+0x117/0x130
        [   92.681285]  ? __bpf_prog_run32+0x6b/0x90
        [   92.682601]  ? __bpf_prog_run64+0x90/0x90
        [   92.683636]  ? lock_downgrade+0x370/0x370
        [   92.684647]  ? mark_held_locks+0x44/0x90
        [   92.685652]  ? ktime_get+0x117/0x130
        [   92.686752]  ? lockdep_hardirqs_on+0x79/0x100
        [   92.688004]  ? ktime_get+0x117/0x130
        [   92.688573]  ? __cant_migrate+0x2b/0x80
        [   92.689192]  ? bpf_test_run+0x2f4/0x510
        [   92.689869]  ? bpf_test_timer_continue+0x1c0/0x1c0
        [   92.690856]  ? rcu_read_lock_bh_held+0x90/0x90
        [   92.691506]  ? __kasan_slab_alloc+0x61/0x80
        [   92.692128]  ? eth_type_trans+0x128/0x240
        [   92.692737]  ? __build_skb+0x46/0x50
        [   92.693252]  ? bpf_prog_test_run_skb+0x65e/0xc50
        [   92.693954]  ? bpf_prog_test_run_raw_tp+0x2d0/0x2d0
        [   92.694639]  ? __fget_light+0xa1/0x100
        [   92.695162]  ? bpf_prog_inc+0x23/0x30
        [   92.695685]  ? __sys_bpf+0xb40/0x2c80
        [   92.696324]  ? bpf_link_get_from_fd+0x90/0x90
        [   92.697150]  ? mark_held_locks+0x24/0x90
        [   92.698007]  ? lockdep_hardirqs_on_prepare+0x124/0x220
        [   92.699045]  ? finish_task_switch+0xe6/0x370
        [   92.700072]  ? lockdep_hardirqs_on+0x79/0x100
        [   92.701233]  ? finish_task_switch+0x11d/0x370
        [   92.702264]  ? __switch_to+0x2c0/0x740
        [   92.703148]  ? mark_held_locks+0x24/0x90
        [   92.704155]  ? __x64_sys_bpf+0x45/0x50
        [   92.705146]  ? do_syscall_64+0x35/0x80
        [   92.706953]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
        [...]
      
      Turns out that the program rejection from e411901c ("bpf: allow for tailcalls
      in BPF subprograms for x64 JIT") is buggy since env->prog->aux->tail_call_reachable
      is never true. Commit ebf7d1f5 ("bpf, x64: rework pro/epilogue and tailcall
      handling in JIT") added a tracker into check_max_stack_depth() which propagates
      the tail_call_reachable condition throughout the subprograms. This info is then
      assigned to the subprogram's func[i]->aux->tail_call_reachable. However, in the
      case of the rejection check upon JIT failure, env->prog->aux->tail_call_reachable
      is used. func[0]->aux->tail_call_reachable which represents the main program's
      information did not propagate this to the outer env->prog->aux, though. Add this
      propagation into check_max_stack_depth() where it needs to belong so that the
      check can be done reliably.
      
      Fixes: ebf7d1f5 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT")
      Fixes: e411901c ("bpf: allow for tailcalls in BPF subprograms for x64 JIT")
      Co-developed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Link: https://lore.kernel.org/bpf/618c34e3163ad1a36b1e82377576a6081e182f25.1626123173.git.daniel@iogearbox.net
      5dd0a6b8
  5. 12 Jul, 2021 6 commits