1. 16 Feb, 2024 5 commits
    • Kuniyuki Iwashima's avatar
      dccp/tcp: Unhash sk from ehash for tb2 alloc failure after check_estalblished(). · 66b60b0c
      Kuniyuki Iwashima authored
      syzkaller reported a warning [0] in inet_csk_destroy_sock() with no
      repro.
      
        WARN_ON(inet_sk(sk)->inet_num && !inet_csk(sk)->icsk_bind_hash);
      
      However, the syzkaller's log hinted that connect() failed just before
      the warning due to FAULT_INJECTION.  [1]
      
      When connect() is called for an unbound socket, we search for an
      available ephemeral port.  If a bhash bucket exists for the port, we
      call __inet_check_established() or __inet6_check_established() to check
      if the bucket is reusable.
      
      If reusable, we add the socket into ehash and set inet_sk(sk)->inet_num.
      
      Later, we look up the corresponding bhash2 bucket and try to allocate
      it if it does not exist.
      
      Although it rarely occurs in real use, if the allocation fails, we must
      revert the changes by check_established().  Otherwise, an unconnected
      socket could illegally occupy an ehash entry.
      
      Note that we do not put tw back into ehash because sk might have
      already responded to a packet for tw and it would be better to free
      tw earlier under such memory presure.
      
      [0]:
      WARNING: CPU: 0 PID: 350830 at net/ipv4/inet_connection_sock.c:1193 inet_csk_destroy_sock (net/ipv4/inet_connection_sock.c:1193)
      Modules linked in:
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      RIP: 0010:inet_csk_destroy_sock (net/ipv4/inet_connection_sock.c:1193)
      Code: 41 5c 41 5d 41 5e e9 2d 4a 3d fd e8 28 4a 3d fd 48 89 ef e8 f0 cd 7d ff 5b 5d 41 5c 41 5d 41 5e e9 13 4a 3d fd e8 0e 4a 3d fd <0f> 0b e9 61 fe ff ff e8 02 4a 3d fd 4c 89 e7 be 03 00 00 00 e8 05
      RSP: 0018:ffffc9000b21fd38 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 0000000000009e78 RCX: ffffffff840bae40
      RDX: ffff88806e46c600 RSI: ffffffff840bb012 RDI: ffff88811755cca8
      RBP: ffff88811755c880 R08: 0000000000000003 R09: 0000000000000000
      R10: 0000000000009e78 R11: 0000000000000000 R12: ffff88811755c8e0
      R13: ffff88811755c892 R14: ffff88811755c918 R15: 0000000000000000
      FS:  00007f03e5243800(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b32f21000 CR3: 0000000112ffe001 CR4: 0000000000770ef0
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? inet_csk_destroy_sock (net/ipv4/inet_connection_sock.c:1193)
       dccp_close (net/dccp/proto.c:1078)
       inet_release (net/ipv4/af_inet.c:434)
       __sock_release (net/socket.c:660)
       sock_close (net/socket.c:1423)
       __fput (fs/file_table.c:377)
       __fput_sync (fs/file_table.c:462)
       __x64_sys_close (fs/open.c:1557 fs/open.c:1539 fs/open.c:1539)
       do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
      RIP: 0033:0x7f03e53852bb
      Code: 03 00 00 00 0f 05 48 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 43 c9 f5 ff 8b 7c 24 0c 41 89 c0 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 0c e8 a1 c9 f5 ff 8b 44
      RSP: 002b:00000000005dfba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
      RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f03e53852bb
      RDX: 0000000000000002 RSI: 0000000000000002 RDI: 0000000000000003
      RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000167c
      R10: 0000000008a79680 R11: 0000000000000293 R12: 00007f03e4e43000
      R13: 00007f03e4e43170 R14: 00007f03e4e43178 R15: 00007f03e4e43170
       </TASK>
      
      [1]:
      FAULT_INJECTION: forcing a failure.
      name failslab, interval 1, probability 0, space 0, times 0
      CPU: 0 PID: 350833 Comm: syz-executor.1 Not tainted 6.7.0-12272-g2121c43f #9
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <TASK>
       dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
       should_fail_ex (lib/fault-inject.c:52 lib/fault-inject.c:153)
       should_failslab (mm/slub.c:3748)
       kmem_cache_alloc (mm/slub.c:3763 mm/slub.c:3842 mm/slub.c:3867)
       inet_bind2_bucket_create (net/ipv4/inet_hashtables.c:135)
       __inet_hash_connect (net/ipv4/inet_hashtables.c:1100)
       dccp_v4_connect (net/dccp/ipv4.c:116)
       __inet_stream_connect (net/ipv4/af_inet.c:676)
       inet_stream_connect (net/ipv4/af_inet.c:747)
       __sys_connect_file (net/socket.c:2048 (discriminator 2))
       __sys_connect (net/socket.c:2065)
       __x64_sys_connect (net/socket.c:2072)
       do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
      RIP: 0033:0x7f03e5284e5d
      Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
      RSP: 002b:00007f03e4641cc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 00000000004bbf80 RCX: 00007f03e5284e5d
      RDX: 0000000000000010 RSI: 0000000020000000 RDI: 0000000000000003
      RBP: 00000000004bbf80 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
      R13: 000000000000000b R14: 00007f03e52e5530 R15: 0000000000000000
       </TASK>
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Fixes: 28044fc1 ("net: Add a bhash2 table hashed by port and address")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66b60b0c
    • David S. Miller's avatar
      Merge branch 'bridge-mdb-events' · 82a678e2
      David S. Miller authored
      Tobias Waldekranz says:
      
      ====================
      net: bridge: switchdev: Ensure MDB events are delivered exactly once
      
      When a device is attached to a bridge, drivers will request a replay
      of objects that were created before the device joined the bridge, that
      are still of interest to the joining port. Typical examples include
      FDB entries and MDB memberships on other ports ("foreign interfaces")
      or on the bridge itself.
      
      Conversely when a device is detached, the bridge will synthesize
      deletion events for all those objects that are still live, but no
      longer applicable to the device in question.
      
      This series eliminates two races related to the synching and
      unsynching phases of a bridge's MDB with a joining or leaving device,
      that would cause notifications of such objects to be either delivered
      twice (1/2), or not at all (2/2).
      
      A similar race to the one solved by 1/2 still remains for the
      FDB. This is much harder to solve, due to the lockless operation of
      the FDB's rhashtable, and is therefore knowingly left out of this
      series.
      
      v1 -> v2:
      - Squash the previously separate addition of
        switchdev_port_obj_act_is_deferred into first consumer.
      - Use ether_addr_equal to compare MAC addresses.
      - Document switchdev_port_obj_act_is_deferred (renamed from
        switchdev_port_obj_is_deferred in v1, to indicate that we also match
        on the action).
      - Delay allocations of MDB objects until we know they're needed.
      - Use non-RCU version of the hash list iterator, now that the MDB is
        not scanned while holding the RCU read lock.
      - Add Fixes tag to commit message
      
      v2 -> v3:
      - Fix unlocking in error paths
      - Access RCU protected port list via mlock_dereference, since MDB is
        guaranteed to remain constant for the duration of the scan.
      
      v3 -> v4:
      - Limit the search for exiting deferred events in 1/2 to only apply to
        additions, since the problem does not exist in the deletion case.
      - Add 2/2, to plug a related race when unoffloading an indirectly
        associated device.
      
      v4 -> v5:
      - Fix grammatical errors in kerneldoc of
        switchdev_port_obj_act_is_deferred
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      82a678e2
    • Tobias Waldekranz's avatar
      net: bridge: switchdev: Ensure deferred event delivery on unoffload · f7a70d65
      Tobias Waldekranz authored
      When unoffloading a device, it is important to ensure that all
      relevant deferred events are delivered to it before it disassociates
      itself from the bridge.
      
      Before this change, this was true for the normal case when a device
      maps 1:1 to a net_bridge_port, i.e.
      
         br0
         /
      swp0
      
      When swp0 leaves br0, the call to switchdev_deferred_process() in
      del_nbp() makes sure to process any outstanding events while the
      device is still associated with the bridge.
      
      In the case when the association is indirect though, i.e. when the
      device is attached to the bridge via an intermediate device, like a
      LAG...
      
          br0
          /
        lag0
        /
      swp0
      
      ...then detaching swp0 from lag0 does not cause any net_bridge_port to
      be deleted, so there was no guarantee that all events had been
      processed before the device disassociated itself from the bridge.
      
      Fix this by always synchronously processing all deferred events before
      signaling completion of unoffloading back to the driver.
      
      Fixes: 4e51bf44 ("net: bridge: move the switchdev object replay helpers to "push" mode")
      Signed-off-by: default avatarTobias Waldekranz <tobias@waldekranz.com>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7a70d65
    • Tobias Waldekranz's avatar
      net: bridge: switchdev: Skip MDB replays of deferred events on offload · dc489f86
      Tobias Waldekranz authored
      Before this change, generation of the list of MDB events to replay
      would race against the creation of new group memberships, either from
      the IGMP/MLD snooping logic or from user configuration.
      
      While new memberships are immediately visible to walkers of
      br->mdb_list, the notification of their existence to switchdev event
      subscribers is deferred until a later point in time. So if a replay
      list was generated during a time that overlapped with such a window,
      it would also contain a replay of the not-yet-delivered event.
      
      The driver would thus receive two copies of what the bridge internally
      considered to be one single event. On destruction of the bridge, only
      a single membership deletion event was therefore sent. As a
      consequence of this, drivers which reference count memberships (at
      least DSA), would be left with orphan groups in their hardware
      database when the bridge was destroyed.
      
      This is only an issue when replaying additions. While deletion events
      may still be pending on the deferred queue, they will already have
      been removed from br->mdb_list, so no duplicates can be generated in
      that scenario.
      
      To a user this meant that old group memberships, from a bridge in
      which a port was previously attached, could be reanimated (in
      hardware) when the port joined a new bridge, without the new bridge's
      knowledge.
      
      For example, on an mv88e6xxx system, create a snooping bridge and
      immediately add a port to it:
      
          root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
          > ip link set dev x3 up master br0
      
      And then destroy the bridge:
      
          root@infix-06-0b-00:~$ ip link del dev br0
          root@infix-06-0b-00:~$ mvls atu
          ADDRESS             FID  STATE      Q  F  0  1  2  3  4  5  6  7  8  9  a
          DEV:0 Marvell 88E6393X
          33:33:00:00:00:6a     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
          33:33:ff:87:e4:3f     1  static     -  -  0  .  .  .  .  .  .  .  .  .  .
          ff:ff:ff:ff:ff:ff     1  static     -  -  0  1  2  3  4  5  6  7  8  9  a
          root@infix-06-0b-00:~$
      
      The two IPv6 groups remain in the hardware database because the
      port (x3) is notified of the host's membership twice: once via the
      original event and once via a replay. Since only a single delete
      notification is sent, the count remains at 1 when the bridge is
      destroyed.
      
      Then add the same port (or another port belonging to the same hardware
      domain) to a new bridge, this time with snooping disabled:
      
          root@infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \
          > ip link set dev x3 up master br1
      
      All multicast, including the two IPv6 groups from br0, should now be
      flooded, according to the policy of br1. But instead the old
      memberships are still active in the hardware database, causing the
      switch to only forward traffic to those groups towards the CPU (port
      0).
      
      Eliminate the race in two steps:
      
      1. Grab the write-side lock of the MDB while generating the replay
         list.
      
      This prevents new memberships from showing up while we are generating
      the replay list. But it leaves the scenario in which a deferred event
      was already generated, but not delivered, before we grabbed the
      lock. Therefore:
      
      2. Make sure that no deferred version of a replay event is already
         enqueued to the switchdev deferred queue, before adding it to the
         replay list, when replaying additions.
      
      Fixes: 4f2673b3 ("net: bridge: add helper to replay port and host-joined mdb entries")
      Signed-off-by: default avatarTobias Waldekranz <tobias@waldekranz.com>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc489f86
    • Alexander Gordeev's avatar
      net/iucv: fix the allocation size of iucv_path_table array · b4ea9b6a
      Alexander Gordeev authored
      iucv_path_table is a dynamically allocated array of pointers to
      struct iucv_path items. Yet, its size is calculated as if it was
      an array of struct iucv_path items.
      Signed-off-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Reviewed-by: default avatarAlexandra Winter <wintera@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4ea9b6a
  2. 15 Feb, 2024 28 commits
  3. 14 Feb, 2024 7 commits
    • Linus Torvalds's avatar
      Merge tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 1f3a3e2a
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
       "A few regular fixes and one fix for space reservation regression since
        6.7 that users have been reporting:
      
         - fix over-reservation of metadata chunks due to not keeping proper
           balance between global block reserve and delayed refs reserve; in
           practice this leaves behind empty metadata block groups, the
           workaround is to reclaim them by using the '-musage=1' balance
           filter
      
         - other space reservation fixes:
            - do not delete unused block group if it may be used soon
            - do not reserve space for checksums for NOCOW files
      
         - fix extent map assertion failure when writing out free space inode
      
         - reject encoded write if inode has nodatasum flag set
      
         - fix chunk map leak when loading block group zone info"
      
      * tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: don't refill whole delayed refs block reserve when starting transaction
        btrfs: zoned: fix chunk map leak when loading block group zone info
        btrfs: reject encoded write if inode has nodatasum flag set
        btrfs: don't reserve space for checksums when writing to nocow files
        btrfs: add new unused block groups to the list of unused block groups
        btrfs: do not delete unused block group if it may be used soon
        btrfs: add and use helper to check if block group is used
        btrfs: don't drop extent_map for free space inode on write error
      1f3a3e2a
    • Linus Torvalds's avatar
      Merge tag 'linux_kselftest-kunit-fixes-6.8-rc5' of... · 91f842ff
      Linus Torvalds authored
      Merge tag 'linux_kselftest-kunit-fixes-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull KUnit fix from Shuah Khan:
       "One important fix to unregister kunit_bus when KUnit module is
        unloaded.
      
        Not doing so causes an error when KUnit module tries to re-register
        the bus when it gets reloaded"
      
      * tag 'linux_kselftest-kunit-fixes-6.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        kunit: device: Unregister the kunit_bus on shutdown
      91f842ff
    • Felix Fietkau's avatar
      netfilter: nf_tables: fix bidirectional offload regression · 84443741
      Felix Fietkau authored
      Commit 8f84780b ("netfilter: flowtable: allow unidirectional rules")
      made unidirectional flow offload possible, while completely ignoring (and
      breaking) bidirectional flow offload for nftables.
      Add the missing flag that was left out as an exercise for the reader :)
      
      Cc: Vlad Buslov <vladbu@nvidia.com>
      Fixes: 8f84780b ("netfilter: flowtable: allow unidirectional rules")
      Reported-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Signed-off-by: default avatarFelix Fietkau <nbd@nbd.name>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      84443741
    • Kyle Swenson's avatar
      netfilter: nat: restore default DNAT behavior · 0f1ae282
      Kyle Swenson authored
      When a DNAT rule is configured via iptables with different port ranges,
      
      iptables -t nat -A PREROUTING -p tcp -d 10.0.0.2 -m tcp --dport 32000:32010
      -j DNAT --to-destination 192.168.0.10:21000-21010
      
      we seem to be DNATing to some random port on the LAN side. While this is
      expected if --random is passed to the iptables command, it is not
      expected without passing --random.  The expected behavior (and the
      observed behavior prior to the commit in the "Fixes" tag) is the traffic
      will be DNAT'd to 192.168.0.10:21000 unless there is a tuple collision
      with that destination.  In that case, we expect the traffic to be
      instead DNAT'd to 192.168.0.10:21001, so on so forth until the end of
      the range.
      
      This patch intends to restore the behavior observed prior to the "Fixes"
      tag.
      
      Fixes: 6ed5943f ("netfilter: nat: remove l4 protocol port rovers")
      Signed-off-by: default avatarKyle Swenson <kyle.swenson@est.tech>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      0f1ae282
    • Pablo Neira Ayuso's avatar
      netfilter: nft_set_pipapo: fix missing : in kdoc · f6374a82
      Pablo Neira Ayuso authored
      Add missing : in kdoc field names.
      
      Fixes: 8683f4b9 ("nft_set_pipapo: Prepare for vectorised implementation: helpers")
      Reported-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      f6374a82
    • Sasha Neftin's avatar
      igc: Remove temporary workaround · 55ea9899
      Sasha Neftin authored
      PHY_CONTROL register works as defined in the IEEE 802.3 specification
      (IEEE 802.3-2008 22.2.4.1). Tidy up the temporary workaround.
      
      User impact: PHY can now be powered down when the ethernet link is down.
      
      Testing hints: ip link set down <device> (or just disconnect the
      ethernet cable).
      
      Oldest tested NVM version is: 1045:740.
      
      Fixes: 5586838f ("igc: Add code for PHY support")
      Signed-off-by: default avatarSasha Neftin <sasha.neftin@intel.com>
      Reviewed-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Tested-by: default avatarNaama Meir <naamax.meir@linux.intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      55ea9899
    • Kunwu Chan's avatar
      igb: Fix string truncation warnings in igb_set_fw_version · c56d0558
      Kunwu Chan authored
      Commit 1978d3ea ("intel: fix string truncation warnings")
      fixes '-Wformat-truncation=' warnings in igb_main.c by using kasprintf.
      
      drivers/net/ethernet/intel/igb/igb_main.c:3092:53: warning:‘%d’ directive output may be truncated writing between 1 and 5 bytes into a region of size between 1 and 13 [-Wformat-truncation=]
       3092 |                                  "%d.%d, 0x%08x, %d.%d.%d",
            |                                                     ^~
      drivers/net/ethernet/intel/igb/igb_main.c:3092:34: note:directive argument in the range [0, 65535]
       3092 |                                  "%d.%d, 0x%08x, %d.%d.%d",
            |                                  ^~~~~~~~~~~~~~~~~~~~~~~~~
      drivers/net/ethernet/intel/igb/igb_main.c:3092:34: note:directive argument in the range [0, 65535]
      drivers/net/ethernet/intel/igb/igb_main.c:3090:25: note:‘snprintf’ output between 23 and 43 bytes into a destination of size 32
      
      kasprintf() returns a pointer to dynamically allocated memory
      which can be NULL upon failure.
      
      Fix this warning by using a larger space for adapter->fw_version,
      and then fall back and continue to use snprintf.
      
      Fixes: 1978d3ea ("intel: fix string truncation warnings")
      Signed-off-by: default avatarKunwu Chan <chentao@kylinos.cn>
      Cc: Kunwu Chan <kunwu.chan@hotmail.com>
      Suggested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      c56d0558