1. 17 Mar, 2022 1 commit
    • Guo Ziliang's avatar
      mm: swap: get rid of livelock in swapin readahead · 029c4628
      Guo Ziliang authored
      In our testing, a livelock task was found.  Through sysrq printing, same
      stack was found every time, as follows:
      
        __swap_duplicate+0x58/0x1a0
        swapcache_prepare+0x24/0x30
        __read_swap_cache_async+0xac/0x220
        read_swap_cache_async+0x58/0xa0
        swapin_readahead+0x24c/0x628
        do_swap_page+0x374/0x8a0
        __handle_mm_fault+0x598/0xd60
        handle_mm_fault+0x114/0x200
        do_page_fault+0x148/0x4d0
        do_translation_fault+0xb0/0xd4
        do_mem_abort+0x50/0xb0
      
      The reason for the livelock is that swapcache_prepare() always returns
      EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
      cannot jump out of the loop.  We suspect that the task that clears the
      SWAP_HAS_CACHE flag never gets a chance to run.  We try to lower the
      priority of the task stuck in a livelock so that the task that clears
      the SWAP_HAS_CACHE flag will run.  The results show that the system
      returns to normal after the priority is lowered.
      
      In our testing, multiple real-time tasks are bound to the same core, and
      the task in the livelock is the highest priority task of the core, so
      the livelocked task cannot be preempted.
      
      Although cond_resched() is used by __read_swap_cache_async, it is an
      empty function in the preemptive system and cannot achieve the purpose
      of releasing the CPU.  A high-priority task cannot release the CPU
      unless preempted by a higher-priority task.  But when this task is
      already the highest priority task on this core, other tasks will not be
      able to be scheduled.  So we think we should replace cond_resched() with
      schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
      call set_current_state first to set the task state, so the task will be
      removed from the running queue, so as to achieve the purpose of giving
      up the CPU and prevent it from running in kernel mode for too long.
      
      (akpm: ugly hack becomes uglier.  But it fixes the issue in a
      backportable-to-stable fashion while we hopefully work on something
      better)
      
      Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.comSigned-off-by: default avatarGuo Ziliang <guo.ziliang@zte.com.cn>
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Reviewed-by: default avatarRan Xiaokai <ran.xiaokai@zte.com.cn>
      Reviewed-by: default avatarJiang Xuexin <jiang.xuexin@zte.com.cn>
      Reviewed-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roger Quadros <rogerq@kernel.org>
      Cc: Ziliang Guo <guo.ziliang@zte.com.cn>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      029c4628
  2. 15 Mar, 2022 1 commit
  3. 14 Mar, 2022 1 commit
  4. 13 Mar, 2022 2 commits
  5. 12 Mar, 2022 8 commits
  6. 11 Mar, 2022 26 commits
  7. 10 Mar, 2022 1 commit
    • Ivan Vecera's avatar
      ice: Fix race condition during interface enslave · 5cb1ebdb
      Ivan Vecera authored
      Commit 5dbbbd01 ("ice: Avoid RTNL lock when re-creating
      auxiliary device") changes a process of re-creation of aux device
      so ice_plug_aux_dev() is called from ice_service_task() context.
      This unfortunately opens a race window that can result in dead-lock
      when interface has left LAG and immediately enters LAG again.
      
      Reproducer:
      ```
      #!/bin/sh
      
      ip link add lag0 type bond mode 1 miimon 100
      ip link set lag0
      
      for n in {1..10}; do
              echo Cycle: $n
              ip link set ens7f0 master lag0
              sleep 1
              ip link set ens7f0 nomaster
      done
      ```
      
      This results in:
      [20976.208697] Workqueue: ice ice_service_task [ice]
      [20976.213422] Call Trace:
      [20976.215871]  __schedule+0x2d1/0x830
      [20976.219364]  schedule+0x35/0xa0
      [20976.222510]  schedule_preempt_disabled+0xa/0x10
      [20976.227043]  __mutex_lock.isra.7+0x310/0x420
      [20976.235071]  enum_all_gids_of_dev_cb+0x1c/0x100 [ib_core]
      [20976.251215]  ib_enum_roce_netdev+0xa4/0xe0 [ib_core]
      [20976.256192]  ib_cache_setup_one+0x33/0xa0 [ib_core]
      [20976.261079]  ib_register_device+0x40d/0x580 [ib_core]
      [20976.266139]  irdma_ib_register_device+0x129/0x250 [irdma]
      [20976.281409]  irdma_probe+0x2c1/0x360 [irdma]
      [20976.285691]  auxiliary_bus_probe+0x45/0x70
      [20976.289790]  really_probe+0x1f2/0x480
      [20976.298509]  driver_probe_device+0x49/0xc0
      [20976.302609]  bus_for_each_drv+0x79/0xc0
      [20976.306448]  __device_attach+0xdc/0x160
      [20976.310286]  bus_probe_device+0x9d/0xb0
      [20976.314128]  device_add+0x43c/0x890
      [20976.321287]  __auxiliary_device_add+0x43/0x60
      [20976.325644]  ice_plug_aux_dev+0xb2/0x100 [ice]
      [20976.330109]  ice_service_task+0xd0c/0xed0 [ice]
      [20976.342591]  process_one_work+0x1a7/0x360
      [20976.350536]  worker_thread+0x30/0x390
      [20976.358128]  kthread+0x10a/0x120
      [20976.365547]  ret_from_fork+0x1f/0x40
      ...
      [20976.438030] task:ip              state:D stack:    0 pid:213658 ppid:213627 flags:0x00004084
      [20976.446469] Call Trace:
      [20976.448921]  __schedule+0x2d1/0x830
      [20976.452414]  schedule+0x35/0xa0
      [20976.455559]  schedule_preempt_disabled+0xa/0x10
      [20976.460090]  __mutex_lock.isra.7+0x310/0x420
      [20976.464364]  device_del+0x36/0x3c0
      [20976.467772]  ice_unplug_aux_dev+0x1a/0x40 [ice]
      [20976.472313]  ice_lag_event_handler+0x2a2/0x520 [ice]
      [20976.477288]  notifier_call_chain+0x47/0x70
      [20976.481386]  __netdev_upper_dev_link+0x18b/0x280
      [20976.489845]  bond_enslave+0xe05/0x1790 [bonding]
      [20976.494475]  do_setlink+0x336/0xf50
      [20976.502517]  __rtnl_newlink+0x529/0x8b0
      [20976.543441]  rtnl_newlink+0x43/0x60
      [20976.546934]  rtnetlink_rcv_msg+0x2b1/0x360
      [20976.559238]  netlink_rcv_skb+0x4c/0x120
      [20976.563079]  netlink_unicast+0x196/0x230
      [20976.567005]  netlink_sendmsg+0x204/0x3d0
      [20976.570930]  sock_sendmsg+0x4c/0x50
      [20976.574423]  ____sys_sendmsg+0x1eb/0x250
      [20976.586807]  ___sys_sendmsg+0x7c/0xc0
      [20976.606353]  __sys_sendmsg+0x57/0xa0
      [20976.609930]  do_syscall_64+0x5b/0x1a0
      [20976.613598]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      
      1. Command 'ip link ... set nomaster' causes that ice_plug_aux_dev()
         is called from ice_service_task() context, aux device is created
         and associated device->lock is taken.
      2. Command 'ip link ... set master...' calls ice's notifier under
         RTNL lock and that notifier calls ice_unplug_aux_dev(). That
         function tries to take aux device->lock but this is already taken
         by ice_plug_aux_dev() in step 1
      3. Later ice_plug_aux_dev() tries to take RTNL lock but this is already
         taken in step 2
      4. Dead-lock
      
      The patch fixes this issue by following changes:
      - Bit ICE_FLAG_PLUG_AUX_DEV is kept to be set during ice_plug_aux_dev()
        call in ice_service_task()
      - The bit is checked in ice_clear_rdma_cap() and only if it is not set
        then ice_unplug_aux_dev() is called. If it is set (in other words
        plugging of aux device was requested and ice_plug_aux_dev() is
        potentially running) then the function only clears the bit
      - Once ice_plug_aux_dev() call (in ice_service_task) is finished
        the bit ICE_FLAG_PLUG_AUX_DEV is cleared but it is also checked
        whether it was already cleared by ice_clear_rdma_cap(). If so then
        aux device is unplugged.
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Co-developed-by: default avatarPetr Oros <poros@redhat.com>
      Signed-off-by: default avatarPetr Oros <poros@redhat.com>
      Reviewed-by: default avatarDave Ertman <david.m.ertman@intel.com>
      Link: https://lore.kernel.org/r/20220310171641.3863659-1-ivecera@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5cb1ebdb