• Ahmed Zaki's avatar
    iavf: fix a deadlock caused by rtnl and driver's lock circular dependencies · d1639a17
    Ahmed Zaki authored
    A driver's lock (crit_lock) is used to serialize all the driver's tasks.
    Lockdep, however, shows a circular dependency between rtnl and
    crit_lock. This happens when an ndo that already holds the rtnl requests
    the driver to reset, since the reset task (in some paths) tries to grab
    rtnl to either change real number of queues of update netdev features.
    
      [566.241851] ======================================================
      [566.241893] WARNING: possible circular locking dependency detected
      [566.241936] 6.2.14-100.fc36.x86_64+debug #1 Tainted: G           OE
      [566.241984] ------------------------------------------------------
      [566.242025] repro.sh/2604 is trying to acquire lock:
      [566.242061] ffff9280fc5ceee8 (&adapter->crit_lock){+.+.}-{3:3}, at: iavf_close+0x3c/0x240 [iavf]
      [566.242167]
                   but task is already holding lock:
      [566.242209] ffffffff9976d350 (rtnl_mutex){+.+.}-{3:3}, at: iavf_remove+0x6b5/0x730 [iavf]
      [566.242300]
                   which lock already depends on the new lock.
    
      [566.242353]
                   the existing dependency chain (in reverse order) is:
      [566.242401]
                   -> #1 (rtnl_mutex){+.+.}-{3:3}:
      [566.242451]        __mutex_lock+0xc1/0xbb0
      [566.242489]        iavf_init_interrupt_scheme+0x179/0x440 [iavf]
      [566.242560]        iavf_watchdog_task+0x80b/0x1400 [iavf]
      [566.242627]        process_one_work+0x2b3/0x560
      [566.242663]        worker_thread+0x4f/0x3a0
      [566.242696]        kthread+0xf2/0x120
      [566.242730]        ret_from_fork+0x29/0x50
      [566.242763]
                   -> #0 (&adapter->crit_lock){+.+.}-{3:3}:
      [566.242815]        __lock_acquire+0x15ff/0x22b0
      [566.242869]        lock_acquire+0xd2/0x2c0
      [566.242901]        __mutex_lock+0xc1/0xbb0
      [566.242934]        iavf_close+0x3c/0x240 [iavf]
      [566.242997]        __dev_close_many+0xac/0x120
      [566.243036]        dev_close_many+0x8b/0x140
      [566.243071]        unregister_netdevice_many_notify+0x165/0x7c0
      [566.243116]        unregister_netdevice_queue+0xd3/0x110
      [566.243157]        iavf_remove+0x6c1/0x730 [iavf]
      [566.243217]        pci_device_remove+0x33/0xa0
      [566.243257]        device_release_driver_internal+0x1bc/0x240
      [566.243299]        pci_stop_bus_device+0x6c/0x90
      [566.243338]        pci_stop_and_remove_bus_device+0xe/0x20
      [566.243380]        pci_iov_remove_virtfn+0xd1/0x130
      [566.243417]        sriov_disable+0x34/0xe0
      [566.243448]        ice_free_vfs+0x2da/0x330 [ice]
      [566.244383]        ice_sriov_configure+0x88/0xad0 [ice]
      [566.245353]        sriov_numvfs_store+0xde/0x1d0
      [566.246156]        kernfs_fop_write_iter+0x15e/0x210
      [566.246921]        vfs_write+0x288/0x530
      [566.247671]        ksys_write+0x74/0xf0
      [566.248408]        do_syscall_64+0x58/0x80
      [566.249145]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
      [566.249886]
                     other info that might help us debug this:
    
      [566.252014]  Possible unsafe locking scenario:
    
      [566.253432]        CPU0                    CPU1
      [566.254118]        ----                    ----
      [566.254800]   lock(rtnl_mutex);
      [566.255514]                                lock(&adapter->crit_lock);
      [566.256233]                                lock(rtnl_mutex);
      [566.256897]   lock(&adapter->crit_lock);
      [566.257388]
                      *** DEADLOCK ***
    
    The deadlock can be triggered by a script that is continuously resetting
    the VF adapter while doing other operations requiring RTNL, e.g:
    
    	while :; do
    		ip link set $VF up
    		ethtool --set-channels $VF combined 2
    		ip link set $VF down
    		ip link set $VF up
    		ethtool --set-channels $VF combined 4
    		ip link set $VF down
    	done
    
    Any operation that triggers a reset can substitute "ethtool --set-channles"
    
    As a fix, add a new task "finish_config" that do all the work which
    needs rtnl lock. With the exception of iavf_remove(), all work that
    require rtnl should be called from this task.
    
    As for iavf_remove(), at the point where we need to call
    unregister_netdevice() (and grab rtnl_lock), we make sure the finish_config
    task is not running (cancel_work_sync()) to safely grab rtnl. Subsequent
    finish_config work cannot restart after that since the task is guarded
    by the __IAVF_IN_REMOVE_TASK bit in iavf_schedule_finish_config().
    
    Fixes: 5ac49f3c ("iavf: use mutexes for locking of critical sections")
    Signed-off-by: default avatarAhmed Zaki <ahmed.zaki@intel.com>
    Signed-off-by: default avatarMateusz Palczewski <mateusz.palczewski@intel.com>
    Tested-by: default avatarRafal Romanowski <rafal.romanowski@intel.com>
    Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
    d1639a17
iavf.h 20.7 KB