• Petr Mladek's avatar
    workqueue: Warn when a new worker could not be created · 3f0ea0b8
    Petr Mladek authored
    The workqueue watchdog reports a lockup when there was not any progress
    in the worker pool for a long time. The progress means that a pending
    work item starts being proceed.
    
    The progress is guaranteed by using idle workers or creating new workers
    for pending work items.
    
    There are several reasons why a new worker could not be created:
    
       + there is not enough memory
    
       + there is no free pool ID (IDR API)
    
       + the system reached PID limit
    
       + the process creating the new worker was interrupted
    
       + the last idle worker (manager) has not been scheduled for a long
         time. It was not able to even start creating the kthread.
    
    None of these failures is reported at the moment. The only clue is that
    show_one_worker_pool() prints that there is a manager. It is the last
    idle worker that is responsible for creating a new one. But it is not
    clear if create_worker() is failing and why.
    
    Make the debugging easier by printing errors in create_worker().
    
    The error code is important, especially from kthread_create_on_node().
    It helps to distinguish the various reasons. For example, reaching
    memory limit (-ENOMEM), other system limits (-EAGAIN), or process
    interrupted (-EINTR).
    
    Use pr_once() to avoid repeating the same error every CREATE_COOLDOWN
    for each stuck worker pool.
    
    Ratelimited printk() might be better. It would help to know if the problem
    remains. It would be more clear if the create_worker() errors and workqueue
    stalls are related. Also old messages might get lost when the internal log
    buffer is full. The problem is that printk() might touch the watchdog.
    For example, see touch_nmi_watchdog() in serial8250_console_write().
    It would require synchronization of the begin and length of the ratelimit
    interval with the workqueue watchdog. Otherwise, the error messages
    might break the watchdog. This does not look worth the complexity.
    Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    3f0ea0b8
workqueue.c 173 KB