• Theodore Ts'o's avatar
    ext4: limit the number of retries after discarding preallocations blocks · 80fa46d6
    Theodore Ts'o authored
    This patch avoids threads live-locking for hours when a large number
    threads are competing over the last few free extents as they blocks
    getting added and removed from preallocation pools.  From our bug
    reporter:
    
       A reliable way for triggering this has multiple writers
       continuously write() to files when the filesystem is full, while
       small amounts of space are freed (e.g. by truncating a large file
       -1MiB at a time). In the local filesystem, this can be done by
       simply not checking the return code of write (0) and/or the error
       (ENOSPACE) that is set. Over NFS with an async mount, even clients
       with proper error checking will behave this way since the linux NFS
       client implementation will not propagate the server errors [the
       write syscalls immediately return success] until the file handle is
       closed. This leads to a situation where NFS clients send a
       continuous stream of WRITE rpcs which result in ERRNOSPACE -- but
       since the client isn't seeing this, the stream of writes continues
       at maximum network speed.
    
       When some space does appear, multiple writers will all attempt to
       claim it for their current write. For NFS, we may see dozens to
       hundreds of threads that do this.
    
       The real-world scenario of this is database backup tooling (in
       particular, github.com/mdkent/percona-xtrabackup) which may write
       large files (>1TiB) to NFS for safe keeping. Some temporary files
       are written, rewound, and read back -- all before closing the file
       handle (the temp file is actually unlinked, to trigger automatic
       deletion on close/crash.) An application like this operating on an
       async NFS mount will not see an error code until TiB have been
       written/read.
    
       The lockup was observed when running this database backup on large
       filesystems (64 TiB in this case) with a high number of block
       groups and no free space. Fragmentation is generally not a factor
       in this filesystem (~thousands of large files, mostly contiguous
       except for the parts written while the filesystem is at capacity.)
    Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
    Cc: stable@kernel.org
    80fa46d6
mballoc.c 185 KB