• Kirill Smelkov's avatar
    fuse: require /dev/fuse reads to have enough buffer capacity (take 2) · 1fb027d7
    Kirill Smelkov authored
    [ This retries commit d4b13963 ("fuse: require /dev/fuse reads to have
    enough buffer capacity"), which was reverted.  In this version we require
    only `sizeof(fuse_in_header) + sizeof(fuse_write_in)` instead of 4K for
    FUSE request header room, because, contrary to libfuse and kernel client
    behaviour, GlusterFS actually provides only so much room for request
    header. ]
    
    A FUSE filesystem server queues /dev/fuse sys_read calls to get filesystem
    requests to handle. It does not know in advance what would be that request
    as it can be anything that client issues - LOOKUP, READ, WRITE, ... Many
    requests are short and retrieve data from the filesystem. However WRITE and
    NOTIFY_REPLY write data into filesystem.
    
    Before getting into operation phase, FUSE filesystem server and kernel
    client negotiate what should be the maximum write size the client will ever
    issue. After negotiation the contract in between server/client is that the
    filesystem server then should queue /dev/fuse sys_read calls with enough
    buffer capacity to receive any client request - WRITE in particular, while
    FUSE client should not, in particular, send WRITE requests with >
    negotiated max_write payload. FUSE client in kernel and libfuse
    historically reserve 4K for request header. However an existing filesystem
    server - GlusterFS - was found which reserves only 80 bytes for header room
    (= `sizeof(fuse_in_header) + sizeof(fuse_write_in)`).
    
    Since
    
    	`sizeof(fuse_in_header) + sizeof(fuse_write_in)` ==
    	`sizeof(fuse_in_header) + sizeof(fuse_read_in)`  ==
    	`sizeof(fuse_in_header) + sizeof(fuse_notify_retrieve_in)`
    
    is the absolute minimum any sane filesystem should be using for header
    room, the contract is that filesystem server should queue sys_reads with
    `sizeof(fuse_in_header) + sizeof(fuse_write_in)` + max_write buffer.
    
    If the filesystem server does not follow this contract, what can happen
    is that fuse_dev_do_read will see that request size is > buffer size,
    and then it will return EIO to client who issued the request but won't
    indicate in any way that there is a problem to filesystem server.
    This can be hard to diagnose because for some requests, e.g. for
    NOTIFY_REPLY which mimics WRITE, there is no client thread that is
    waiting for request completion and that EIO goes nowhere, while on
    filesystem server side things look like the kernel is not replying back
    after successful NOTIFY_RETRIEVE request made by the server.
    
    We can make the problem easy to diagnose if we indicate via error return to
    filesystem server when it is violating the contract.  This should not
    practically cause problems because if a filesystem server is using shorter
    buffer, writes to it were already very likely to cause EIO, and if the
    filesystem is read-only it should be too following FUSE_MIN_READ_BUFFER
    minimum buffer size.
    
    Please see [1] for context where the problem of stuck filesystem was hit
    for real (because kernel client was incorrectly sending more than
    max_write data with NOTIFY_REPLY; see also previous patch), how the
    situation was traced and for more involving patch that did not make it
    into the tree.
    
    [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2Signed-off-by: Kirill Smelkov's avatarKirill Smelkov <kirr@nexedi.com>
    Tested-by: default avatarSander Eikelenboom <linux@eikelenboom.it>
    Cc: Han-Wen Nienhuys <hanwen@google.com>
    Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
    Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
    1fb027d7
dev.c 55.1 KB