1. 10 Sep, 2019 10 commits
    • Miklos Szeredi's avatar
      fuse: add pages to fuse_args · 68583165
      Miklos Szeredi authored
      Derive fuse_args_pages from fuse_args. This is used to handle requests
      which use pages for input or output.  The related flags are added to
      fuse_args.
      
      New FR_ALLOC_PAGES flags is added to indicate whether the page arrays in
      fuse_req need to be freed by fuse_put_request() or not.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      68583165
    • Miklos Szeredi's avatar
      fuse: convert destroy to simple api · 1ccd1ea2
      Miklos Szeredi authored
      We can use the "force" flag to make sure the DESTROY request is always sent
      to userspace.  So no need to keep it allocated during the lifetime of the
      filesystem.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      1ccd1ea2
    • Miklos Szeredi's avatar
      fuse: add nocreds to fuse_args · e413754b
      Miklos Szeredi authored
      In some cases it makes no sense to set pid/uid/gid fields in the request
      header.  Allow fuse_simple_background() to omit these.  This is only
      required in the "force" case, so for now just WARN if set otherwise.
      
      Fold fuse_get_req_nofail_nopages() into its only caller.  Comment is
      obsolete anyway.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      e413754b
    • Miklos Szeredi's avatar
      fuse: convert fuse_force_forget() to simple api · 3545fe21
      Miklos Szeredi authored
      Move this function to the readdir.c where its only caller resides.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      3545fe21
    • Miklos Szeredi's avatar
      fuse: add noreply to fuse_args · 454a7613
      Miklos Szeredi authored
      This will be used by fuse_force_forget().
      
      We can expand fuse_request_send() into fuse_simple_request().  The
      FR_WAITING bit has already been set, no need to check.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      454a7613
    • Miklos Szeredi's avatar
      fuse: convert flush to simple api · c500ebaa
      Miklos Szeredi authored
      Add 'force' to fuse_args and use fuse_get_req_nofail_nopages() to allocate
      the request in that case.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      c500ebaa
    • Miklos Szeredi's avatar
      fuse: simplify 'nofail' request · 40ac7ab2
      Miklos Szeredi authored
      Instead of complex games with a reserved request, just use __GFP_NOFAIL.
      
      Both calers (flush, readdir) guarantee that connection was already
      initialized, so no need to wait for fc->initialized.
      
      Also remove unneeded clearing of FR_BACKGROUND flag.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      40ac7ab2
    • Miklos Szeredi's avatar
      fuse: rearrange and resize fuse_args fields · 1f4e9d03
      Miklos Szeredi authored
      This makes the structure better packed.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      1f4e9d03
    • Miklos Szeredi's avatar
      fuse: flatten 'struct fuse_args' · d5b48543
      Miklos Szeredi authored
      ...to make future expansion simpler.  The hiearachical structure is a
      historical thing that does not serve any practical purpose.
      
      The generated code is excatly the same before and after the patch.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      d5b48543
    • Eric Biggers's avatar
      fuse: fix deadlock with aio poll and fuse_iqueue::waitq.lock · 76e43c8c
      Eric Biggers authored
      When IOCB_CMD_POLL is used on the FUSE device, aio_poll() disables IRQs
      and takes kioctx::ctx_lock, then fuse_iqueue::waitq.lock.
      
      This may have to wait for fuse_iqueue::waitq.lock to be released by one
      of many places that take it with IRQs enabled.  Since the IRQ handler
      may take kioctx::ctx_lock, lockdep reports that a deadlock is possible.
      
      Fix it by protecting the state of struct fuse_iqueue with a separate
      spinlock, and only accessing fuse_iqueue::waitq using the versions of
      the waitqueue functions which do IRQ-safe locking internally.
      
      Reproducer:
      
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <sys/mount.h>
      	#include <sys/stat.h>
      	#include <sys/syscall.h>
      	#include <unistd.h>
      	#include <linux/aio_abi.h>
      
      	int main()
      	{
      		char opts[128];
      		int fd = open("/dev/fuse", O_RDWR);
      		aio_context_t ctx = 0;
      		struct iocb cb = { .aio_lio_opcode = IOCB_CMD_POLL, .aio_fildes = fd };
      		struct iocb *cbp = &cb;
      
      		sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
      		mkdir("mnt", 0700);
      		mount("foo",  "mnt", "fuse", 0, opts);
      		syscall(__NR_io_setup, 1, &ctx);
      		syscall(__NR_io_submit, ctx, 1, &cbp);
      	}
      
      Beginning of lockdep output:
      
      	=====================================================
      	WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
      	5.3.0-rc5 #9 Not tainted
      	-----------------------------------------------------
      	syz_fuse/135 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      	000000003590ceda (&fiq->waitq){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
      	000000003590ceda (&fiq->waitq){+.+.}, at: aio_poll fs/aio.c:1751 [inline]
      	000000003590ceda (&fiq->waitq){+.+.}, at: __io_submit_one.constprop.0+0x203/0x5b0 fs/aio.c:1825
      
      	and this task is already holding:
      	0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: spin_lock_irq include/linux/spinlock.h:363 [inline]
      	0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: aio_poll fs/aio.c:1749 [inline]
      	0000000075037284 (&(&ctx->ctx_lock)->rlock){..-.}, at: __io_submit_one.constprop.0+0x1f4/0x5b0 fs/aio.c:1825
      	which would create a new lock dependency:
      	 (&(&ctx->ctx_lock)->rlock){..-.} -> (&fiq->waitq){+.+.}
      
      	but this new dependency connects a SOFTIRQ-irq-safe lock:
      	 (&(&ctx->ctx_lock)->rlock){..-.}
      
      	[...]
      
      Reported-by: syzbot+af05535bb79520f95431@syzkaller.appspotmail.com
      Reported-by: syzbot+d86c4426a01f60feddc7@syzkaller.appspotmail.com
      Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL")
      Cc: <stable@vger.kernel.org> # v4.19+
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      76e43c8c
  2. 06 Sep, 2019 3 commits
  3. 05 Sep, 2019 4 commits
    • David Howells's avatar
      mtd: Provide fs_context-aware mount_mtd() replacement · 0f071004
      David Howells authored
      Provide a function, get_tree_mtd(), to replace mount_mtd(), using an
      fs_context struct to hold the parameters.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: David Woodhouse <dwmw2@infradead.org>
      cc: Brian Norris <computersforpeace@gmail.com>
      cc: Boris Brezillon <bbrezillon@kernel.org>
      cc: Marek Vasut <marek.vasut@gmail.com>
      cc: Richard Weinberger <richard@nod.at>
      cc: linux-mtd@lists.infradead.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0f071004
    • David Howells's avatar
      vfs: Create fs_context-aware mount_bdev() replacement · fe62c3a4
      David Howells authored
      Create a function, get_tree_bdev(), that is fs_context-aware and a
      ->get_tree() counterpart of mount_bdev().
      
      It caches the block device pointer in the fs_context struct so that this
      information can be passed into sget_fc()'s test and set functions.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Jens Axboe <axboe@kernel.dk>
      cc: linux-block@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      fe62c3a4
    • Al Viro's avatar
      new helper: get_tree_keyed() · 533770cc
      Al Viro authored
      For vfs_get_keyed_super users.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      533770cc
    • Eric Biggers's avatar
      vfs: set fs_context::user_ns for reconfigure · 1dd9bc08
      Eric Biggers authored
      fs_context::user_ns is used by fuse_parse_param(), even during remount,
      so it needs to be set to the existing value for reconfigure.
      
      Reproducer:
      
      	#include <fcntl.h>
      	#include <sys/mount.h>
      
      	int main()
      	{
      		char opts[128];
      		int fd = open("/dev/fuse", O_RDWR);
      
      		sprintf(opts, "fd=%d,rootmode=040000,user_id=0,group_id=0", fd);
      		mkdir("mnt", 0777);
      		mount("foo",  "mnt", "fuse.foo", 0, opts);
      		mount("foo", "mnt", "fuse.foo", MS_REMOUNT, opts);
      	}
      
      Crash:
      	BUG: kernel NULL pointer dereference, address: 0000000000000000
      	#PF: supervisor read access in kernel mode
      	#PF: error_code(0x0000) - not-present page
      	PGD 0 P4D 0
      	Oops: 0000 [#1] SMP
      	CPU: 0 PID: 129 Comm: syz_make_kuid Not tainted 5.3.0-rc5-next-20190821 #3
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
      	RIP: 0010:map_id_range_down+0xb/0xc0 kernel/user_namespace.c:291
      	[...]
      	Call Trace:
      	 map_id_down kernel/user_namespace.c:312 [inline]
      	 make_kuid+0xe/0x10 kernel/user_namespace.c:389
      	 fuse_parse_param+0x116/0x210 fs/fuse/inode.c:523
      	 vfs_parse_fs_param+0xdb/0x1b0 fs/fs_context.c:145
      	 vfs_parse_fs_string+0x6a/0xa0 fs/fs_context.c:188
      	 generic_parse_monolithic+0x85/0xc0 fs/fs_context.c:228
      	 parse_monolithic_mount_data+0x1b/0x20 fs/fs_context.c:708
      	 do_remount fs/namespace.c:2525 [inline]
      	 do_mount+0x39a/0xa60 fs/namespace.c:3107
      	 ksys_mount+0x7d/0xd0 fs/namespace.c:3325
      	 __do_sys_mount fs/namespace.c:3339 [inline]
      	 __se_sys_mount fs/namespace.c:3336 [inline]
      	 __x64_sys_mount+0x20/0x30 fs/namespace.c:3336
      	 do_syscall_64+0x4a/0x1a0 arch/x86/entry/common.c:290
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Reported-by: syzbot+7d6a57304857423318a5@syzkaller.appspotmail.com
      Fixes: 408cbe695350 ("vfs: Convert fuse to use the new mount API")
      Cc: David Howells <dhowells@redhat.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Reviewed-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1dd9bc08
  4. 02 Sep, 2019 3 commits
    • Miklos Szeredi's avatar
      cuse: fix broken release · 56d250ef
      Miklos Szeredi authored
      The inode parameter in cuse_release() is likely *not* a fuse inode.  It's a
      small wonder it didn't blow up until now.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      56d250ef
    • Maxim Patlasov's avatar
      fuse: cleanup fuse_wait_on_page_writeback · 17b2cbe2
      Maxim Patlasov authored
      fuse_wait_on_page_writeback() always returns zero and nobody cares.
      Let's make it void.
      Signed-off-by: default avatarMaxim Patlasov <mpatlasov@virtuozzo.com>
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      17b2cbe2
    • Kirill Smelkov's avatar
      fuse: require /dev/fuse reads to have enough buffer capacity (take 2) · 1fb027d7
      Kirill Smelkov authored
      [ This retries commit d4b13963 ("fuse: require /dev/fuse reads to have
      enough buffer capacity"), which was reverted.  In this version we require
      only `sizeof(fuse_in_header) + sizeof(fuse_write_in)` instead of 4K for
      FUSE request header room, because, contrary to libfuse and kernel client
      behaviour, GlusterFS actually provides only so much room for request
      header. ]
      
      A FUSE filesystem server queues /dev/fuse sys_read calls to get filesystem
      requests to handle. It does not know in advance what would be that request
      as it can be anything that client issues - LOOKUP, READ, WRITE, ... Many
      requests are short and retrieve data from the filesystem. However WRITE and
      NOTIFY_REPLY write data into filesystem.
      
      Before getting into operation phase, FUSE filesystem server and kernel
      client negotiate what should be the maximum write size the client will ever
      issue. After negotiation the contract in between server/client is that the
      filesystem server then should queue /dev/fuse sys_read calls with enough
      buffer capacity to receive any client request - WRITE in particular, while
      FUSE client should not, in particular, send WRITE requests with >
      negotiated max_write payload. FUSE client in kernel and libfuse
      historically reserve 4K for request header. However an existing filesystem
      server - GlusterFS - was found which reserves only 80 bytes for header room
      (= `sizeof(fuse_in_header) + sizeof(fuse_write_in)`).
      
      Since
      
      	`sizeof(fuse_in_header) + sizeof(fuse_write_in)` ==
      	`sizeof(fuse_in_header) + sizeof(fuse_read_in)`  ==
      	`sizeof(fuse_in_header) + sizeof(fuse_notify_retrieve_in)`
      
      is the absolute minimum any sane filesystem should be using for header
      room, the contract is that filesystem server should queue sys_reads with
      `sizeof(fuse_in_header) + sizeof(fuse_write_in)` + max_write buffer.
      
      If the filesystem server does not follow this contract, what can happen
      is that fuse_dev_do_read will see that request size is > buffer size,
      and then it will return EIO to client who issued the request but won't
      indicate in any way that there is a problem to filesystem server.
      This can be hard to diagnose because for some requests, e.g. for
      NOTIFY_REPLY which mimics WRITE, there is no client thread that is
      waiting for request completion and that EIO goes nowhere, while on
      filesystem server side things look like the kernel is not replying back
      after successful NOTIFY_RETRIEVE request made by the server.
      
      We can make the problem easy to diagnose if we indicate via error return to
      filesystem server when it is violating the contract.  This should not
      practically cause problems because if a filesystem server is using shorter
      buffer, writes to it were already very likely to cause EIO, and if the
      filesystem is read-only it should be too following FUSE_MIN_READ_BUFFER
      minimum buffer size.
      
      Please see [1] for context where the problem of stuck filesystem was hit
      for real (because kernel client was incorrectly sending more than
      max_write data with NOTIFY_REPLY; see also previous patch), how the
      situation was traced and for more involving patch that did not make it
      into the tree.
      
      [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2Signed-off-by: Kirill Smelkov's avatarKirill Smelkov <kirr@nexedi.com>
      Tested-by: default avatarSander Eikelenboom <linux@eikelenboom.it>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Cc: Jakob Unterwurzacher <jakobunt@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      1fb027d7
  5. 25 Aug, 2019 20 commits