1. 09 Aug, 2022 40 commits
    • Mikulas Patocka's avatar
      add barriers to buffer_uptodate and set_buffer_uptodate · d4252071
      Mikulas Patocka authored
      Let's have a look at this piece of code in __bread_slow:
      
      	get_bh(bh);
      	bh->b_end_io = end_buffer_read_sync;
      	submit_bh(REQ_OP_READ, 0, bh);
      	wait_on_buffer(bh);
      	if (buffer_uptodate(bh))
      		return bh;
      
      Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
      Consequently, if someone calls sb_bread and then reads the buffer data,
      the read of buffer data may be executed before wait_on_buffer(bh) on
      architectures with weak memory ordering and it may return invalid data.
      
      Fix this bug by adding a memory barrier to set_buffer_uptodate and an
      acquire barrier to buffer_uptodate (in a similar way as
      folio_test_uptodate and folio_mark_uptodate).
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d4252071
    • Linus Torvalds's avatar
      Merge tag 'nfsd-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux · e394ff83
      Linus Torvalds authored
      Pull nfsd updates from Chuck Lever:
       "Work on 'courteous server', which was introduced in 5.19, continues
        apace. This release introduces a more flexible limit on the number of
        NFSv4 clients that NFSD allows, now that NFSv4 clients can remain in
        courtesy state long after the lease expiration timeout. The client
        limit is adjusted based on the physical memory size of the server.
      
        The NFSD filecache is a cache of files held open by NFSv4 clients or
        recently touched by NFSv2 or NFSv3 clients. This cache had some
        significant scalability constraints that have been relieved in this
        release. Thanks to all who contributed to this work.
      
        A data corruption bug found during the most recent NFS bake-a-thon
        that involves NFSv3 and NFSv4 clients writing the same file has been
        addressed in this release.
      
        This release includes several improvements in CPU scalability for
        NFSv4 operations. In addition, Neil Brown provided patches that
        simplify locking during file lookup, creation, rename, and removal
        that enables subsequent work on making these operations more scalable.
        We expect to see that work materialize in the next release.
      
        There are also numerous single-patch fixes, clean-ups, and the usual
        improvements in observability"
      
      * tag 'nfsd-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (78 commits)
        lockd: detect and reject lock arguments that overflow
        NFSD: discard fh_locked flag and fh_lock/fh_unlock
        NFSD: use (un)lock_inode instead of fh_(un)lock for file operations
        NFSD: use explicit lock/unlock for directory ops
        NFSD: reduce locking in nfsd_lookup()
        NFSD: only call fh_unlock() once in nfsd_link()
        NFSD: always drop directory lock in nfsd_unlink()
        NFSD: change nfsd_create()/nfsd_symlink() to unlock directory before returning.
        NFSD: add posix ACLs to struct nfsd_attrs
        NFSD: add security label to struct nfsd_attrs
        NFSD: set attributes when creating symlinks
        NFSD: introduce struct nfsd_attrs
        NFSD: verify the opened dentry after setting a delegation
        NFSD: drop fh argument from alloc_init_deleg
        NFSD: Move copy offload callback arguments into a separate structure
        NFSD: Add nfsd4_send_cb_offload()
        NFSD: Remove kmalloc from nfsd4_do_async_copy()
        NFSD: Refactor nfsd4_do_copy()
        NFSD: Refactor nfsd4_cleanup_inter_ssc() (2/2)
        NFSD: Refactor nfsd4_cleanup_inter_ssc() (1/2)
        ...
      e394ff83
    • Linus Torvalds's avatar
      Merge tag 'fscache-fixes-20220809' of... · 15205c28
      Linus Torvalds authored
      Merge tag 'fscache-fixes-20220809' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
      
      Pull fscache updates from David Howells:
      
       - Fix a cookie access ref leak if a cookie is invalidated a second time
         before the first invalidation is actually processed.
      
       - Add a tracepoint to log cookie lookup failure
      
      * tag 'fscache-fixes-20220809' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
        fscache: add tracepoint when failing cookie
        fscache: don't leak cookie access refs if invalidation is in progress or failed
      15205c28
    • Linus Torvalds's avatar
      Merge tag 'afs-fixes-20220802' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · 4b22e207
      Linus Torvalds authored
      Pull AFS fixes from David Howells:
       "Fix AFS refcount handling.
      
        The first patch converts afs to use refcount_t for its refcounts and
        the second patch fixes afs_put_call() and afs_put_server() to save the
        values they're going to log in the tracepoint before decrementing the
        refcount"
      
      * tag 'afs-fixes-20220802' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
        afs: Fix access after dec in put functions
        afs: Use refcount_t rather than atomic_t
      4b22e207
    • Linus Torvalds's avatar
      Merge tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux · 426b4ca2
      Linus Torvalds authored
      Pull setgid updates from Christian Brauner:
       "This contains the work to move setgid stripping out of individual
        filesystems and into the VFS itself.
      
        Creating files that have both the S_IXGRP and S_ISGID bit raised in
        directories that themselves have the S_ISGID bit set requires
        additional privileges to avoid security issues.
      
        When a filesystem creates a new inode it needs to take care that the
        caller is either in the group of the newly created inode or they have
        CAP_FSETID in their current user namespace and are privileged over the
        parent directory of the new inode. If any of these two conditions is
        true then the S_ISGID bit can be raised for an S_IXGRP file and if not
        it needs to be stripped.
      
        However, there are several key issues with the current implementation:
      
         - S_ISGID stripping logic is entangled with umask stripping.
      
           For example, if the umask removes the S_IXGRP bit from the file
           about to be created then the S_ISGID bit will be kept.
      
           The inode_init_owner() helper is responsible for S_ISGID stripping
           and is called before posix_acl_create(). So we can end up with two
           different orderings:
      
           1. FS without POSIX ACL support
      
              First strip umask then strip S_ISGID in inode_init_owner().
      
              In other words, if a filesystem doesn't support or enable POSIX
              ACLs then umask stripping is done directly in the vfs before
              calling into the filesystem:
      
           2. FS with POSIX ACL support
      
              First strip S_ISGID in inode_init_owner() then strip umask in
              posix_acl_create().
      
              In other words, if the filesystem does support POSIX ACLs then
              unmask stripping may be done in the filesystem itself when
              calling posix_acl_create().
      
           Note that technically filesystems are free to impose their own
           ordering between posix_acl_create() and inode_init_owner() meaning
           that there's additional ordering issues that influence S_ISGID
           inheritance.
      
           (Note that the commit message of commit 1639a49c ("fs: move
           S_ISGID stripping into the vfs_*() helpers") gets the ordering
           between inode_init_owner() and posix_acl_create() the wrong way
           around. I realized this too late.)
      
         - Filesystems that don't rely on inode_init_owner() don't get S_ISGID
           stripping logic.
      
           While that may be intentional (e.g. network filesystems might just
           defer setgid stripping to a server) it is often just a security
           issue.
      
           Note that mandating the use of inode_init_owner() was proposed as
           an alternative solution but that wouldn't fix the ordering issues
           and there are examples such as afs where the use of
           inode_init_owner() isn't possible.
      
           In any case, we should also try the cleaner and generalized
           solution first before resorting to this approach.
      
         - We still have S_ISGID inheritance bugs years after the initial
           round of S_ISGID inheritance fixes:
      
             e014f37d ("xfs: use setattr_copy to set vfs inode attributes")
             01ea173e ("xfs: fix up non-directory creation in SGID directories")
             fd84bfdd ("ceph: fix up non-directory creation in SGID directories")
      
        All of this led us to conclude that the current state is too messy.
        While we won't be able to make it completely clean as
        posix_acl_create() is still a filesystem specific call we can improve
        the S_SIGD stripping situation quite a bit by hoisting it out of
        inode_init_owner() and into the respective vfs creation operations.
      
        The obvious advantage is that we don't need to rely on individual
        filesystems getting S_ISGID stripping right and instead can
        standardize the ordering between S_ISGID and umask stripping directly
        in the VFS.
      
        A few short implementation notes:
      
         - The stripping logic needs to happen in vfs_*() helpers for the sake
           of stacking filesystems such as overlayfs that rely on these
           helpers taking care of S_ISGID stripping.
      
         - Security hooks have never seen the mode as it is ultimately seen by
           the filesystem because of the ordering issue we mentioned. Nothing
           is changed for them. We simply continue to strip the umask before
           passing the mode down to the security hooks.
      
         - The following filesystems use inode_init_owner() and thus relied on
           S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs,
           hfsplus, hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs,
           overlayfs, ramfs, reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs,
           bpf, tmpfs.
      
           We've audited all callchains as best as we could. More details can
           be found in the commit message to 1639a49c ("fs: move S_ISGID
           stripping into the vfs_*() helpers")"
      
      * tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
        ceph: rely on vfs for setgid stripping
        fs: move S_ISGID stripping into the vfs_*() helpers
        fs: Add missing umask strip in vfs_tmpfile
        fs: add mode_strip_sgid() helper
      426b4ca2
    • Linus Torvalds's avatar
      Merge tag 'memblock-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock · b8dcef87
      Linus Torvalds authored
      Pull memblock updates from Mike Rapoport:
      
       - An optimization in memblock_add_range() to reduce array traversals
      
       - Improvements to the memblock test suite
      
      * tag 'memblock-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
        memblock test: Modify the obsolete description in README
        memblock tests: fix compilation errors
        memblock tests: change build options to run-time options
        memblock tests: remove completed TODO items
        memblock tests: set memblock_debug to enable memblock_dbg() messages
        memblock tests: add verbose output to memblock tests
        memblock tests: Makefile: add arguments to control verbosity
        memblock: avoid some repeat when add new range
      b8dcef87
    • Linus Torvalds's avatar
      Merge tag 'm68knommu-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu · 15886321
      Linus Torvalds authored
      Pull m68knommu fixes from Greg Ungerer:
      
       - spelling in comment
      
       - compilation when flexcan driver enabled
      
       - sparse warning
      
      * tag 'm68knommu-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
        m68k: Fix syntax errors in comments
        m68k: coldfire: make symbol m523x_clk_lookup static
        m68k: coldfire/device.c: protect FLEXCAN blocks
      15886321
    • Linus Torvalds's avatar
      Merge tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5318b987
      Linus Torvalds authored
      Pull x86 eIBRS fixes from Borislav Petkov:
       "More from the CPU vulnerability nightmares front:
      
        Intel eIBRS machines do not sufficiently mitigate against RET
        mispredictions when doing a VM Exit therefore an additional RSB,
        one-entry stuffing is needed"
      
      * tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/speculation: Add LFENCE to RSB fill sequence
        x86/speculation: Add RSB VM Exit protections
      5318b987
    • Jeff Layton's avatar
      1a1e3aca
    • Jeff Layton's avatar
      fscache: don't leak cookie access refs if invalidation is in progress or failed · fb24771f
      Jeff Layton authored
      It's possible for a request to invalidate a fscache_cookie will come in
      while we're already processing an invalidation. If that happens we
      currently take an extra access reference that will leak. Only call
      __fscache_begin_cookie_access if the FSCACHE_COOKIE_DO_INVALIDATE bit
      was previously clear.
      
      Also, ensure that we attempt to clear the bit when the cookie is
      "FAILED" and put the reference to avoid an access leak.
      
      Fixes: 85e4ea10 ("fscache: Fix invalidation/lookup race")
      Suggested-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      fb24771f
    • Linus Torvalds's avatar
      Merge tag '5.20-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd · eb555cb5
      Linus Torvalds authored
      Pull ksmbd updates from Steve French:
      
       - fixes for memory access bugs (out of bounds access, oops, leak)
      
       - multichannel fixes
      
       - session disconnect performance improvement, and session register
         improvement
      
       - cleanup
      
      * tag '5.20-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
        ksmbd: fix heap-based overflow in set_ntacl_dacl()
        ksmbd: prevent out of bound read for SMB2_TREE_CONNNECT
        ksmbd: prevent out of bound read for SMB2_WRITE
        ksmbd: fix use-after-free bug in smb2_tree_disconect
        ksmbd: fix memory leak in smb2_handle_negotiate
        ksmbd: fix racy issue while destroying session on multichannel
        ksmbd: use wait_event instead of schedule_timeout()
        ksmbd: fix kernel oops from idr_remove()
        ksmbd: add channel rwlock
        ksmbd: replace sessions list in connection with xarray
        MAINTAINERS: ksmbd: add entry for documentation
        ksmbd: remove unused ksmbd_share_configs_cleanup function
      eb555cb5
    • Linus Torvalds's avatar
      Merge tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · f30adc0d
      Linus Torvalds authored
      Pull more iov_iter updates from Al Viro:
      
       - more new_sync_{read,write}() speedups - ITER_UBUF introduction
      
       - ITER_PIPE cleanups
      
       - unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
         switching them to advancing semantics
      
       - making ITER_PIPE take high-order pages without splitting them
      
       - handling copy_page_from_iter() for high-order pages properly
      
      * tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
        fix copy_page_from_iter() for compound destinations
        hugetlbfs: copy_page_to_iter() can deal with compound pages
        copy_page_to_iter(): don't split high-order page in case of ITER_PIPE
        expand those iov_iter_advance()...
        pipe_get_pages(): switch to append_pipe()
        get rid of non-advancing variants
        ceph: switch the last caller of iov_iter_get_pages_alloc()
        9p: convert to advancing variant of iov_iter_get_pages_alloc()
        af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
        iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
        block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
        iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
        iov_iter: saner helper for page array allocation
        fold __pipe_get_pages() into pipe_get_pages()
        ITER_XARRAY: don't open-code DIV_ROUND_UP()
        unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
        unify xarray_get_pages() and xarray_get_pages_alloc()
        unify pipe_get_pages() and pipe_get_pages_alloc()
        iov_iter_get_pages(): sanity-check arguments
        iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
        ...
      f30adc0d
    • Al Viro's avatar
      fix copy_page_from_iter() for compound destinations · c03f05f1
      Al Viro authored
      had been broken for ITER_BVEC et.al. since ever (OK, v3.17 when
      ITER_BVEC had first appeared)...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c03f05f1
    • Al Viro's avatar
      hugetlbfs: copy_page_to_iter() can deal with compound pages · c7d57ab1
      Al Viro authored
      ... since April 2021
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c7d57ab1
    • Al Viro's avatar
      copy_page_to_iter(): don't split high-order page in case of ITER_PIPE · f0f6b614
      Al Viro authored
      ... just shove it into one pipe_buffer.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f0f6b614
    • Al Viro's avatar
      expand those iov_iter_advance()... · 310d9d5a
      Al Viro authored
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      310d9d5a
    • Al Viro's avatar
      pipe_get_pages(): switch to append_pipe() · 746de1f8
      Al Viro authored
      now that we are advancing the iterator, there's no need to
      treat the first page separately - just call append_pipe()
      in a loop.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      746de1f8
    • Al Viro's avatar
      get rid of non-advancing variants · eba2d3d7
      Al Viro authored
      mechanical change; will be further massaged in subsequent commits
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      eba2d3d7
    • Al Viro's avatar
      ceph: switch the last caller of iov_iter_get_pages_alloc() · b5358992
      Al Viro authored
      here nothing even looks at the iov_iter after the call, so we couldn't
      care less whether it advances or not.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b5358992
    • Al Viro's avatar
      9p: convert to advancing variant of iov_iter_get_pages_alloc() · 7f024647
      Al Viro authored
      that one is somewhat clumsier than usual and needs serious testing.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      7f024647
    • Al Viro's avatar
      af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages() · dc5801f6
      Al Viro authored
      ... and adjust the callers
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      dc5801f6
    • Al Viro's avatar
      iter_to_pipe(): switch to advancing variant of iov_iter_get_pages() · 7d690c15
      Al Viro authored
      ... and untangle the cleanup on failure to add into pipe.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      7d690c15
    • Al Viro's avatar
      block: convert to advancing variants of iov_iter_get_pages{,_alloc}() · 480cb846
      Al Viro authored
      ... doing revert if we end up not using some pages
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      480cb846
    • Al Viro's avatar
      iov_iter: advancing variants of iov_iter_get_pages{,_alloc}() · 1ef255e2
      Al Viro authored
      Most of the users immediately follow successful iov_iter_get_pages()
      with advancing by the amount it had returned.
      
      Provide inline wrappers doing that, convert trivial open-coded
      uses of those.
      
      BTW, iov_iter_get_pages() never returns more than it had been asked
      to; such checks in cifs ought to be removed someday...
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1ef255e2
    • Al Viro's avatar
      iov_iter: saner helper for page array allocation · 3cf42da3
      Al Viro authored
      All call sites of get_pages_array() are essenitally identical now.
      Replace with common helper...
      
      Returns number of slots available in resulting array or 0 on OOM;
      it's up to the caller to make sure it doesn't ask to zero-entry
      array (i.e. neither maxpages nor size are allowed to be zero).
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      3cf42da3
    • Al Viro's avatar
      fold __pipe_get_pages() into pipe_get_pages() · 85200084
      Al Viro authored
      ... and don't mangle maxsize there - turn the loop into counting
      one instead.  Easier to see that we won't run out of array that
      way.  Note that special treatment of the partial buffer in that
      thing is an artifact of the non-advancing semantics of
      iov_iter_get_pages() - if not for that, it would be append_pipe(),
      same as the body of the loop that follows it.  IOW, once we make
      iov_iter_get_pages() advancing, the whole thing will turn into
      	calculate how many pages do we want
      	allocate an array (if needed)
      	call append_pipe() that many times.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      85200084
    • Al Viro's avatar
      ITER_XARRAY: don't open-code DIV_ROUND_UP() · 0aa4fc32
      Al Viro authored
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      0aa4fc32
    • Al Viro's avatar
      unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts · 451c0ba9
      Al Viro authored
      same as for pipes and xarrays; after that iov_iter_get_pages() becomes
      a wrapper for __iov_iter_get_pages_alloc().
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      451c0ba9
    • Al Viro's avatar
      unify xarray_get_pages() and xarray_get_pages_alloc() · 68fe506f
      Al Viro authored
      same as for pipes
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      68fe506f
    • Al Viro's avatar
      unify pipe_get_pages() and pipe_get_pages_alloc() · acbdeb83
      Al Viro authored
      	The differences between those two are
      * pipe_get_pages() gets a non-NULL struct page ** value pointing to
      preallocated array + array size.
      * pipe_get_pages_alloc() gets an address of struct page ** variable that
      contains NULL, allocates the array and (on success) stores its address in
      that variable.
      
      	Not hard to combine - always pass struct page ***, have
      the previous pipe_get_pages_alloc() caller pass ~0U as cap for
      array size.
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      acbdeb83
    • Al Viro's avatar
      iov_iter_get_pages(): sanity-check arguments · c81ce28d
      Al Viro authored
      zero maxpages is bogus, but best treated as "just return 0";
      NULL pages, OTOH, should be treated as a hard bug.
      
      get rid of now completely useless checks in xarray_get_pages{,_alloc}().
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c81ce28d
    • Al Viro's avatar
      iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper · 91329559
      Al Viro authored
      Incidentally, ITER_XARRAY did *not* free the sucker in case when
      iter_xarray_populate_pages() returned 0...
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      91329559
    • Al Viro's avatar
      ITER_PIPE: fold data_start() and pipe_space_for_user() together · 12d426ab
      Al Viro authored
      All their callers are next to each other; all of them
      want the total amount of pages and, possibly, the
      offset in the partial final buffer.
      
      Combine into a new helper (pipe_npages()), fix the
      bogosity in pipe_space_for_user(), while we are at it.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      12d426ab
    • Al Viro's avatar
      ITER_PIPE: cache the type of last buffer · 10f525a8
      Al Viro authored
      We often need to find whether the last buffer is anon or not, and
      currently it's rather clumsy:
      	check if ->iov_offset is non-zero (i.e. that pipe is not empty)
      	if so, get the corresponding pipe_buffer and check its ->ops
      	if it's &default_pipe_buf_ops, we have an anon buffer.
      
      Let's replace the use of ->iov_offset (which is nowhere near similar to
      its role for other flavours) with signed field (->last_offset), with
      the following rules:
      	empty, no buffers occupied:		0
      	anon, with bytes up to N-1 filled:	N
      	zero-copy, with bytes up to N-1 filled:	-N
      
      That way abs(i->last_offset) is equal to what used to be in i->iov_offset
      and empty vs. anon vs. zero-copy can be distinguished by the sign of
      i->last_offset.
      
      	Checks for "should we extend the last buffer or should we start
      a new one?" become easier to follow that way.
      
      	Note that most of the operations can only be done in a sane
      state - i.e. when the pipe has nothing past the current position of
      iterator.  About the only thing that could be done outside of that
      state is iov_iter_advance(), which transitions to the sane state by
      truncating the pipe.  There are only two cases where we leave the
      sane state:
      	1) iov_iter_get_pages()/iov_iter_get_pages_alloc().  Will be
      dealt with later, when we make get_pages advancing - the callers are
      actually happier that way.
      	2) iov_iter copied, then something is put into the copy.  Since
      they share the underlying pipe, the original gets behind.  When we
      decide that we are done with the copy (original is not usable until then)
      we advance the original.  direct_io used to be done that way; nowadays
      it operates on the original and we do iov_iter_revert() to discard
      the excessive data.  At the moment there's nothing in the kernel that
      could do that to ITER_PIPE iterators, so this reason for insane state
      is theoretical right now.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      10f525a8
    • Al Viro's avatar
      ITER_PIPE: clean iov_iter_revert() · 92acdc4f
      Al Viro authored
      Fold pipe_truncate() into it, clean up.  We can release buffers
      in the same loop where we walk backwards to the iterator beginning
      looking for the place where the new position will be.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      92acdc4f
    • Al Viro's avatar
      ITER_PIPE: clean pipe_advance() up · 2c855de9
      Al Viro authored
      instead of setting ->iov_offset for new position and calling
      pipe_truncate() to adjust ->len of the last buffer and discard
      everything after it, adjust ->len at the same time we set ->iov_offset
      and use pipe_discard_from() to deal with buffers past that.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      2c855de9
    • Al Viro's avatar
      ITER_PIPE: lose iter_head argument of __pipe_get_pages() · ca591967
      Al Viro authored
      it's only used to get to the partial buffer we can add to,
      and that's always the last one, i.e. pipe->head - 1.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      ca591967
    • Al Viro's avatar
      ITER_PIPE: fold push_pipe() into __pipe_get_pages() · e3b42964
      Al Viro authored
      	Expand the only remaining call of push_pipe() (in
      __pipe_get_pages()), combine it with the page-collecting loop there.
      
      Note that the only reason it's not a loop doing append_pipe() is
      that append_pipe() is advancing, while iov_iter_get_pages() is not.
      As soon as it switches to saner semantics, this thing will switch
      to using append_pipe().
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e3b42964
    • Al Viro's avatar
      ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives · 8fad7767
      Al Viro authored
      New helper: append_pipe().  Extends the last buffer if possible,
      allocates a new one otherwise.  Returns page and offset in it
      on success, NULL on failure.  iov_iter is advanced past the
      data we've got.
      
      Use that instead of push_pipe() in copy-to-pipe primitives;
      they get simpler that way.  Handling of short copy (in "mc" one)
      is done simply by iov_iter_revert() - iov_iter is in consistent
      state after that one, so we can use that.
      
      [Fix for braino caught by Liu Xinpeng <liuxp11@chinatelecom.cn> folded in]
      [another braino fix, this time in copy_pipe_to_iter() and pipe_zero();
      caught by testcase from Hugh Dickins]
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      8fad7767
    • Al Viro's avatar
      ITER_PIPE: helpers for adding pipe buffers · 47b7fcae
      Al Viro authored
      There are only two kinds of pipe_buffer in the area used by ITER_PIPE.
      
      1) anonymous - copy_to_iter() et.al. end up creating those and copying
      data there.  They have zero ->offset, and their ->ops points to
      default_pipe_page_ops.
      
      2) zero-copy ones - those come from copy_page_to_iter(), and page
      comes from caller.  ->offset is also caller-supplied - it might be
      non-zero.  ->ops points to page_cache_pipe_buf_ops.
      
      Move creation and insertion of those into helpers - push_anon(pipe, size)
      and push_page(pipe, page, offset, size) resp., separating them from
      the "could we avoid creating a new buffer by merging with the current
      head?" logics.
      Acked-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      47b7fcae