1. 26 May, 2018 14 commits
  2. 24 May, 2018 1 commit
    • Al Viro's avatar
      fix io_destroy()/aio_complete() race · 4faa9996
      Al Viro authored
      If io_destroy() gets to cancelling everything that can be cancelled and
      gets to kiocb_cancel() calling the function driver has left in ->ki_cancel,
      it becomes vulnerable to a race with IO completion.  At that point req
      is already taken off the list and aio_complete() does *NOT* spin until
      we (in free_ioctx_users()) releases ->ctx_lock.  As the result, it proceeds
      to kiocb_free(), freing req just it gets passed to ->ki_cancel().
      
      Fix is simple - remove from the list after the call of kiocb_cancel().  All
      instances of ->ki_cancel() already have to cope with the being called with
      iocb still on list - that's what happens in io_cancel(2).
      
      Cc: stable@kernel.org
      Fixes: 0460fef2 "aio: use cancellation list lazily"
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      4faa9996
  3. 21 May, 2018 10 commits
    • Al Viro's avatar
      aio: fix io_destroy(2) vs. lookup_ioctx() race · baf10564
      Al Viro authored
      kill_ioctx() used to have an explicit RCU delay between removing the
      reference from ->ioctx_table and percpu_ref_kill() dropping the refcount.
      At some point that delay had been removed, on the theory that
      percpu_ref_kill() itself contained an RCU delay.  Unfortunately, that was
      the wrong kind of RCU delay and it didn't care about rcu_read_lock() used
      by lookup_ioctx().  As the result, we could get ctx freed right under
      lookup_ioctx().  Tejun has fixed that in a6d7cff4 ("fs/aio: Add explicit
      RCU grace period when freeing kioctx"); however, that fix is not enough.
      
      Suppose io_destroy() from one thread races with e.g. io_setup() from another;
      CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2
      has picked it (under rcu_read_lock()).  Then CPU1 proceeds to drop the
      refcount, getting it to 0 and triggering a call of free_ioctx_users(),
      which proceeds to drop the secondary refcount and once that reaches zero
      calls free_ioctx_reqs().  That does
              INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
              queue_rcu_work(system_wq, &ctx->free_rwork);
      and schedules freeing the whole thing after RCU delay.
      
      In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the
      refcount from 0 to 1 and returned the reference to io_setup().
      
      Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get
      freed until after percpu_ref_get().  Sure, we'd increment the counter before
      ctx can be freed.  Now we are out of rcu_read_lock() and there's nothing to
      stop freeing of the whole thing.  Unfortunately, CPU2 assumes that since it
      has grabbed the reference, ctx is *NOT* going away until it gets around to
      dropping that reference.
      
      The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss.
      It's not costlier than what we currently do in normal case, it's safe to
      call since freeing *is* delayed and it closes the race window - either
      lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users
      won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx()
      fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see
      the object in question at all.
      
      Cc: stable@kernel.org
      Fixes: a6d7cff4 "fs/aio: Add explicit RCU grace period when freeing kioctx"
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      baf10564
    • Al Viro's avatar
      ext2: fix a block leak · 5aa1437d
      Al Viro authored
      open file, unlink it, then use ioctl(2) to make it immutable or
      append only.  Now close it and watch the blocks *not* freed...
      
      Immutable/append-only checks belong in ->setattr().
      Note: the bug is old and backport to anything prior to 737f2e93
      ("ext2: convert to use the new truncate convention") will need
      these checks lifted into ext2_setattr().
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      5aa1437d
    • Al Viro's avatar
      nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed · 3819bb0d
      Al Viro authored
      That can (and does, on some filesystems) happen - ->mkdir() (and thus
      vfs_mkdir()) can legitimately leave its argument negative and just
      unhash it, counting upon the lookup to pick the object we'd created
      next time we try to look at that name.
      
      Some vfs_mkdir() callers forget about that possibility...
      Acked-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      3819bb0d
    • Al Viro's avatar
      cachefiles: vfs_mkdir() might succeed leaving dentry negative unhashed · 9c3e9025
      Al Viro authored
      That can (and does, on some filesystems) happen - ->mkdir() (and thus
      vfs_mkdir()) can legitimately leave its argument negative and just
      unhash it, counting upon the lookup to pick the object we'd created
      next time we try to look at that name.
      
      Some vfs_mkdir() callers forget about that possibility...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      9c3e9025
    • Al Viro's avatar
      unfuck sysfs_mount() · 7b745a4e
      Al Viro authored
      new_sb is left uninitialized in case of early failures in kernfs_mount_ns(),
      and while IS_ERR(root) is true in all such cases, using IS_ERR(root) || !new_sb
      is not a solution - IS_ERR(root) is true in some cases when new_sb is true.
      
      Make sure new_sb is initialized (and matches the reality) in all cases and
      fix the condition for dropping kobj reference - we want it done precisely
      in those situations where the reference has not been transferred into a new
      super_block instance.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      7b745a4e
    • Al Viro's avatar
      kernfs: deal with kernfs_fill_super() failures · 82382ace
      Al Viro authored
      make sure that info->node is initialized early, so that kernfs_kill_sb()
      can list_del() it safely.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      82382ace
    • Joe Perches's avatar
      cramfs: Fix IS_ENABLED typo · 08a8f308
      Joe Perches authored
      There's an extra C here...
      
      Fixes: 99c18ce5 ("cramfs: direct memory access support")
      Acked-by: default avatarNicolas Pitre <nico@linaro.org>
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      08a8f308
    • Al Viro's avatar
      befs_lookup(): use d_splice_alias() · f4e4d434
      Al Viro authored
      RTFS(Documentation/filesystems/nfs/Exporting) if you try to make
      something exportable.
      
      Fixes: ac632f5b "befs: add NFS export support"
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f4e4d434
    • Al Viro's avatar
      affs_lookup: switch to d_splice_alias() · 87fbd639
      Al Viro authored
      Making something exportable takes more than providing ->s_export_ops.
      In particular, ->lookup() *MUST* use d_splice_alias() instead of
      d_add().
      
      Reading Documentation/filesystems/nfs/Exporting would've been a good idea;
      as it is, exporting AFFS is badly (and exploitably) broken.
      
      Partially-Fixes: ed4433d7 "fs/affs: make affs exportable"
      Acked-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      87fbd639
    • Al Viro's avatar
      affs_lookup(): close a race with affs_remove_link() · 30da870c
      Al Viro authored
      we unlock the directory hash too early - if we are looking at secondary
      link and primary (in another directory) gets removed just as we unlock,
      we could have the old primary moved in place of the secondary, leaving
      us to look into freed entry (and leaving our dentry with ->d_fsdata
      pointing to a freed entry).
      
      Cc: stable@vger.kernel.org # 2.4.4+
      Acked-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      30da870c
  4. 13 May, 2018 1 commit
    • Al Viro's avatar
      fix breakage caused by d_find_alias() semantics change · b127125d
      Al Viro authored
      "VFS: don't keep disconnected dentries on d_anon" had a non-trivial
      side-effect - d_unhashed() now returns true for those dentries,
      making d_find_alias() skip them altogether.  For most of its callers
      that's fine - we really want a connected alias there.  However,
      there is a codepath where we relied upon picking such aliases
      if nothing else could be found - selinux delayed initialization
      of contexts for inodes on already mounted filesystems used to
      rely upon that.
      
      Cc: stable@kernel.org # f1ee6162 "VFS: don't keep disconnected dentries on d_anon"
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      b127125d
  5. 11 May, 2018 2 commits
    • Dave Chinner's avatar
      fs: don't scan the inode cache before SB_BORN is set · 79f546a6
      Dave Chinner authored
      We recently had an oops reported on a 4.14 kernel in
      xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage
      and so the m_perag_tree lookup walked into lala land.  It produces
      an oops down this path during the failed mount:
      
        radix_tree_gang_lookup_tag+0xc4/0x130
        xfs_perag_get_tag+0x37/0xf0
        xfs_reclaim_inodes_count+0x32/0x40
        xfs_fs_nr_cached_objects+0x11/0x20
        super_cache_count+0x35/0xc0
        shrink_slab.part.66+0xb1/0x370
        shrink_node+0x7e/0x1a0
        try_to_free_pages+0x199/0x470
        __alloc_pages_slowpath+0x3a1/0xd20
        __alloc_pages_nodemask+0x1c3/0x200
        cache_grow_begin+0x20b/0x2e0
        fallback_alloc+0x160/0x200
        kmem_cache_alloc+0x111/0x4e0
      
      The problem is that the superblock shrinker is running before the
      filesystem structures it depends on have been fully set up. i.e.
      the shrinker is registered in sget(), before ->fill_super() has been
      called, and the shrinker can call into the filesystem before
      fill_super() does it's setup work. Essentially we are exposed to
      both use-after-free and use-before-initialisation bugs here.
      
      To fix this, add a check for the SB_BORN flag in super_cache_count.
      In general, this flag is not set until ->fs_mount() completes
      successfully, so we know that it is set after the filesystem
      setup has completed. This matches the trylock_super() behaviour
      which will not let super_cache_scan() run if SB_BORN is not set, and
      hence will not allow the superblock shrinker from entering the
      filesystem while it is being set up or after it has failed setup
      and is being torn down.
      
      Cc: stable@kernel.org
      Signed-Off-By: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      79f546a6
    • Al Viro's avatar
      do d_instantiate/unlock_new_inode combinations safely · 1e2e547a
      Al Viro authored
      For anything NFS-exported we do _not_ want to unlock new inode
      before it has grown an alias; original set of fixes got the
      ordering right, but missed the nasty complication in case of
      lockdep being enabled - unlock_new_inode() does
      	lockdep_annotate_inode_mutex_key(inode)
      which can only be done before anyone gets a chance to touch
      ->i_mutex.  Unfortunately, flipping the order and doing
      unlock_new_inode() before d_instantiate() opens a window when
      mkdir can race with open-by-fhandle on a guessed fhandle, leading
      to multiple aliases for a directory inode and all the breakage
      that follows from that.
      
      	Correct solution: a new primitive (d_instantiate_new())
      combining these two in the right order - lockdep annotate, then
      d_instantiate(), then the rest of unlock_new_inode().  All
      combinations of d_instantiate() with unlock_new_inode() should
      be converted to that.
      
      Cc: stable@kernel.org	# 2.6.29 and later
      Tested-by: default avatarMike Marshall <hubcap@omnibond.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      1e2e547a
  6. 02 May, 2018 9 commits
  7. 29 Apr, 2018 3 commits
    • Linus Torvalds's avatar
      Linux v4.17-rc3 · 6da6c0db
      Linus Torvalds authored
      6da6c0db
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c61a56ab
      Linus Torvalds authored
      Pull x86 fixes from Thomas Gleixner:
       "Another set of x86 related updates:
      
         - Fix the long broken x32 version of the IPC user space headers which
           was noticed by Arnd Bergman in course of his ongoing y2038 work.
           GLIBC seems to have non broken private copies of these headers so
           this went unnoticed.
      
         - Two microcode fixlets which address some more fallout from the
           recent modifications in that area:
      
            - Unconditionally save the microcode patch, which was only saved
              when CPU_HOTPLUG was enabled causing failures in the late
              loading mechanism
      
            - Make the later loader synchronization finally work under all
              circumstances. It was exiting early and causing timeout failures
              due to a missing synchronization point.
      
         - Do not use mwait_play_dead() on AMD systems to prevent excessive
           power consumption as the CPU cannot go into deep power states from
           there.
      
         - Address an annoying sparse warning due to lost type qualifiers of
           the vmemmap and vmalloc base address constants.
      
         - Prevent reserving crash kernel region on Xen PV as this leads to
           the wrong perception that crash kernels actually work there which
           is not the case. Xen PV has its own crash mechanism handled by the
           hypervisor.
      
         - Add missing TLB cpuid values to the table to make the printout on
           certain machines correct.
      
         - Enumerate the new CLDEMOTE instruction
      
         - Fix an incorrect SPDX identifier
      
         - Remove stale macros"
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/ipc: Fix x32 version of shmid64_ds and msqid64_ds
        x86/setup: Do not reserve a crash kernel region if booted on Xen PV
        x86/cpu/intel: Add missing TLB cpuid values
        x86/smpboot: Don't use mwait_play_dead() on AMD systems
        x86/mm: Make vmemmap and vmalloc base address constants unsigned long
        x86/vector: Remove the unused macro FPU_IRQ
        x86/vector: Remove the macro VECTOR_OFFSET_START
        x86/cpufeatures: Enumerate cldemote instruction
        x86/microcode: Do not exit early from __reload_late()
        x86/microcode/intel: Save microcode patch unconditionally
        x86/jailhouse: Fix incorrect SPDX identifier
      c61a56ab
    • Linus Torvalds's avatar
      Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 65f4d6d0
      Linus Torvalds authored
      Pull x86 pti fixes from Thomas Gleixner:
       "A set of updates for the x86/pti related code:
      
         - Preserve r8-r11 in int $0x80. r8-r11 need to be preserved, but the
           int$80 entry code removed that quite some time ago. Make it correct
           again.
      
         - A set of fixes for the Global Bit work which went into 4.17 and
           caused a bunch of interesting regressions:
      
            - Triggering a BUG in the page attribute code due to a missing
              check for early boot stage
      
            - Warnings in the page attribute code about holes in the kernel
              text mapping which are caused by the freeing of the init code.
              Handle such holes gracefully.
      
            - Reduce the amount of kernel memory which is set global to the
              actual text and do not incidentally overlap with data.
      
            - Disable the global bit when RANDSTRUCT is enabled as it
              partially defeats the hardening.
      
            - Make the page protection setup correct for vma->page_prot
              population again. The adjustment of the protections fell through
              the crack during the Global bit rework and triggers warnings on
              machines which do not support certain features, e.g. NX"
      
      * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/entry/64/compat: Preserve r8-r11 in int $0x80
        x86/pti: Filter at vma->vm_page_prot population
        x86/pti: Disallow global kernel text with RANDSTRUCT
        x86/pti: Reduce amount of kernel text allowed to be Global
        x86/pti: Fix boot warning from Global-bit setting
        x86/pti: Fix boot problems from Global-bit setting
      65f4d6d0