1. 08 Jul, 2021 15 commits
  2. 29 Jun, 2021 3 commits
  3. 28 Jun, 2021 4 commits
    • Zhang Xiaoxu's avatar
      SUNRPC: Should wake up the privileged task firstly. · 5483b904
      Zhang Xiaoxu authored
      When find a task from wait queue to wake up, a non-privileged task may
      be found out, rather than the privileged. This maybe lead a deadlock
      same as commit dfe1fe75 ("NFSv4: Fix deadlock between nfs4_evict_inode()
      and nfs4_opendata_get_inode()"):
      
      Privileged delegreturn task is queued to privileged list because all
      the slots are assigned. If there has no enough slot to wake up the
      non-privileged batch tasks(session less than 8 slot), then the privileged
      delegreturn task maybe lost waked up because the found out task can't
      get slot since the session is on draining.
      
      So we should treate the privileged task as the emergency task, and
      execute it as for as we can.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Fixes: 5fcdfacc ("NFSv4: Return delegations synchronously in evict_inode")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarZhang Xiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      5483b904
    • Zhang Xiaoxu's avatar
      SUNRPC: Fix the batch tasks count wraparound. · fcb170a9
      Zhang Xiaoxu authored
      The 'queue->nr' will wraparound from 0 to 255 when only current
      priority queue has tasks. This maybe lead a deadlock same as commit
      dfe1fe75 ("NFSv4: Fix deadlock between nfs4_evict_inode()
      and nfs4_opendata_get_inode()"):
      
      Privileged delegreturn task is queued to privileged list because all
      the slots are assigned. When non-privileged task complete and release
      the slot, a non-privileged maybe picked out. It maybe allocate slot
      failed when the session on draining.
      
      If the 'queue->nr' has wraparound to 255, and no enough slot to
      service it, then the privileged delegreturn will lost to wake up.
      
      So we should avoid the wraparound on 'queue->nr'.
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Fixes: 5fcdfacc ("NFSv4: Return delegations synchronously in evict_inode")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarZhang Xiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      fcb170a9
    • Dave Wysochanski's avatar
      NFS: Remove unnecessary inode parameter from nfs_pageio_complete_read() · b42ad64f
      Dave Wysochanski authored
      Simplify nfs_pageio_complete_read() by using the inode pointer saved
      inside nfs_pageio_descriptor.
      Signed-off-by: default avatarDave Wysochanski <dwysocha@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      b42ad64f
    • Scott Mayhew's avatar
      nfs: update has_sec_mnt_opts after cloning lsm options from parent · eae00c5d
      Scott Mayhew authored
      After calling security_sb_clone_mnt_opts() in nfs_get_root(), it's
      necessary to copy the value of has_sec_mnt_opts from the cloned
      super_block's nfs_server.  Otherwise, calls to nfs_compare_super()
      using this super_block may not return the correct result, leading to
      mount failures.
      
      For example, mounting an nfs server with the following in /etc/exports:
      /export *(rw,insecure,crossmnt,no_root_squash,security_label)
      and having /export/scratch on a separate block device.
      
      mount -o v4.2,context=system_u:object_r:root_t:s0 server:/export/test /mnt/test
      mount -o v4.2,context=system_u:object_r:swapfile_t:s0 server:/export/scratch /mnt/scratch
      
      The second mount would fail with "mount.nfs: /mnt/scratch is busy or
      already mounted or sharecache fail" and "SELinux: mount invalid.  Same
      superblock, different security settings for..." would appear in the
      syslog.
      
      Also while we're in there, replace several instances of "NFS_SB(s)"
      with "server", which was already declared at the top of the
      nfs_get_root().
      
      Fixes: ec1ade6a ("nfs: account for selinux security context when deciding to share superblock")
      Signed-off-by: default avatarScott Mayhew <smayhew@redhat.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      eae00c5d
  4. 26 Jun, 2021 3 commits
  5. 21 Jun, 2021 3 commits
  6. 13 Jun, 2021 10 commits
  7. 12 Jun, 2021 2 commits
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 8ecfa36c
      Linus Torvalds authored
      Pull RISC-V fixes from Palmer Dabbelt:
      
       - A pair of XIP fixes: one to fix alternatives, and one to turn off the
         rest of the features that require code modification
      
       - A fix to a type that was causing some alternatives to break
      
       - A build fix for BUILTIN_DTB
      
      * tag 'riscv-for-linus-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: Fix BUILTIN_DTB for sifive and microchip soc
        riscv: alternative: fix typo in macro name
        riscv: code patching only works on !XIP_KERNEL
        riscv: xip: support runtime trap patching
      8ecfa36c
    • Feng Tang's avatar
      mm: relocate 'write_protect_seq' in struct mm_struct · 2e302543
      Feng Tang authored
      0day robot reported a 9.2% regression for will-it-scale mmap1 test
      case[1], caused by commit 57efa1fe ("mm/gup: prevent gup_fast from
      racing with COW during fork").
      
      Further debug shows the regression is due to that commit changes the
      offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some
      cache alignment changes.
      
      From the perf data, the contention for 'mmap_lock' is very severe and
      takes around 95% cpu cycles, and it is a rw_semaphore
      
              struct rw_semaphore {
                      atomic_long_t count;	/* 8 bytes */
                      atomic_long_t owner;	/* 8 bytes */
                      struct optimistic_spin_queue osq; /* spinner MCS lock */
                      ...
      
      Before commit 57efa1fe adds the 'write_protect_seq', it happens to
      have a very optimal cache alignment layout, as Linus explained:
      
       "and before the addition of the 'write_protect_seq' field, the
        mmap_sem was at offset 120 in 'struct mm_struct'.
      
        Which meant that count and owner were in two different cachelines,
        and then when you have contention and spend time in
        rwsem_down_write_slowpath(), this is probably *exactly* the kind
        of layout you want.
      
        Because first the rwsem_write_trylock() will do a cmpxchg on the
        first cacheline (for the optimistic fast-path), and then in the
        case of contention, rwsem_down_write_slowpath() will just access
        the second cacheline.
      
        Which is probably just optimal for a load that spends a lot of
        time contended - new waiters touch that first cacheline, and then
        they queue themselves up on the second cacheline."
      
      After the commit, the rw_semaphore is at offset 128, which means the
      'count' and 'owner' fields are now in the same cacheline, and causes
      more cache bouncing.
      
      Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which will
      affect its offset:
      
        CONFIG_MMU
        CONFIG_MEMBARRIER
        CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
      
      The layout above is on 64 bits system with 0day's default kernel config
      (similar to RHEL-8.3's config), in which all these 3 options are 'y'.
      And the layout can vary with different kernel configs.
      
      Relayouting a structure is usually a double-edged sword, as sometimes it
      can helps one case, but hurt other cases.  For this case, one solution
      is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t
      (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes
      hole in 'mm_struct' will not change other fields' alignment, while
      restoring the regression.
      
      Link: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ [1]
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Signed-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e302543