1. 30 Nov, 2019 14 commits
    • Gerald Schaefer's avatar
      s390/kaslr: store KASLR offset for early dumps · a9f2f686
      Gerald Schaefer authored
      The KASLR offset is added to vmcoreinfo in arch_crash_save_vmcoreinfo(),
      so that it can be found by crash when processing kernel dumps.
      
      However, arch_crash_save_vmcoreinfo() is called during a subsys_initcall,
      so if the kernel crashes before that, we have no vmcoreinfo and no KASLR
      offset.
      
      Fix this by storing the KASLR offset in the lowcore, where the vmcore_info
      pointer will be stored, and where it can be found by crash. In order to
      make it distinguishable from a real vmcore_info pointer, mark it as uneven
      (KASLR offset itself is aligned to THREAD_SIZE).
      
      When arch_crash_save_vmcoreinfo() stores the real vmcore_info pointer in
      the lowcore, it overwrites the KASLR offset. At that point, the KASLR
      offset is not yet added to vmcoreinfo, so we also need to move the
      mem_assign_absolute() behind the vmcoreinfo_append_str().
      
      Fixes: b2d24b97 ("s390/kernel: add support for kernel address space layout randomization (KASLR)")
      Cc: <stable@vger.kernel.org> # v5.2+
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      a9f2f686
    • Vasily Gorbik's avatar
      s390/unwind: stop gracefully at task pt_regs · e76e6961
      Vasily Gorbik authored
      Consider reaching task pt_regs graceful unwinder termination. Task
      pt_regs itself never contains a valid state to which a task might return
      within the kernel context (user task pt_regs is a special case). Since
      we already avoid printing user task pt_regs and in most cases we don't
      even bother filling task pt_regs psw and r15 with something reasonable
      simply skip task pt_regs altogether. With this change unwind_error() now
      accurately represent whether unwinder reached task pt_regs successfully
      or failed along the way.
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      e76e6961
    • Vasily Gorbik's avatar
      s390/head64: correct init_task stack setup · cb7948e8
      Vasily Gorbik authored
      Add missing allocation of pt_regs at the bottom of the stack. This
      makes it consistent with other stack setup cases and also what stack
      unwinder expects.
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      cb7948e8
    • Vasily Gorbik's avatar
      s390/unwind: make reuse_sp default when unwinding pt_regs · 97806dfb
      Vasily Gorbik authored
      Currently unwinder yields 2 entries when pt_regs are met:
      sp="address of pt_regs itself" ip=pt_regs->psw
      sp=pt_regs->gprs[15] ip="r14 from stack frame pointed by pt_regs->gprs[15]"
      
      And neither of those 2 states (combination of sp and ip) ever happened.
      
      reuse_sp has been introduced by commit a1d863ac ("s390/unwind: fix
      mixing regs and sp"). reuse_sp=true makes unwinder keen to produce the
      following result, when pt_regs are given (as an arg to unwind_start):
      sp=pt_regs->gprs[15] ip=pt_regs->psw
      sp=pt_regs->gprs[15] ip="r14 from stack frame pointed by pt_regs->gprs[15]"
      
      The first state is an actual state in which a task was when pt_regs were
      collected. The second state is marked unreliable and is for debugging
      purposes to cover the case when a task has been interrupted in between
      stack frame allocation and writing back_chain - in this case r14 might
      show an actual caller.
      
      Make unwinder behaviour enabled via reuse_sp=true default and drop the
      special case handling.
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      97806dfb
    • Vasily Gorbik's avatar
      s390/unwind: report an error if pt_regs are not on stack · 67f55934
      Vasily Gorbik authored
      If unwinder is looking at pt_regs which is not on stack then something
      went wrong and an error has to be reported rather than successful
      unwinding termination.
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      67f55934
    • Vasily Gorbik's avatar
      s390: avoid misusing CALL_ON_STACK for task stack setup · 7bcaad1f
      Vasily Gorbik authored
      CALL_ON_STACK is intended to be used for temporary stack switching with
      potential return to the caller.
      
      When CALL_ON_STACK is misused to switch from nodat stack to task stack
      back_chain information would later lead stack unwinder from task stack into
      (per cpu) nodat stack which is reused for other purposes. This would
      yield confusing unwinding result or errors.
      
      To avoid that introduce CALL_ON_STACK_NORETURN to be used instead. It
      makes sure that back_chain is zeroed and unwinder finishes gracefully
      ending up at task pt_regs.
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      7bcaad1f
    • Vasily Gorbik's avatar
      s390: correct CALL_ON_STACK back_chain saving · 75794257
      Vasily Gorbik authored
      Currently CALL_ON_STACK saves r15 as back_chain in the first stack frame of
      the stack we about to switch to. But if a function which uses CALL_ON_STACK
      calls other function it allocates a stack frame for a callee. In this
      case r15 is pointing to a callee stack frame and not a stack frame of
      function itself. This results in dummy unwinding entry with random
      sp and ip values.
      
      Introduce and utilize current_frame_address macro to get an address of
      actual function stack frame.
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      75794257
    • Vasily Gorbik's avatar
      s390/unwind: unify task is current checks · 103b4cca
      Vasily Gorbik authored
      Avoid mixture of task == NULL and task == current meaning the same
      thing and simply always initialize task with current in unwind_start.
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      103b4cca
    • Vasily Gorbik's avatar
      s390: disable preemption when switching to nodat stack with CALL_ON_STACK · 7f28dad3
      Vasily Gorbik authored
      Make sure preemption is disabled when temporary switching to nodat
      stack with CALL_ON_STACK helper, because nodat stack is per cpu.
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      7f28dad3
    • Vasily Gorbik's avatar
      s390: always inline disabled_wait · c2e06e15
      Vasily Gorbik authored
      disabled_wait uses _THIS_IP_ and assumes that compiler would inline it.
      Make sure this assumption is always correct by utilizing __always_inline.
      Reviewed-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      c2e06e15
    • Heiko Carstens's avatar
      s390/vdso: fix getcpu · 5a5525b0
      Heiko Carstens authored
      getcpu reads the required values for cpu and node with two
      instructions. This might lead to an inconsistent result if user space
      gets preempted and migrated to a different CPU between the two
      instructions.
      
      Fix this by using just a single instruction to read both values at
      once.
      
      This is currently rather a theoretical bug, since there is no real
      NUMA support available (except for NUMA emulation).
      Reviewed-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      5a5525b0
    • Heiko Carstens's avatar
      s390/smp,vdso: fix ASCE handling · a2308c11
      Heiko Carstens authored
      When a secondary CPU is brought up it must initialize its control
      registers. CPU A which triggers that a secondary CPU B is brought up
      stores its control register contents into the lowcore of new CPU B,
      which then loads these values on startup.
      
      This is problematic in various ways: the control register which
      contains the home space ASCE will correctly contain the kernel ASCE;
      however control registers for primary and secondary ASCEs are
      initialized with whatever values were present in CPU A.
      
      Typically:
      - the primary ASCE will contain the user process ASCE of the process
        that triggered onlining of CPU B.
      - the secondary ASCE will contain the percpu VDSO ASCE of CPU A.
      
      Due to lazy ASCE handling we may also end up with other combinations.
      
      When then CPU B switches to a different process (!= idle) it will
      fixup the primary ASCE. However the problem is that the (wrong) ASCE
      from CPU A was loaded into control register 1: as soon as an ASCE is
      attached (aka loaded) a CPU is free to generate TLB entries using that
      address space.
      Even though it is very unlikey that CPU B will actually generate such
      entries, this could result in TLB entries of the address space of the
      process that ran on CPU A. These entries shouldn't exist at all and
      could cause problems later on.
      
      Furthermore the secondary ASCE of CPU B will not be updated correctly.
      This means that processes may see wrong results or even crash if they
      access VDSO data on CPU B. The correct VDSO ASCE will eventually be
      loaded on return to user space as soon as the kernel executed a call
      to strnlen_user or an atomic futex operation on CPU B.
      
      Fix both issues by intializing the to be loaded control register
      contents with the correct ASCEs and also enforce (re-)loading of the
      ASCEs upon first context switch and return to user space.
      
      Fixes: 0aaba41b ("s390: remove all code using the access register mode")
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      a2308c11
    • Harald Freudenberger's avatar
      s390/zcrypt: handle new reply code FILTERED_BY_HYPERVISOR · 6733775a
      Harald Freudenberger authored
      This patch introduces support for a new architectured reply
      code 0x8B indicating that a hypervisor layer (if any) has
      rejected an ap message.
      
      Linux may run as a guest on top of a hypervisor like zVM
      or KVM. So the crypto hardware seen by the ap bus may be
      restricted by the hypervisor for example only a subset like
      only clear key crypto requests may be supported. Other
      requests will be filtered out - rejected by the hypervisor.
      The new reply code 0x8B will appear in such cases and needs
      to get recognized by the ap bus and zcrypt device driver zoo.
      Signed-off-by: default avatarHarald Freudenberger <freude@linux.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      6733775a
    • Ilya Leoshkevich's avatar
      s390: implement perf_arch_fetch_caller_regs · 914d52e4
      Ilya Leoshkevich authored
      On s390 bpf_get_stack_raw_tp() returns 0 entries for both kernel and
      user stacks. While there is no practical unwinding solution for userspace
      on s390 at this moment, there certainly is a kernel unwinder. However,
      it is not properly integrated with BPF.
      
      In order to start unwinding, bpf_get_stack_raw_tp() obtains the current
      kernel register values using perf_fetch_caller_regs(), which is not
      implemented for s390. The actual unwinding then happens by passing those
      registers to perf_callchain_kernel().
      
      Implement perf_arch_fetch_caller_regs() for s390, where
      __builtin_frame_address(0) points to back_chain.
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Acked-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      914d52e4
  2. 26 Nov, 2019 1 commit
    • Linus Torvalds's avatar
      Merge tag 's390-5.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · ea1f56fa
      Linus Torvalds authored
      Pull s390 updates from Vasily Gorbik:
      
       - Adjust PMU device drivers registration to avoid WARN_ON and few other
         perf improvements.
      
       - Enhance tracing in vfio-ccw.
      
       - Few stack unwinder fixes and improvements, convert get_wchan custom
         stack unwinding to generic api usage.
      
       - Fixes for mm helpers issues uncovered with tests validating
         architecture page table helpers.
      
       - Fix noexec bit handling when hardware doesn't support it.
      
       - Fix memleak and unsigned value compared with zero bugs in crypto
         code. Minor code simplification.
      
       - Fix crash during kdump with kasan enabled kernel.
      
       - Switch bug and alternatives from asm to asm_inline to improve
         inlining decisions.
      
       - Use 'depends on cc-option' for MARCH and TUNE options in Kconfig, add
         z13s and z14 ZR1 to TUNE descriptions.
      
       - Minor head64.S simplification.
      
       - Fix physical to logical CPU map for SMT.
      
       - Several cleanups in qdio code.
      
       - Other minor cleanups and fixes all over the code.
      
      * tag 's390-5.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (41 commits)
        s390/cpumf: Adjust registration of s390 PMU device drivers
        s390/smp: fix physical to logical CPU map for SMT
        s390/early: move access registers setup in C code
        s390/head64: remove unnecessary vdso_per_cpu_data setup
        s390/early: move control registers setup in C code
        s390/kasan: support memcpy_real with TRACE_IRQFLAGS
        s390/crypto: Fix unsigned variable compared with zero
        s390/pkey: use memdup_user() to simplify code
        s390/pkey: fix memory leak within _copy_apqns_from_user()
        s390/disassembler: don't hide instruction addresses
        s390/cpum_sf: Assign error value to err variable
        s390/cpum_sf: Replace function name in debug statements
        s390/cpum_sf: Use consistant debug print format for sampling
        s390/unwind: drop unnecessary code around calling ftrace_graph_ret_addr()
        s390: add error handling to perf_callchain_kernel
        s390: always inline current_stack_pointer()
        s390/mm: add mm_pxd_folded() checks to pxd_free()
        s390/mm: properly clear _PAGE_NOEXEC bit when it is not supported
        s390/mm: simplify page table helpers for large entries
        s390/mm: make pmd/pud_bad() report large entries as bad
        ...
      ea1f56fa
  3. 25 Nov, 2019 20 commits
    • Linus Torvalds's avatar
      Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 4ba380f6
      Linus Torvalds authored
      Pull arm64 updates from Catalin Marinas:
       "Apart from the arm64-specific bits (core arch and perf, new arm64
        selftests), it touches the generic cow_user_page() (reviewed by
        Kirill) together with a macro for x86 to preserve the existing
        behaviour on this architecture.
      
        Summary:
      
         - On ARMv8 CPUs without hardware updates of the access flag, avoid
           failing cow_user_page() on PFN mappings if the pte is old. The
           patches introduce an arch_faults_on_old_pte() macro, defined as
           false on x86. When true, cow_user_page() makes the pte young before
           attempting __copy_from_user_inatomic().
      
         - Covert the synchronous exception handling paths in
           arch/arm64/kernel/entry.S to C.
      
         - FTRACE_WITH_REGS support for arm64.
      
         - ZONE_DMA re-introduced on arm64 to support Raspberry Pi 4
      
         - Several kselftest cases specific to arm64, together with a
           MAINTAINERS update for these files (moved to the ARM64 PORT entry).
      
         - Workaround for a Neoverse-N1 erratum where the CPU may fetch stale
           instructions under certain conditions.
      
         - Workaround for Cortex-A57 and A72 errata where the CPU may
           speculatively execute an AT instruction and associate a VMID with
           the wrong guest page tables (corrupting the TLB).
      
         - Perf updates for arm64: additional PMU topologies on HiSilicon
           platforms, support for CCN-512 interconnect, AXI ID filtering in
           the IMX8 DDR PMU, support for the CCPI2 uncore PMU in ThunderX2.
      
         - GICv3 optimisation to avoid a heavy barrier when accessing the
           ICC_PMR_EL1 register.
      
         - ELF HWCAP documentation updates and clean-up.
      
         - SMC calling convention conduit code clean-up.
      
         - KASLR diagnostics printed during boot
      
         - NVIDIA Carmel CPU added to the KPTI whitelist
      
         - Some arm64 mm clean-ups: use generic free_initrd_mem(), remove
           stale macro, simplify calculation in __create_pgd_mapping(), typos.
      
         - Kconfig clean-ups: CMDLINE_FORCE to depend on CMDLINE, choice for
           endinanness to help with allmodconfig"
      
      * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (93 commits)
        arm64: Kconfig: add a choice for endianness
        kselftest: arm64: fix spelling mistake "contiguos" -> "contiguous"
        arm64: Kconfig: make CMDLINE_FORCE depend on CMDLINE
        MAINTAINERS: Add arm64 selftests to the ARM64 PORT entry
        arm64: kaslr: Check command line before looking for a seed
        arm64: kaslr: Announce KASLR status on boot
        kselftest: arm64: fake_sigreturn_misaligned_sp
        kselftest: arm64: fake_sigreturn_bad_size
        kselftest: arm64: fake_sigreturn_duplicated_fpsimd
        kselftest: arm64: fake_sigreturn_missing_fpsimd
        kselftest: arm64: fake_sigreturn_bad_size_for_magic0
        kselftest: arm64: fake_sigreturn_bad_magic
        kselftest: arm64: add helper get_current_context
        kselftest: arm64: extend test_init functionalities
        kselftest: arm64: mangle_pstate_invalid_mode_el[123][ht]
        kselftest: arm64: mangle_pstate_invalid_daif_bits
        kselftest: arm64: mangle_pstate_invalid_compat_toggle and common utils
        kselftest: arm64: extend toplevel skeleton Makefile
        drivers/perf: hisi: update the sccl_id/ccl_id for certain HiSilicon platform
        arm64: mm: reserve CMA and crashkernel in ZONE_DMA32
        ...
      4ba380f6
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-5.5-rc1-kunit' of... · e25645b1
      Linus Torvalds authored
      Merge tag 'linux-kselftest-5.5-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull kselftest KUnit support gtom Shuah Khan:
       "This adds KUnit, a lightweight unit testing and mocking framework for
        the Linux kernel from Brendan Higgins.
      
        KUnit is not an end-to-end testing framework. It is currently
        supported on UML and sub-systems can write unit tests and run them in
        UML env. KUnit documentation is included in this update.
      
        In addition, this Kunit update adds 3 new kunit tests:
      
         - proc sysctl test from Iurii Zaikin
      
         - the 'list' doubly linked list test from David Gow
      
         - ext4 tests for decoding extended timestamps from Iurii Zaikin
      
        In the future KUnit will be linked to Kselftest framework to provide a
        way to trigger KUnit tests from user-space"
      
      * tag 'linux-kselftest-5.5-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (23 commits)
        lib/list-test: add a test for the 'list' doubly linked list
        ext4: add kunit test for decoding extended timestamps
        Documentation: kunit: Fix verification command
        kunit: Fix '--build_dir' option
        kunit: fix failure to build without printk
        MAINTAINERS: add proc sysctl KUnit test to PROC SYSCTL section
        kernel/sysctl-test: Add null pointer test for sysctl.c:proc_dointvec()
        MAINTAINERS: add entry for KUnit the unit testing framework
        Documentation: kunit: add documentation for KUnit
        kunit: defconfig: add defconfigs for building KUnit tests
        kunit: tool: add Python wrappers for running KUnit tests
        kunit: test: add tests for KUnit managed resources
        kunit: test: add the concept of assertions
        kunit: test: add tests for kunit test abort
        kunit: test: add support for test abort
        objtool: add kunit_try_catch_throw to the noreturn list
        kunit: test: add initial tests
        lib: enable building KUnit in lib/
        kunit: test: add the concept of expectations
        kunit: test: add assertion printing library
        ...
      e25645b1
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-5.5-rc1-fixes' of... · db7d2754
      Linus Torvalds authored
      Merge tag 'linux-kselftest-5.5-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull kselftest fixes from Shuah Khan:
       "This consists of several fixes to tests and framework.
      
        Masami Hiramatsu fixed several tests to build and run correctly on arm
        and other 32bit architectures"
      
      * tag 'linux-kselftest-5.5-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        selftests: sync: Fix cast warnings on arm
        selftests: net: Fix printf format warnings on arm
        selftests: net: Use size_t and ssize_t for counting file size
        selftests: vm: Build/Run 64bit tests only on 64bit arch
        selftests: proc: Make va_max 1MB
        kselftest: Fix NULL INSTALL_PATH for TARGETS runlist
        selftests: Move kselftest_module.sh into kselftest/
        selftests: gen_kselftest_tar.sh: Do not clobber kselftest/
        selftests: breakpoints: Fix a typo of function name
        selftests: Fix O= and KBUILD_OUTPUT handling for relative paths
      db7d2754
    • Linus Torvalds's avatar
      Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt · 1c1ff483
      Linus Torvalds authored
      Pull fsverity updates from Eric Biggers:
       "Expose the fs-verity bit through statx()"
      
      * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
        docs: fs-verity: mention statx() support
        f2fs: support STATX_ATTR_VERITY
        ext4: support STATX_ATTR_VERITY
        statx: define STATX_ATTR_VERITY
        docs: fs-verity: document first supported kernel version
      1c1ff483
    • Linus Torvalds's avatar
      Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt · ea4b71bc
      Linus Torvalds authored
      Pull fscrypt updates from Eric Biggers:
      
       - Add the IV_INO_LBLK_64 encryption policy flag which modifies the
         encryption to be optimized for UFS inline encryption hardware.
      
       - For AES-128-CBC, use the crypto API's implementation of ESSIV (which
         was added in 5.4) rather than doing ESSIV manually.
      
       - A few other cleanups.
      
      * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
        f2fs: add support for IV_INO_LBLK_64 encryption policies
        ext4: add support for IV_INO_LBLK_64 encryption policies
        fscrypt: add support for IV_INO_LBLK_64 policies
        fscrypt: avoid data race on fscrypt_mode::logged_impl_name
        docs: ioctl-number: document fscrypt ioctl numbers
        fscrypt: zeroize fscrypt_info before freeing
        fscrypt: remove struct fscrypt_ctx
        fscrypt: invoke crypto API for ESSIV handling
      ea4b71bc
    • Linus Torvalds's avatar
      Merge tag 'affs-for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · ae36607b
      Linus Torvalds authored
      Pull AFFS updates from David Sterba:
       "A minor bugfix and cleanup for AFFS"
      
      * tag 'affs-for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        affs: fix a memory leak in affs_remount
        affs: Replace binary semaphores with mutexes
      ae36607b
    • Linus Torvalds's avatar
      Merge tag 'for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 97d0bf96
      Linus Torvalds authored
      Pull btrfs updates from David Sterba:
       "User visible changes:
         - new block group profiles: RAID1 with 3- and 4- copies
             - RAID1 in btrfs has always 2 copies, now add support for 3 and 4
             - this is an incompat feature (named RAID1C34)
             - recommended use of RAID1C3 is replacement of RAID6 profile on
               metadata, this brings a more reliable resiliency against 2
               device loss/damage
      
         - support for new checksums
             - per-filesystem, set at mkfs time
             - fast hash (crc32c successor): xxhash, 64bit digest
             - strong hashes (both 256bit): sha256 (slower, FIPS), blake2b
               (faster)
             - the blake2b module goes via the crypto tree, btrfs.ko has a
               soft dependency
      
         - speed up lseek, don't take inode locks unnecessarily, this can
           speed up parallel SEEK_CUR/SEEK_SET/SEEK_END by 80%
      
         - send:
             - allow clone operations within the same file
             - limit maximum number of sent clone references to avoid slow
               backref walking
      
         - error message improvements: device scan prints process name and PID
      
        Core changes:
         - cleanups
             - remove unique workqueue helpers, used to provide a way to avoid
               deadlocks in the workqueue code, now done in a simpler way
             - remove lots of indirect function calls in compression code
             - extent IO tree code moved out of extent_io.c
             - cleanup backup superblock handling at mount time
             - transaction life cycle documentation and cleanups
             - locking code cleanups, annotations and documentation
             - add more cold, const, pure function attributes
             - removal of unused or redundant struct members or variables
      
         - new tree-checker sanity tests
             - try to detect missing INODE_ITEM, cross-reference checks of
               DIR_ITEM, DIR_INDEX, INODE_REF, and XATTR_* items
      
         - remove own bio scheduling code (used to avoid checksum submissions
           being stuck behind other IO), replaced by cgroup controller-based
           code to allow better control and avoid priority inversions in cases
           where the custom and cgroup scheduling disagreed
      
        Fixes:
         - avoid getting stuck during cyclic writebacks
      
         - fix trimming of ranges crossing block group boundaries
      
         - fix rename exchange on subvolumes, all involved subvolumes need to
           be recorded in the transaction"
      
      * tag 'for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (137 commits)
        btrfs: drop bdev argument from submit_extent_page
        btrfs: remove extent_map::bdev
        btrfs: drop bio_set_dev where not needed
        btrfs: get bdev directly from fs_devices in submit_extent_page
        btrfs: record all roots for rename exchange on a subvol
        Btrfs: fix block group remaining RO forever after error during device replace
        btrfs: scrub: Don't check free space before marking a block group RO
        btrfs: change btrfs_fs_devices::rotating to bool
        btrfs: change btrfs_fs_devices::seeding to bool
        btrfs: rename btrfs_block_group_cache
        btrfs: block-group: Reuse the item key from caller of read_one_block_group()
        btrfs: block-group: Refactor btrfs_read_block_groups()
        btrfs: document extent buffer locking
        btrfs: access eb::blocking_writers according to ACCESS_ONCE policies
        btrfs: set blocking_writers directly, no increment or decrement
        btrfs: merge blocking_writers branches in btrfs_tree_read_lock
        btrfs: drop incompat bit for raid1c34 after last block group is gone
        btrfs: add incompat for raid1 with 3, 4 copies
        btrfs: add support for 4-copy replication (raid1c4)
        btrfs: add support for 3-copy replication (raid1c3)
        ...
      97d0bf96
    • Linus Torvalds's avatar
      Merge tag 'mtd/for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux · 1b88176b
      Linus Torvalds authored
      Pull MTD updates from Miquel Raynal:
       "MTD core:
         - drop inactive maintainers, update the repositories and add IRC
           channel
         - debugfs functions improvements
         - initialize more structure parameters
         - misc fixes reported by robots
      
        MTD devices:
         - spear_smi: Fixed Write Burst mode
         - new Intel IXP4xx flash probing hook
      
        Raw NAND core:
         - useless extra checks dropped
         - update the detection of the bad block markers position
      
        Raw NAND controller drivers:
         - Cadence: new driver
         - Brcmnand: support for flash-dma v0 + fixes
         - Denali: drop support for the legacy controller/chip DT representation
         - superfluous dev_err() calls removed
      
        SPI NOR core changes:
         - introduce 'struct spi_nor_controller_ops'
         - clean the Register Operations methods
         - use dev_dbg insted of dev_err for low level info
         - fix retlen handling in sst_write()
         - fix silent truncations in spi_nor_read and spi_nor_read_raw()
         - fix the clearing of QE bit on lock()/unlock()
         - rework the disabling of the block write protection
         - rework the Quad Enable methods
         - make sure nor->spimem and nor->controller_ops are mutually exclusive
         - set default Quad Enable method for ISSI flashes
         - add support for few flashes
      
        SPI NOR controller drivers changes:
         - intel-spi:
            - support chips without software sequencer
            - add support for Intel Cannon Lake and Intel Comet Lake-H flashes
      
        CFI core changes:
         - code cleanups related useless initializers and coding style issues
         - fix for a possible double free problem in cfi_cmdset_0002
         - improved HyperFlash error reporting and handling in cfi_cmdset_0002 core"
      
      * tag 'mtd/for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux: (73 commits)
        mtd: devices: fix mchp23k256 read and write
        mtd: no need to check return value of debugfs_create functions
        mtd: spi-nor: Set default Quad Enable method for ISSI flashes
        mtd: spi-nor: Add support for is25wp256
        mtd: spi-nor: Add support for w25q256jw
        mtd: spi-nor: Move condition to avoid a NULL check
        mtd: spi-nor: Make sure nor->spimem and nor->controller_ops are mutually exclusive
        mtd: spi-nor: Rename Quad Enable methods
        mtd: spi-nor: Merge spansion Quad Enable methods
        mtd: spi-nor: Rename CR_QUAD_EN_SPAN to SR2_QUAD_EN_BIT1
        mtd: spi-nor: Extend the SR Read Back test
        mtd: spi-nor: Rework the disabling of block write protection
        mtd: spi-nor: Fix clearing of QE bit on lock()/unlock()
        mtd: cfi_cmdset_0002: fix delayed error detection on HyperFlash
        mtd: cfi_cmdset_0002: only check errors when ready in cfi_check_err_status()
        mtd: cfi_cmdset_0002: don't free cfi->cfiq in error path of cfi_amdstd_setup()
        mtd: cfi_cmdset_*: kill useless 'ret' variable initializers
        mtd: cfi_util: use DIV_ROUND_UP() in cfi_udelay()
        mtd: spi-nor: Print debug message when the read back test fails
        mtd: spi-nor: Check all the bits written, not just the BP ones
        ...
      1b88176b
    • Linus Torvalds's avatar
      Merge tag 'for-5.5/dm-changes' of... · eeee2827
      Linus Torvalds authored
      Merge tag 'for-5.5/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper updates from Mike Snitzer:
      
       - Fix DM core to disallow stacking request-based DM on partitions.
      
       - Fix DM raid target to properly resync raidset even if bitmap needed
         additional pages.
      
       - Fix DM crypt performance regression due to use of WQ_HIGHPRI for the
         IO and crypt workqueues.
      
       - Fix DM integrity metadata layout that was aligned on 128K boundary
         rather than the intended 4K boundary (removes 124K of wasted space
         for each metadata block).
      
       - Improve the DM thin, cache and clone targets to use spin_lock_irq
         rather than spin_lock_irqsave where possible.
      
       - Fix DM thin single thread performance that was lost due to needless
         workqueue wakeups.
      
       - Fix DM zoned target performance that was lost due to excessive
         backing device checks.
      
       - Add ability to trigger write failure with the DM dust test target.
      
       - Fix whitespace indentation in drivers/md/Kconfig.
      
       - Various smalls fixes and cleanups (e.g. use struct_size, fix
         uninitialized variable, variable renames, etc).
      
      * tag 'for-5.5/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (22 commits)
        Revert "dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues"
        dm: Fix Kconfig indentation
        dm thin: wakeup worker only when deferred bios exist
        dm integrity: fix excessive alignment of metadata runs
        dm raid: Remove unnecessary negation of a shift in raid10_format_to_md_layout
        dm zoned: reduce overhead of backing device checks
        dm dust: add limited write failure mode
        dm dust: change ret to r in dust_map_read and dust_map
        dm dust: change result vars to r
        dm cache: replace spin_lock_irqsave with spin_lock_irq
        dm bio prison: replace spin_lock_irqsave with spin_lock_irq
        dm thin: replace spin_lock_irqsave with spin_lock_irq
        dm clone: add bucket_lock_irq/bucket_unlock_irq helpers
        dm clone: replace spin_lock_irqsave with spin_lock_irq
        dm writecache: handle REQ_FUA
        dm writecache: fix uninitialized variable warning
        dm stripe: use struct_size() in kmalloc()
        dm raid: streamline rs_get_progress() and its raid_status() caller side
        dm raid: simplify rs_setup_recovery call chain
        dm raid: to ensure resynchronization, perform raid set grow in preresume
        ...
      eeee2827
    • Linus Torvalds's avatar
      Merge tag 'for-5.5/disk-revalidate-20191122' of git://git.kernel.dk/linux-block · 7e5192b9
      Linus Torvalds authored
      Pull disk revalidation updates from Jens Axboe:
       "This continues the work that Jan Kara started to thoroughly cleanup
        and consolidate how we handle rescans and revalidations"
      
      * tag 'for-5.5/disk-revalidate-20191122' of git://git.kernel.dk/linux-block:
        block: move clearing bd_invalidated into check_disk_size_change
        block: remove (__)blkdev_reread_part as an exported API
        block: fix bdev_disk_changed for non-partitioned devices
        block: move rescan_partitions to fs/block_dev.c
        block: merge invalidate_partitions into rescan_partitions
        block: refactor rescan_partitions
      7e5192b9
    • Linus Torvalds's avatar
      Merge tag 'for-5.5/zoned-20191122' of git://git.kernel.dk/linux-block · 464a47f4
      Linus Torvalds authored
      Pull zoned block device update from Jens Axboe:
       "Enhancements and improvements to the zoned device support"
      
      * tag 'for-5.5/zoned-20191122' of git://git.kernel.dk/linux-block:
        scsi: sd_zbc: Remove set but not used variable 'buflen'
        block: rework zone reporting
        scsi: sd_zbc: Cleanup sd_zbc_alloc_report_buffer()
        null_blk: Add zone_nr_conv to features
        null_blk: clean up report zones
        null_blk: clean up the block device operations
        block: Remove partition support for zoned block devices
        block: Simplify report zones execution
        block: cleanup the !zoned case in blk_revalidate_disk_zones
        block: Enhance blk_revalidate_disk_zones()
      464a47f4
    • Linus Torvalds's avatar
      Merge tag 'for-5.5/drivers-post-20191122' of git://git.kernel.dk/linux-block · 323264ee
      Linus Torvalds authored
      Pull additional block driver updates from Jens Axboe:
       "Here's another block driver update, done to avoid conflicts with the
        zoned changes coming next.
      
        This contains:
      
         - Prepare SCSI sd for zone open/close/finish support
      
         - Small NVMe pull request
              - hwmon support (Akinobu)
              - add new co-maintainer (Christoph)
              - work-around for a discard issue on non-conformant drives
                (Eduard)
      
         - Small nbd leak fix"
      
      * tag 'for-5.5/drivers-post-20191122' of git://git.kernel.dk/linux-block:
        nbd: prevent memory leak
        nvme: hwmon: add quirk to avoid changing temperature threshold
        nvme: hwmon: provide temperature min and max values for each sensor
        nvmet: add another maintainer
        nvme: Discard workaround for non-conformant devices
        nvme: Add hardware monitoring support
        scsi: sd_zbc: add zone open, close, and finish support
      323264ee
    • Linus Torvalds's avatar
      Merge tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block · 2d539430
      Linus Torvalds authored
      Pull block driver updates from Jens Axboe:
       "Here are the main block driver updates for 5.5. Nothing major in here,
        mostly just fixes. This contains:
      
         - a set of bcache changes via Coly
      
         - MD changes from Song
      
         - loop unmap write-zeroes fix (Darrick)
      
         - spelling fixes (Geert)
      
         - zoned additions cleanups to null_blk/dm (Ajay)
      
         - allow null_blk online submit queue changes (Bart)
      
         - NVMe changes via Keith, nothing major here either"
      
      * tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block: (56 commits)
        Revert "bcache: fix fifo index swapping condition in journal_pin_cmp()"
        drivers/md/raid5-ppl.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
        drivers/md/raid5.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
        bcache: don't export symbols
        bcache: remove the extra cflags for request.o
        bcache: at least try to shrink 1 node in bch_mca_scan()
        bcache: add idle_max_writeback_rate sysfs interface
        bcache: add code comments in bch_btree_leaf_dirty()
        bcache: fix deadlock in bcache_allocator
        bcache: add code comment bch_keylist_pop() and bch_keylist_pop_front()
        bcache: deleted code comments for dead code in bch_data_insert_keys()
        bcache: add more accurate error messages in read_super()
        bcache: fix static checker warning in bcache_device_free()
        bcache: fix a lost wake-up problem caused by mca_cannibalize_lock
        bcache: fix fifo index swapping condition in journal_pin_cmp()
        md/raid10: prevent access of uninitialized resync_pages offset
        md: avoid invalid memory access for array sb->dev_roles
        md/raid1: avoid soft lockup under high load
        null_blk: add zone open, close, and finish support
        dm: add zone open, close and finish support
        ...
      2d539430
    • Linus Torvalds's avatar
      Merge tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block · ff6814b0
      Linus Torvalds authored
      Pull core block updates from Jens Axboe:
       "Due to more granular branches, this one is small and will be followed
        with other core branches that add specific features. I meant to just
        have a core and drivers branch, but external dependencies we ended up
        adding a few more that are also core.
      
        The changes are:
      
         - Fixes and improvements for the zoned device support (Ajay, Damien)
      
         - sed-opal table writing and datastore UID (Revanth)
      
         - blk-cgroup (and bfq) blk-cgroup stat fixes (Tejun)
      
         - Improvements to the block stats tracking (Pavel)
      
         - Fix for overruning sysfs buffer for large number of CPUs (Ming)
      
         - Optimization for small IO (Ming, Christoph)
      
         - Fix typo in RWH lifetime hint (Eugene)
      
         - Dead code removal and documentation (Bart)
      
         - Reduction in memory usage for queue and tag set (Bart)
      
         - Kerneldoc header documentation (André)
      
         - Device/partition revalidation fixes (Jan)
      
         - Stats tracking for flush requests (Konstantin)
      
         - Various other little fixes here and there (et al)"
      
      * tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block: (48 commits)
        Revert "block: split bio if the only bvec's length is > SZ_4K"
        block: add iostat counters for flush requests
        block,bfq: Skip tracing hooks if possible
        block: sed-opal: Introduce SUM_SET_LIST parameter and append it using 'add_token_u64'
        blk-cgroup: cgroup_rstat_updated() shouldn't be called on cgroup1
        block: Don't disable interrupts in trigger_softirq()
        sbitmap: Delete sbitmap_any_bit_clear()
        blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue()
        block: split bio if the only bvec's length is > SZ_4K
        block: still try to split bio if the bvec crosses pages
        blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT
        blk-cgroup: reimplement basic IO stats using cgroup rstat
        blk-cgroup: remove now unused blkg_print_stat_{bytes|ios}_recursive()
        blk-throtl: stop using blkg->stat_bytes and ->stat_ios
        bfq-iosched: stop using blkg->stat_bytes and ->stat_ios
        bfq-iosched: relocate bfqg_*rwstat*() helpers
        block: add zone open, close and finish ioctl support
        block: add zone open, close and finish operations
        block: Simplify REQ_OP_ZONE_RESET_ALL handling
        block: Remove REQ_OP_ZONE_RESET plugging
        ...
      ff6814b0
    • Linus Torvalds's avatar
      Merge tag 'for-5.5/libata-20191121' of git://git.kernel.dk/linux-block · 6e7b06a4
      Linus Torvalds authored
      Pull libata updates from Jens Axboe:
       "Just a few fixes all over the place, support for the Annapurna SATA
        controller, and a patchset that cleans up the error defines and
        ultimately fixes anissue with sata_mv"
      
      * tag 'for-5.5/libata-20191121' of git://git.kernel.dk/linux-block:
        ata: pata_artop: make arrays static const, makes object smaller
        ata_piix: remove open-coded dmi_match(DMI_OEM_STRING)
        ata: sata_mv, avoid trigerrable BUG_ON
        ata: make qc_prep return ata_completion_errors
        ata: define AC_ERR_OK
        ata: Documentation, fix function names
        libata: Ensure ata_port probe has completed before detach
        ahci: tegra: use regulator_bulk_set_supply_names()
        ahci: Add support for Amazon's Annapurna Labs SATA controller
      6e7b06a4
    • Linus Torvalds's avatar
      Merge tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block · fb4b3d3f
      Linus Torvalds authored
      Pull io_uring updates from Jens Axboe:
       "A lot of stuff has been going on this cycle, with improving the
        support for networked IO (and hence unbounded request completion
        times) being one of the major themes. There's been a set of fixes done
        this week, I'll send those out as well once we're certain we're fully
        happy with them.
      
        This contains:
      
         - Unification of the "normal" submit path and the SQPOLL path (Pavel)
      
         - Support for sparse (and bigger) file sets, and updating of those
           file sets without needing to unregister/register again.
      
         - Independently sized CQ ring, instead of just making it always 2x
           the SQ ring size. This makes it more flexible for networked
           applications.
      
         - Support for overflowed CQ ring, never dropping events but providing
           backpressure on submits.
      
         - Add support for absolute timeouts, not just relative ones.
      
         - Support for generic cancellations. This divorces io_uring from
           workqueues as well, which additionally gets us one step closer to
           generic async system call support.
      
         - With cancellations, we can support grabbing the process file table
           as well, just like we do mm context. This allows support for system
           calls that create file descriptors, like accept4() support that's
           built on top of that.
      
         - Support for io_uring tracing (Dmitrii)
      
         - Support for linked timeouts. These abort an operation if it isn't
           completed by the time noted in the linke timeout.
      
         - Speedup tracking of poll requests
      
         - Various cleanups making the coder easier to follow (Jackie, Pavel,
           Bob, YueHaibing, me)
      
         - Update MAINTAINERS with new io_uring list"
      
      * tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block: (64 commits)
        io_uring: make POLL_ADD/POLL_REMOVE scale better
        io-wq: remove now redundant struct io_wq_nulls_list
        io_uring: Fix getting file for non-fd opcodes
        io_uring: introduce req_need_defer()
        io_uring: clean up io_uring_cancel_files()
        io-wq: ensure free/busy list browsing see all items
        io-wq: ensure we have a stable view of ->cur_work for cancellations
        io_wq: add get/put_work handlers to io_wq_create()
        io_uring: check for validity of ->rings in teardown
        io_uring: fix potential deadlock in io_poll_wake()
        io_uring: use correct "is IO worker" helper
        io_uring: fix -ENOENT issue with linked timer with short timeout
        io_uring: don't do flush cancel under inflight_lock
        io_uring: flag SQPOLL busy condition to userspace
        io_uring: make ASYNC_CANCEL work with poll and timeout
        io_uring: provide fallback request for OOM situations
        io_uring: convert accept4() -ERESTARTSYS into -EINTR
        io_uring: fix error clear of ->file_table in io_sqe_files_register()
        io_uring: separate the io_free_req and io_free_req_find_next interface
        io_uring: keep io_put_req only responsible for release and put req
        ...
      fb4b3d3f
    • Linus Torvalds's avatar
      Merge tag 'tpmdd-next-20191112' of git://git.infradead.org/users/jjs/linux-tpmdd · 54f0e540
      Linus Torvalds authored
      Pull tpmd updates from Jarkko Sakkinen:
      
       - support for Cr50 fTPM
      
       - support for fTPM on AMD Zen+ CPUs
      
       - TPM 2.0 trusted keys code relocated from drivers/char/tpm to
         security/keys
      
      * tag 'tpmdd-next-20191112' of git://git.infradead.org/users/jjs/linux-tpmdd:
        KEYS: trusted: Remove set but not used variable 'keyhndl'
        tpm: Switch to platform_get_irq_optional()
        tpm_crb: fix fTPM on AMD Zen+ CPUs
        KEYS: trusted: Move TPM2 trusted keys code
        KEYS: trusted: Create trusted keys subsystem
        KEYS: Use common tpm_buf for trusted and asymmetric keys
        tpm: Move tpm_buf code to include/linux/
        tpm: use GFP_KERNEL instead of GFP_HIGHMEM for tpm_buf
        tpm: add check after commands attribs tab allocation
        tpm: tpm_tis_spi: Drop THIS_MODULE usage from driver struct
        tpm: tpm_tis_spi: Cleanup includes
        tpm: tpm_tis_spi: Support cr50 devices
        tpm: tpm_tis_spi: Introduce a flow control callback
        tpm: Add a flag to indicate TPM power is managed by firmware
        dt-bindings: tpm: document properties for cr50
        tpm_tis: override durations for STM tpm with firmware 1.2.8.28
        tpm: provide a way to override the chip returned durations
        tpm: Remove duplicate code from caps_show() in tpm-sysfs.c
      54f0e540
    • Linus Torvalds's avatar
      vfs: properly and reliably lock f_pos in fdget_pos() · 0be0ee71
      Linus Torvalds authored
      fdget_pos() is used by file operations that will read and update f_pos:
      things like "read()", "write()" and "lseek()" (but not, for example,
      "pread()/pwrite" that get their file positions elsewhere).
      
      However, it had two separate escape clauses for this, because not
      everybody wants or needs serialization of the file position.
      
      The first and most obvious case is the "file descriptor doesn't have a
      position at all", ie a stream-like file.  Except we didn't actually use
      FMODE_STREAM, but instead used FMODE_ATOMIC_POS.  The reason for that
      was that FMODE_STREAM didn't exist back in the days, but also that we
      didn't want to mark all the special cases, so we only marked the ones
      that _required_ position atomicity according to POSIX - regular files
      and directories.
      
      The case one was intentionally lazy, but now that we _do_ have
      FMODE_STREAM we could and should just use it.  With the change to use
      FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
      the code to set it is deleted.
      
      Any cases where we don't want the serialization because the driver (or
      subsystem) doesn't use the file position should just be updated to do
      "stream_open()".  We've done that for all the obvious and common
      situations, we may need a few more.  Quoting Kirill Smelkov in the
      original FMODE_STREAM thread (see link below for full email):
      
       "And I appreciate if people could help at least somehow with "getting
        rid of mixed case entirely" (i.e. always lock f_pos_lock on
        !FMODE_STREAM), because this transition starts to diverge from my
        particular use-case too far. To me it makes sense to do that
        transition as follows:
      
         - convert nonseekable_open -> stream_open via stream_open.cocci;
         - audit other nonseekable_open calls and convert left users that
           truly don't depend on position to stream_open;
         - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
           will cover pipes and sockets), or maybe convert pipes and sockets
           to FMODE_STREAM manually;
         - extend stream_open.cocci to analyze file_operations that use
           no_llseek or noop_llseek, but do not use nonseekable_open or
           alloc_file_pseudo. This might find files that have stream semantic
           but are opened differently;
         - extend stream_open.cocci to analyze file_operations whose
           .read/.write do not use ppos at all (independently of how file was
           opened);
         - ...
         - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
           !FMODE_STREAM;
         - gather bug reports for deadlocked read/write and convert missed
           cases to FMODE_STREAM, probably extending stream_open.cocci along
           the road to catch similar cases
      
        i.e. always take f_pos_lock unless a file is explicitly marked as
        being stream, and try to find and cover all files that are streams"
      
      We have not done the "extend stream_open.cocci to analyze
      alloc_file_pseudo" as well, but the previous commit did manually handle
      the case of pipes and sockets.
      
      The other case where we can avoid locking f_pos is the "this file
      descriptor only has a single user and it is us, and thus there is no
      need to lock it".
      
      The second test was correct, although a bit subtle and worth just
      re-iterating here.  There are two kinds of other sources of references
      to the same file descriptor: file descriptors that have been explicitly
      shared across fork() or with dup(), and file tables having elevated
      reference counts due to threading (or explicit file sharing with
      clone()).
      
      The first case would have incremented the file count explicitly, and in
      the second case the previous __fdget() would have incremented it for us
      and set the FDPUT_FPUT flag.
      
      But in both cases the file count would be greater than one, so the
      "file_count(file) > 1" test catches both situations.  Also note that if
      file_count is 1, that also means that no other thread can have access to
      the file table, so there also cannot be races with concurrent calls to
      dup()/fork()/clone() that would increment the file count any other way.
      
      Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Cc: Eic Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0be0ee71
    • Linus Torvalds's avatar
      vfs: mark pipes and sockets as stream-like file descriptors · d8e464ec
      Linus Torvalds authored
      In commit 3975b097 ("convert stream-like files -> stream_open, even
      if they use noop_llseek") Kirill used a coccinelle script to change
      "nonseekable_open()" to "stream_open()", which changed the trivial cases
      of stream-like file descriptors to the new model with FMODE_STREAM.
      
      However, the two big cases - sockets and pipes - don't actually have
      that trivial pattern at all, and were thus never converted to
      FMODE_STREAM even though it makes lots of sense to do so.
      
      That's particularly true when looking forward to the next change:
      getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
      decide whether f_pos updates are needed or not.  And if they are, we'll
      always do them atomically.
      
      This came up because KCSAN (correctly) noted that the non-locked f_pos
      updates are data races: they are clearly benign for the case where we
      don't care, but it would be good to just not have that issue exist at
      all.
      
      Note that the reason we used FMODE_ATOMIC_POS originally is that only
      doing it for the minimal required case is "safer" in that it's possible
      that the f_pos locking can cause unnecessary serialization across the
      whole write() call.  And in the worst case, that kind of serialization
      can cause deadlock issues: think writers that need readers to empty the
      state using the same file descriptor.
      
      [ Note that the locking is per-file descriptor - because it protects
        "f_pos", which is obviously per-file descriptor - so it only affects
        cases where you literally use the same file descriptor to both read
        and write.
      
        So a regular pipe that has separate reading and writing file
        descriptors doesn't really have this situation even though it's the
        obvious case of "reader empties what a bit writer concurrently fills"
      
        But we want to make pipes as being stream-line anyway, because we
        don't want the unnecessary overhead of locking, and because a named
        pipe can be (ab-)used by reading and writing to the same file
        descriptor. ]
      
      There are likely a lot of other cases that might want FMODE_STREAM, and
      looking for ".llseek = no_llseek" users and other cases that don't have
      an lseek file operation at all and making them use "stream_open()" might
      be a good idea.  But pipes and sockets are likely to be the two main
      cases.
      
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Cc: Eic Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8e464ec
    • Linus Torvalds's avatar
      Linux 5.4 · 219d5433
      Linus Torvalds authored
      219d5433
  4. 24 Nov, 2019 2 commits
  5. 23 Nov, 2019 2 commits
  6. 22 Nov, 2019 1 commit