1. 11 Mar, 2022 3 commits
    • David Howells's avatar
      watch_queue: Fix to release page in ->release() · c1853fba
      David Howells authored
      When a pipe ring descriptor points to a notification message, the
      refcount on the backing page is incremented by the generic get function,
      but the release function, which marks the bitmap, doesn't drop the page
      ref.
      
      Fix this by calling generic_pipe_buf_release() at the end of
      watch_queue_pipe_buf_release().
      
      Fixes: c73be61c ("pipe: Add general notification queue support")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1853fba
    • David Howells's avatar
      watch_queue, pipe: Free watchqueue state after clearing pipe ring · db8facfc
      David Howells authored
      In free_pipe_info(), free the watchqueue state after clearing the pipe
      ring as each pipe ring descriptor has a release function, and in the
      case of a notification message, this is watch_queue_pipe_buf_release()
      which tries to mark the allocation bitmap that was previously released.
      
      Fix this by moving the put of the pipe's ref on the watch queue to after
      the ring has been cleared.  We still need to call watch_queue_clear()
      before doing that to make sure that the pipe is disconnected from any
      notification sources first.
      
      Fixes: c73be61c ("pipe: Add general notification queue support")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db8facfc
    • David Howells's avatar
      watch_queue: Fix filter limit check · c993ee0f
      David Howells authored
      In watch_queue_set_filter(), there are a couple of places where we check
      that the filter type value does not exceed what the type_filter bitmap
      can hold.  One place calculates the number of bits by:
      
         if (tf[i].type >= sizeof(wfilter->type_filter) * 8)
      
      which is fine, but the second does:
      
         if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG)
      
      which is not.  This can lead to a couple of out-of-bounds writes due to
      a too-large type:
      
       (1) __set_bit() on wfilter->type_filter
       (2) Writing more elements in wfilter->filters[] than we allocated.
      
      Fix this by just using the proper WATCH_TYPE__NR instead, which is the
      number of types we actually know about.
      
      The bug may cause an oops looking something like:
      
        BUG: KASAN: slab-out-of-bounds in watch_queue_set_filter+0x659/0x740
        Write of size 4 at addr ffff88800d2c66bc by task watch_queue_oob/611
        ...
        Call Trace:
         <TASK>
         dump_stack_lvl+0x45/0x59
         print_address_description.constprop.0+0x1f/0x150
         ...
         kasan_report.cold+0x7f/0x11b
         ...
         watch_queue_set_filter+0x659/0x740
         ...
         __x64_sys_ioctl+0x127/0x190
         do_syscall_64+0x43/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        Allocated by task 611:
         kasan_save_stack+0x1e/0x40
         __kasan_kmalloc+0x81/0xa0
         watch_queue_set_filter+0x23a/0x740
         __x64_sys_ioctl+0x127/0x190
         do_syscall_64+0x43/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        The buggy address belongs to the object at ffff88800d2c66a0
         which belongs to the cache kmalloc-32 of size 32
        The buggy address is located 28 bytes inside of
         32-byte region [ffff88800d2c66a0, ffff88800d2c66c0)
      
      Fixes: c73be61c ("pipe: Add general notification queue support")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c993ee0f
  2. 06 Mar, 2022 5 commits
    • Linus Torvalds's avatar
      Linux 5.17-rc7 · ffb217a1
      Linus Torvalds authored
      ffb217a1
    • Linus Torvalds's avatar
      Merge tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 3ee65c0f
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
       "A few more fixes for various problems that have user visible effects
        or seem to be urgent:
      
         - fix corruption when combining DIO and non-blocking io_uring over
           multiple extents (seen on MariaDB)
      
         - fix relocation crash due to premature return from commit
      
         - fix quota deadlock between rescan and qgroup removal
      
         - fix item data bounds checks in tree-checker (found on a fuzzed
           image)
      
         - fix fsync of prealloc extents after EOF
      
         - add missing run of delayed items after unlink during log replay
      
         - don't start relocation until snapshot drop is finished
      
         - fix reversed condition for subpage writers locking
      
         - fix warning on page error"
      
      * tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: fallback to blocking mode when doing async dio over multiple extents
        btrfs: add missing run of delayed items after unlink during log replay
        btrfs: qgroup: fix deadlock between rescan worker and remove qgroup
        btrfs: fix relocation crash due to premature return from btrfs_commit_transaction()
        btrfs: do not start relocation until in progress drops are done
        btrfs: tree-checker: use u64 for item data end to avoid overflow
        btrfs: do not WARN_ON() if we have PageError set
        btrfs: fix lost prealloc extents beyond eof after full fsync
        btrfs: subpage: fix a wrong check on subpage->writers
      3ee65c0f
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · f81664f7
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "x86 guest:
      
         - Tweaks to the paravirtualization code, to avoid using them when
           they're pointless or harmful
      
        x86 host:
      
         - Fix for SRCU lockdep splat
      
         - Brown paper bag fix for the propagation of errno"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run
        KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots()
        KVM: x86: Yield to IPI target vCPU only if it is busy
        x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64
        x86/kvm: Don't waste memory if kvmclock is disabled
        x86/kvm: Don't use PV TLB/yield when mwait is advertised
      f81664f7
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 9bdeaca1
      Linus Torvalds authored
      Pull powerpc fix from Michael Ellerman:
       "Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set.
      
        Thanks to Murilo Opsfelder Araujo, and Erhard F"
      
      * tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/64s: Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set
      9bdeaca1
    • Linus Torvalds's avatar
      Merge tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · f40a33f5
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Fix sorting on old "cpu" value in histograms
      
       - Fix return value of __setup() boot parameter handlers
      
      * tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        tracing: Fix return value of __setup handlers
        tracing/histogram: Fix sorting on old "cpu" value
      f40a33f5
  3. 05 Mar, 2022 13 commits
  4. 04 Mar, 2022 15 commits
    • Linus Torvalds's avatar
      Merge tag 'riscv-for-linus-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux · 07ebd38a
      Linus Torvalds authored
      Pull RISC-V fixes from Palmer Dabbelt:
      
       - Fixes for a handful of KASAN-related crashes.
      
       - A fix to avoid a crash during boot for SPARSEMEM &&
         !SPARSEMEM_VMEMMAP configurations.
      
       - A fix to stop reporting some incorrect errors under DEBUG_VIRTUAL.
      
       - A fix for the K210's device tree to properly populate the interrupt
         map, so hart1 will get interrupts again.
      
      * tag 'riscv-for-linus-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
        riscv: dts: k210: fix broken IRQs on hart1
        riscv: Fix kasan pud population
        riscv: Move high_memory initialization to setup_bootmem
        riscv: Fix config KASAN && DEBUG_VIRTUAL
        riscv: Fix DEBUG_VIRTUAL false warnings
        riscv: Fix config KASAN && SPARSEMEM && !SPARSE_VMEMMAP
        riscv: Fix is_linear_mapping with recent move of KASAN region
      07ebd38a
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 3f509f59
      Linus Torvalds authored
      Pull iommu fixes from Joerg Roedel:
      
       - Fix a double list_add() in Intel VT-d code
      
       - Add missing put_device() in Tegra SMMU driver
      
       - Two AMD IOMMU fixes:
           - Memory leak in IO page-table freeing code
           - Add missing recovery from event-log overflow
      
      * tag 'iommu-fixes-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu/tegra-smmu: Fix missing put_device() call in tegra_smmu_find
        iommu/vt-d: Fix double list_add when enabling VMD in scalable mode
        iommu/amd: Fix I/O page table memory leak
        iommu/amd: Recover from event log overflow
      3f509f59
    • Linus Torvalds's avatar
      Merge tag 'thermal-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · a4ffdb61
      Linus Torvalds authored
      Pull thermal control fix from Rafael Wysocki:
       "Fix NULL pointer dereference in the thermal netlink interface (Nicolas
        Cavallari)"
      
      * tag 'thermal-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        thermal: core: Fix TZ_GET_TRIP NULL pointer dereference
      a4ffdb61
    • Linus Torvalds's avatar
      Merge tag 'sound-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 8d670948
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "Hopefully the last PR for 5.17, including just a few small changes:
        an additional fix for ASoC ops boundary check and other minor
        device-specific fixes"
      
      * tag 'sound-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: intel_hdmi: Fix reference to PCM buffer address
        ASoC: cs4265: Fix the duplicated control name
        ASoC: ops: Shift tested values in snd_soc_put_volsw() by +min
      8d670948
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2022-03-04' of git://anongit.freedesktop.org/drm/drm · c4fc118a
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Things are quieting down as expected, just a small set of fixes, i915,
        exynos, amdgpu, vrr, bridge and hdlcd. Nothing scary at all.
      
        i915:
         - Fix GuC SLPC unset command
         - Fix misidentification of some Apple MacBook Pro laptops as Jasper Lake
      
        amdgpu:
         - Suspend regression fix
      
        exynos:
         - irq handling fixes
         - Fix two regressions to TE-gpio handling
      
        arm/hdlcd:
         - Select DRM_GEM_CMEA_HELPER for HDLCD
      
        bridge:
         - ti-sn65dsi86: Properly undo autosuspend
      
        vrr:
         - Fix potential NULL-pointer deref"
      
      * tag 'drm-fixes-2022-03-04' of git://anongit.freedesktop.org/drm/drm:
        drm/amdgpu: fix suspend/resume hang regression
        drm/vrr: Set VRR capable prop only if it is attached to connector
        drm/arm: arm hdlcd select DRM_GEM_CMA_HELPER
        drm/bridge: ti-sn65dsi86: Properly undo autosuspend
        drm/i915: s/JSP2/ICP2/ PCH
        drm/i915/guc/slpc: Correct the param count for unset param
        drm/exynos: Search for TE-gpio in DSI panel's node
        drm/exynos: Don't fail if no TE-gpio is defined for DSI driver
        drm/exynos: gsc: Use platform_get_irq() to get the interrupt
        drm/exynos/fimc: Use platform_get_irq() to get the interrupt
        drm/exynos/exynos_drm_fimd: Use platform_get_irq_byname() to get the interrupt
        drm/exynos: mixer: Use platform_get_irq() to get the interrupt
        drm/exynos/exynos7_drm_decon: Use platform_get_irq_byname() to get the interrupt
      c4fc118a
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 0b7344a6
      Linus Torvalds authored
      Pull pin control fixes from Linus Walleij:
       "These two fixes should fix the issues seen on the OrangePi, first we
        needed the correct offset when calling pinctrl_gpio_direction(), and
        fixing that made a lockdep issue explode in our face. Both now fixed"
      
      * tag 'pinctrl-v5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: sunxi: Use unique lockdep classes for IRQs
        pinctrl-sunxi: sunxi_pinctrl_gpio_direction_in/output: use correct offset
      0b7344a6
    • Randy Dunlap's avatar
      tracing: Fix return value of __setup handlers · 1d02b444
      Randy Dunlap authored
      __setup() handlers should generally return 1 to indicate that the
      boot options have been handled.
      
      Using invalid option values causes the entire kernel boot option
      string to be reported as Unknown and added to init's environment
      strings, polluting it.
      
        Unknown kernel command line parameters "BOOT_IMAGE=/boot/bzImage-517rc6
          kprobe_event=p,syscall_any,$arg1 trace_options=quiet
          trace_clock=jiffies", will be passed to user space.
      
       Run /sbin/init as init process
         with arguments:
           /sbin/init
         with environment:
           HOME=/
           TERM=linux
           BOOT_IMAGE=/boot/bzImage-517rc6
           kprobe_event=p,syscall_any,$arg1
           trace_options=quiet
           trace_clock=jiffies
      
      Return 1 from the __setup() handlers so that init's environment is not
      polluted with kernel boot options.
      
      Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
      Link: https://lkml.kernel.org/r/20220303031744.32356-1-rdunlap@infradead.org
      
      Cc: stable@vger.kernel.org
      Fixes: 7bcfaf54 ("tracing: Add trace_options kernel command line parameter")
      Fixes: e1e232ca ("tracing: Add trace_clock=<clock> kernel parameter")
      Fixes: 970988e1 ("tracing/kprobe: Add kprobe_event= boot parameter")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatarIgor Zhbanov <i.zhbanov@omprussia.ru>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      1d02b444
    • Daniel Borkmann's avatar
      mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls · 0708a0af
      Daniel Borkmann authored
      syzkaller was recently triggering an oversized kvmalloc() warning via
      xdp_umem_create().
      
      The triggered warning was added back in 7661809d ("mm: don't allow
      oversized kvmalloc() calls"). The rationale for the warning for huge
      kvmalloc sizes was as a reaction to a security bug where the size was
      more than UINT_MAX but not everything was prepared to handle unsigned
      long sizes.
      
      Anyway, the AF_XDP related call trace from this syzkaller report was:
      
        kvmalloc include/linux/mm.h:806 [inline]
        kvmalloc_array include/linux/mm.h:824 [inline]
        kvcalloc include/linux/mm.h:829 [inline]
        xdp_umem_pin_pages net/xdp/xdp_umem.c:102 [inline]
        xdp_umem_reg net/xdp/xdp_umem.c:219 [inline]
        xdp_umem_create+0x6a5/0xf00 net/xdp/xdp_umem.c:252
        xsk_setsockopt+0x604/0x790 net/xdp/xsk.c:1068
        __sys_setsockopt+0x1fd/0x4e0 net/socket.c:2176
        __do_sys_setsockopt net/socket.c:2187 [inline]
        __se_sys_setsockopt net/socket.c:2184 [inline]
        __x64_sys_setsockopt+0xb5/0x150 net/socket.c:2184
        do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Björn mentioned that requests for >2GB allocation can still be valid:
      
        The structure that is being allocated is the page-pinning accounting.
        AF_XDP has an internal limit of U32_MAX pages, which is *a lot*, but
        still fewer than what memcg allows (PAGE_COUNTER_MAX is a LONG_MAX/
        PAGE_SIZE on 64 bit systems). [...]
      
        I could just change from U32_MAX to INT_MAX, but as I stated earlier
        that has a hacky feeling to it. [...] From my perspective, the code
        isn't broken, with the memcg limits in consideration. [...]
      
      Linus says:
      
        [...] Pretty much every time this has come up, the kernel warning has
        shown that yes, the code was broken and there really wasn't a reason
        for doing allocations that big.
      
        Of course, some people would be perfectly fine with the allocation
        failing, they just don't want the warning. I didn't want __GFP_NOWARN
        to shut it up originally because I wanted people to see all those
        cases, but these days I think we can just say "yeah, people can shut
        it up explicitly by saying 'go ahead and fail this allocation, don't
        warn about it'".
      
        So enough time has passed that by now I'd certainly be ok with [it].
      
      Thus allow call-sites to silence such userspace triggered splats if the
      allocation requests have __GFP_NOWARN. For xdp_umem_pin_pages()'s call
      to kvcalloc() this is already the case, so nothing else needed there.
      
      Fixes: 7661809d ("mm: don't allow oversized kvmalloc() calls")
      Reported-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com
      Cc: Björn Töpel <bjorn@kernel.org>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Link: https://lore.kernel.org/bpf/CAJ+HfNhyfsT5cS_U9EC213ducHs9k9zNxX9+abqC0kTrPbQ0gg@mail.gmail.com
      Link: https://lore.kernel.org/bpf/20211201202905.b9892171e3f5b9a60f9da251@linux-foundation.orgReviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Ackd-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0708a0af
    • Filipe Manana's avatar
      btrfs: fallback to blocking mode when doing async dio over multiple extents · ca93e44b
      Filipe Manana authored
      Some users recently reported that MariaDB was getting a read corruption
      when using io_uring on top of btrfs. This started to happen in 5.16,
      after commit 51bd9563 ("btrfs: fix deadlock due to page faults
      during direct IO reads and writes"). That changed btrfs to use the new
      iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
      iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
      corresponds to a memory mapped file region. That type of scenario is
      exercised by test case generic/647 from fstests.
      
      For this MariaDB scenario, we attempt to read 16K from file offset X
      using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
      with a size of 4K, and what happens is the following:
      
      1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
      
      2) iomap creates a struct iomap_dio object, its reference count is
         initialized to 1 and its ->size field is initialized to 0;
      
      3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
         the first 4K extent, and setups an iomap for this extent consisting
         of a single page;
      
      4) At iomap_dio_bio_iter(), we are able to access the first page of the
         buffer (struct iov_iter) with bio_iov_iter_get_pages() without
         triggering a page fault;
      
      5) iomap submits a bio for this 4K extent
         (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
         the refcount on the struct iomap_dio object to 2; The ->size field
         of the struct iomap_dio object is incremented to 4K;
      
      6) iomap calls btrfs_iomap_begin() again, this time with a file
         offset of X + 4K. There we setup an iomap for the next extent
         that also has a size of 4K;
      
      7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
         which tries to access the next page (2nd page) of the buffer.
         This triggers a page fault and returns -EFAULT;
      
      8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
         to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
         the struct iomap_dio object has a ->size value of 4K (we submitted
         a bio for an extent already). The 'wait_for_completion' variable
         is not set to true, because our iocb has IOCB_NOWAIT set;
      
      9) At the bottom of __iomap_dio_rw(), we decrement the reference count
         of the struct iomap_dio object from 2 to 1. Because we were not
         the only ones holding a reference on it and 'wait_for_completion' is
         set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
         just returns it up the callchain, up to io_uring;
      
      10) The bio submitted for the first extent (step 5) completes and its
          bio endio function, iomap_dio_bio_end_io(), decrements the last
          reference on the struct iomap_dio object, resulting in calling
          iomap_dio_complete_work() -> iomap_dio_complete().
      
      11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
          and return 4K (the amount of io done) to iomap_dio_complete_work();
      
      12) iomap_dio_complete_work() calls the iocb completion callback,
          iocb->ki_complete() with a second argument value of 4K (total io
          done) and the iocb with the adjust ki_pos of X + 4K. This results
          in completing the read request for io_uring, leaving it with a
          result of 4K bytes read, and only the first page of the buffer
          filled in, while the remaining 3 pages, corresponding to the other
          3 extents, were not filled;
      
      13) For the application, the result is unexpected because if we ask
          to read N bytes, it expects to get N bytes read as long as those
          N bytes don't cross the EOF (i_size).
      
      MariaDB reports this as an error, as it's not expecting a short read,
      since it knows it's asking for read operations fully within the i_size
      boundary. This is typical in many applications, but it may also be
      questionable if they should react to such short reads by issuing more
      read calls to get the remaining data. Nevertheless, the short read
      happened due to a change in btrfs regarding how it deals with page
      faults while in the middle of a read operation, and there's no reason
      why btrfs can't have the previous behaviour of returning the whole data
      that was requested by the application.
      
      The problem can also be triggered with the following simple program:
      
        /* Get O_DIRECT */
        #ifndef _GNU_SOURCE
        #define _GNU_SOURCE
        #endif
      
        #include <stdio.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <fcntl.h>
        #include <errno.h>
        #include <string.h>
        #include <liburing.h>
      
        int main(int argc, char *argv[])
        {
            char *foo_path;
            struct io_uring ring;
            struct io_uring_sqe *sqe;
            struct io_uring_cqe *cqe;
            struct iovec iovec;
            int fd;
            long pagesize;
            void *write_buf;
            void *read_buf;
            ssize_t ret;
            int i;
      
            if (argc != 2) {
                fprintf(stderr, "Use: %s <directory>\n", argv[0]);
                return 1;
            }
      
            foo_path = malloc(strlen(argv[1]) + 5);
            if (!foo_path) {
                fprintf(stderr, "Failed to allocate memory for file path\n");
                return 1;
            }
            strcpy(foo_path, argv[1]);
            strcat(foo_path, "/foo");
      
            /*
             * Create file foo with 2 extents, each with a size matching
             * the page size. Then allocate a buffer to read both extents
             * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
             * the read with io_uring, access the first page of the buffer
             * to fault it in, so that during the read we only trigger a
             * page fault when accessing the second page of the buffer.
             */
             fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
                      O_DIRECT, 0666);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to create file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             pagesize = sysconf(_SC_PAGE_SIZE);
             ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate write buffer\n");
                 return 1;
             }
      
             memset(write_buf, 0xab, pagesize);
             memset(write_buf + pagesize, 0xcd, pagesize);
      
             /* Create 2 extents, each with a size matching page size. */
             for (i = 0; i < 2; i++) {
                 ret = pwrite(fd, write_buf + i * pagesize, pagesize,
                              i * pagesize);
                 if (ret != pagesize) {
                     fprintf(stderr,
                           "Failed to write to file, ret = %ld errno %d (%s)\n",
                            ret, errno, strerror(errno));
                     return 1;
                 }
                 ret = fsync(fd);
                 if (ret != 0) {
                     fprintf(stderr, "Failed to fsync file\n");
                     return 1;
                 }
             }
      
             close(fd);
             fd = open(foo_path, O_RDONLY | O_DIRECT);
             if (fd == -1) {
                 fprintf(stderr,
                         "Failed to open file 'foo': %s (errno %d)",
                         strerror(errno), errno);
                 return 1;
             }
      
             ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
             if (ret) {
                 fprintf(stderr, "Failed to allocate read buffer\n");
                 return 1;
             }
      
             /*
              * Fault in only the first page of the read buffer.
              * We want to trigger a page fault for the 2nd page of the
              * read buffer during the read operation with io_uring
              * (O_DIRECT and IOCB_NOWAIT).
              */
             memset(read_buf, 0, 1);
      
             ret = io_uring_queue_init(1, &ring, 0);
             if (ret != 0) {
                 fprintf(stderr, "Failed to create io_uring queue\n");
                 return 1;
             }
      
             sqe = io_uring_get_sqe(&ring);
             if (!sqe) {
                 fprintf(stderr, "Failed to get io_uring sqe\n");
                 return 1;
             }
      
             iovec.iov_base = read_buf;
             iovec.iov_len = 2 * pagesize;
             io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
      
             ret = io_uring_submit_and_wait(&ring, 1);
             if (ret != 1) {
                 fprintf(stderr,
                         "Failed at io_uring_submit_and_wait()\n");
                 return 1;
             }
      
             ret = io_uring_wait_cqe(&ring, &cqe);
             if (ret < 0) {
                 fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
                 return 1;
             }
      
             printf("io_uring read result for file foo:\n\n");
             printf("  cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
             printf("  memcmp(read_buf, write_buf) == %d (expected 0)\n",
                    memcmp(read_buf, write_buf, 2 * pagesize));
      
             io_uring_cqe_seen(&ring, cqe);
             io_uring_queue_exit(&ring);
      
             return 0;
        }
      
      When running it on an unpatched kernel:
      
        $ gcc io_uring_test.c -luring
        $ mkfs.btrfs -f /dev/sda
        $ mount /dev/sda /mnt/sda
        $ ./a.out /mnt/sda
        io_uring read result for file foo:
      
          cqe->res == 4096 (expected 8192)
          memcmp(read_buf, write_buf) == -205 (expected 0)
      
      After this patch, the read always returns 8192 bytes, with the buffer
      filled with the correct data. Although that reproducer always triggers
      the bug in my test vms, it's possible that it will not be so reliable
      on other environments, as that can happen if the bio for the first
      extent completes and decrements the reference on the struct iomap_dio
      object before we do the atomic_dec_and_test() on the reference at
      __iomap_dio_rw().
      
      Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
      whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
      set) over a range that spans multiple extents (or a mix of extents and
      holes). This avoids returning success to the caller when we only did
      partial IO, which is not optimal for writes and for reads it's actually
      incorrect, as the caller doesn't expect to get less bytes read than it has
      requested (unless EOF is crossed), as previously mentioned. This is also
      the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
      even though it doesn't use IOMAP_DIO_PARTIAL.
      
      A test case for fstests will follow soon.
      
      Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca93e44b
    • Niklas Cassel's avatar
      riscv: dts: k210: fix broken IRQs on hart1 · 74583f1b
      Niklas Cassel authored
      Commit 67d96729 ("riscv: Update Canaan Kendryte K210 device tree")
      incorrectly removed two entries from the PLIC interrupt-controller node's
      interrupts-extended property.
      
      The PLIC driver cannot know the mapping between hart contexts and hart ids,
      so this information has to be provided by device tree, as specified by the
      PLIC device tree binding.
      
      The PLIC driver uses the interrupts-extended property, and initializes the
      hart context registers in the exact same order as provided by the
      interrupts-extended property.
      
      In other words, if we don't specify the S-mode interrupts, the PLIC driver
      will simply initialize the hart0 S-mode hart context with the hart1 M-mode
      configuration. It is therefore essential to specify the S-mode IRQs even
      though the system itself will only ever be running in M-mode.
      
      Re-add the S-mode interrupts, so that we get working IRQs on hart1 again.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 67d96729 ("riscv: Update Canaan Kendryte K210 device tree")
      Signed-off-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      74583f1b
    • Dave Airlie's avatar
      Merge tag 'drm-misc-fixes-2022-03-03' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes · 8fdb1967
      Dave Airlie authored
       * drm/arm: Select DRM_GEM_CMEA_HELPER for HDLCD
       * drm/bridge: ti-sn65dsi86: Properly undo autosuspend
       * drm/vrr: Fix potential NULL-pointer deref
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      
      From: Thomas Zimmermann <tzimmermann@suse.de>
      Link: https://patchwork.freedesktop.org/patch/msgid/YiCTGZ8IVCw0ilKK@linux-uq9g
      8fdb1967
    • Dave Airlie's avatar
      Merge tag 'amd-drm-fixes-5.17-2022-03-02' of... · c9585249
      Dave Airlie authored
      Merge tag 'amd-drm-fixes-5.17-2022-03-02' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes
      
      amd-drm-fixes-5.17-2022-03-02:
      
      amdgpu:
      - Suspend regression fix
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Alex Deucher <alexander.deucher@amd.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20220303045035.5650-1-alexander.deucher@amd.com
      c9585249
    • Dave Airlie's avatar
      Merge tag 'drm-intel-fixes-2022-03-03' of... · 0d9f0ee1
      Dave Airlie authored
      Merge tag 'drm-intel-fixes-2022-03-03' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
      
      - Fix GuC SLPC unset command. (Vinay Belgaumkar)
      - Fix misidentification of some Apple MacBook Pro laptops as Jasper Lake. (Ville Syrjälä)
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/YiCXHiTyCE7TbopG@tursulin-mobl2
      0d9f0ee1
    • William Mahon's avatar
      HID: add mapping for KEY_ALL_APPLICATIONS · 327b89f0
      William Mahon authored
      This patch adds a new key definition for KEY_ALL_APPLICATIONS
      and aliases KEY_DASHBOARD to it.
      
      It also maps the 0x0c/0x2a2 usage code to KEY_ALL_APPLICATIONS.
      Signed-off-by: default avatarWilliam Mahon <wmahon@chromium.org>
      Acked-by: default avatarBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Link: https://lore.kernel.org/r/20220303035618.1.I3a7746ad05d270161a18334ae06e3b6db1a1d339@changeidSigned-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      327b89f0
    • William Mahon's avatar
      HID: add mapping for KEY_DICTATE · bfa26ba3
      William Mahon authored
      Numerous keyboards are adding dictate keys which allows for text
      messages to be dictated by a microphone.
      
      This patch adds a new key definition KEY_DICTATE and maps 0x0c/0x0d8
      usage code to this new keycode. Additionally hid-debug is adjusted to
      recognize this new usage code as well.
      Signed-off-by: default avatarWilliam Mahon <wmahon@chromium.org>
      Acked-by: default avatarBenjamin Tissoires <benjamin.tissoires@redhat.com>
      Link: https://lore.kernel.org/r/20220303021501.1.I5dbf50eb1a7a6734ee727bda4a8573358c6d3ec0@changeidSigned-off-by: default avatarDmitry Torokhov <dmitry.torokhov@gmail.com>
      bfa26ba3
  5. 03 Mar, 2022 4 commits
    • Alexandre Ghiti's avatar
      riscv: Fix kasan pud population · e4fcfe6e
      Alexandre Ghiti authored
      In sv48, the kasan inner regions are not aligned on PGDIR_SIZE and then
      when we populate the kasan linear mapping region, we clear the kasan
      vmalloc region which is in the same PGD.
      
      Fix this by copying the content of the kasan early pud after allocating a
      new PGD for the first time.
      
      Fixes: e8a62cc2 ("riscv: Implement sv48 support")
      Signed-off-by: default avatarAlexandre Ghiti <alexandre.ghiti@canonical.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      e4fcfe6e
    • Alexandre Ghiti's avatar
      riscv: Move high_memory initialization to setup_bootmem · 625e24a5
      Alexandre Ghiti authored
      high_memory used to be initialized in mem_init, way after setup_bootmem.
      But a call to dma_contiguous_reserve in this function gives rise to the
      below warning because high_memory is equal to 0 and is used at the very
      beginning at cma_declare_contiguous_nid.
      
      It went unnoticed since the move of the kasan region redefined
      KERN_VIRT_SIZE so that it does not encompass -1 anymore.
      
      Fix this by initializing high_memory in setup_bootmem.
      
      ------------[ cut here ]------------
      virt_to_phys used for non-linear address: ffffffffffffffff (0xffffffffffffffff)
      WARNING: CPU: 0 PID: 0 at arch/riscv/mm/physaddr.c:14 __virt_to_phys+0xac/0x1b8
      Modules linked in:
      CPU: 0 PID: 0 Comm: swapper Not tainted 5.17.0-rc1-00007-ga68b89289e26 #27
      Hardware name: riscv-virtio,qemu (DT)
      epc : __virt_to_phys+0xac/0x1b8
       ra : __virt_to_phys+0xac/0x1b8
      epc : ffffffff80014922 ra : ffffffff80014922 sp : ffffffff84a03c30
       gp : ffffffff85866c80 tp : ffffffff84a3f180 t0 : ffffffff86bce657
       t1 : fffffffef09406e8 t2 : 0000000000000000 s0 : ffffffff84a03c70
       s1 : ffffffffffffffff a0 : 000000000000004f a1 : 00000000000f0000
       a2 : 0000000000000002 a3 : ffffffff8011f408 a4 : 0000000000000000
       a5 : 0000000000000000 a6 : 0000000000f00000 a7 : ffffffff84a03747
       s2 : ffffffd800000000 s3 : ffffffff86ef4000 s4 : ffffffff8467f828
       s5 : fffffff800000000 s6 : 8000000000006800 s7 : 0000000000000000
       s8 : 0000000480000000 s9 : 0000000080038ea0 s10: 0000000000000000
       s11: ffffffffffffffff t3 : ffffffff84a035c0 t4 : fffffffef09406e8
       t5 : fffffffef09406e9 t6 : ffffffff84a03758
      status: 0000000000000100 badaddr: 0000000000000000 cause: 0000000000000003
      [<ffffffff8322ef4c>] cma_declare_contiguous_nid+0xf2/0x64a
      [<ffffffff83212a58>] dma_contiguous_reserve_area+0x46/0xb4
      [<ffffffff83212c3a>] dma_contiguous_reserve+0x174/0x18e
      [<ffffffff83208fc2>] paging_init+0x12c/0x35e
      [<ffffffff83206bd2>] setup_arch+0x120/0x74e
      [<ffffffff83201416>] start_kernel+0xce/0x68c
      irq event stamp: 0
      hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      hardirqs last disabled at (0): [<0000000000000000>] 0x0
      softirqs last  enabled at (0): [<0000000000000000>] 0x0
      softirqs last disabled at (0): [<0000000000000000>] 0x0
      ---[ end trace 0000000000000000 ]---
      
      Fixes: f7ae0233 ("riscv: Move KASAN mapping next to the kernel mapping")
      Signed-off-by: default avatarAlexandre Ghiti <alexandre.ghiti@canonical.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      625e24a5
    • Alexandre Ghiti's avatar
      riscv: Fix config KASAN && DEBUG_VIRTUAL · c648c4bb
      Alexandre Ghiti authored
      __virt_to_phys function is called very early in the boot process (ie
      kasan_early_init) so it should not be instrumented by KASAN otherwise it
      bugs.
      
      Fix this by declaring phys_addr.c as non-kasan instrumentable.
      Signed-off-by: default avatarAlexandre Ghiti <alexandre.ghiti@canonical.com>
      Fixes: 8ad8b727 (riscv: Add KASAN support)
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      c648c4bb
    • Alexandre Ghiti's avatar
      riscv: Fix DEBUG_VIRTUAL false warnings · 5f763b3b
      Alexandre Ghiti authored
      KERN_VIRT_SIZE used to encompass the kernel mapping before it was
      redefined when moving the kasan mapping next to the kernel mapping to only
      match the maximum amount of physical memory.
      
      Then, kernel mapping addresses that go through __virt_to_phys are now
      declared as wrong which is not true, one can use __virt_to_phys on such
      addresses.
      
      Fix this by redefining the condition that matches wrong addresses.
      
      Fixes: f7ae0233 ("riscv: Move KASAN mapping next to the kernel mapping")
      Signed-off-by: default avatarAlexandre Ghiti <alexandre.ghiti@canonical.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPalmer Dabbelt <palmer@rivosinc.com>
      5f763b3b