1. 12 Feb, 2022 4 commits
    • Roman Gushchin's avatar
      mm: memcg: synchronize objcg lists with a dedicated spinlock · 0764db9b
      Roman Gushchin authored
      Alexander reported a circular lock dependency revealed by the mmap1 ltp
      test:
      
        LOCKDEP_CIRCULAR (suite: ltp, case: mtest06 (mmap1))
                WARNING: possible circular locking dependency detected
                5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1 Not tainted
                ------------------------------------------------------
                mmap1/202299 is trying to acquire lock:
                00000001892c0188 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x4a/0xe0
                but task is already holding lock:
                00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
                which lock already depends on the new lock.
                the existing dependency chain (in reverse order) is:
                -> #1 (&sighand->siglock){-.-.}-{2:2}:
                       __lock_acquire+0x604/0xbd8
                       lock_acquire.part.0+0xe2/0x238
                       lock_acquire+0xb0/0x200
                       _raw_spin_lock_irqsave+0x6a/0xd8
                       __lock_task_sighand+0x90/0x190
                       cgroup_freeze_task+0x2e/0x90
                       cgroup_migrate_execute+0x11c/0x608
                       cgroup_update_dfl_csses+0x246/0x270
                       cgroup_subtree_control_write+0x238/0x518
                       kernfs_fop_write_iter+0x13e/0x1e0
                       new_sync_write+0x100/0x190
                       vfs_write+0x22c/0x2d8
                       ksys_write+0x6c/0xf8
                       __do_syscall+0x1da/0x208
                       system_call+0x82/0xb0
                -> #0 (css_set_lock){..-.}-{2:2}:
                       check_prev_add+0xe0/0xed8
                       validate_chain+0x736/0xb20
                       __lock_acquire+0x604/0xbd8
                       lock_acquire.part.0+0xe2/0x238
                       lock_acquire+0xb0/0x200
                       _raw_spin_lock_irqsave+0x6a/0xd8
                       obj_cgroup_release+0x4a/0xe0
                       percpu_ref_put_many.constprop.0+0x150/0x168
                       drain_obj_stock+0x94/0xe8
                       refill_obj_stock+0x94/0x278
                       obj_cgroup_charge+0x164/0x1d8
                       kmem_cache_alloc+0xac/0x528
                       __sigqueue_alloc+0x150/0x308
                       __send_signal+0x260/0x550
                       send_signal+0x7e/0x348
                       force_sig_info_to_task+0x104/0x180
                       force_sig_fault+0x48/0x58
                       __do_pgm_check+0x120/0x1f0
                       pgm_check_handler+0x11e/0x180
                other info that might help us debug this:
                 Possible unsafe locking scenario:
                       CPU0                    CPU1
                       ----                    ----
                  lock(&sighand->siglock);
                                               lock(css_set_lock);
                                               lock(&sighand->siglock);
                  lock(css_set_lock);
                 *** DEADLOCK ***
                2 locks held by mmap1/202299:
                 #0: 00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
                 #1: 00000001892ad560 (rcu_read_lock){....}-{1:2}, at: percpu_ref_put_many.constprop.0+0x0/0x168
                stack backtrace:
                CPU: 15 PID: 202299 Comm: mmap1 Not tainted 5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1
                Hardware name: IBM 3906 M04 704 (LPAR)
                Call Trace:
                  dump_stack_lvl+0x76/0x98
                  check_noncircular+0x136/0x158
                  check_prev_add+0xe0/0xed8
                  validate_chain+0x736/0xb20
                  __lock_acquire+0x604/0xbd8
                  lock_acquire.part.0+0xe2/0x238
                  lock_acquire+0xb0/0x200
                  _raw_spin_lock_irqsave+0x6a/0xd8
                  obj_cgroup_release+0x4a/0xe0
                  percpu_ref_put_many.constprop.0+0x150/0x168
                  drain_obj_stock+0x94/0xe8
                  refill_obj_stock+0x94/0x278
                  obj_cgroup_charge+0x164/0x1d8
                  kmem_cache_alloc+0xac/0x528
                  __sigqueue_alloc+0x150/0x308
                  __send_signal+0x260/0x550
                  send_signal+0x7e/0x348
                  force_sig_info_to_task+0x104/0x180
                  force_sig_fault+0x48/0x58
                  __do_pgm_check+0x120/0x1f0
                  pgm_check_handler+0x11e/0x180
                INFO: lockdep is turned off.
      
      In this example a slab allocation from __send_signal() caused a
      refilling and draining of a percpu objcg stock, resulted in a releasing
      of another non-related objcg.  Objcg release path requires taking the
      css_set_lock, which is used to synchronize objcg lists.
      
      This can create a circular dependency with the sighandler lock, which is
      taken with the locked css_set_lock by the freezer code (to freeze a
      task).
      
      In general it seems that using css_set_lock to synchronize objcg lists
      makes any slab allocations and deallocation with the locked css_set_lock
      and any intervened locks risky.
      
      To fix the problem and make the code more robust let's stop using
      css_set_lock to synchronize objcg lists and use a new dedicated spinlock
      instead.
      
      Link: https://lkml.kernel.org/r/Yfm1IHmoGdyUR81T@carbon.dhcp.thefacebook.com
      Fixes: bf4f0599 ("mm: memcg/slab: obj_cgroup API")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reported-by: default avatarAlexander Egorenkov <egorenar@linux.ibm.com>
      Tested-by: default avatarAlexander Egorenkov <egorenar@linux.ibm.com>
      Reviewed-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarJeremy Linton <jeremy.linton@arm.com>
      Tested-by: default avatarJeremy Linton <jeremy.linton@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0764db9b
    • Mel Gorman's avatar
      mm: vmscan: remove deadlock due to throttling failing to make progress · b485c6f1
      Mel Gorman authored
      A soft lockup bug in kcompactd was reported in a private bugzilla with
      the following visible in dmesg;
      
        watchdog: BUG: soft lockup - CPU#33 stuck for 26s! [kcompactd0:479]
        watchdog: BUG: soft lockup - CPU#33 stuck for 52s! [kcompactd0:479]
        watchdog: BUG: soft lockup - CPU#33 stuck for 78s! [kcompactd0:479]
        watchdog: BUG: soft lockup - CPU#33 stuck for 104s! [kcompactd0:479]
      
      The machine had 256G of RAM with no swap and an earlier failed
      allocation indicated that node 0 where kcompactd was run was potentially
      unreclaimable;
      
        Node 0 active_anon:29355112kB inactive_anon:2913528kB active_file:0kB
          inactive_file:0kB unevictable:64kB isolated(anon):0kB isolated(file):0kB
          mapped:8kB dirty:0kB writeback:0kB shmem:26780kB shmem_thp:
          0kB shmem_pmdmapped: 0kB anon_thp: 23480320kB writeback_tmp:0kB
          kernel_stack:2272kB pagetables:24500kB all_unreclaimable? yes
      
      Vlastimil Babka investigated a crash dump and found that a task
      migrating pages was trying to drain PCP lists;
      
        PID: 52922  TASK: ffff969f820e5000  CPU: 19  COMMAND: "kworker/u128:3"
        Call Trace:
           __schedule
           schedule
           schedule_timeout
           wait_for_completion
           __flush_work
           __drain_all_pages
           __alloc_pages_slowpath.constprop.114
           __alloc_pages
           alloc_migration_target
           migrate_pages
           migrate_to_node
           do_migrate_pages
           cpuset_migrate_mm_workfn
           process_one_work
           worker_thread
           kthread
           ret_from_fork
      
      This failure is specific to CONFIG_PREEMPT=n builds.  The root of the
      problem is that kcompact0 is not rescheduling on a CPU while a task that
      has isolated a large number of the pages from the LRU is waiting on
      kcompact0 to reschedule so the pages can be released.  While
      shrink_inactive_list() only loops once around too_many_isolated, reclaim
      can continue without rescheduling if sc->skipped_deactivate == 1 which
      could happen if there was no file LRU and the inactive anon list was not
      low.
      
      Link: https://lkml.kernel.org/r/20220203100326.GD3301@suse.de
      Fixes: d818fca1 ("mm/vmscan: throttle reclaim and compaction when too may pages are isolated")
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Debugged-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b485c6f1
    • Yang Shi's avatar
      fs/proc: task_mmu.c: don't read mapcount for migration entry · 24d7275c
      Yang Shi authored
      The syzbot reported the below BUG:
      
        kernel BUG at include/linux/page-flags.h:785!
        invalid opcode: 0000 [#1] PREEMPT SMP KASAN
        CPU: 1 PID: 4392 Comm: syz-executor560 Not tainted 5.16.0-rc6-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:PageDoubleMap include/linux/page-flags.h:785 [inline]
        RIP: 0010:__page_mapcount+0x2d2/0x350 mm/util.c:744
        Call Trace:
          page_mapcount include/linux/mm.h:837 [inline]
          smaps_account+0x470/0xb10 fs/proc/task_mmu.c:466
          smaps_pte_entry fs/proc/task_mmu.c:538 [inline]
          smaps_pte_range+0x611/0x1250 fs/proc/task_mmu.c:601
          walk_pmd_range mm/pagewalk.c:128 [inline]
          walk_pud_range mm/pagewalk.c:205 [inline]
          walk_p4d_range mm/pagewalk.c:240 [inline]
          walk_pgd_range mm/pagewalk.c:277 [inline]
          __walk_page_range+0xe23/0x1ea0 mm/pagewalk.c:379
          walk_page_vma+0x277/0x350 mm/pagewalk.c:530
          smap_gather_stats.part.0+0x148/0x260 fs/proc/task_mmu.c:768
          smap_gather_stats fs/proc/task_mmu.c:741 [inline]
          show_smap+0xc6/0x440 fs/proc/task_mmu.c:822
          seq_read_iter+0xbb0/0x1240 fs/seq_file.c:272
          seq_read+0x3e0/0x5b0 fs/seq_file.c:162
          vfs_read+0x1b5/0x600 fs/read_write.c:479
          ksys_read+0x12d/0x250 fs/read_write.c:619
          do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
          entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The reproducer was trying to read /proc/$PID/smaps when calling
      MADV_FREE at the mean time.  MADV_FREE may split THPs if it is called
      for partial THP.  It may trigger the below race:
      
                 CPU A                         CPU B
                 -----                         -----
        smaps walk:                      MADV_FREE:
        page_mapcount()
          PageCompound()
                                         split_huge_page()
          page = compound_head(page)
          PageDoubleMap(page)
      
      When calling PageDoubleMap() this page is not a tail page of THP anymore
      so the BUG is triggered.
      
      This could be fixed by elevated refcount of the page before calling
      mapcount, but that would prevent it from counting migration entries, and
      it seems overkilling because the race just could happen when PMD is
      split so all PTE entries of tail pages are actually migration entries,
      and smaps_account() does treat migration entries as mapcount == 1 as
      Kirill pointed out.
      
      Add a new parameter for smaps_account() to tell this entry is migration
      entry then skip calling page_mapcount().  Don't skip getting mapcount
      for device private entries since they do track references with mapcount.
      
      Pagemap also has the similar issue although it was not reported.  Fixed
      it as well.
      
      [shy828301@gmail.com: v4]
        Link: https://lkml.kernel.org/r/20220203182641.824731-1-shy828301@gmail.com
      [nathan@kernel.org: avoid unused variable warning in pagemap_pmd_range()]
        Link: https://lkml.kernel.org/r/20220207171049.1102239-1-nathan@kernel.org
      Link: https://lkml.kernel.org/r/20220120202805.3369-1-shy828301@gmail.com
      Fixes: e9b61f19 ("thp: reintroduce split_huge_page()")
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reported-by: syzbot+1f52b3a18d5633fa7f82@syzkaller.appspotmail.com
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24d7275c
    • Mike Rapoport's avatar
      fs/binfmt_elf: fix PT_LOAD p_align values for loaders · 925346c1
      Mike Rapoport authored
      Rui Salvaterra reported that Aisleroit solitaire crashes with "Wrong
      __data_start/_end pair" assertion from libgc after update to v5.17-rc1.
      
      Bisection pointed to commit 9630f0d6 ("fs/binfmt_elf: use PT_LOAD
      p_align values for static PIE") that fixed handling of static PIEs, but
      made the condition that guards load_bias calculation to exclude loader
      binaries.
      
      Restoring the check for presence of interpreter fixes the problem.
      
      Link: https://lkml.kernel.org/r/20220202121433.3697146-1-rppt@kernel.org
      Fixes: 9630f0d6 ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reported-by: default avatarRui Salvaterra <rsalvaterra@gmail.com>
      Tested-by: default avatarRui Salvaterra <rsalvaterra@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: "H.J. Lu" <hjl.tools@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      925346c1
  2. 11 Feb, 2022 1 commit
    • Linus Torvalds's avatar
      Merge tag 'net-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · f1baf68e
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from netfilter and can.
      
      Current release - new code bugs:
      
         - sparx5: fix get_stat64 out-of-bound access and crash
      
         - smc: fix netdev ref tracker misuse
      
        Previous releases - regressions:
      
         - eth: ixgbevf: require large buffers for build_skb on 82599VF, avoid
           overflows
      
         - eth: ocelot: fix all IP traffic getting trapped to CPU with PTP
           over IP
      
         - bonding: fix rare link activation misses in 802.3ad mode
      
        Previous releases - always broken:
      
         - tcp: fix tcp sock mem accounting in zero-copy corner cases
      
         - remove the cached dst when uncloning an skb dst and its metadata,
           since we only have one ref it'd lead to an UaF
      
         - netfilter:
            - conntrack: don't refresh sctp entries in closed state
            - conntrack: re-init state for retransmitted syn-ack, avoid
              connection establishment getting stuck with strange stacks
            - ctnetlink: disable helper autoassign, avoid it getting lost
            - nft_payload: don't allow transport header access for fragments
      
         - dsa: fix use of devres for mdio throughout drivers
      
         - eth: amd-xgbe: disable interrupts during pci removal
      
         - eth: dpaa2-eth: unregister netdev before disconnecting the PHY
      
         - eth: ice: fix IPIP and SIT TSO offload"
      
      * tag 'net-5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (53 commits)
        net: dsa: mv88e6xxx: fix use-after-free in mv88e6xxx_mdios_unregister
        net: mscc: ocelot: fix mutex lock error during ethtool stats read
        ice: Avoid RTNL lock when re-creating auxiliary device
        ice: Fix KASAN error in LAG NETDEV_UNREGISTER handler
        ice: fix IPIP and SIT TSO offload
        ice: fix an error code in ice_cfg_phy_fec()
        net: mpls: Fix GCC 12 warning
        dpaa2-eth: unregister the netdev before disconnecting from the PHY
        skbuff: cleanup double word in comment
        net: macb: Align the dma and coherent dma masks
        mptcp: netlink: process IPv6 addrs in creating listening sockets
        selftests: mptcp: add missing join check
        net: usb: qmi_wwan: Add support for Dell DW5829e
        vlan: move dev_put into vlan_dev_uninit
        vlan: introduce vlan_dev_free_egress_priority
        ax25: fix UAF bugs of net_device caused by rebinding operation
        net: dsa: fix panic when DSA master device unbinds on shutdown
        net: amd-xgbe: disable interrupts during pci removal
        tipc: rate limit warning for received illegal binding update
        net: mdio: aspeed: Add missing MODULE_DEVICE_TABLE
        ...
      f1baf68e
  3. 10 Feb, 2022 20 commits
  4. 09 Feb, 2022 15 commits
    • Paul Moore's avatar
      audit: don't deref the syscall args when checking the openat2 open_how::flags · 7a82f89d
      Paul Moore authored
      As reported by Jeff, dereferencing the openat2 syscall argument in
      audit_match_perm() to obtain the open_how::flags can result in an
      oops/page-fault.  This patch fixes this by using the open_how struct
      that we store in the audit_context with audit_openat2_how().
      
      Independent of this patch, Richard Guy Briggs posted a similar patch
      to the audit mailing list roughly 40 minutes after this patch was
      posted.
      
      Cc: stable@vger.kernel.org
      Fixes: 1c30e3af ("audit: add support for the openat2 syscall")
      Reported-by: default avatarJeff Mahoney <jeffm@suse.com>
      Signed-off-by: default avatarPaul Moore <paul@paul-moore.com>
      7a82f89d
    • Linus Torvalds's avatar
      Merge tag 'nfsd-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux · f4bc5bbb
      Linus Torvalds authored
      Pull more nfsd fixes from Chuck Lever:
       "Ensure that NFS clients cannot send file size or offset values that
        can cause the NFS server to crash or to return incorrect or surprising
        results.
      
        In particular, fix how the NFS server handles values larger than
        OFFSET_MAX"
      
      * tag 'nfsd-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
        NFSD: Deprecate NFS_OFFSET_MAX
        NFSD: Fix offset type in I/O trace points
        NFSD: COMMIT operations must not return NFS?ERR_INVAL
        NFSD: Clamp WRITE offsets
        NFSD: Fix NFSv3 SETATTR/CREATE's handling of large file sizes
        NFSD: Fix ia_size underflow
        NFSD: Fix the behavior of READ near OFFSET_MAX
      f4bc5bbb
    • Linus Torvalds's avatar
      Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · f9f94c9d
      Linus Torvalds authored
      Pull crypto fixes from Herbert Xu:
       "Fix two regressions:
      
         - Potential boot failure due to missing cryptomgr on initramfs
      
         - Stack overflow in octeontx2"
      
      * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: api - Move cryptomgr soft dependency into algapi
        crypto: octeontx2 - Avoid stack variable overflow
      f9f94c9d
    • Domenico Andreoli's avatar
      Fix regression due to "fs: move binfmt_misc sysctl to its own file" · b42bc9a3
      Domenico Andreoli authored
      Commit 3ba442d5 ("fs: move binfmt_misc sysctl to its own file") did
      not go unnoticed, binfmt-support stopped to work on my Debian system
      since v5.17-rc2 (did not check with -rc1).
      
      The existance of the /proc/sys/fs/binfmt_misc is a precondition for
      attempting to mount the binfmt_misc fs, which in turn triggers the
      autoload of the binfmt_misc module.  Without it, no module is loaded and
      no binfmt is available at boot.
      
      Building as built-in or manually loading the module and mounting the fs
      works fine, it's therefore only a matter of interaction with user-space.
      I could try to improve the Debian systemd configuration but I can't say
      anything about the other distributions.
      
      This patch restores a working system right after boot.
      
      Fixes: 3ba442d5 ("fs: move binfmt_misc sysctl to its own file")
      Signed-off-by: default avatarDomenico Andreoli <domenico.andreoli@linux.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: default avatarTong Zhang <ztong0001@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b42bc9a3
    • Linus Torvalds's avatar
      Merge tag 'kvm-s390-kernel-access' from emailed bundle · 09a93c1d
      Linus Torvalds authored
      Pull s390 kvm fix from Christian Borntraeger:
       "Add missing check for the MEMOP ioctl
      
        The SIDA MEMOPs must only be used for secure guests, otherwise
        userspace can do unwanted memory accesses"
      
      * tag 'kvm-s390-kernel-access' from emailed bundle:
        KVM: s390: Return error on SIDA memop on normal guest
      09a93c1d
    • Chuck Lever's avatar
      NFSD: Deprecate NFS_OFFSET_MAX · c306d737
      Chuck Lever authored
      NFS_OFFSET_MAX was introduced way back in Linux v2.3.y before there
      was a kernel-wide OFFSET_MAX value. As a clean up, replace the last
      few uses of it with its generic equivalent, and get rid of it.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      c306d737
    • Chuck Lever's avatar
      NFSD: Fix offset type in I/O trace points · 6a4d333d
      Chuck Lever authored
      NFSv3 and NFSv4 use u64 offset values on the wire. Record these values
      verbatim without the implicit type case to loff_t.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      6a4d333d
    • Chuck Lever's avatar
      NFSD: COMMIT operations must not return NFS?ERR_INVAL · 3f965021
      Chuck Lever authored
      Since, well, forever, the Linux NFS server's nfsd_commit() function
      has returned nfserr_inval when the passed-in byte range arguments
      were non-sensical.
      
      However, according to RFC 1813 section 3.3.21, NFSv3 COMMIT requests
      are permitted to return only the following non-zero status codes:
      
            NFS3ERR_IO
            NFS3ERR_STALE
            NFS3ERR_BADHANDLE
            NFS3ERR_SERVERFAULT
      
      NFS3ERR_INVAL is not included in that list. Likewise, NFS4ERR_INVAL
      is not listed in the COMMIT row of Table 6 in RFC 8881.
      
      RFC 7530 does permit COMMIT to return NFS4ERR_INVAL, but does not
      specify when it can or should be used.
      
      Instead of dropping or failing a COMMIT request in a byte range that
      is not supported, turn it into a valid request by treating one or
      both arguments as zero. Offset zero means start-of-file, count zero
      means until-end-of-file, so we only ever extend the commit range.
      NFS servers are always allowed to commit more and sooner than
      requested.
      
      The range check is no longer bounded by NFS_OFFSET_MAX, but rather
      by the value that is returned in the maxfilesize field of the NFSv3
      FSINFO procedure or the NFSv4 maxfilesize file attribute.
      
      Note that this change results in a new pynfs failure:
      
      CMT4     st_commit.testCommitOverflow                             : RUNNING
      CMT4     st_commit.testCommitOverflow                             : FAILURE
                 COMMIT with offset + count overflow should return
                 NFS4ERR_INVAL, instead got NFS4_OK
      
      IMO the test is not correct as written: RFC 8881 does not allow the
      COMMIT operation to return NFS4ERR_INVAL.
      Reported-by: default avatarDan Aloni <dan.aloni@vastdata.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarBruce Fields <bfields@fieldses.org>
      3f965021
    • Chuck Lever's avatar
      NFSD: Clamp WRITE offsets · 6260d9a5
      Chuck Lever authored
      Ensure that a client cannot specify a WRITE range that falls in a
      byte range outside what the kernel's internal types (such as loff_t,
      which is signed) can represent. The kiocb iterators, invoked in
      nfsd_vfs_write(), should properly limit write operations to within
      the underlying file system's s_maxbytes.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      6260d9a5
    • Chuck Lever's avatar
      NFSD: Fix NFSv3 SETATTR/CREATE's handling of large file sizes · a648fdeb
      Chuck Lever authored
      iattr::ia_size is a loff_t, so these NFSv3 procedures must be
      careful to deal with incoming client size values that are larger
      than s64_max without corrupting the value.
      
      Silently capping the value results in storing a different value
      than the client passed in which is unexpected behavior, so remove
      the min_t() check in decode_sattr3().
      
      Note that RFC 1813 permits only the WRITE procedure to return
      NFS3ERR_FBIG. We believe that NFSv3 reference implementations
      also return NFS3ERR_FBIG when ia_size is too large.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      a648fdeb
    • Chuck Lever's avatar
      NFSD: Fix ia_size underflow · e6faac3f
      Chuck Lever authored
      iattr::ia_size is a loff_t, which is a signed 64-bit type. NFSv3 and
      NFSv4 both define file size as an unsigned 64-bit type. Thus there
      is a range of valid file size values an NFS client can send that is
      already larger than Linux can handle.
      
      Currently decode_fattr4() dumps a full u64 value into ia_size. If
      that value happens to be larger than S64_MAX, then ia_size
      underflows. I'm about to fix up the NFSv3 behavior as well, so let's
      catch the underflow in the common code path: nfsd_setattr().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      e6faac3f
    • Chuck Lever's avatar
      NFSD: Fix the behavior of READ near OFFSET_MAX · 0cb4d23a
      Chuck Lever authored
      Dan Aloni reports:
      > Due to commit 8cfb9015 ("NFS: Always provide aligned buffers to
      > the RPC read layers") on the client, a read of 0xfff is aligned up
      > to server rsize of 0x1000.
      >
      > As a result, in a test where the server has a file of size
      > 0x7fffffffffffffff, and the client tries to read from the offset
      > 0x7ffffffffffff000, the read causes loff_t overflow in the server
      > and it returns an NFS code of EINVAL to the client. The client as
      > a result indefinitely retries the request.
      
      The Linux NFS client does not handle NFS?ERR_INVAL, even though all
      NFS specifications permit servers to return that status code for a
      READ.
      
      Instead of NFS?ERR_INVAL, have out-of-range READ requests succeed
      and return a short result. Set the EOF flag in the result to prevent
      the client from retrying the READ request. This behavior appears to
      be consistent with Solaris NFS servers.
      
      Note that NFSv3 and NFSv4 use u64 offset values on the wire. These
      must be converted to loff_t internally before use -- an implicit
      type cast is not adequate for this purpose. Otherwise VFS checks
      against sb->s_maxbytes do not work properly.
      Reported-by: default avatarDan Aloni <dan.aloni@vastdata.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      0cb4d23a
    • David S. Miller's avatar
      Merge branch 'vlan-QinQ-leak-fix' · 3bed06e3
      David S. Miller authored
      Xin Long says:
      
      ====================
      vlan: fix a netdev refcnt leak for QinQ
      
      This issue can be simply reproduced by:
      
        # ip link add dummy0 type dummy
        # ip link add link dummy0 name dummy0.1 type vlan id 1
        # ip link add link dummy0.1 name dummy0.1.2 type vlan id 2
        # rmmod 8021q
      
       unregister_netdevice: waiting for dummy0.1 to become free. Usage count = 1
      
      So as to fix it, adjust vlan_dev_uninit() in Patch 1/1 so that it won't
      be called twice for the same device, then do the fix in vlan_dev_uninit()
      in Patch 2/2.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3bed06e3
    • Xin Long's avatar
      vlan: move dev_put into vlan_dev_uninit · d6ff94af
      Xin Long authored
      Shuang Li reported an QinQ issue by simply doing:
      
        # ip link add dummy0 type dummy
        # ip link add link dummy0 name dummy0.1 type vlan id 1
        # ip link add link dummy0.1 name dummy0.1.2 type vlan id 2
        # rmmod 8021q
      
       unregister_netdevice: waiting for dummy0.1 to become free. Usage count = 1
      
      When rmmods 8021q, all vlan devs are deleted from their real_dev's vlan grp
      and added into list_kill by unregister_vlan_dev(). dummy0.1 is unregistered
      before dummy0.1.2, as it's using for_each_netdev() in __rtnl_kill_links().
      
      When unregisters dummy0.1, dummy0.1.2 is not unregistered in the event of
      NETDEV_UNREGISTER, as it's been deleted from dummy0.1's vlan grp. However,
      due to dummy0.1.2 still holding dummy0.1, dummy0.1 will keep waiting in
      netdev_wait_allrefs(), while dummy0.1.2 will never get unregistered and
      release dummy0.1, as it delays dev_put until calling dev->priv_destructor,
      vlan_dev_free().
      
      This issue was introduced by Commit 563bcbae ("net: vlan: fix a UAF in
      vlan_dev_real_dev()"), and this patch is to fix it by moving dev_put() into
      vlan_dev_uninit(), which is called after NETDEV_UNREGISTER event but before
      netdev_wait_allrefs().
      
      Fixes: 563bcbae ("net: vlan: fix a UAF in vlan_dev_real_dev()")
      Reported-by: default avatarShuang Li <shuali@redhat.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d6ff94af
    • Xin Long's avatar
      vlan: introduce vlan_dev_free_egress_priority · 37aa50c5
      Xin Long authored
      This patch is to introduce vlan_dev_free_egress_priority() to
      free egress priority for vlan dev, and keep vlan_dev_uninit()
      static as .ndo_uninit. It makes the code more clear and safer
      when adding new code in vlan_dev_uninit() in the future.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37aa50c5