1. 04 Feb, 2022 8 commits
    • Guillaume Nault's avatar
      selftests: rtnetlink: Use more sensible tos values · 95eb6ef8
      Guillaume Nault authored
      Using tos 0x1 with 'ip route get <IPv4 address> ...' doesn't test much
      of the tos option handling: 0x1 just sets an ECN bit, which is cleared
      by inet_rtm_getroute() before doing the fib lookup. Let's use 0x10
      instead, which is actually taken into account in the route lookup (and
      is less surprising for the reader).
      
      For consistency, use 0x10 for the IPv6 route lookup too (IPv6 currently
      doesn't clear ECN bits, but might do so in the future).
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Link: https://lore.kernel.org/r/d61119e68d01ba7ef3ba50c1345a5123a11de123.1643815297.git.gnault@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      95eb6ef8
    • Guillaume Nault's avatar
      selftests: fib offload: use sensible tos values · bafe517a
      Guillaume Nault authored
      Although both iproute2 and the kernel accept 1 and 2 as tos values for
      new routes, those are invalid. These values only set ECN bits, which
      are ignored during IPv4 fib lookups. Therefore, no packet can actually
      match such routes. This selftest therefore only succeeds because it
      doesn't verify that the new routes do actually work in practice (it
      just checks if the routes are offloaded or not).
      
      It makes more sense to use tos values that don't conflict with ECN.
      This way, the selftest won't be affected if we later decide to warn or
      even reject invalid tos configurations for new routes.
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/5e43b343720360a1c0e4f5947d9e917b26f30fbf.1643826556.git.gnault@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bafe517a
    • Eric Dumazet's avatar
      net: minor __dev_alloc_name() optimization · 25ee1660
      Eric Dumazet authored
      __dev_alloc_name() allocates a private zeroed page,
      then sets bits in it while iterating through net devices.
      
      It can use __set_bit() to avoid unnecessary locked operations.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220203064609.3242863-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      25ee1660
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · c59400a6
      Jakub Kicinski authored
      No conflicts.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c59400a6
    • Kees Cook's avatar
      gcc-plugins/stackleak: Use noinstr in favor of notrace · dcb85f85
      Kees Cook authored
      While the stackleak plugin was already using notrace, objtool is now a
      bit more picky.  Update the notrace uses to noinstr.  Silences the
      following objtool warnings when building with:
      
      CONFIG_DEBUG_ENTRY=y
      CONFIG_STACK_VALIDATION=y
      CONFIG_VMLINUX_VALIDATION=y
      CONFIG_GCC_PLUGIN_STACKLEAK=y
      
        vmlinux.o: warning: objtool: do_syscall_64()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: do_int80_syscall_32()+0x9: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: exc_general_protection()+0x22: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: fixup_bad_iret()+0x20: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: do_machine_check()+0x27: call to stackleak_track_stack() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .text+0x5346e: call to stackleak_erase() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .entry.text+0x143: call to stackleak_erase() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .entry.text+0x10eb: call to stackleak_erase() leaves .noinstr.text section
        vmlinux.o: warning: objtool: .entry.text+0x17f9: call to stackleak_erase() leaves .noinstr.text section
      
      Note that the plugin's addition of calls to stackleak_track_stack() from
      noinstr functions is expected to be safe, as it isn't runtime
      instrumentation and is self-contained.
      
      Cc: Alexander Popov <alex.popov@linux.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcb85f85
    • Linus Torvalds's avatar
      Merge tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · eb2eb516
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bpf, netfilter, and ieee802154.
      
        Current release - regressions:
      
         - Partially revert "net/smc: Add netlink net namespace support", fix
           uABI breakage
      
         - netfilter:
            - nft_ct: fix use after free when attaching zone template
            - nft_byteorder: track register operations
      
        Previous releases - regressions:
      
         - ipheth: fix EOVERFLOW in ipheth_rcvbulk_callback
      
         - phy: qca8081: fix speeds lower than 2.5Gb/s
      
         - sched: fix use-after-free in tc_new_tfilter()
      
        Previous releases - always broken:
      
         - tcp: fix mem under-charging with zerocopy sendmsg()
      
         - tcp: add missing tcp_skb_can_collapse() test in
           tcp_shift_skb_data()
      
         - neigh: do not trigger immediate probes on NUD_FAILED from
           neigh_managed_work, avoid a deadlock
      
         - bpf: use VM_MAP instead of VM_ALLOC for ringbuf, avoid KASAN
           false-positives
      
         - netfilter: nft_reject_bridge: fix for missing reply from prerouting
      
         - smc: forward wakeup to smc socket waitqueue after fallback
      
         - ieee802154:
            - return meaningful error codes from the netlink helpers
            - mcr20a: fix lifs/sifs periods
            - at86rf230, ca8210: stop leaking skbs on error paths
      
         - macsec: add missing un-offload call for NETDEV_UNREGISTER of parent
      
         - ax25: add refcount in ax25_dev to avoid UAF bugs
      
         - eth: mlx5e:
            - fix SFP module EEPROM query
            - fix broken SKB allocation in HW-GRO
            - IPsec offload: fix tunnel mode crypto for non-TCP/UDP flows
      
         - eth: amd-xgbe:
            - fix skb data length underflow
            - ensure reset of the tx_timer_active flag, avoid Tx timeouts
      
         - eth: stmmac: fix runtime pm use in stmmac_dvr_remove()
      
         - eth: e1000e: handshake with CSME starts from Alder Lake platforms"
      
      * tag 'net-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits)
        ax25: fix reference count leaks of ax25_dev
        net: stmmac: ensure PTP time register reads are consistent
        net: ipa: request IPA register values be retained
        dt-bindings: net: qcom,ipa: add optional qcom,qmp property
        tools/resolve_btfids: Do not print any commands when building silently
        bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
        net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work
        tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data()
        net: sparx5: do not refer to skb after passing it on
        Partially revert "net/smc: Add netlink net namespace support"
        net/mlx5e: Avoid field-overflowing memcpy()
        net/mlx5e: Use struct_group() for memcpy() region
        net/mlx5e: Avoid implicit modify hdr for decap drop rule
        net/mlx5e: IPsec: Fix tunnel mode crypto offload for non TCP/UDP traffic
        net/mlx5e: IPsec: Fix crypto offload for non TCP/UDP encapsulated traffic
        net/mlx5e: Don't treat small ceil values as unlimited in HTB offload
        net/mlx5: E-Switch, Fix uninitialized variable modact
        net/mlx5e: Fix handling of wrong devices during bond netevent
        net/mlx5e: Fix broken SKB allocation in HW-GRO
        net/mlx5e: Fix wrong calculation of header index in HW_GRO
        ...
      eb2eb516
    • Linus Torvalds's avatar
      Merge tag 'selinux-pr-20220203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux · 551007a8
      Linus Torvalds authored
      Pull selinux fix from Paul Moore:
       "One small SELinux patch to ensure that a policy structure field is
        properly reset after freeing so that we don't inadvertently do a
        double-free on certain error conditions"
      
      * tag 'selinux-pr-20220203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
        selinux: fix double free of cond_list on error paths
      551007a8
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-fixes-5.17-rc3' of... · 25b20ae8
      Linus Torvalds authored
      Merge tag 'linux-kselftest-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest fixes from Shuah Khan:
       "Important fixes to several tests and documentation clarification on
        running mainline kselftest on stable releases. A few notable fixes:
      
         - fix kselftest run hang due to child processes that haven't been
           terminated. Fix signals all child processes
      
         - fix false pass/fail results from vdso_test_abi, openat2, mincore
      
         - build failures when using -j (multiple jobs) option
      
         - exec test build failure due to incorrect build rule for a run-time
           created "pipe"
      
         - zram test fixes related to interaction with zram-generator to make
           sure zram test to coordinate deleted with zram-generator
      
         - zram test compression ratio calculation fix and skipping
           max_comp_streams.
      
         - increasing rtc test timeout
      
         - cpufreq test to write test results to stdout which will necessary
           on automated test systems"
      
      * tag 'linux-kselftest-fixes-5.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        kselftest: Fix vdso_test_abi return status
        selftests: skip mincore.check_file_mmap when fs lacks needed support
        selftests: openat2: Skip testcases that fail with EOPNOTSUPP
        selftests: openat2: Add missing dependency in Makefile
        selftests: openat2: Print also errno in failure messages
        selftests: futex: Use variable MAKE instead of make
        selftests/exec: Remove pipe from TEST_GEN_FILES
        selftests/zram: Adapt the situation that /dev/zram0 is being used
        selftests/zram01.sh: Fix compression ratio calculation
        selftests/zram: Skip max_comp_streams interface on newer kernel
        docs/kselftest: clarify running mainline tests on stables
        kselftest: signal all child processes
        selftests: cpufreq: Write test output to stdout as well
        selftests: rtc: Increase test timeout so that all tests run
      25b20ae8
  2. 03 Feb, 2022 32 commits
    • Duoming Zhou's avatar
      ax25: fix reference count leaks of ax25_dev · 87563a04
      Duoming Zhou authored
      The previous commit d01ffb9e ("ax25: add refcount in ax25_dev
      to avoid UAF bugs") introduces refcount into ax25_dev, but there
      are reference leak paths in ax25_ctl_ioctl(), ax25_fwd_ioctl(),
      ax25_rt_add(), ax25_rt_del() and ax25_rt_opt().
      
      This patch uses ax25_dev_put() and adjusts the position of
      ax25_addr_ax25dev() to fix reference cout leaks of ax25_dev.
      
      Fixes: d01ffb9e ("ax25: add refcount in ax25_dev to avoid UAF bugs")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Reviewed-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Link: https://lore.kernel.org/r/20220203150811.42256-1-duoming@zju.edu.cnSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      87563a04
    • Yannick Vignon's avatar
      net: stmmac: ensure PTP time register reads are consistent · 80d46090
      Yannick Vignon authored
      Even if protected from preemption and interrupts, a small time window
      remains when the 2 register reads could return inconsistent values,
      each time the "seconds" register changes. This could lead to an about
      1-second error in the reported time.
      
      Add logic to ensure the "seconds" and "nanoseconds" values are consistent.
      
      Fixes: 92ba6888 ("stmmac: add the support for PTP hw clock driver")
      Signed-off-by: default avatarYannick Vignon <yannick.vignon@nxp.com>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Link: https://lore.kernel.org/r/20220203160025.750632-1-yannick.vignon@oss.nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      80d46090
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 77b1b8b4
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2022-02-03
      
      We've added 6 non-merge commits during the last 10 day(s) which contain
      a total of 7 files changed, 11 insertions(+), 236 deletions(-).
      
      The main changes are:
      
      1) Fix BPF ringbuf to allocate its area with VM_MAP instead of VM_ALLOC
         flag which otherwise trips over KASAN, from Hou Tao.
      
      2) Fix unresolved symbol warning in resolve_btfids due to LSM callback
         rename, from Alexei Starovoitov.
      
      3) Fix a possible race in inc_misses_counter() when IRQ would trigger
         during counter update, from He Fengqing.
      
      4) Fix tooling infra for cross-building with clang upon probing whether
         gcc provides the standard libraries, from Jean-Philippe Brucker.
      
      5) Fix silent mode build for resolve_btfids, from Nathan Chancellor.
      
      6) Drop unneeded and outdated lirc.h header copy from tooling infra as
         BPF does not require it anymore, from Sean Young.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        tools/resolve_btfids: Do not print any commands when building silently
        bpf: Use VM_MAP instead of VM_ALLOC for ringbuf
        tools: Ignore errors from `which' when searching a GCC toolchain
        tools headers UAPI: remove stale lirc.h
        bpf: Fix possible race in inc_misses_counter
        bpf: Fix renaming task_getsecid_subj->current_getsecid_subj.
      ====================
      
      Link: https://lore.kernel.org/r/20220203155815.25689-1-daniel@iogearbox.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      77b1b8b4
    • Mickaël Salaün's avatar
      printk: Fix incorrect __user type in proc_dointvec_minmax_sysadmin() · 1f2cfdd3
      Mickaël Salaün authored
      The move of proc_dointvec_minmax_sysadmin() from kernel/sysctl.c to
      kernel/printk/sysctl.c introduced an incorrect __user attribute to the
      buffer argument.  I spotted this change in [1] as well as the kernel
      test robot.  Revert this change to please sparse:
      
        kernel/printk/sysctl.c:20:51: warning: incorrect type in argument 3 (different address spaces)
        kernel/printk/sysctl.c:20:51:    expected void *
        kernel/printk/sysctl.c:20:51:    got void [noderef] __user *buffer
      
      Fixes: faaa357a ("printk: move printk sysctl to printk/sysctl.c")
      Link: https://lore.kernel.org/r/20220104155024.48023-2-mic@digikod.net [1]
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: John Ogness <john.ogness@linutronix.de>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Xiaoming Ni <nixiaoming@huawei.com>
      Signed-off-by: default avatarMickaël Salaün <mic@linux.microsoft.com>
      Link: https://lore.kernel.org/r/20220203145029.272640-1-mic@digikod.netSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f2cfdd3
    • Igor Pylypiv's avatar
      Revert "module, async: async_synchronize_full() on module init iff async is used" · 67d6212a
      Igor Pylypiv authored
      This reverts commit 774a1221.
      
      We need to finish all async code before the module init sequence is
      done.  In the reverted commit the PF_USED_ASYNC flag was added to mark a
      thread that called async_schedule().  Then the PF_USED_ASYNC flag was
      used to determine whether or not async_synchronize_full() needs to be
      invoked.  This works when modprobe thread is calling async_schedule(),
      but it does not work if module dispatches init code to a worker thread
      which then calls async_schedule().
      
      For example, PCI driver probing is invoked from a worker thread based on
      a node where device is attached:
      
      	if (cpu < nr_cpu_ids)
      		error = work_on_cpu(cpu, local_pci_probe, &ddi);
      	else
      		error = local_pci_probe(&ddi);
      
      We end up in a situation where a worker thread gets the PF_USED_ASYNC
      flag set instead of the modprobe thread.  As a result,
      async_synchronize_full() is not invoked and modprobe completes without
      waiting for the async code to finish.
      
      The issue was discovered while loading the pm80xx driver:
      (scsi_mod.scan=async)
      
      modprobe pm80xx                      worker
      ...
        do_init_module()
        ...
          pci_call_probe()
            work_on_cpu(local_pci_probe)
                                           local_pci_probe()
                                             pm8001_pci_probe()
                                               scsi_scan_host()
                                                 async_schedule()
                                                 worker->flags |= PF_USED_ASYNC;
                                           ...
            < return from worker >
        ...
        if (current->flags & PF_USED_ASYNC) <--- false
        	async_synchronize_full();
      
      Commit 21c3c5d2 ("block: don't request module during elevator init")
      fixed the deadlock issue which the reverted commit 774a1221
      ("module, async: async_synchronize_full() on module init iff async is
      used") tried to fix.
      
      Since commit 0fdff3ec ("async, kmod: warn on synchronous
      request_module() from async workers") synchronous module loading from
      async is not allowed.
      
      Given that the original deadlock issue is fixed and it is no longer
      allowed to call synchronous request_module() from async we can remove
      PF_USED_ASYNC flag to make module init consistently invoke
      async_synchronize_full() unless async module probe is requested.
      Signed-off-by: default avatarIgor Pylypiv <ipylypiv@google.com>
      Reviewed-by: default avatarChangyuan Lyu <changyuanl@google.com>
      Reviewed-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67d6212a
    • Linus Torvalds's avatar
      Merge branch 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 305e6c42
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
      
       - Eric's fix for a long standing cgroup1 permission issue where it only
         checks for uid 0 instead of CAP which inadvertently allows
         unprivileged userns roots to modify release_agent userhelper
      
       - Fixes for the fallout from Waiman's recent cpuset work
      
      * 'for-5.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning
        cgroup-v1: Require capabilities to set release_agent
        cpuset: Fix the bug that subpart_cpus updated wrongly in update_cpumask()
        cgroup/cpuset: Make child cpusets restrict parents on v1 hierarchy
      305e6c42
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-enable-register-retention' · 0166556a
      Jakub Kicinski authored
      Alex Elder says:
      
      ====================
      net: ipa: enable register retention
      
      With runtime power management in place, we sometimes need to issue
      a command to enable retention of IPA register values before power
      collapse.  This requires a new Device Tree property, whose presence
      will also be used to signal that the command is required.
      ====================
      
      Link: https://lore.kernel.org/r/20220201150205.468403-1-elder@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0166556a
    • Alex Elder's avatar
      net: ipa: request IPA register values be retained · 34a08176
      Alex Elder authored
      In some cases, the IPA hardware needs to request the always-on
      subsystem (AOSS) to coordinate with the IPA microcontroller to
      retain IPA register values at power collapse.  This is done by
      issuing a QMP request to the AOSS microcontroller.  A similar
      request ondoes that request.
      
      We must get and hold the "QMP" handle early, because we might get
      back EPROBE_DEFER for that.  But the actual request should be sent
      while we know the IPA clock is active, and when we know the
      microcontroller is operational.
      
      Fixes: 1aac309d ("net: ipa: use autosuspend")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      34a08176
    • Alex Elder's avatar
      dt-bindings: net: qcom,ipa: add optional qcom,qmp property · ac62a017
      Alex Elder authored
      For some systems, the IPA driver must make a request to ensure that
      its registers are retained across power collapse of the IPA hardware.
      On such systems, we'll use the existence of the "qcom,qmp" property
      as a signal that this request is required.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ac62a017
    • Waiman Long's avatar
      cgroup/cpuset: Fix "suspicious RCU usage" lockdep warning · 2bdfd282
      Waiman Long authored
      It was found that a "suspicious RCU usage" lockdep warning was issued
      with the rcu_read_lock() call in update_sibling_cpumasks().  It is
      because the update_cpumasks_hier() function may sleep. So we have
      to release the RCU lock, call update_cpumasks_hier() and reacquire
      it afterward.
      
      Also add a percpu_rwsem_assert_held() in update_sibling_cpumasks()
      instead of stating that in the comment.
      
      Fixes: 4716909c ("cpuset: Track cpusets that use parent's effective_cpus")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Tested-by: default avatarPhil Auld <pauld@redhat.com>
      Reviewed-by: default avatarPhil Auld <pauld@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      2bdfd282
    • Nathan Chancellor's avatar
      tools/resolve_btfids: Do not print any commands when building silently · 7f3bdbc3
      Nathan Chancellor authored
      When building with 'make -s', there is some output from resolve_btfids:
      
      $ make -sj"$(nproc)" oldconfig prepare
        MKDIR     .../tools/bpf/resolve_btfids/libbpf/
        MKDIR     .../tools/bpf/resolve_btfids//libsubcmd
        LINK     resolve_btfids
      
      Silent mode means that no information should be emitted about what is
      currently being done. Use the $(silent) variable from Makefile.include
      to avoid defining the msg macro so that there is no information printed.
      
      Fixes: fbbb68de ("bpf: Add resolve_btfids tool to resolve BTF IDs in ELF object")
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220201212503.731732-1-nathan@kernel.org
      7f3bdbc3
    • John Hubbard's avatar
      Revert "mm/gup: small refactoring: simplify try_grab_page()" · c36c04c2
      John Hubbard authored
      This reverts commit 54d516b1
      
      That commit did a refactoring that effectively combined fast and slow
      gup paths (again).  And that was again incorrect, for two reasons:
      
       a) Fast gup and slow gup get reference counts on pages in different
          ways and with different goals: see Linus' writeup in commit
          cd1adf1b ("Revert "mm/gup: remove try_get_page(), call
          try_get_compound_head() directly""), and
      
       b) try_grab_compound_head() also has a specific check for
          "FOLL_LONGTERM && !is_pinned(page)", that assumes that the caller
          can fall back to slow gup. This resulted in new failures, as
          recently report by Will McVicker [1].
      
      But (a) has problems too, even though they may not have been reported
      yet.  So just revert this.
      
      Link: https://lore.kernel.org/r/20220131203504.3458775-1-willmcvicker@google.com [1]
      Fixes: 54d516b1 ("mm/gup: small refactoring: simplify try_grab_page()")
      Reported-and-tested-by: default avatarWill McVicker <willmcvicker@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: stable@vger.kernel.org # 5.15
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c36c04c2
    • Linus Torvalds's avatar
      Merge tag 'mips-fixes-5.17_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · d394bb77
      Linus Torvalds authored
      Pull MIPS fixes from Thomas Bogendoerfer:
      
       - fix missed change for PTR->PTR_WD conversion
      
       - kernel-doc fixes
      
      * tag 'mips-fixes-5.17_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        MIPS: KVM: fix vz.c kernel-doc notation
        MIPS: octeon: Fix missed PTR->PTR_WD conversion
      d394bb77
    • David S. Miller's avatar
      Merge branch 'dsa-mv88e6xxx-phylink_generic_validate' · 9c309189
      David S. Miller authored
      Russell King says:
      
      ====================
      net: dsa: mv88e6xxx: convert to phylink_generic_validate()
      
      The overall objective of this series is to convert the mv88e6xxx DSA
      driver to use phylink_generic_validate().
      
      Patch 1 adds a new helper mv88e6352_g2_scratch_port_has_serdes() which
      indicates whether an 88e6352 port has a serdes associated with it. This
      is necessary as ports 4 and 5 will normally be in automedia mode, where
      the CMODE field in the port status register will change e.g. between 15
      (internal PHY) and 9 (1000base-X) depending on whether the serdes has
      link.
      
      The existing code caches the cmode field, and depending whether the
      serdes has link at probe time, determines whether we allow things such
      as the serdes statistics to be accessed. This means if the link isn't
      up at probe time, the serdes is essentially unavailable.
      
      Patch 1 addresses this by reading the pin configuration to find out
      whether the serdes is attached to port 4 or port 5.
      
      Patch 2 is a joint effort between myself and Marek Behún, adding the
      supported interfaces and MAC capabilities to all mv88e6xxx supported
      switch devices. This is slightly more restrictive than the original
      code as we didn't used to care too much about the interface mode, but
      with this we do - which is why we must know if there's a serdes
      associated now.
      
      Patch 3 switches mv88e6xxx to use the generic validation by removing
      the initialisation of the phylink_validate pointer in the dsa_ops
      struct.
      
      Patch 4 updates the statistics code to use the new helper in patch 1,
      so the serdes statistics are available even if the link was down at
      driver probe time.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c309189
    • Russell King (Oracle)'s avatar
      net: dsa: mv88e6xxx: improve 88e6352 serdes statistics detection · 7f7d32bc
      Russell King (Oracle) authored
      The decision whether to report serdes statistics currently depends on
      the cached C_Mode value for the port, read at probe time or updated by
      configuration. However, port 4 can be in "automedia" mode when it is
      used as a serdes port, meaning it switches between the internal PHY and
      the serdes, changing the read-only C_Mode value depending on which
      first gains link. Consequently, the C_Mode value read at probe does not
      accurately reflect whether the port has the serdes associated with it.
      
      In "net: dsa: mv88e6xxx: add mv88e6352_g2_scratch_port_has_serdes()",
      we added a way to read the hardware configuration to determine which
      port has the serdes associated with it. Use this to determine which
      port reports the serdes statistics.
      Reviewed-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f7d32bc
    • Russell King (Oracle)'s avatar
      net: dsa: mv88e6xxx: convert to phylink_generic_validate() · 2ee84cfe
      Russell King (Oracle) authored
      Now that the mv88e6xxx chip drivers are supplying the supported
      interfaces and MAC capabilities, switch the driver to use the generic
      phylink validation implementation by removing our own validation
      implementations. This causes DSA to call phylink_generic_validate()
      on our behalf.
      Reviewed-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2ee84cfe
    • Russell King (Oracle)'s avatar
      net: dsa: mv88e6xxx: populate supported_interfaces and mac_capabilities · d4ebf12b
      Russell King (Oracle) authored
      Populate the supported interfaces and MAC capabilities for the
      Marvell MV88E6xxx DSA switches in preparation to using these for the
      validation functionality.
      
      Patch co-authored by Marek.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: Marek Behún <kabel@kernel.org> [ fixed 6341 and 6393x ]
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4ebf12b
    • Russell King (Oracle)'s avatar
      net: dsa: mv88e6xxx: add mv88e6352_g2_scratch_port_has_serdes() · 62001548
      Russell King (Oracle) authored
      Read the hardware configuration to determine which port is attached
      to the serdes.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62001548
    • David S. Miller's avatar
      Merge branch 'dsa-mv88e6xxx-port-isolation' · 09476443
      David S. Miller authored
      Tobias Waldekranz says:
      
      ====================
      net: dsa: mv88e6xxx: Improve standalone port isolation
      
      The ideal isolation between standalone ports satisfies two properties:
      1. Packets from one standalone port must not be forwarded to any other
         port.
      2. Packets from a standalone port must be sent to the CPU port.
      
      mv88e6xxx solves (1) by isolating standalone ports using the PVT. Up
      to this point though, (2) has not guaranteed; as the ATU is still
      consulted, there is a chance that incoming packets never reach the CPU
      if its DA has previously been used as the SA of an earlier packet (see
      1/5 for more details). This is typically not a problem, except for one
      very useful setup in which switch ports are looped in order to run the
      bridge kselftests in tools/testing/selftests/net/forwarding. This
      series attempts to solve (2).
      
      Ideally, we could simply use the "ForceMap" bit of more modern chips
      (Agate and newer) to classify all incoming packets as MGMT. This is
      not available on older silicon that is still widely used (Opal Plus
      chips like the 6097 for example).
      
      Instead, this series takes a two pronged approach:
      
      1/5: Always clear MapDA on standalone ports to make sure that no ATU
           entry can lead packets astray. This solves (2) for single-chip
           systems.
      
      2/5: Trivial prep work for 4/5.
      3/5: Trivial prep work for 4/5.
      
      4/5: On multi-chip systems though, this is not enough. On the incoming
           chip, the packet will be forced out towards the CPU thanks to
           1/5, but on any intermediate chips the ATU is still consulted. We
           override this behavior by marking the reserved standalone VID (0)
           as a policy VID, the DSA ports' VID policy is set to TRAP. This
           will cause the packet to be reclassified as MGMT on the first
           intermediate chip, after which it's a straight shot towards the
           CPU.
      
      Finally, we allow more tests to be run on mv88e6xxx:
      
      5/5: The bridge_vlan{,un}aware suites sets an ageing_time of 10s on
           the bridge it creates, but mv88e6xxx has a minimum supported time
           of 15s. Allow this time to be overridden in forwarding.config.
      
      With this series in place, mv88e6xxx passes the following kselftest
      suites:
      
      - bridge_port_isolation.sh
      - bridge_sticky_fdb.sh
      - bridge_vlan_aware.sh
      - bridge_vlan_unaware.sh
      
      v1 -> v2:
        - Wording/spelling (Vladimir)
        - Use standard iterator in dsa_switch_upstream_port (Vladimir)
        - Limit enabling of VTU port policy to downstream DSA ports (Vladimir)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09476443
    • Tobias Waldekranz's avatar
      selftests: net: bridge: Parameterize ageing timeout · 08119759
      Tobias Waldekranz authored
      Allow the ageing timeout that is set on bridges to be customized from
      forwarding.config. This allows the tests to be run on hardware which
      does not support a 10s timeout (e.g. mv88e6xxx).
      Signed-off-by: default avatarTobias Waldekranz <tobias@waldekranz.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08119759
    • Tobias Waldekranz's avatar
      net: dsa: mv88e6xxx: Improve multichip isolation of standalone ports · d352b20f
      Tobias Waldekranz authored
      Given that standalone ports are now configured to bypass the ATU and
      forward all frames towards the upstream port, extend the ATU bypass to
      multichip systems.
      
      Load VID 0 (standalone) into the VTU with the policy bit set. Since
      VID 4095 (bridged) is already loaded, we now know that all VIDs in use
      are always available in all VTUs. Therefore, we can safely enable
      802.1Q on DSA ports.
      
      Setting the DSA ports' VTU policy to TRAP means that all incoming
      frames on VID 0 will be classified as MGMT - as a result, the ATU is
      bypassed on all subsequent switches.
      
      With this isolation in place, we are able to support configurations
      that are simultaneously very quirky and very useful. Quirky because it
      involves looping cables between local switchports like in this
      example:
      
         CPU
          |     .------.
      .---0---. | .----0----.
      |  sw0  | | |   sw1   |
      '-1-2-3-' | '-1-2-3-4-'
        $ @ '---'   $ @ % %
      
      We have three physically looped pairs ($, @, and %).
      
      This is very useful because it allows us to run the kernel's
      kselftests for the bridge on mv88e6xxx hardware.
      Signed-off-by: default avatarTobias Waldekranz <tobias@waldekranz.com>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d352b20f
    • Tobias Waldekranz's avatar
      net: dsa: mv88e6xxx: Enable port policy support on 6097 · 585d42bb
      Tobias Waldekranz authored
      This chip has support for the same per-port policy actions found in
      later versions of LinkStreet devices.
      
      Fixes: f3a2cd32 ("net: dsa: mv88e6xxx: introduce .port_set_policy")
      Signed-off-by: default avatarTobias Waldekranz <tobias@waldekranz.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      585d42bb
    • Tobias Waldekranz's avatar
      net: dsa: mv88e6xxx: Support policy entries in the VTU · bb03b280
      Tobias Waldekranz authored
      A VTU entry with policy enabled is used in combination with a port's
      VTU policy setting to override normal switching behavior for frames
      assigned to the entry's VID.
      
      A typical example is to Treat all frames in a particular VLAN as
      control traffic, and trap them to the CPU. In which case the relevant
      user port's VTU policy would be set to TRAP.
      Signed-off-by: default avatarTobias Waldekranz <tobias@waldekranz.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb03b280
    • Tobias Waldekranz's avatar
      net: dsa: mv88e6xxx: Improve isolation of standalone ports · 7af4a361
      Tobias Waldekranz authored
      Clear MapDA on standalone ports to bypass any ATU lookup that might
      point the packet in the wrong direction. This means that all packets
      are flooded using the PVT config. So make sure that standalone ports
      are only allowed to communicate with the local upstream port.
      
      Here is a scenario in which this is needed:
      
         CPU
          |     .----.
      .---0---. | .--0--.
      |  sw0  | | | sw1 |
      '-1-2-3-' | '-1-2-'
            '---'
      
      - sw0p1 and sw1p1 are bridged
      - sw0p2 and sw1p2 are in standalone mode
      - Learning must be enabled on sw0p3 in order for hardware forwarding
        to work properly between bridged ports
      
      1. A packet with SA :aa comes in on sw1p2
         1a. Egresses sw1p0
         1b. Ingresses sw0p3, ATU adds an entry for :aa towards port 3
         1c. Egresses sw0p0
      
      2. A packet with DA :aa comes in on sw0p2
         2a. If an ATU lookup is done at this point, the packet will be
             incorrectly forwarded towards sw0p3. With this change in place,
             the ATU is bypassed and the packet is forwarded in accordance
             with the PVT, which only contains the CPU port.
      Signed-off-by: default avatarTobias Waldekranz <tobias@waldekranz.com>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7af4a361
    • David S. Miller's avatar
      Merge branch 'ptp-virtual-clock-improvements' · b566967c
      David S. Miller authored
      Miroslav Lichvar says:
      
      ====================
      Virtual PTP clock improvements and fix
      
      v2:
      - dropped patch changing initial time of virtual clocks
      
      The first patch fixes an oops when unloading a driver with PTP clock and
      enabled virtual clocks.
      
      The other patches add missing features to make synchronization with
      virtual clocks work as well as with the physical clock.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b566967c
    • Miroslav Lichvar's avatar
      ptp: add getcrosststamp() to virtual clocks. · 21fad630
      Miroslav Lichvar authored
      If the physical clock supports cross timestamping (it has the
      getcrosststamp() function), provide a wrapper in the virtual clock to
      enable cross timestamping.
      
      This adds support for the PTP_SYS_OFFSET_PRECISE ioctl.
      Signed-off-by: default avatarMiroslav Lichvar <mlichvar@redhat.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Cc: Yangbo Lu <yangbo.lu@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21fad630
    • Miroslav Lichvar's avatar
      ptp: add gettimex64() to virtual clocks. · f0067ebf
      Miroslav Lichvar authored
      If the physical clock has the gettimex64() function, provide a
      gettimex64() wrapper in the virtual clock to enable more accurate
      and stable synchronization.
      
      This adds support for the PTP_SYS_OFFSET_EXTENDED ioctl.
      Signed-off-by: default avatarMiroslav Lichvar <mlichvar@redhat.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Cc: Yangbo Lu <yangbo.lu@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f0067ebf
    • Miroslav Lichvar's avatar
      ptp: increase maximum adjustment of virtual clocks. · f77222d6
      Miroslav Lichvar authored
      Increase the maximum frequency offset of virtual clocks to 50% to enable
      faster slewing corrections.
      
      This value cannot be represented as scaled ppm when long has 32 bits,
      but that is already the case for other drivers, even those that provide
      the adjfine() function, i.e. 32-bit applications are expected to check
      for the limit.
      Signed-off-by: default avatarMiroslav Lichvar <mlichvar@redhat.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Cc: Yangbo Lu <yangbo.lu@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f77222d6
    • Miroslav Lichvar's avatar
      ptp: unregister virtual clocks when unregistering physical clock. · bfcbb76b
      Miroslav Lichvar authored
      When unregistering a physical clock which has some virtual clocks,
      unregister the virtual clocks with it.
      
      This fixes the following oops, which can be triggered by unloading
      a driver providing a PTP clock when it has enabled virtual clocks:
      
      BUG: unable to handle page fault for address: ffffffffc04fc4d8
      Oops: 0000 [#1] PREEMPT SMP NOPTI
      RIP: 0010:ptp_vclock_read+0x31/0xb0
      Call Trace:
       timecounter_read+0xf/0x50
       ptp_vclock_refresh+0x2c/0x50
       ? ptp_clock_release+0x40/0x40
       ptp_aux_kworker+0x17/0x30
       kthread_worker_fn+0x9b/0x240
       ? kthread_should_park+0x30/0x30
       kthread+0xe2/0x110
       ? kthread_complete_and_exit+0x20/0x20
       ret_from_fork+0x22/0x30
      
      Fixes: 73f37068 ("ptp: support ptp physical/virtual clocks conversion")
      Signed-off-by: default avatarMiroslav Lichvar <mlichvar@redhat.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Cc: Yangbo Lu <yangbo.lu@nxp.com>
      Cc: Yang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bfcbb76b
    • Alexander Duyck's avatar
      page_pool: Refactor page_pool to enable fragmenting after allocation · 52cc6ffc
      Alexander Duyck authored
      This change is meant to permit a driver to perform "fragmenting" of the
      page from within the driver instead of the current model which requires
      pre-partitioning the page. The main motivation behind this is to support
      use cases where the page will be split up by the driver after DMA instead
      of before.
      
      With this change it becomes possible to start using page pool to replace
      some of the existing use cases where multiple references were being used
      for a single page, but the number needed was unknown as the size could be
      dynamic.
      
      For example, with this code it would be possible to do something like
      the following to handle allocation:
        page = page_pool_alloc_pages();
        if (!page)
          return NULL;
        page_pool_fragment_page(page, DRIVER_PAGECNT_BIAS_MAX);
        rx_buf->page = page;
        rx_buf->pagecnt_bias = DRIVER_PAGECNT_BIAS_MAX;
      
      Then we would process a received buffer by handling it with:
        rx_buf->pagecnt_bias--;
      
      Once the page has been fully consumed we could then flush the remaining
      instances with:
        if (page_pool_defrag_page(page, rx_buf->pagecnt_bias))
          continue;
        page_pool_put_defragged_page(pool, page -1, !!budget);
      
      The general idea is that we want to have the ability to allocate a page
      with excess fragment count and then trim off the unneeded fragments.
      Signed-off-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Reviewed-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52cc6ffc
    • David S. Miller's avatar
      Merge branch 'dsa-phylink_generic_validate' · 33f7a32d
      David S. Miller authored
      Russell King says:
      
      ====================
      Trivial DSA conversions to phylink_generic_validate()
      
      This series converts five DSA drivers to use phylink_generic_validate().
      No feedback or testing reports were received from the CFT posting.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33f7a32d
    • Russell King (Oracle)'s avatar
      net: dsa: xrs700x: convert to phylink_generic_validate() · 1f8d99de
      Russell King (Oracle) authored
      Populate the supported interfaces and MAC capabilities for the xrs700x
      family of DSA switches and remove the old validate implementation to
      allow DSA to use phylink_generic_validate() for this switch driver.
      
      According to commit ee00b24f ("net: dsa: add Arrow SpeedChips
      XRS700x driver") the switch supports one RMII port and up to three
      RGMII ports. This commit assumes that port 0 is the RMII port and the
      remainder are RGMII.
      
      This commit also results in the Autoneg bit being set in the ethtool
      link modes, which wasn't in the original; if this switch supports
      RGMII to a 10/100/1G PHY, then surely we want to allow Autoneg on the
      PHY.
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f8d99de