1. 20 Nov, 2020 11 commits
    • Joel Fernandes (Google)'s avatar
      rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs · bd56e0a4
      Joel Fernandes (Google) authored
      Testing showed that rcu_pending() can return 1 when offloaded callbacks
      are ready to execute.  This invokes RCU core processing, for example,
      by raising RCU_SOFTIRQ, eventually resulting in a call to rcu_core().
      However, rcu_core() explicitly avoids in any way manipulating offloaded
      callbacks, which are instead handled by the rcuog and rcuoc kthreads,
      which work independently of rcu_core().
      
      One exception to this independence is that rcu_core() invokes
      do_nocb_deferred_wakeup(), however, rcu_pending() also checks
      rcu_nocb_need_deferred_wakeup() in order to correctly handle this case,
      invoking rcu_core() when needed.
      
      This commit therefore avoids needlessly invoking RCU core processing
      by checking rcu_segcblist_ready_cbs() only on non-offloaded CPUs.
      This reduces overhead, for example, by reducing softirq activity.
      
      This change passed 30 minute tests of TREE01 through TREE09 each.
      
      On TREE08, there is at most 150us from the time that rcu_pending() chose
      not to invoke RCU core processing to the time when the ready callbacks
      were invoked by the rcuoc kthread.  This provides further evidence that
      there is no need to invoke rcu_core() for offloaded callbacks that are
      ready to invoke.
      
      Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
      Reviewed-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: default avatarNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      bd56e0a4
    • Peter Zijlstra's avatar
      rcu,ftrace: Fix ftrace recursion · d2098b44
      Peter Zijlstra authored
      Kim reported that perf-ftrace made his box unhappy. It turns out that
      commit:
      
        ff5c4f5c ("rcu/tree: Mark the idle relevant functions noinstr")
      
      removed one too many notrace qualifiers, probably due to there not being
      a helpful comment.
      
      This commit therefore reinstates the notrace and adds a comment to avoid
      losing it again.
      
      [ paulmck: Apply Steven Rostedt's feedback on the comment. ]
      Fixes: ff5c4f5c ("rcu/tree: Mark the idle relevant functions noinstr")
      Reported-by: default avatarKim Phillips <kim.phillips@amd.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      d2098b44
    • Joe Perches's avatar
      rcu/tree: Make struct kernel_param_ops definitions const · 7c47ee5a
      Joe Perches authored
      These should be const, so make it so.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      7c47ee5a
    • Joel Fernandes (Google)'s avatar
      rcu/tree: Add a warning if CPU being onlined did not report QS already · 9f866dac
      Joel Fernandes (Google) authored
      Currently, rcu_cpu_starting() checks to see if the RCU core expects a
      quiescent state from the incoming CPU.  However, the current interaction
      between RCU quiescent-state reporting and CPU-hotplug operations should
      mean that the incoming CPU never needs to report a quiescent state.
      First, the outgoing CPU reports a quiescent state if needed.  Second,
      the race where the CPU is leaving just as RCU is initializing a new
      grace period is handled by an explicit check for this condition.  Third,
      the CPU's leaf rcu_node structure's ->lock serializes these checks.
      
      This means that if rcu_cpu_starting() ever feels the need to report
      a quiescent state, then there is a bug somewhere in the CPU hotplug
      code or the RCU grace-period handling code.  This commit therefore
      adds a WARN_ON_ONCE() to bring that bug to everyone's attention.
      
      Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
      Suggested-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      9f866dac
    • Neeraj Upadhyay's avatar
      rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config · a3941517
      Neeraj Upadhyay authored
      This commit clarifies that the "p" and the "s" in the in the RCU_NOCB_CPU
      config-option description refer to the "x" in the "rcuox/N" kthread name.
      Signed-off-by: default avatarNeeraj Upadhyay <neeraju@codeaurora.org>
      [ paulmck: While in the area, update description and advice. ]
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      a3941517
    • Neeraj Upadhyay's avatar
      rcu: Fix single-CPU check in rcu_blocking_is_gp() · ed73860c
      Neeraj Upadhyay authored
      Currently, for CONFIG_PREEMPTION=n kernels, rcu_blocking_is_gp() uses
      num_online_cpus() to determine whether there is only one CPU online.  When
      there is only a single CPU online, the simple fact that synchronize_rcu()
      could be legally called implies that a full grace period has elapsed.
      Therefore, in the single-CPU case, synchronize_rcu() simply returns
      immediately.  Unfortunately, num_online_cpus() is unreliable while a
      CPU-hotplug operation is transitioning to or from single-CPU operation
      because:
      
      1.	num_online_cpus() uses atomic_read(&__num_online_cpus) to
      	locklessly sample the number of online CPUs.  The hotplug locks
      	are not held, which means that an incoming CPU can concurrently
      	update this count.  This in turn means that an RCU read-side
      	critical section on the incoming CPU might observe updates
      	prior to the grace period, but also that this critical section
      	might extend beyond the end of the optimized synchronize_rcu().
      	This breaks RCU's fundamental guarantee.
      
      2.	In addition, num_online_cpus() does no ordering, thus providing
      	another way that RCU's fundamental guarantee can be broken by
      	the current code.
      
      3.	The most probable failure mode happens on outgoing CPUs.
      	The outgoing CPU updates the count of online CPUs in the
      	CPUHP_TEARDOWN_CPU stop-machine handler, which is fine in
      	and of itself due to preemption being disabled at the call
      	to num_online_cpus().  Unfortunately, after that stop-machine
      	handler returns, the CPU takes one last trip through the
      	scheduler (which has RCU readers) and, after the resulting
      	context switch, one final dive into the idle loop.  During this
      	time, RCU needs to keep track of two CPUs, but num_online_cpus()
      	will say that there is only one, which in turn means that the
      	surviving CPU will incorrectly ignore the outgoing CPU's RCU
      	read-side critical sections.
      
      This problem is illustrated by the following litmus test in which P0()
      corresponds to synchronize_rcu() and P1() corresponds to the incoming CPU.
      The herd7 tool confirms that the "exists" clause can be satisfied,
      thus demonstrating that this breakage can happen according to the Linux
      kernel memory model.
      
         {
           int x = 0;
           atomic_t numonline = ATOMIC_INIT(1);
         }
      
         P0(int *x, atomic_t *numonline)
         {
           int r0;
           WRITE_ONCE(*x, 1);
           r0 = atomic_read(numonline);
           if (r0 == 1) {
             smp_mb();
           } else {
             synchronize_rcu();
           }
           WRITE_ONCE(*x, 2);
         }
      
         P1(int *x, atomic_t *numonline)
         {
           int r0; int r1;
      
           atomic_inc(numonline);
           smp_mb();
           rcu_read_lock();
           r0 = READ_ONCE(*x);
           smp_rmb();
           r1 = READ_ONCE(*x);
           rcu_read_unlock();
         }
      
         locations [x;numonline;]
      
         exists (1:r0=0 /\ 1:r1=2)
      
      It is important to note that these problems arise only when the system
      is transitioning to or from single-CPU operation.
      
      One solution would be to hold the CPU-hotplug locks while sampling
      num_online_cpus(), which was in fact the intent of the (redundant)
      preempt_disable() and preempt_enable() surrounding this call to
      num_online_cpus().  Actually blocking CPU hotplug would not only result
      in excessive overhead, but would also unnecessarily impede CPU-hotplug
      operations.
      
      This commit therefore follows long-standing RCU tradition by maintaining
      a separate RCU-specific set of CPU-hotplug books.
      
      This separate set of books is implemented by a new ->n_online_cpus field
      in the rcu_state structure that maintains RCU's count of the online CPUs.
      This count is incremented early in the CPU-online process, so that
      the critical transition away from single-CPU operation will occur when
      there is only a single CPU.  Similarly for the critical transition to
      single-CPU operation, the counter is decremented late in the CPU-offline
      process, again while there is only a single CPU.  Because there is only
      ever a single CPU when the ->n_online_cpus field undergoes the critical
      1->2 and 2->1 transitions, full memory ordering and mutual exclusion is
      provided implicitly and, better yet, for free.
      
      In the case where the CPU is coming online, nothing will happen until
      the current CPU helps it come online.  Therefore, the new CPU will see
      all accesses prior to the optimized grace period, which means that RCU
      does not need to further delay this new CPU.  In the case where the CPU
      is going offline, the outgoing CPU is totally out of the picture before
      the optimized grace period starts, which means that this outgoing CPU
      cannot see any of the accesses following that grace period.  Again,
      RCU needs no further interaction with the outgoing CPU.
      
      This does mean that synchronize_rcu() will unnecessarily do a few grace
      periods the hard way just before the second CPU comes online and just
      after the second-to-last CPU goes offline, but it is not worth optimizing
      this uncommon case.
      Reviewed-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      ed73860c
    • Frederic Weisbecker's avatar
      rcu: Implement rcu_segcblist_is_offloaded() config dependent · e3771c85
      Frederic Weisbecker authored
      This commit simplifies the use of the rcu_segcblist_is_offloaded() API so
      that its callers no longer need to check the RCU_NOCB_CPU Kconfig option.
      Note that rcu_segcblist_is_offloaded() is defined in the header file,
      which means that the generated code should be just as efficient as before.
      Suggested-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      e3771c85
    • Asif Rasheed's avatar
      list.h: Update comment to explicitly note circular lists · 1eafe075
      Asif Rasheed authored
      The students in the Operating System Lecture Section at the
      American University of Sharjah were confused by the header comment
      in include/linux/list.h, which says "Simple doubly linked list
      implementation".  This comment means "simple" as in "not complex",
      but "simple" is often used in this context to mean "not circular".
      This commit therefore avoids this ambiguity by explicitly calling out
      "circular".
      Signed-off-by: default avatarAsif Rasheed <b00073877@aus.edu>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      1eafe075
    • chao's avatar
      rcu: Panic after fixed number of stalls · dfe56404
      chao authored
      Some stalls are transient, so that system fully recovers.  This commit
      therefore allows users to configure the number of stalls that must happen
      in order to trigger kernel panic.
      Signed-off-by: default avatarchao <chao@eero.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      dfe56404
    • Paul E. McKenney's avatar
      x86/smpboot: Move rcu_cpu_starting() earlier · 29368e09
      Paul E. McKenney authored
      The call to rcu_cpu_starting() in mtrr_ap_init() is not early enough
      in the CPU-hotplug onlining process, which results in lockdep splats
      as follows:
      
      =============================
      WARNING: suspicious RCU usage
      5.9.0+ #268 Not tainted
      -----------------------------
      kernel/kprobes.c:300 RCU-list traversed in non-reader section!!
      
      other info that might help us debug this:
      
      RCU used illegally from offline CPU!
      rcu_scheduler_active = 1, debug_locks = 1
      no locks held by swapper/1/0.
      
      stack backtrace:
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0+ #268
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
      Call Trace:
       dump_stack+0x77/0x97
       __is_insn_slot_addr+0x15d/0x170
       kernel_text_address+0xba/0xe0
       ? get_stack_info+0x22/0xa0
       __kernel_text_address+0x9/0x30
       show_trace_log_lvl+0x17d/0x380
       ? dump_stack+0x77/0x97
       dump_stack+0x77/0x97
       __lock_acquire+0xdf7/0x1bf0
       lock_acquire+0x258/0x3d0
       ? vprintk_emit+0x6d/0x2c0
       _raw_spin_lock+0x27/0x40
       ? vprintk_emit+0x6d/0x2c0
       vprintk_emit+0x6d/0x2c0
       printk+0x4d/0x69
       start_secondary+0x1c/0x100
       secondary_startup_64_no_verify+0xb8/0xbb
      
      This is avoided by moving the call to rcu_cpu_starting up near
      the beginning of the start_secondary() function.  Note that the
      raw_smp_processor_id() is required in order to avoid calling into lockdep
      before RCU has declared the CPU to be watched for readers.
      
      Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/Reported-by: default avatarQian Cai <cai@redhat.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      29368e09
    • Peter Zijlstra's avatar
      rcu: Allow rcu_irq_enter_check_tick() from NMI · 6dbce04d
      Peter Zijlstra authored
      Eugenio managed to tickle #PF from NMI context which resulted in
      hitting a WARN in RCU through irqentry_enter() ->
      __rcu_irq_enter_check_tick().
      
      However, this situation is perfectly sane and does not warrant an
      WARN. The #PF will (necessarily) be atomic and not require messing
      with the tick state, so early return is correct.  This commit
      therefore removes the WARN.
      
      Fixes: aaf2bc50 ("rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()")
      Reported-by: default avatar"Eugenio Pérez" <eupm90@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      6dbce04d
  2. 11 Nov, 2020 1 commit
  3. 25 Oct, 2020 17 commits
  4. 24 Oct, 2020 11 commits
    • Linus Torvalds's avatar
      Merge tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block · d7691390
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - NVMe pull request from Christoph
           - rdma error handling fixes (Chao Leng)
           - fc error handling and reconnect fixes (James Smart)
           - fix the qid displace when tracing ioctl command (Keith Busch)
           - don't use BLK_MQ_REQ_NOWAIT for passthru (Chaitanya Kulkarni)
           - fix MTDT for passthru (Logan Gunthorpe)
           - blacklist Write Same on more devices (Kai-Heng Feng)
           - fix an uninitialized work struct (zhenwei pi)"
      
       - lightnvm out-of-bounds fix (Colin)
      
       - SG allocation leak fix (Doug)
      
       - rnbd fixes (Gioh, Guoqing, Jack)
      
       - zone error translation fixes (Keith)
      
       - kerneldoc markup fix (Mauro)
      
       - zram lockdep fix (Peter)
      
       - Kill unused io_context members (Yufen)
      
       - NUMA memory allocation cleanup (Xianting)
      
       - NBD config wakeup fix (Xiubo)
      
      * tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block: (27 commits)
        block: blk-mq: fix a kernel-doc markup
        nvme-fc: shorten reconnect delay if possible for FC
        nvme-fc: wait for queues to freeze before calling update_hr_hw_queues
        nvme-fc: fix error loop in create_hw_io_queues
        nvme-fc: fix io timeout to abort I/O
        null_blk: use zone status for max active/open
        nvmet: don't use BLK_MQ_REQ_NOWAIT for passthru
        nvmet: cleanup nvmet_passthru_map_sg()
        nvmet: limit passthru MTDS by BIO_MAX_PAGES
        nvmet: fix uninitialized work for zero kato
        nvme-pci: disable Write Zeroes on Sandisk Skyhawk
        nvme: use queuedata for nvme_req_qid
        nvme-rdma: fix crash due to incorrect cqe
        nvme-rdma: fix crash when connect rejected
        block: remove unused members for io_context
        blk-mq: remove the calling of local_memory_node()
        zram: Fix __zram_bvec_{read,write}() locking order
        skd_main: remove unused including <linux/version.h>
        sgl_alloc_order: fix memory leak
        lightnvm: fix out-of-bounds write to array devices->info[]
        ...
      d7691390
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block · af004187
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
      
       - fsize was missed in previous unification of work flags
      
       - Few fixes cleaning up the flags unification creds cases (Pavel)
      
       - Fix NUMA affinities for completely unplugged/replugged node for io-wq
      
       - Two fallout fixes from the set_fs changes. One local to io_uring, one
         for the splice entry point that io_uring uses.
      
       - Linked timeout fixes (Pavel)
      
       - Removal of ->flush() ->files work-around that we don't need anymore
         with referenced files (Pavel)
      
       - Various cleanups (Pavel)
      
      * tag 'io_uring-5.10-2020-10-24' of git://git.kernel.dk/linux-block:
        splice: change exported internal do_splice() helper to take kernel offset
        io_uring: make loop_rw_iter() use original user supplied pointers
        io_uring: remove req cancel in ->flush()
        io-wq: re-set NUMA node affinities if CPUs come online
        io_uring: don't reuse linked_timeout
        io_uring: unify fsize with def->work_flags
        io_uring: fix racy REQ_F_LINK_TIMEOUT clearing
        io_uring: do poll's hash_node init in common code
        io_uring: inline io_poll_task_handler()
        io_uring: remove extra ->file check in poll prep
        io_uring: make cached_cq_overflow non atomic_t
        io_uring: inline io_fail_links()
        io_uring: kill ref get/drop in personality init
        io_uring: flags-based creds init in queue
      af004187
    • Linus Torvalds's avatar
      Merge tag 'libata-5.10-2020-10-24' of git://git.kernel.dk/linux-block · cb6b2897
      Linus Torvalds authored
      Pull libata fixes from Jens Axboe:
       "Two minor libata fixes:
      
         - Fix a DMA boundary mask regression for sata_rcar (Geert)
      
         - kerneldoc markup fix (Mauro)"
      
      * tag 'libata-5.10-2020-10-24' of git://git.kernel.dk/linux-block:
        ata: fix some kernel-doc markups
        ata: sata_rcar: Fix DMA boundary mask
      cb6b2897
    • Linus Torvalds's avatar
      Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 0eac1102
      Linus Torvalds authored
      Pull misc vfs updates from Al Viro:
       "Assorted stuff all over the place (the largest group here is
        Christoph's stat cleanups)"
      
      * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        fs: remove KSTAT_QUERY_FLAGS
        fs: remove vfs_stat_set_lookup_flags
        fs: move vfs_fstatat out of line
        fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
        fs: remove vfs_statx_fd
        fs: omfs: use kmemdup() rather than kmalloc+memcpy
        [PATCH] reduce boilerplate in fsid handling
        fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
        selftests: mount: add nosymfollow tests
        Add a "nosymfollow" mount option.
      0eac1102
    • Linus Torvalds's avatar
      Merge tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mapping · 1b307ac8
      Linus Torvalds authored
      Pull dma-mapping fixes from Christoph Hellwig:
      
       - document the new dma_{alloc,free}_pages() API
      
       - two fixups for the dma-mapping.h split
      
      * tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mapping:
        dma-mapping: document dma_{alloc,free}_pages
        dma-mapping: move more functions to dma-map-ops.h
        ARM/sa1111: add a missing include of dma-map-ops.h
      1b307ac8
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 9bf8d8bc
      Linus Torvalds authored
      Pull KVM fixes from Paolo Bonzini:
       "Two fixes for this merge window, and an unrelated bugfix for a host
        hang"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: ioapic: break infinite recursion on lazy EOI
        KVM: vmx: rename pi_init to avoid conflict with paride
        KVM: x86/mmu: Avoid modulo operator on 64-bit value to fix i386 build
      9bf8d8bc
    • Linus Torvalds's avatar
      Merge tag 'x86_seves_fixes_for_v5.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c51ae124
      Linus Torvalds authored
      Pull x86 SEV-ES fixes from Borislav Petkov:
       "Three fixes to SEV-ES to correct setting up the new early pagetable on
        5-level paging machines, to always map boot_params and the kernel
        cmdline, and disable stack protector for ../compressed/head{32,64}.c.
        (Arvind Sankar)"
      
      * tag 'x86_seves_fixes_for_v5.10_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/boot/64: Explicitly map boot_params and command line
        x86/head/64: Disable stack protection for head$(BITS).o
        x86/boot/64: Initialize 5-level paging variables earlier
      c51ae124
    • Willy Tarreau's avatar
      random32: add a selftest for the prandom32 code · c6e169bc
      Willy Tarreau authored
      Given that this code is new, let's add a selftest for it as well.
      It doesn't rely on fixed sets, instead it picks 1024 numbers and
      verifies that they're not more correlated than desired.
      
      Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
      Cc: George Spelvin <lkml@sdf.org>
      Cc: Amit Klein <aksecurity@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: tytso@mit.edu
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Marc Plumb <lkml.mplumb@gmail.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      c6e169bc
    • Willy Tarreau's avatar
      random32: add noise from network and scheduling activity · 3744741a
      Willy Tarreau authored
      With the removal of the interrupt perturbations in previous random32
      change (random32: make prandom_u32() output unpredictable), the PRNG
      has become 100% deterministic again. While SipHash is expected to be
      way more robust against brute force than the previous Tausworthe LFSR,
      there's still the risk that whoever has even one temporary access to
      the PRNG's internal state is able to predict all subsequent draws till
      the next reseed (roughly every minute). This may happen through a side
      channel attack or any data leak.
      
      This patch restores the spirit of commit f227e3ec ("random32: update
      the net random state on interrupt and activity") in that it will perturb
      the internal PRNG's statee using externally collected noise, except that
      it will not pick that noise from the random pool's bits nor upon
      interrupt, but will rather combine a few elements along the Tx path
      that are collectively hard to predict, such as dev, skb and txq
      pointers, packet length and jiffies values. These ones are combined
      using a single round of SipHash into a single long variable that is
      mixed with the net_rand_state upon each invocation.
      
      The operation was inlined because it produces very small and efficient
      code, typically 3 xor, 2 add and 2 rol. The performance was measured
      to be the same (even very slightly better) than before the switch to
      SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC
      (i40e), the connection rate dropped from 556k/s to 555k/s while the
      SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps.
      
      Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
      Cc: George Spelvin <lkml@sdf.org>
      Cc: Amit Klein <aksecurity@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: tytso@mit.edu
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Marc Plumb <lkml.mplumb@gmail.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      3744741a
    • George Spelvin's avatar
      random32: make prandom_u32() output unpredictable · c51f8f88
      George Spelvin authored
      Non-cryptographic PRNGs may have great statistical properties, but
      are usually trivially predictable to someone who knows the algorithm,
      given a small sample of their output.  An LFSR like prandom_u32() is
      particularly simple, even if the sample is widely scattered bits.
      
      It turns out the network stack uses prandom_u32() for some things like
      random port numbers which it would prefer are *not* trivially predictable.
      Predictability led to a practical DNS spoofing attack.  Oops.
      
      This patch replaces the LFSR with a homebrew cryptographic PRNG based
      on the SipHash round function, which is in turn seeded with 128 bits
      of strong random key.  (The authors of SipHash have *not* been consulted
      about this abuse of their algorithm.)  Speed is prioritized over security;
      attacks are rare, while performance is always wanted.
      
      Replacing all callers of prandom_u32() is the quick fix.
      Whether to reinstate a weaker PRNG for uses which can tolerate it
      is an open question.
      
      Commit f227e3ec ("random32: update the net random state on interrupt
      and activity") was an earlier attempt at a solution.  This patch replaces
      it.
      Reported-by: default avatarAmit Klein <aksecurity@gmail.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: tytso@mit.edu
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Marc Plumb <lkml.mplumb@gmail.com>
      Fixes: f227e3ec ("random32: update the net random state on interrupt and activity")
      Signed-off-by: default avatarGeorge Spelvin <lkml@sdf.org>
      Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
      [ willy: partial reversal of f227e3ec; moved SIPROUND definitions
        to prandom.h for later use; merged George's prandom_seed() proposal;
        inlined siprand_u32(); replaced the net_rand_state[] array with 4
        members to fix a build issue; cosmetic cleanups to make checkpatch
        happy; fixed RANDOM32_SELFTEST build ]
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      c51f8f88
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · b6f96e75
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - A fix for undetected data corruption on Power9 Nimbus <= DD2.1 in the
         emulation of VSX loads. The affected CPUs were not widely available.
      
       - Two fixes for machine check handling in guests under PowerVM.
      
       - A fix for our recent changes to SMP setup, when
         CONFIG_CPUMASK_OFFSTACK=y.
      
       - Three fixes for races in the handling of some of our powernv sysfs
         attributes.
      
       - One change to remove TM from the set of Power10 CPU features.
      
       - A couple of other minor fixes.
      
      Thanks to: Aneesh Kumar K.V, Christophe Leroy, Ganesh Goudar, Jordan
      Niethe, Mahesh Salgaonkar, Michael Neuling, Oliver O'Halloran, Qian Cai,
      Srikar Dronamraju, Vasant Hegde.
      
      * tag 'powerpc-5.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/pseries: Avoid using addr_to_pfn in real mode
        powerpc/uaccess: Don't use "m<>" constraint with GCC 4.9
        powerpc/eeh: Fix eeh_dev_check_failure() for PE#0
        powerpc/64s: Remove TM from Power10 features
        selftests/powerpc: Make alignment handler test P9N DD2.1 vector CI load workaround
        powerpc: Fix undetected data corruption with P9N DD2.1 VSX CI load emulation
        powerpc/powernv/dump: Handle multiple writes to ack attribute
        powerpc/powernv/dump: Fix race while processing OPAL dump
        powerpc/smp: Use GFP_ATOMIC while allocating tmp mask
        powerpc/smp: Remove unnecessary variable
        powerpc/mce: Avoid nmi_enter/exit in real mode on pseries hash
        powerpc/opal_elog: Handle multiple writes to ack attribute
      b6f96e75