1. 21 Feb, 2017 1 commit
    • Tejun Heo's avatar
      kernfs: fix locking around kernfs_ops->release() callback · f83f3c51
      Tejun Heo authored
      The release callback may be called from two places - file release
      operation and kernfs open file draining.  kernfs_open_file->mutex is
      used to synchronize the two callsites.  This unfortunately leads to
      possible circular locking because of->mutex is used to protect the
      usual kernfs operations which may use locking constructs which are
      held while removing and thus draining kernfs files.
      
      @of->mutex is for synchronizing concurrent kernfs access operations
      and all we need here is synchronization between the releaes and drain
      paths.  As the drain path has to grab kernfs_open_file_mutex anyway,
      let's use the mutex to synchronize the release operation instead.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarTony Lindgren <tony@atomide.com>
      Fixes: 0e67db2f ("kernfs: add kernfs_ops->open/release() callbacks")
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f83f3c51
  2. 02 Feb, 2017 3 commits
    • Tejun Heo's avatar
      Merge branch 'cgroup/for-4.11-rdmacg' into cgroup/for-4.11 · 63f1ca59
      Tejun Heo authored
      Merge in to resolve conflicts in Documentation/cgroup-v2.txt.  The
      conflicts are from multiple section additions and trivial to resolve.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      63f1ca59
    • Tejun Heo's avatar
      cgroup: drop the matching uid requirement on migration for cgroup v2 · 576dd464
      Tejun Heo authored
      Along with the write access to the cgroup.procs or tasks file, cgroup
      has required the writer's euid, unless root, to match [s]uid of the
      target process or task.  On cgroup v1, this is necessary because
      there's nothing preventing a delegatee from pulling in tasks or
      processes from all over the system.
      
      If a user has a cgroup subdirectory delegated to it, the user would
      have write access to the cgroup.procs or tasks file.  If there are no
      further checks than file write access check, the user would be able to
      pull processes from all over the system into its subhierarchy which is
      clearly not the intended behavior.  The matching [s]uid requirement
      partially prevents this problem by allowing a delegatee to pull in the
      processes that belongs to it.  This isn't a sufficient protection
      however, because a user would still be able to jump processes across
      two disjoint sub-hierarchies that has been delegated to them.
      
      cgroup v2 resolves the issue by requiring the writer to have access to
      the common ancestor of the cgroup.procs file of the source and target
      cgroups.  This confines each delegatee to their own sub-hierarchy
      proper and bases all permission decisions on the cgroup filesystem
      rather than having to pull in explicit uid matching.
      
      cgroup v2 has still been applying the matching [s]uid requirement just
      for historical reasons.  On cgroup2, the requirement doesn't serve any
      purpose while unnecessarily complicating the permission model.  Let's
      drop it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      576dd464
    • Tejun Heo's avatar
      cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy · 968ebff1
      Tejun Heo authored
      perf_event is a utility controller whose primary role is identifying
      cgroup membership to filter perf events; however, because it also
      tracks some per-css state, it can't be replaced by pure cgroup
      membership test.  Mark the controller as implicitly enabled on the
      default hierarchy so that perf events can always be filtered based on
      cgroup v2 path as long as the controller is not mounted on a legacy
      hierarchy.
      
      "perf record" is updated accordingly so that it searches for both v1
      and v2 hierarchies.  A v1 hierarchy is used if perf_event is mounted
      on it; otherwise, it uses the v2 hierarchy.
      
      v2: Doc updated to reflect more flexible rebinding behavior.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      968ebff1
  3. 30 Jan, 2017 1 commit
    • Tejun Heo's avatar
      cgroup: misc cleanups · b807421a
      Tejun Heo authored
      * cgrp_dfl_implicit_ss_mask is ulong instead of u16 unlike other
        ss_masks.  Make it a u16.
      
      * Move have_canfork_callback together with other callback ss_masks.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      b807421a
  4. 26 Jan, 2017 2 commits
    • Tejun Heo's avatar
      Merge branch 'for-4.10-fixes' into for-4.11 · bdf3d06b
      Tejun Heo authored
      bdf3d06b
    • Tejun Heo's avatar
      cgroup: don't online subsystems before cgroup_name/path() are operational · 07cd1294
      Tejun Heo authored
      While refactoring cgroup creation, a5bca215 ("cgroup: factor out
      cgroup_create() out of cgroup_mkdir()") incorrectly onlined subsystems
      before the new cgroup is associated with it kernfs_node.  This is fine
      for cgroup proper but cgroup_name/path() depend on the associated
      kernfs_node and if a subsystem makes the new cgroup_subsys_state
      visible, which they're allowed to after onlining, it can lead to NULL
      dereference.
      
      The current code performs cgroup creation and subsystem onlining in
      cgroup_create() and cgroup_mkdir() makes the cgroup and subsystems
      visible afterwards.  There's no reason to online the subsystems early
      and we can simply drop cgroup_apply_control_enable() call from
      cgroup_create() so that the subsystems are onlined and made visible at
      the same time.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Fixes: a5bca215 ("cgroup: factor out cgroup_create() out of cgroup_mkdir()") 
      Cc: stable@vger.kernel.org # v4.6+
      07cd1294
  5. 16 Jan, 2017 3 commits
    • Tejun Heo's avatar
      cgroup: call subsys->*attach() only for subsystems which are actually affected by migration · bfc2cf6f
      Tejun Heo authored
      Currently, subsys->*attach() callbacks are called for all subsystems
      which are attached to the hierarchy on which the migration is taking
      place.
      
      With cgroup_migrate_prepare_dst() filtering out identity migrations,
      v1 hierarchies can avoid spurious ->*attach() callback invocations
      where the source and destination csses are identical; however, this
      isn't enough on v2 as only a subset of the attached controllers can be
      affected on controller enable/disable.
      
      While spurious ->*attach() invocations aren't critically broken,
      they're unnecessary overhead and can lead to temporary overcharges on
      certain controllers.  Fix it by tracking which subsystems are affected
      by a migration and invoking ->*attach() callbacks only on those
      subsystems.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      bfc2cf6f
    • Tejun Heo's avatar
      cgroup: track migration context in cgroup_mgctx · e595cd70
      Tejun Heo authored
      cgroup migration is performed in four steps - css_set preloading,
      addition of target tasks, actual migration, and clean up.  A list
      named preloaded_csets is used to track the preloading.  This is a bit
      too restricted and the code is already depending on the subtlety that
      all source css_sets appear before destination ones.
      
      Let's create struct cgroup_mgctx which keeps track of everything
      during migration.  Currently, it has separate preload lists for source
      and destination csets and also embeds cgroup_taskset which is used
      during the actual migration.  This moves struct cgroup_taskset
      definition to cgroup-internal.h.
      
      This patch doesn't cause any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      e595cd70
    • Tejun Heo's avatar
      cgroup: cosmetic update to cgroup_taskset_add() · d8ebf519
      Tejun Heo authored
      cgroup_taskset_add() was using list_add_tail() when for source csets
      but list_move_tail() for destination.  As the operations are gated by
      list_empty() test, list_move_tail() is equivalent to list_add_tail()
      here.  Use list_add_tail() too for destination csets too.
      
      This doesn't cause any functional changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
      d8ebf519
  6. 10 Jan, 2017 5 commits
  7. 27 Dec, 2016 15 commits
  8. 26 Dec, 2016 2 commits
    • Linus Torvalds's avatar
      Linux 4.10-rc1 · 7ce7d89f
      Linus Torvalds authored
      7ce7d89f
    • Larry Finger's avatar
      powerpc: Fix build warning on 32-bit PPC · 8ae679c4
      Larry Finger authored
      I am getting the following warning when I build kernel 4.9-git on my
      PowerBook G4 with a 32-bit PPC processor:
      
          AS      arch/powerpc/kernel/misc_32.o
        arch/powerpc/kernel/misc_32.S:299:7: warning: "CONFIG_FSL_BOOKE" is not defined [-Wundef]
      
      This problem is evident after commit 989cea5c ("kbuild: prevent
      lib-ksyms.o rebuilds"); however, this change in kbuild only exposes an
      error that has been in the code since 2005 when this source file was
      created.  That was with commit 9994a338 ("powerpc: Introduce
      entry_{32,64}.S, misc_{32,64}.S, systbl.S").
      
      The offending line does not make a lot of sense.  This error does not
      seem to cause any errors in the executable, thus I am not recommending
      that it be applied to any stable versions.
      
      Thanks to Nicholas Piggin for suggesting this solution.
      
      Fixes: 9994a338 ("powerpc: Introduce entry_{32,64}.S, misc_{32,64}.S, systbl.S")
      Signed-off-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ae679c4
  9. 25 Dec, 2016 8 commits
    • Linus Torvalds's avatar
      avoid spurious "may be used uninitialized" warning · d33d5a6c
      Linus Torvalds authored
      The timer type simplifications caused a new gcc warning:
      
        drivers/base/power/domain.c: In function ‘genpd_runtime_suspend’:
        drivers/base/power/domain.c:562:14: warning: ‘time_start’ may be used uninitialized in this function [-Wmaybe-uninitialized]
           elapsed_ns = ktime_to_ns(ktime_sub(ktime_get(), time_start));
      
      despite the actual use of "time_start" not having changed in any way.
      It appears that simply changing the type of ktime_t from a union to a
      plain scalar type made gcc check the use.
      
      The variable wasn't actually used uninitialized, but gcc apparently
      failed to notice that the conditional around the use was exactly the
      same as the conditional around the initialization of that variable.
      
      Add an unnecessary initialization just to shut up the compiler.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d33d5a6c
    • Linus Torvalds's avatar
      Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3ddc76df
      Linus Torvalds authored
      Pull timer type cleanups from Thomas Gleixner:
       "This series does a tree wide cleanup of types related to
        timers/timekeeping.
      
         - Get rid of cycles_t and use a plain u64. The type is not really
           helpful and caused more confusion than clarity
      
         - Get rid of the ktime union. The union has become useless as we use
           the scalar nanoseconds storage unconditionally now. The 32bit
           timespec alike storage got removed due to the Y2038 limitations
           some time ago.
      
           That leaves the odd union access around for no reason. Clean it up.
      
        Both changes have been done with coccinelle and a small amount of
        manual mopping up"
      
      * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        ktime: Get rid of ktime_equal()
        ktime: Cleanup ktime_set() usage
        ktime: Get rid of the union
        clocksource: Use a plain u64 instead of cycle_t
      3ddc76df
    • Linus Torvalds's avatar
      Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · b272f732
      Linus Torvalds authored
      Pull SMP hotplug notifier removal from Thomas Gleixner:
       "This is the final cleanup of the hotplug notifier infrastructure. The
        series has been reintgrated in the last two days because there came a
        new driver using the old infrastructure via the SCSI tree.
      
        Summary:
      
         - convert the last leftover drivers utilizing notifiers
      
         - fixup for a completely broken hotplug user
      
         - prevent setup of already used states
      
         - removal of the notifiers
      
         - treewide cleanup of hotplug state names
      
         - consolidation of state space
      
        There is a sphinx based documentation pending, but that needs review
        from the documentation folks"
      
      * 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        irqchip/armada-xp: Consolidate hotplug state space
        irqchip/gic: Consolidate hotplug state space
        coresight/etm3/4x: Consolidate hotplug state space
        cpu/hotplug: Cleanup state names
        cpu/hotplug: Remove obsolete cpu hotplug register/unregister functions
        staging/lustre/libcfs: Convert to hotplug state machine
        scsi/bnx2i: Convert to hotplug state machine
        scsi/bnx2fc: Convert to hotplug state machine
        cpu/hotplug: Prevent overwriting of callbacks
        x86/msr: Remove bogus cleanup from the error path
        bus: arm-ccn: Prevent hotplug callback leak
        perf/x86/intel/cstate: Prevent hotplug callback leak
        ARM/imx/mmcd: Fix broken cpu hotplug handling
        scsi: qedi: Convert to hotplug state machine
      b272f732
    • Linus Torvalds's avatar
      Merge branch 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux · 10bbe759
      Linus Torvalds authored
      Pull turbostat updates from Len Brown.
      
      * 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
        tools/power turbostat: remove obsolete -M, -m, -C, -c options
        tools/power turbostat: Make extensible via the --add parameter
        tools/power turbostat: Denverton uses a 25 MHz crystal, not 19.2 MHz
        tools/power turbostat: line up headers when -M is used
        tools/power turbostat: fix SKX PKG_CSTATE_LIMIT decoding
        tools/power turbostat: Support Knights Mill (KNM)
        tools/power turbostat: Display HWP OOB status
        tools/power turbostat: fix Denverton BCLK
        tools/power turbostat: use intel-family.h model strings
        tools/power/turbostat: Add Denverton RAPL support
        tools/power/turbostat: Add Denverton support
        tools/power/turbostat: split core MSR support into status + limit
        tools/power turbostat: fix error case overflow read of slm_freq_table[]
        tools/power turbostat: Allocate correct amount of fd and irq entries
        tools/power turbostat: switch to tab delimited output
        tools/power turbostat: Gracefully handle ACPI S3
        tools/power turbostat: tidy up output on Joule counter overflow
      10bbe759
    • Nicholas Piggin's avatar
      mm: add PageWaiters indicating tasks are waiting for a page bit · 62906027
      Nicholas Piggin authored
      Add a new page flag, PageWaiters, to indicate the page waitqueue has
      tasks waiting. This can be tested rather than testing waitqueue_active
      which requires another cacheline load.
      
      This bit is always set when the page has tasks on page_waitqueue(page),
      and is set and cleared under the waitqueue lock. It may be set when
      there are no tasks on the waitqueue, which will cause a harmless extra
      wakeup check that will clears the bit.
      
      The generic bit-waitqueue infrastructure is no longer used for pages.
      Instead, waitqueues are used directly with a custom key type. The
      generic code was not flexible enough to have PageWaiters manipulation
      under the waitqueue lock (which simplifies concurrency).
      
      This improves the performance of page lock intensive microbenchmarks by
      2-3%.
      
      Putting two bits in the same word opens the opportunity to remove the
      memory barrier between clearing the lock bit and testing the waiters
      bit, after some work on the arch primitives (e.g., ensuring memory
      operand widths match and cover both bits).
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62906027
    • Nicholas Piggin's avatar
      mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked · 6326fec1
      Nicholas Piggin authored
      A page is not added to the swap cache without being swap backed,
      so PageSwapBacked mappings can use PG_owner_priv_1 for PageSwapCache.
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6326fec1
    • Thomas Gleixner's avatar
      ktime: Get rid of ktime_equal() · 1f3a8e49
      Thomas Gleixner authored
      No point in going through loops and hoops instead of just comparing the
      values.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      1f3a8e49
    • Thomas Gleixner's avatar
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner authored
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8b0e1953