1. 12 Dec, 2014 3 commits
    • Eric W. Biederman's avatar
      userns; Correct the comment in map_write · 36476bea
      Eric W. Biederman authored
      
      It is important that all maps are less than PAGE_SIZE
      or else setting the last byte of the buffer to '0'
      could write off the end of the allocated storage.
      
      Correct the misleading comment.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      36476bea
    • Eric W. Biederman's avatar
      userns: Allow setting gid_maps without privilege when setgroups is disabled · 66d2f338
      Eric W. Biederman authored
      
      Now that setgroups can be disabled and not reenabled, setting gid_map
      without privielge can now be enabled when setgroups is disabled.
      
      This restores most of the functionality that was lost when unprivileged
      setting of gid_map was removed.  Applications that use this functionality
      will need to check to see if they use setgroups or init_groups, and if they
      don't they can be fixed by simply disabling setgroups before writing to
      gid_map.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      66d2f338
    • Eric W. Biederman's avatar
      userns: Add a knob to disable setgroups on a per user namespace basis · 9cc46516
      Eric W. Biederman authored
      
      - Expose the knob to user space through a proc file /proc/<pid>/setgroups
      
        A value of "deny" means the setgroups system call is disabled in the
        current processes user namespace and can not be enabled in the
        future in this user namespace.
      
        A value of "allow" means the segtoups system call is enabled.
      
      - Descendant user namespaces inherit the value of setgroups from
        their parents.
      
      - A proc file is used (instead of a sysctl) as sysctls currently do
        not allow checking the permissions at open time.
      
      - Writing to the proc file is restricted to before the gid_map
        for the user namespace is set.
      
        This ensures that disabling setgroups at a user namespace
        level will never remove the ability to call setgroups
        from a process that already has that ability.
      
        A process may opt in to the setgroups disable for itself by
        creating, entering and configuring a user namespace or by calling
        setns on an existing user namespace with setgroups disabled.
        Processes without privileges already can not call setgroups so this
        is a noop.  Prodcess with privilege become processes without
        privilege when entering a user namespace and as with any other path
        to dropping privilege they would not have the ability to call
        setgroups.  So this remains within the bounds of what is possible
        without a knob to disable setgroups permanently in a user namespace.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      9cc46516
  2. 09 Dec, 2014 5 commits
  3. 06 Dec, 2014 1 commit
    • Eric W. Biederman's avatar
      userns: Document what the invariant required for safe unprivileged mappings. · 0542f17b
      Eric W. Biederman authored
      
      The rule is simple.  Don't allow anything that wouldn't be allowed
      without unprivileged mappings.
      
      It was previously overlooked that establishing gid mappings would
      allow dropping groups and potentially gaining permission to files and
      directories that had lesser permissions for a specific group than for
      all other users.
      
      This is the rule needed to fix CVE-2014-8989 and prevent any other
      security issues with new_idmap_permitted.
      
      The reason for this rule is that the unix permission model is old and
      there are programs out there somewhere that take advantage of every
      little corner of it.  So allowing a uid or gid mapping to be
      established without privielge that would allow anything that would not
      be allowed without that mapping will result in expectations from some
      code somewhere being violated.  Violated expectations about the
      behavior of the OS is a long way to say a security issue.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      0542f17b
  4. 08 Aug, 2014 1 commit
  5. 06 Jun, 2014 1 commit
  6. 14 Apr, 2014 1 commit
    • Mikulas Patocka's avatar
      user namespace: fix incorrect memory barriers · e79323bd
      Mikulas Patocka authored
      
      smp_read_barrier_depends() can be used if there is data dependency between
      the readers - i.e. if the read operation after the barrier uses address
      that was obtained from the read operation before the barrier.
      
      In this file, there is only control dependency, no data dependecy, so the
      use of smp_read_barrier_depends() is incorrect. The code could fail in the
      following way:
      * the cpu predicts that idx < entries is true and starts executing the
        body of the for loop
      * the cpu fetches map->extent[0].first and map->extent[0].count
      * the cpu fetches map->nr_extents
      * the cpu verifies that idx < extents is true, so it commits the
        instructions in the body of the for loop
      
      The problem is that in this scenario, the cpu read map->extent[0].first
      and map->nr_extents in the wrong order. We need a full read memory barrier
      to prevent it.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e79323bd
  7. 03 Apr, 2014 1 commit
    • Paul Gortmaker's avatar
      kernel: audit/fix non-modular users of module_init in core code · c96d6660
      Paul Gortmaker authored
      Code that is obj-y (always built-in) or dependent on a bool Kconfig
      (built-in or absent) can never be modular.  So using module_init as an
      alias for __initcall can be somewhat misleading.
      
      Fix these up now, so that we can relocate module_init from init.h into
      module.h in the future.  If we don't do this, we'd have to add module.h
      to obviously non-modular code, and that would be a worse thing.
      
      The audit targets the following module_init users for change:
       kernel/user.c                  obj-y
       kernel/kexec.c                 bool KEXEC (one instance per arch)
       kernel/profile.c               bool PROFILING
       kernel/hung_task.c             bool DETECT_HUNG_TASK
       kernel/sched/stats.c           bool SCHEDSTATS
       kernel/user_namespace.c        bool USER_NS
      
      Note that direct use of __initcall is discouraged, vs.  one of the
      priority categorized subgroups.  As __initcall gets mapped onto
      device_initcall, our use of subsys_initcall (which makes sense for...
      c96d6660
  8. 20 Feb, 2014 1 commit
  9. 19 Feb, 2014 1 commit
  10. 24 Sep, 2013 1 commit
    • David Howells's avatar
      KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches · f36f8c75
      David Howells authored
      
      Add support for per-user_namespace registers of persistent per-UID kerberos
      caches held within the kernel.
      
      This allows the kerberos cache to be retained beyond the life of all a user's
      processes so that the user's cron jobs can work.
      
      The kerberos cache is envisioned as a keyring/key tree looking something like:
      
      	struct user_namespace
      	  \___ .krb_cache keyring		- The register
      		\___ _krb.0 keyring		- Root's Kerberos cache
      		\___ _krb.5000 keyring		- User 5000's Kerberos cache
      		\___ _krb.5001 keyring		- User 5001's Kerberos cache
      			\___ tkt785 big_key	- A ccache blob
      			\___ tkt12345 big_key	- Another ccache blob
      
      Or possibly:
      
      	struct user_namespace
      	  \___ .krb_cache keyring		- The register
      		\___ _krb.0 keyring		- Root's Kerberos cache
      		\___ _krb.5000 keyring		- User 5000's Kerberos cache
      		\___ _krb.5001 keyring		- User 5001's Kerberos cache
      			\___ tkt785 keyring	- A ccache
      				\___ krbtgt/REDHAT.COM@REDHAT.COM big_key
      				\___ http/REDHAT.COM@REDHAT.COM user
      				\___ afs/REDHAT.COM@REDHAT.COM user
      				\___ nfs/REDHAT.COM@REDHAT.COM user
      				\___ krbtgt/KERNEL.ORG@KERNEL.ORG big_key
      				\___ http/KERNEL.ORG@KERNEL.ORG big_key
      
      What goes into a particular Kerberos cache is entirely up to userspace.  Kernel
      support is limited to giving you the Kerberos cache keyring that you want.
      
      The user asks for their Kerberos cache by:
      
      	krb_cache = keyctl_get_krbcache(uid, dest_keyring);
      
      The uid is -1 or the user's own UID for the user's own cache or the uid of some
      other user's cache (requires CAP_SETUID).  This permits rpc.gssd or whatever to
      mess with the cache.
      
      The cache returned is a keyring named "_krb.<uid>" that the possessor can read,
      search, clear, invalidate, unlink from and add links to.  Active LSMs get a
      chance to rule on whether the caller is permitted to make a link.
      
      Each uid's cache keyring is created when it first accessed and is given a
      timeout that is extended each time this function is called so that the keyring
      goes away after a while.  The timeout is configurable by sysctl but defaults to
      three days.
      
      Each user_namespace struct gets a lazily-created keyring that serves as the
      register.  The cache keyrings are added to it.  This means that standard key
      search and garbage collection facilities are available.
      
      The user_namespace struct's register goes away when it does and anything left
      in it is then automatically gc'd.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarSimo Sorce <simo@redhat.com>
      cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
      cc: Eric W. Biederman <ebiederm@xmission.com>
      f36f8c75
  11. 27 Aug, 2013 1 commit
    • Eric W. Biederman's avatar
      userns: Better restrictions on when proc and sysfs can be mounted · e51db735
      Eric W. Biederman authored
      
      Rely on the fact that another flavor of the filesystem is already
      mounted and do not rely on state in the user namespace.
      
      Verify that the mounted filesystem is not covered in any significant
      way.  I would love to verify that the previously mounted filesystem
      has no mounts on top but there are at least the directories
      /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
      for other filesystems to mount on top of.
      
      Refactor the test into a function named fs_fully_visible and call that
      function from the mount routines of proc and sysfs.  This makes this
      test local to the filesystems involved and the results current of when
      the mounts take place, removing a weird threading of the user
      namespace, the mount namespace and the filesystems themselves.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      e51db735
  12. 08 Aug, 2013 1 commit
  13. 06 Aug, 2013 1 commit
  14. 01 May, 2013 1 commit
  15. 15 Apr, 2013 3 commits
  16. 27 Mar, 2013 2 commits
    • Eric W. Biederman's avatar
      userns: Restrict when proc and sysfs can be mounted · 87a8ebd6
      Eric W. Biederman authored
      
      Only allow unprivileged mounts of proc and sysfs if they are already
      mounted when the user namespace is created.
      
      proc and sysfs are interesting because they have content that is
      per namespace, and so fresh mounts are needed when new namespaces
      are created while at the same time proc and sysfs have content that
      is shared between every instance.
      
      Respect the policy of who may see the shared content of proc and sysfs
      by only allowing new mounts if there was an existing mount at the time
      the user namespace was created.
      
      In practice there are only two interesting cases: proc and sysfs are
      mounted at their usual places, proc and sysfs are not mounted at all
      (some form of mount namespace jail).
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      87a8ebd6
    • Eric W. Biederman's avatar
      userns: Don't allow creation if the user is chrooted · 3151527e
      Eric W. Biederman authored
      
      Guarantee that the policy of which files may be access that is
      established by setting the root directory will not be violated
      by user namespaces by verifying that the root directory points
      to the root of the mount namespace at the time of user namespace
      creation.
      
      Changing the root is a privileged operation, and as a matter of policy
      it serves to limit unprivileged processes to files below the current
      root directory.
      
      For reasons of simplicity and comprehensibility the privilege to
      change the root directory is gated solely on the CAP_SYS_CHROOT
      capability in the user namespace.  Therefore when creating a user
      namespace we must ensure that the policy of which files may be access
      can not be violated by changing the root directory.
      
      Anyone who runs a processes in a chroot and would like to use user
      namespace can setup the same view of filesystems with a mount
      namespace instead.  With this result that this is not a practical
      limitation for using user namespaces.
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Reported-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      3151527e
  17. 13 Mar, 2013 1 commit
    • Eric W. Biederman's avatar
      userns: Don't allow CLONE_NEWUSER | CLONE_FS · e66eded8
      Eric W. Biederman authored
      
      Don't allowing sharing the root directory with processes in a
      different user namespace.  There doesn't seem to be any point, and to
      allow it would require the overhead of putting a user namespace
      reference in fs_struct (for permission checks) and incrementing that
      reference count on practically every call to fork.
      
      So just perform the inexpensive test of forbidding sharing fs_struct
      acrosss processes in different user namespaces.  We already disallow
      other forms of threading when unsharing a user namespace so this
      should be no real burden in practice.
      
      This updates setns, clone, and unshare to disallow multiple user
      namespaces sharing an fs_struct.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e66eded8
  18. 27 Jan, 2013 2 commits
    • Eric W. Biederman's avatar
      userns: Allow any uid or gid mappings that don't overlap. · 0bd14b4f
      Eric W. Biederman authored
      
      When I initially wrote the code for /proc/<pid>/uid_map.  I was lazy
      and avoided duplicate mappings by the simple expedient of ensuring the
      first number in a new extent was greater than any number in the
      previous extent.
      
      Unfortunately that precludes a number of valid mappings, and someone
      noticed and complained.  So use a simple check to ensure that ranges
      in the mapping extents don't overlap.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      0bd14b4f
    • Eric W. Biederman's avatar
      userns: Avoid recursion in put_user_ns · c61a2810
      Eric W. Biederman authored
      
      When freeing a deeply nested user namespace free_user_ns calls
      put_user_ns on it's parent which may in turn call free_user_ns again.
      When -fno-optimize-sibling-calls is passed to gcc one stack frame per
      user namespace is left on the stack, potentially overflowing the
      kernel stack.  CONFIG_FRAME_POINTER forces -fno-optimize-sibling-calls
      so we can't count on gcc to optimize this code.
      
      Remove struct kref and use a plain atomic_t.  Making the code more
      flexible and easier to comprehend.  Make the loop in free_user_ns
      explict to guarantee that the stack does not overflow with
      CONFIG_FRAME_POINTER enabled.
      
      I have tested this fix with a simple program that uses unshare to
      create a deeply nested user namespace structure and then calls exit.
      With 1000 nesteuser namespaces before this change running my test
      program causes the kernel to die a horrible death.  With 10,000,000
      nested user namespaces after this change my test program runs to
      completion and causes no harm.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Pointed-out-by: default avatarVasily Kulikov <segoon@openwall.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      c61a2810
  19. 15 Dec, 2012 1 commit
  20. 20 Nov, 2012 5 commits
  21. 18 Sep, 2012 1 commit
    • Eric W. Biederman's avatar
      userns: Add kprojid_t and associated infrastructure in projid.h · f76d207a
      Eric W. Biederman authored
      
      Implement kprojid_t a cousin of the kuid_t and kgid_t.
      
      The per user namespace mapping of project id values can be set with
      /proc/<pid>/projid_map.
      
      A full compliment of helpers is provided: make_kprojid, from_kprojid,
      from_kprojid_munged, kporjid_has_mapping, projid_valid, projid_eq,
      projid_eq, projid_lt.
      
      Project identifiers are part of the generic disk quota interface,
      although it appears only xfs implements project identifiers currently.
      
      The xfs code allows anyone who has permission to set the project
      identifier on a file to use any project identifier so when
      setting up the user namespace project identifier mappings I do
      not require a capability.
      
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      f76d207a
  22. 03 May, 2012 1 commit
  23. 26 Apr, 2012 2 commits
    • Eric W. Biederman's avatar
      userns: Rework the user_namespace adding uid/gid mapping support · 22d917d8
      Eric W. Biederman authored
      
      - Convert the old uid mapping functions into compatibility wrappers
      - Add a uid/gid mapping layer from user space uid and gids to kernel
        internal uids and gids that is extent based for simplicty and speed.
        * Working with number space after mapping uids/gids into their kernel
          internal version adds only mapping complexity over what we have today,
          leaving the kernel code easy to understand and test.
      - Add proc files /proc/self/uid_map /proc/self/gid_map
        These files display the mapping and allow a mapping to be added
        if a mapping does not exist.
      - Allow entering the user namespace without a uid or gid mapping.
        Since we are starting with an existing user our uids and gids
        still have global mappings so are still valid and useful they just don't
        have local mappings.  The requirement for things to work are global uid
        and gid so it is odd but perfectly fine not to have a local uid
        and gid mapping.
        Not requiring global uid and gid mappings greatly simplifies
        the logic of setting up the uid and gid mappings by allowing
        the mappings to be set after the namespace is created which makes the
        slight weirdness worth it.
      - Make the mappings in the initial user namespace to the global
        uid/gid space explicit.  Today it is an identity mapping
        but in the future we may want to twist this for debugging, similar
        to what we do with jiffies.
      - Document the memory ordering requirements of setting the uid and
        gid mappings.  We only allow the mappings to be set once
        and there are no pointers involved so the requirments are
        trivial but a little atypical.
      
      Performance:
      
      In this scheme for the permission checks the performance is expected to
      stay the same as the actuall machine instructions should remain the same.
      
      The worst case I could think of is ls -l on a large directory where
      all of the stat results need to be translated with from kuids and
      kgids to uids and gids.  So I benchmarked that case on my laptop
      with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.
      
      My benchmark consisted of going to single user mode where nothing else
      was running. On an ext4 filesystem opening 1,000,000 files and looping
      through all of the files 1000 times and calling fstat on the
      individuals files.  This was to ensure I was benchmarking stat times
      where the inodes were in the kernels cache, but the inode values were
      not in the processors cache.  My results:
      
      v3.4-rc1:         ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
      v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
      v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
      
      All of the configurations ran in roughly 120ns when I performed tests
      that ran in the cpu cache.
      
      So in summary the performance impact is:
      1ns improvement in the worst case with user namespace support compiled out.
      8ns aka 5% slowdown in the worst case with user namespace support compiled in.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      22d917d8
    • Eric W. Biederman's avatar
      userns: Simplify the user_namespace by making userns->creator a kuid. · 783291e6
      Eric W. Biederman authored
      
      - Transform userns->creator from a user_struct reference to a simple
        kuid_t, kgid_t pair.
      
        In cap_capable this allows the check to see if we are the creator of
        a namespace to become the classic suser style euid permission check.
      
        This allows us to remove the need for a struct cred in the mapping
        functions and still be able to dispaly the user namespace creators
        uid and gid as 0.
      
      - Remove the now unnecessary delayed_work in free_user_ns.
      
        All that is left for free_user_ns to do is to call kmem_cache_free
        and put_user_ns.  Those functions can be called in any context
        so call them directly from free_user_ns removing the need for delayed work.
      Acked-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      783291e6
  24. 08 Apr, 2012 1 commit
  25. 07 Apr, 2012 1 commit