- 06 Apr, 2016 6 commits
-
-
Seth Forshee authored
Unprivileged users are normally restricted from mounting with the allow_other option by system policy, but this could be bypassed for a mount done with user namespace root permissions. In such cases allow_other should not allow users outside the userns to access the mount as doing so would give the unprivileged user the ability to manipulate processes it would otherwise be unable to manipulate. Restrict allow_other to apply to users in the same userns used at mount or a descendant of that namespace. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
In order to support mounts from namespaces other than init_user_ns, fuse must translate uids and gids to/from the userns of the process servicing requests on /dev/fuse. This patch does that, with a couple of restrictions on the namespace: - The userns for the fuse connection is fixed to the namespace from which /dev/fuse is opened. - The namespace must be the same as s_user_ns. These restrictions simplify the implementation by avoiding the need to pass around userns references and by allowing fuse to rely on the checks in inode_change_ok for ownership changes. Either restriction could be relaxed in the future if needed. For cuse the namespace used for the connection is also simply current_user_ns() at the time /dev/cuse is opened. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
If the userspace process servicing fuse requests is running in a pid namespace then pids passed via the fuse fd need to be translated relative to that namespace. Capture the pid namespace in use when the filesystem is mounted and use this for pid translation. Since no use case currently exists for changing namespaces all translations are done relative to the pid namespace in use when /dev/fuse is opened. Mounting or /dev/fuse IO from another namespace will return errors. Requests from processes whose pid cannot be translated into the target namespace are not permitted, except for requests allocated via fuse_get_req_nofail_nopages. For no-fail requests in.h.pid will be 0 if the pid translation fails. File locking changes based on previous work done by Eric Biederman. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
A privileged user in s_user_ns will generally have the ability to manipulate the backing store and insert security.* xattrs into the filesystem directly. Therefore the kernel must be prepared to handle these xattrs from unprivileged mounts, and it makes little sense for commoncap to prevent writing these xattrs to the filesystem. The capability and LSM code have already been updated to appropriately handle xattrs from unprivileged mounts, so it is safe to loosen this restriction on setting xattrs. The exception to this logic is that writing xattrs to a mounted filesystem may also cause the LSM inode_post_setxattr or inode_setsecurity callbacks to be invoked. SELinux will deny the xattr update by virtue of applying mountpoint labeling to unprivileged userns mounts, and Smack will deny the writes for any user without global CAP_MAC_ADMIN, so loosening the capability check in commoncap is safe in this respect as well. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
Superblock level remounts are currently restricted to global CAP_SYS_ADMIN, as is the path for changing the root mount to read only on umount. Loosen both of these permission checks to also allow CAP_SYS_ADMIN in any namespace which is privileged towards the userns which originally mounted the filesystem. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
Expand the check in should_remove_suid() to keep privileges for CAP_FSETID in s_user_ns rather than init_user_ns. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
- 29 Feb, 2016 34 commits
-
-
Seth Forshee authored
ids in on-disk ACLs should be converted to s_user_ns instead of init_user_ns as is done now. This introduces the possibility for id mappings to fail, and when this happens syscalls will return EOVERFLOW. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
Add checks to inode_change_ok to verify that uid and gid changes will map into the superblock's user namespace. If they do not fail with -EOVERFLOW. This cannot be overriden with ATTR_FORCE. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
Using INVALID_[UG]ID for the LSM file creation context doesn't make sense, so return an error if the inode passed to set_create_file_as() has an invalid id. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
Filesystem uids which don't map into a user namespace may result in inode->i_uid being INVALID_UID. A symlink and its parent could have different owners in the filesystem can both get mapped to INVALID_UID, which may result in following a symlink when this would not have otherwise been permitted when protected symlinks are enabled. Add a new helper function, uid_valid_eq(), and use this to validate that the ids in may_follow_link() are both equal and valid. Also add an equivalent helper for gids, which is currently unused. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
The SMACK64, SMACK64EXEC, and SMACK64MMAP labels are all handled differently in untrusted mounts. This is confusing and potentically problematic. Change this to handle them all the same way that SMACK64 is currently handled; that is, read the label from disk and check it at use time. For SMACK64 and SMACK64MMAP access is denied if the label does not match smk_root. To be consistent with suid, a SMACK64EXEC label which does not match smk_root will still allow execution of the file but will not run with the label supplied in the xattr. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
All current callers of in_userns pass current_user_ns as the first argument. Simplify by replacing in_userns with current_in_userns which checks whether current_user_ns is in the namespace supplied as an argument. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
Security labels from unprivileged mounts in user namespaces must be ignored. Force superblocks from user namespaces whose labeling behavior is to use xattrs to use mountpoint labeling instead. For the mountpoint label, default to converting the current task context into a form suitable for file objects, but also allow the policy writer to specify a different label through policy transition rules. Pieced together from code snippets provided by Stephen Smalley. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Andy Lutomirski authored
If a process gets access to a mount from a different user namespace, that process should not be able to take advantage of setuid files or selinux entrypoints from that filesystem. Prevent this by treating mounts from other mount namespaces and those not owned by current_user_ns() or an ancestor as nosuid. This will make it safer to allow more complex filesystems to be mounted in non-root user namespaces. This does not remove the need for MNT_LOCK_NOSUID. The setuid, setgid, and file capability bits can no longer be abused if code in a user namespace were to clear nosuid on an untrusted filesystem, but this patch, by itself, is insufficient to protect the system from abuse of files that, when execed, would increase MAC privilege. As a more concrete explanation, any task that can manipulate a vfsmount associated with a given user namespace already has capabilities in that namespace and all of its descendents. If they can cause a malicious setuid, setgid, or file-caps executable to appear in that mount, then that executable will only allow them to elevate privileges in exactly the set of namespaces in which they are already privileges. On the other hand, if they can cause a malicious executable to appear with a dangerous MAC label, running it could change the caller's security context in a way that should not have been possible, even inside the namespace in which the task is confined. As a hardening measure, this would have made CVE-2014-5207 much more difficult to exploit. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: James Morris <james.l.morris@oracle.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
Unprivileged users should not be able to mount block devices when they lack sufficient privileges towards the block device inode. Update blkdev_get_by_path() to validate that the user has the required access to the inode at the specified path. The check will be skipped for CAP_SYS_ADMIN, so privileged mounts will continue working as before. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
When looking up a block device by path no permission check is done to verify that the user has access to the block device inode at the specified path. In some cases it may be necessary to check permissions towards the inode, such as allowing unprivileged users to mount block devices in user namespaces. Add an argument to lookup_bdev() to optionally perform this permission check. A value of 0 skips the permission check and behaves the same as before. A non-zero value specifies the mask of access rights required towards the inode at the specified path. The check is always skipped if the user has CAP_SYS_ADMIN. All callers of lookup_bdev() currently pass a mask of 0, so this patch results in no functional change. Subsequent patches will add permission checks where appropriate. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
Security labels from unprivileged mounts cannot be trusted. Ideally for these mounts we would assign the objects in the filesystem the same label as the inode for the backing device passed to mount. Unfortunately it's currently impossible to determine which inode this is from the LSM mount hooks, so we settle for the label of the process doing the mount. This label is assigned to s_root, and also to smk_default to ensure that new inodes receive this label. The transmute property is also set on s_root to make this behavior more explicit, even though it is technically not necessary. If a filesystem has existing security labels, access to inodes is permitted if the label is the same as smk_root, otherwise access is denied. The SMACK64EXEC xattr is completely ignored. Explicit setting of security labels continues to require CAP_MAC_ADMIN in init_user_ns. Altogether, this ensures that filesystem objects are not accessible to subjects which cannot already access the backing store, that MAC is not violated for any objects in the fileystem which are already labeled, and that a user cannot use an unprivileged mount to gain elevated MAC privileges. sysfs, tmpfs, and ramfs are already mountable from user namespaces and support security labels. We can't rule out the possibility that these filesystems may already be used in mounts from user namespaces with security lables set from the init namespace, so failing to trust lables in these filesystems may introduce regressions. It is safe to trust labels from these filesystems, since the unprivileged user does not control the backing store and thus cannot supply security labels, so an explicit exception is made to trust labels from these filesystems. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
Capability sets attached to files must be ignored except in the user namespaces where the mounter is privileged, i.e. s_user_ns and its descendants. Otherwise a vector exists for gaining privileges in namespaces where a user is not already privileged. Add a new helper function, in_user_ns(), to test whether a user namespace is the same as or a descendant of another namespace. Use this helper to determine whether a file's capability set should be applied to the caps constructed during exec. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Seth Forshee authored
Initially this will be used to eliminate the implicit MNT_NODEV flag for mounts from user namespaces. In the future it will also be used for translating ids and checking capabilities for filesystems mounted from user namespaces. s_user_ns is initialized in alloc_super() and is generally set to current_user_ns(). To avoid security and corruption issues, two additional mount checks are also added: - do_new_mount() gains a check that the user has CAP_SYS_ADMIN in current_user_ns(). - sget() will fail with EBUSY when the filesystem it's looking for is already mounted from another user namespace. proc requires some special handling. The user namespace of current isn't appropriate when forking as a result of clone (2) with CLONE_NEWPID|CLONE_NEWUSER, as it will set s_user_ns to the namespace of the parent and make proc unmountable in the new user namespace. Instead, the user namespace which owns the new pid namespace is used. sget_userns() is allowed to allow passing in a namespace other than that of current, and sget becomes a wrapper around sget_userns() which passes current_user_ns(). Changes to original version of this patch * Documented @user_ns in sget_userns, alloc_super and fs.h * Kept an blank line in fs.h * Removed unncessary include of user_namespace.h from fs.h * Tweaked the location of get_user_ns and put_user_ns so the security modules can (if they wish) depend on it. -- EWB Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Xiangliang Yu authored
BugLink: http://bugs.launchpad.net/bugs/1542071 This adds support for AMD's PCI-Express Non-Transparent Bridge (NTB) device on the Zeppelin platform. The driver connnects to the standard NTB sub-system interface, with modification to add hooks for power management in a separate patch. The AMD NTB device has 3 memory windows, 16 doorbell, 16 scratch-pad registers, and supports up to 16 PCIe lanes running a Gen3 speeds. Signed-off-by: Xiangliang Yu <Xiangliang.Yu@amd.com> Reviewed-by: Allen Hubbe <Allen.Hubbe@emc.com> Signed-off-by: Jon Mason <jdmason@kudzu.us> (cherry picked from commit a1b36958) Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Tim Gardner authored
BugLink: http://bugs.launchpad.net/bugs/1542071Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Joseph Salisbury authored
BugLink: http://bugs.launchpad.net/bugs/1495983 OriginalAuthor: Olaf Hering <olaf@aepfle.de> Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Brad Figg <brad.figg@canonical.com> Acked-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Brad Figg <brad.figg@canonical.com>
-
Tim Gardner authored
BugLink: http://bugs.launchpad.net/bugs/1545542Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Suman Tripathi authored
ahci_xgene: Implement the workaround to fix the missing of the edge interrupt for the HOST_IRQ_STAT. Due to H/W errata, the HOST_IRQ_STAT register misses the edge interrupt when clearing the HOST_IRQ_STAT register and hardware reporting the PORT_IRQ_STAT register happens to be at the same clock cycle. Signed-off-by: Suman Tripathi <stripathi@apm.com> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from linux-next commit 32aea268) Signed-off-by: dann frazier <dann.frazier@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Suman Tripathi authored
The flexibility to override the irq handles in the LLD's are already present, so controllers implementing a edge trigger latch can implement their own interrupt handler inside the driver. This patch removes the AHCI_HFLAG_EDGE_IRQ support from libahci and moves edge irq handling to ahci_xgene. tj: Minor update to description. Signed-off-by: Suman Tripathi <stripathi@apm.com> Signed-off-by: Tejun Heo <tj@kenrel.org> (cherry picked from linux-next commit d867b95f) [ dannf: offset adjustments ] Signed-off-by: dann frazier <dann.frazier@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Suman Tripathi authored
This patch implements the capability to override the generic AHCI interrupt handler so that specific ahci drivers can implement their own custom interrupt handler routines. It also exports ahci_handle_port_intr so that custom irq_handler implementations can use it. tj: s/ahci_irq_handler/irq_handler/ and updated description. Signed-off-by: Suman Tripathi <stripathi@apm.com> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from linux-next commit f070d671) [ dannf: backported to v4.4 ] Signed-off-by: dann frazier <dann.frazier@canonical.com> Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Nicholas Krause authored
This adds the needed check after the call to the function mraid_mm_alloc_kioc in order to make sure that this function has not returned NULL and therefore makes sure we do not deference a NULL pointer if one is returned by mraid_mm_alloc_kioc. Further more add needed comments explaining that this function call can return NULL if the list head is empty for the pointer passed in order to allow furture users to understand this required pointer check. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Acked-by: Sumit Saxena <sumit.saxena@avagotech.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> (cherry picked from commit 7296f62f) Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Tim Gardner authored
Ignore: yes Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Tim Gardner authored
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Sebastian Ott authored
BugLink: http://bugs.launchpad.net/bugs/1541534 Per channel path measurement characteristics are obtained during channel path registration. However if some properties of a channel path change we don't update the measurement characteristics. Make sure to update the characteristics when we change the properties of a channel path or receive a notification from FW about such a change. Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> (cherry picked from commit 9f3d6d7a) Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Sebastian Ott authored
BugLink: http://bugs.launchpad.net/bugs/1541534 Make sure that in all cases where we could not obtain measurement characteristics the associated fields are set to invalid values. Note: without this change the "shared" capability of a channel path for which we could not obtain the measurement characteristics was incorrectly displayed as 0 (not shared). We will now correctly report "unknown" in this case. Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> (cherry picked from commit 61f0bfcf) Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Sebastian Ott authored
BugLink: http://bugs.launchpad.net/bugs/1541534 Measurement characteristics are allocated during channel path registration but not freed during deregistration. Fix this by embedding these characteristics inside struct channel_path. Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> (cherry picked from commit 0d9bfe91) Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Ursula Braun authored
BugLink: http://bugs.launchpad.net/bugs/1541907 /sys/class/net/<interface>/operstate for an active qeth network interface offen shows "unknown", which translates to "state UNKNOWN in output of "ip link show". It is caused by a missing initialization of the __LINK_STATE_NOCARRIER bit in the net_device state field. This patch adds a netif_carrier_off() invocation when creating the net_device for a qeth device. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Acked-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Reference-ID: Bugzilla 133209 Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit e5ebe632) Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Tim Gardner authored
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Dan Williams authored
BugLink: http://bugs.launchpad.net/bugs/1534647 ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new mm zones that are bumping up against the current maximum limit of 4 zones, i.e. 2 bits in page->flags. When adding a zone this equation still needs to be satisified: SECTIONS_WIDTH + ZONES_WIDTH + NODES_SHIFT + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS ZONE_DEVICE currently tries to satisfy this equation by requiring that ZONE_DMA be disabled, but this is untenable given generic kernels want to support ZONE_DEVICE and ZONE_DMA simultaneously. ZONE_CMA would like to increase the amount of memory covered per section, but that limits the minimum granularity at which consecutive memory ranges can be added via devm_memremap_pages(). The trade-off of what is acceptable to sacrifice depends heavily on the platform. For example, ZONE_CMA is targeted for 32-bit platforms where page->flags is constrained, but those platforms likely do not care about the minimum granularity of memory hotplug. A big iron machine with 1024 numa nodes can likely sacrifice ZONE_DMA where a general purpose distribution kernel can not. CONFIG_NR_ZONES_EXTENDED is a configuration symbol that gets selected when the number of configured zones exceeds 4. It documents the configuration symbols and definitions that get modified when ZONES_WIDTH is greater than 2. For now, it steals a bit from NODES_SHIFT. Later on it can be used to document the definitions that get modified when a 32-bit configuration wants more zone bits. Note that GFP_ZONE_TABLE poses an interesting constraint since include/linux/gfp.h gets included by the 32-bit portion of a 64-bit build. We need to be careful to only build the table for zones that have a corresponding gfp_t flag. GFP_ZONES_SHIFT is introduced for this purpose. This patch does not attempt to solve the problem of adding a new zone that also has a corresponding GFP_ flag. Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931 Fixes: 033fbae9 ("mm: ZONE_DEVICE for "device memory"") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Mark <markk@clara.co.uk> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from linux-next commit 27ffb3827ac71a46e8d52fc7ed7302d33a619d6c) Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Tim Gardner authored
BugLink: http://bugs.launchpad.net/bugs/1534647Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Hemant Kumar authored
BugLink: http://bugs.launchpad.net/bugs/1521678 Powerpc provides hcall events that also provides insights into guest behaviour. Enhance perf kvm stat to record and analyze hcall events. - To trace hcall events : perf kvm stat record - To show the results : perf kvm stat report --event=hcall The result shows the number of hypervisor calls from the guest grouped by their respective reasons displayed with the frequency. This patch makes use of two additional tracepoints "kvm_hv:kvm_hcall_enter" and "kvm_hv:kvm_hcall_exit". To map the hcall codes to their respective names, it needs a mapping. Such mapping is added in this patch in book3s_hcalls.h. # pgrep qemu A sample output : 19378 60515 2 VMs running. # perf kvm stat record -a ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 4.153 MB perf.data.guest (39624 samples) ] # perf kvm stat report -p 60515 --event=hcall Analyze events for all VMs, all VCPUs: HCALL-EVENT Samples Samples% Time% MinTime MaxTime AvgTime H_IPI 822 66.08% 88.10% 0.63us 11.38us 2.05us (+- 1.42%) H_SEND_CRQ 144 11.58% 3.77% 0.41us 0.88us 0.50us (+- 1.47%) H_VIO_SIGNAL 118 9.49% 2.86% 0.37us 0.83us 0.47us (+- 1.43%) H_PUT_TERM_CHAR 76 6.11% 2.07% 0.37us 0.90us 0.52us (+- 2.43%) H_GET_TERM_CHAR 74 5.95% 2.23% 0.37us 1.70us 0.58us (+- 4.77%) H_RTAS 6 0.48% 0.85% 1.10us 9.25us 2.70us (+-48.57%) H_PERFMON 4 0.32% 0.12% 0.41us 0.96us 0.59us (+-20.92%) Total Samples:1244, Total events handled time:1916.69us. Signed-off-by: Hemant Kumar <hemant@linux.vnet.ibm.com> Cc: Alexander Yarygin <yarygin@linux.vnet.ibm.com> Cc: David Ahern <dsahern@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Scott Wood <scottwood@freescale.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1453962787-15376-4-git-send-email-hemant@linux.vnet.ibm.comSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> (cherry picked from linux-next commit 78e6c39b) Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Hemant Kumar authored
BugLink: http://bugs.launchpad.net/bugs/1521678 perf kvm can be used to analyze guest exit reasons. This support already exists in x86. Hence, porting it to powerpc. - To trace KVM events : perf kvm stat record If many guests are running, we can track for a specific guest by using --pid as in : perf kvm stat record --pid <pid> - To see the results : perf kvm stat report The result shows the number of exits (from the guest context to host/hypervisor context) grouped by their respective exit reasons with their frequency. Since, different powerpc machines have different KVM tracepoints, this patch discovers the available tracepoints dynamically and accordingly looks for them. If any single tracepoint is not present, this support won't be enabled for reporting. To record, this will fail if any of the events we are looking to record isn't available. Right now, its only supported on PowerPC Book3S_HV architectures. To analyze the different exits, group them and present them (in a slight descriptive way) to the user, we need a mapping between the "exit code" (dumped in the kvm_guest_exit tracepoint data) and to its related Interrupt vector description (exit reason). This patch adds this mapping in book3s_hv_exits.h. It records on two available KVM tracepoints for book3s_hv: "kvm_hv:kvm_guest_exit" and "kvm_hv:kvm_guest_enter". Here is a sample o/p: # pgrep qemu 19378 60515 2 Guests are running on the host. # perf kvm stat record -a ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 4.153 MB perf.data.guest (39624 samples) ] # perf kvm stat report -p 60515 Analyze events for pid(s) 60515, all VCPUs: VM-EXIT Samples Samples% Time% MinTime MaxTime Avg time SYSCALL 9141 63.67% 7.49% 1.26us 5782.39us 9.87us (+- 6.46%) H_DATA_STORAGE 4114 28.66% 5.07% 1.72us 4597.68us 14.84us (+-20.06%) HV_DECREMENTER 418 2.91% 4.26% 0.70us 30002.22us 122.58us (+-70.29%) EXTERNAL 392 2.73% 0.06% 0.64us 104.10us 1.94us (+-18.83%) RETURN_TO_HOST 287 2.00% 83.11% 1.53us 124240.15us 3486.52us (+-16.81%) H_INST_STORAGE 5 0.03% 0.00% 1.88us 3.73us 2.39us (+-14.20%) Total Samples:14357, Total events handled time:1203918.42us. Signed-off-by: Hemant Kumar <hemant@linux.vnet.ibm.com> Cc: Alexander Yarygin <yarygin@linux.vnet.ibm.com> Cc: David Ahern <dsahern@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Scott Wood <scottwood@freescale.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1453962787-15376-3-git-send-email-hemant@linux.vnet.ibm.comSigned-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> (cherry picked from linux-next commit 066d3593) Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Hemant Kumar authored
BugLink: http://bugs.launchpad.net/bugs/1521678 This patch removes the "const" qualifier from kvm_events_tp declaration to account for the fact that some architectures may need to update this variable dynamically. For instance, powerpc will need to update this variable dynamically depending on the machine type. Signed-off-by: Hemant Kumar <hemant@linux.vnet.ibm.com> Acked-by: David Ahern <dsahern@gmail.com> Cc: Alexander Yarygin <yarygin@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Scott Wood <scottwood@freescale.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1453962787-15376-2-git-send-email-hemant@linux.vnet.ibm.comSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> (cherry picked from linux-next commit 48deaa74) Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-
Hemant Kumar authored
BugLink: http://bugs.launchpad.net/bugs/1521678 Its better to remove the dependency on uapi/kvm_perf.h to allow dynamic discovery of kvm events (if its needed). To do this, some extern variables have been introduced with which we can keep the generic functions generic. Signed-off-by: Hemant Kumar <hemant@linux.vnet.ibm.com> Acked-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com> Acked-by: David Ahern <dsahern@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Scott Wood <scottwood@freescale.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1453962787-15376-1-git-send-email-hemant@linux.vnet.ibm.comSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> (cherry picked from linux-next commit 162607ea) Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
-