• Vasily Averin's avatar
    memcg: enable accounting for mnt_cache entries · 79f6540b
    Vasily Averin authored
    Patch series "memcg accounting from OpenVZ", v7.
    
    OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
    Initially we used our own accounting subsystem, then partially committed
    it to upstream, and a few years ago switched to cgroups v1.  Now we're
    rebasing again, revising our old patches and trying to push them upstream.
    
    We try to protect the host system from any misuse of kernel memory
    allocation triggered by untrusted users inside the containers.
    
    Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
    list, though I would be very grateful for any comments from maintainersi
    of affected subsystems or other people added in cc:
    
    Compared to the upstream, we additionally account the following kernel objects:
    - network devices and its Tx/Rx queues
    - ipv4/v6 addresses and routing-related objects
    - inet_bind_bucket cache objects
    - VLAN group arrays
    - ipv6/sit: ip_tunnel_prl
    - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
    - nsproxy and namespace objects itself
    - IPC objects: semaphores, message queues and share memory segments
    - mounts
    - pollfd and select bits arrays
    - signals and posix timers
    - file lock
    - fasync_struct used by the file lease code and driver's fasync queues
    - tty objects
    - per-mm LDT
    
    We have an incorrect/incomplete/obsoleted accounting for few other kernel
    objects: sk_filter, af_packets, netlink and xt_counters for iptables.
    They require rework and probably will be dropped at all.
    
    Also we're going to add an accounting for nft, however it is not ready
    yet.
    
    We have not tested performance on upstream, however, our performance team
    compares our current RHEL7-based production kernel and reports that they
    are at least not worse as the according original RHEL7 kernel.
    
    This patch (of 10):
    
    The kernel allocates ~400 bytes of 'struct mount' for any new mount.
    Creating a new mount namespace clones most of the parent mounts, and this
    can be repeated many times.  Additionally, each mount allocates up to
    PATH_MAX=4096 bytes for mnt->mnt_devname.
    
    It makes sense to account for these allocations to restrict the host's
    memory consumption from inside the memcg-limited container.
    
    Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.comSigned-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
    Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
    Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Yutian Yang <nglaive@gmail.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: Alexey Dobriyan <adobriyan@gmail.com>
    Cc: Andrei Vagin <avagin@gmail.com>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Dmitry Safonov <0x7f454c46@gmail.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: "J. Bruce Fields" <bfields@fieldses.org>
    Cc: Jeff Layton <jlayton@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Jiri Slaby <jirislaby@kernel.org>
    Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Serge Hallyn <serge@hallyn.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Zefan Li <lizefan.x@bytedance.com>
    Cc: Borislav Petkov <bp@suse.de>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    79f6540b
namespace.c 109 KB