1. 13 Jan, 2012 26 commits
    • Johannes Weiner's avatar
      mm: make per-memcg LRU lists exclusive · 925b7673
      Johannes Weiner authored
      Now that all code that operated on global per-zone LRU lists is
      converted to operate on per-memory cgroup LRU lists instead, there is no
      reason to keep the double-LRU scheme around any longer.
      
      The pc->lru member is removed and page->lru is linked directly to the
      per-memory cgroup LRU lists, which removes two pointers from a
      descriptor that exists for every page frame in the system.
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarYing Han <yinghan@google.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      925b7673
    • Johannes Weiner's avatar
      mm: collect LRU list heads into struct lruvec · 6290df54
      Johannes Weiner authored
      Having a unified structure with a LRU list set for both global zones and
      per-memcg zones allows to keep that code simple which deals with LRU
      lists and does not care about the container itself.
      
      Once the per-memcg LRU lists directly link struct pages, the isolation
      function and all other list manipulations are shared between the memcg
      case and the global LRU case.
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6290df54
    • Johannes Weiner's avatar
      mm: vmscan: convert global reclaim to per-memcg LRU lists · b95a2f2d
      Johannes Weiner authored
      The global per-zone LRU lists are about to go away on memcg-enabled
      kernels, global reclaim must be able to find its pages on the per-memcg
      LRU lists.
      
      Since the LRU pages of a zone are distributed over all existing memory
      cgroups, a scan target for a zone is complete when all memory cgroups
      are scanned for their proportional share of a zone's memory.
      
      The forced scanning of small scan targets from kswapd is limited to
      zones marked unreclaimable, otherwise kswapd can quickly overreclaim by
      force-scanning the LRU lists of multiple memory cgroups.
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b95a2f2d
    • Johannes Weiner's avatar
      mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty · ad2b8e60
      Johannes Weiner authored
      root_mem_cgroup, lacking a configurable limit, was never subject to
      limit reclaim, so the pages charged to it could be kept off its LRU
      lists.  They would be found on the global per-zone LRU lists upon
      physical memory pressure and it made sense to avoid uselessly linking
      them to both lists.
      
      The global per-zone LRU lists are about to go away on memcg-enabled
      kernels, with all pages being exclusively linked to their respective
      per-memcg LRU lists.  As a result, pages of the root_mem_cgroup must
      also be linked to its LRU lists again.  This is purely about the LRU
      list, root_mem_cgroup is still not charged.
      
      The overhead is temporary until the double-LRU scheme is going away
      completely.
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad2b8e60
    • Johannes Weiner's avatar
      mm: move memcg hierarchy reclaim to generic reclaim code · 5660048c
      Johannes Weiner authored
      Memory cgroup limit reclaim and traditional global pressure reclaim will
      soon share the same code to reclaim from a hierarchical tree of memory
      cgroups.
      
      In preparation of this, move the two right next to each other in
      shrink_zone().
      
      The mem_cgroup_hierarchical_reclaim() polymath is split into a soft
      limit reclaim function, which still does hierarchy walking on its own,
      and a limit (shrinking) reclaim function, which relies on generic
      reclaim code to walk the hierarchy.
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5660048c
    • Johannes Weiner's avatar
      mm: memcg: per-priority per-zone hierarchy scan generations · 527a5ec9
      Johannes Weiner authored
      Memory cgroup limit reclaim currently picks one memory cgroup out of the
      target hierarchy, remembers it as the last scanned child, and reclaims
      all zones in it with decreasing priority levels.
      
      The new hierarchy reclaim code will pick memory cgroups from the same
      hierarchy concurrently from different zones and priority levels, it
      becomes necessary that hierarchy roots not only remember the last
      scanned child, but do so for each zone and priority level.
      
      Until now, we reclaimed memcgs like this:
      
          mem = mem_cgroup_iter(root)
          for each priority level:
            for each zone in zonelist:
              reclaim(mem, zone)
      
      But subsequent patches will move the memcg iteration inside the loop
      over the zones:
      
          for each priority level:
            for each zone in zonelist:
              mem = mem_cgroup_iter(root)
              reclaim(mem, zone)
      
      And to keep with the original scan order - memcg -> priority -> zone -
      the last scanned memcg has to be remembered per zone and per priority
      level.
      
      Furthermore, global reclaim will be switched to the hierarchy walk as
      well.  Different from limit reclaim, which can just recheck the limit
      after some reclaim progress, its target is to scan all memcgs for the
      desired zone pages, proportional to the memcg size, and so reliably
      detecting a full hierarchy round-trip will become crucial.
      
      Currently, the code relies on one reclaimer encountering the same memcg
      twice, but that is error-prone with concurrent reclaimers.  Instead, use
      a generation counter that is increased every time the child with the
      highest ID has been visited, so that reclaimers can stop when the
      generation changes.
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      527a5ec9
    • Johannes Weiner's avatar
      mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned · f16015fb
      Johannes Weiner authored
      Memory cgroup hierarchies are currently handled completely outside of
      the traditional reclaim code, which is invoked with a single memory
      cgroup as an argument for the whole call stack.
      
      Subsequent patches will switch this code to do hierarchical reclaim, so
      there needs to be a distinction between a) the memory cgroup that is
      triggering reclaim due to hitting its limit and b) the memory cgroup
      that is being scanned as a child of a).
      
      This patch introduces a struct mem_cgroup_zone that contains the
      combination of the memory cgroup and the zone being scanned, which is
      then passed down the stack instead of the zone argument.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f16015fb
    • Johannes Weiner's avatar
      mm: vmscan: distinguish global reclaim from global LRU scanning · 89b5fae5
      Johannes Weiner authored
      The traditional zone reclaim code is scanning the per-zone LRU lists
      during direct reclaim and kswapd, and the per-zone per-memory cgroup LRU
      lists when reclaiming on behalf of a memory cgroup limit.
      
      Subsequent patches will convert the traditional reclaim code to reclaim
      exclusively from the per-memory cgroup LRU lists.  As a result, using
      the predicate for which LRU list is scanned will no longer be
      appropriate to tell global reclaim from limit reclaim.
      
      This patch adds a global_reclaim() predicate to tell direct/kswapd
      reclaim from memory cgroup limit reclaim and substitutes it in all
      places where currently scanning_global_lru() is used for that.
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89b5fae5
    • Johannes Weiner's avatar
      mm: memcg: consolidate hierarchy iteration primitives · 9f3a0d09
      Johannes Weiner authored
      The memcg naturalization series:
      
      Memory control groups are currently bolted onto the side of
      traditional memory management in places where better integration would
      be preferrable.  To reclaim memory, for example, memory control groups
      maintain their own LRU list and reclaim strategy aside from the global
      per-zone LRU list reclaim.  But an extra list head for each existing
      page frame is expensive and maintaining it requires additional code.
      
      This patchset disables the global per-zone LRU lists on memory cgroup
      configurations and converts all its users to operate on the per-memory
      cgroup lists instead.  As LRU pages are then exclusively on one list,
      this saves two list pointers for each page frame in the system:
      
      page_cgroup array size with 4G physical memory
      
        vanilla: allocated 31457280 bytes of page_cgroup
        patched: allocated 15728640 bytes of page_cgroup
      
      At the same time, system performance for various workloads is
      unaffected:
      
      100G sparse file cat, 4G physical memory, 10 runs, to test for code
      bloat in the traditional LRU handling and kswapd & direct reclaim
      paths, without/with the memory controller configured in
      
        vanilla: 71.603(0.207) seconds
        patched: 71.640(0.156) seconds
      
        vanilla: 79.558(0.288) seconds
        patched: 77.233(0.147) seconds
      
      100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
      bloat in the traditional memory cgroup LRU handling and reclaim path
      
        vanilla: 96.844(0.281) seconds
        patched: 94.454(0.311) seconds
      
      4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
      swap on SSD, 10 runs, to test for regressions in kswapd & direct
      reclaim using per-memcg LRU lists with multiple memcgs and multiple
      allocators within each memcg
      
        vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
        patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
      
      16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
      swap on SSD, 10 runs, to test for regressions in hierarchical memcg
      setups
      
        vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
        patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]
      
      This patch:
      
      There are currently two different implementations of iterating over a
      memory cgroup hierarchy tree.
      
      Consolidate them into one worker function and base the convenience
      looping-macros on top of it.
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f3a0d09
    • KAMEZAWA Hiroyuki's avatar
      memcg: add mem_cgroup_replace_page_cache() to fix LRU issue · ab936cbc
      KAMEZAWA Hiroyuki authored
      Commit ef6a3c63 ("mm: add replace_page_cache_page() function") added a
      function replace_page_cache_page().  This function replaces a page in the
      radix-tree with a new page.  WHen doing this, memory cgroup needs to fix
      up the accounting information.  memcg need to check PCG_USED bit etc.
      
      In some(many?) cases, 'newpage' is on LRU before calling
      replace_page_cache().  So, memcg's LRU accounting information should be
      fixed, too.
      
      This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
       In that function, old pages will be unaccounted without touching
      res_counter and new page will be accounted to the memcg (of old page).
      WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
      races with LRU handling.
      
      Background:
        replace_page_cache_page() is called by FUSE code in its splice() handling.
        Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
        page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
        because rmdir() checks the whole LRU is empty and there is no account leak.
        If a page is on the other LRU than it should be, rmdir() will fail.
      
      This bug was added in March 2011, but no bug report yet.  I guess there
      are not many people who use memcg and FUSE at the same time with upstream
      kernels.
      
      The result of this bug is that admin cannot destroy a memcg because of
      account leak.  So, no panic, no deadlock.  And, even if an active cgroup
      exist, umount can succseed.  So no problem at shutdown.
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab936cbc
    • Jason Baron's avatar
      epoll: limit paths · 28d82dc1
      Jason Baron authored
      The current epoll code can be tickled to run basically indefinitely in
      both loop detection path check (on ep_insert()), and in the wakeup paths.
      The programs that tickle this behavior set up deeply linked networks of
      epoll file descriptors that cause the epoll algorithms to traverse them
      indefinitely.  A couple of these sample programs have been previously
      posted in this thread: https://lkml.org/lkml/2011/2/25/297.
      
      To fix the loop detection path check algorithms, I simply keep track of
      the epoll nodes that have been already visited.  Thus, the loop detection
      becomes proportional to the number of epoll file descriptor and links.
      This dramatically decreases the run-time of the loop check algorithm.  In
      one diabolical case I tried it reduced the run-time from 15 mintues (all
      in kernel time) to .3 seconds.
      
      Fixing the wakeup paths could be done at wakeup time in a similar manner
      by keeping track of nodes that have already been visited, but the
      complexity is harder, since there can be multiple wakeups on different
      cpus...Thus, I've opted to limit the number of possible wakeup paths when
      the paths are created.
      
      This is accomplished, by noting that the end file descriptor points that
      are found during the loop detection pass (from the newly added link), are
      actually the sources for wakeup events.  I keep a list of these file
      descriptors and limit the number and length of these paths that emanate
      from these 'source file descriptors'.  In the current implemetation I
      allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
      length 4 and 10 of length 5.  Note that it is sufficient to check the
      'source file descriptors' reachable from the newly added link, since no
      other 'source file descriptors' will have newly added links.  This allows
      us to check only the wakeup paths that may have gotten too long, and not
      re-check all possible wakeup paths on the system.
      
      In terms of the path limit selection, I think its first worth noting that
      the most common case for epoll, is probably the model where you have 1
      epoll file descriptor that is monitoring n number of 'source file
      descriptors'.  In this case, each 'source file descriptor' has a 1 path of
      length 1.  Thus, I believe that the limits I'm proposing are quite
      reasonable and in fact may be too generous.  Thus, I'm hoping that the
      proposed limits will not prevent any workloads that currently work to
      fail.
      
      In terms of locking, I have extended the use of the 'epmutex' to all
      epoll_ctl add and remove operations.  Currently its only used in a subset
      of the add paths.  I need to hold the epmutex, so that we can correctly
      traverse a coherent graph, to check the number of paths.  I believe that
      this additional locking is probably ok, since its in the setup/teardown
      paths, and doesn't affect the running paths, but it certainly is going to
      add some extra overhead.  Also, worth noting is that the epmuex was
      recently added to the ep_ctl add operations in the initial path loop
      detection code using the argument that it was not on a critical path.
      
      Another thing to note here, is the length of epoll chains that is allowed.
      Currently, eventpoll.c defines:
      
      /* Maximum number of nesting allowed inside epoll sets */
      #define EP_MAX_NESTS 4
      
      This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
      + 1).  However, this limit is currently only enforced during the loop
      check detection code, and only when the epoll file descriptors are added
      in a certain order.  Thus, this limit is currently easily bypassed.  The
      newly added check for wakeup paths, stricly limits the wakeup paths to a
      length of 5, regardless of the order in which ep's are linked together.
      Thus, a side-effect of the new code is a more consistent enforcement of
      the graph depth.
      
      Thus far, I've tested this, using the sample programs previously
      mentioned, which now either return quickly or return -EINVAL.  I've also
      testing using the piptest.c epoll tester, which showed no difference in
      performance.  I've also created a number of different epoll networks and
      tested that they behave as expectded.
      
      I believe this solves the original diabolical test cases, while still
      preserving the sane epoll nesting.
      Signed-off-by: default avatarJason Baron <jbaron@redhat.com>
      Cc: Nelson Elhage <nelhage@ksplice.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28d82dc1
    • Sasha Levin's avatar
      pipe: fail cleanly when root tries F_SETPIPE_SZ with big size · 2ccd4f4d
      Sasha Levin authored
      When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe
      with size bigger than kmalloc() can alloc it spits out an ugly warning:
      
        ------------[ cut here ]------------
        WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0()
        Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4
        Call Trace:
           warn_slowpath_common+0x75/0xb0
           warn_slowpath_null+0x15/0x20
           __alloc_pages_nodemask+0x5d3/0x7a0
           __get_free_pages+0x12/0x50
           __kmalloc+0x12b/0x150
           pipe_set_size+0x75/0x120
           pipe_fcntl+0xf8/0x140
           do_fcntl+0x2d4/0x410
           sys_fcntl+0x66/0xa0
           system_call_fastpath+0x16/0x1b
        ---[ end trace 432f702e6db7b5ee ]---
      
      Instead, make kcalloc() handle the overflow case and fail quietly.
      
      [akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness]
      Signed-off-by: default avatarSasha Levin <levinsasha928@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ccd4f4d
    • Stanislaw Gruszka's avatar
      slub: document setting min order with debug_guardpage_minorder > 0 · 888a214d
      Stanislaw Gruszka authored
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: default avatarStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      888a214d
    • Mathias Krause's avatar
      parisc, exec: remove redundant set_fs(USER_DS) · 15ee2d00
      Mathias Krause authored
      The address limit is already set in flush_old_exec() so those calls to
      set_fs(USER_DS) are redundant.
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Helge Deller <deller@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15ee2d00
    • Mathias Krause's avatar
      ia64, exec: remove redundant set_fs(USER_DS) · 01fa310c
      Mathias Krause authored
      The address limit is already set in flush_old_exec() so this
      set_fs(USER_DS) is redundant.
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01fa310c
    • Andrew Morton's avatar
      drivers/video/nvidia/nvidia.c: fix warning · 08346bf8
      Andrew Morton authored
      Fix the int/bool confusion in there.
      
        drivers/video/nvidia/nvidia.c:1602: warning: return from incompatible pointer type
      
      Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08346bf8
    • Heiko Carstens's avatar
      mm,x86,um: move CMPXCHG_DOUBLE config option · 2565409f
      Heiko Carstens authored
      Move CMPXCHG_DOUBLE and rename it to HAVE_CMPXCHG_DOUBLE so architectures
      can simply select the option if it is supported.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2565409f
    • Heiko Carstens's avatar
      mm,x86,um: move CMPXCHG_LOCAL config option · 4156153c
      Heiko Carstens authored
      Move CMPXCHG_LOCAL and rename it to HAVE_CMPXCHG_LOCAL so architectures
      can simply select the option if it is supported.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4156153c
    • Heiko Carstens's avatar
      mm,slub,x86: decouple size of struct page from CONFIG_CMPXCHG_LOCAL · 43570fd2
      Heiko Carstens authored
      While implementing cmpxchg_double() on s390 I realized that we don't set
      CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.
      
      However setting that option will increase the size of struct page by
      eight bytes on 64 bit, which we certainly do not want.  Also, it doesn't
      make sense that a present cpu feature should increase the size of struct
      page.
      
      Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
      that it should depend on CMPXCHG_DOUBLE instead.
      
      This patch:
      
      If an architecture supports CMPXCHG_LOCAL this shouldn't result
      automatically in larger struct pages if the SLUB allocator is used.
      Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
      can be selected if a double word aligned struct page is required.  Also
      update x86 Kconfig so that it should work as before.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43570fd2
    • Joe Perches's avatar
      include/linux/linkage.h: remove unused ATTRIB_NORET macro · 0d259cf8
      Joe Perches authored
      The uses have been renamed so delete the unused macro.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d259cf8
    • Joe Perches's avatar
      treewide: convert uses of ATTRIB_NORETURN to __noreturn · ff2d8b19
      Joe Perches authored
      Use the more commonly used __noreturn instead of ATTRIB_NORETURN.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff2d8b19
    • Joe Perches's avatar
      treewide: remove useless NORET_TYPE macro and uses · 9402c95f
      Joe Perches authored
      It's a very old and now unused prototype marking so just delete it.
      
      Neaten panic pointer argument style to keep checkpatch quiet.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9402c95f
    • Joe Perches's avatar
      include/linux/linkage.h: remove unused NORET_AND macro · 80bf007f
      Joe Perches authored
      The only use in kernel.h is gone so remove the macro.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80bf007f
    • Joe Perches's avatar
      kernel.h: neaten panic prototype · 4da47859
      Joe Perches authored
      Use __printf macro.
      Convert NORET_AND to ATTRIB_NORET.
      Use the normal kernel style for pointer arguments.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4da47859
    • Stephen Boyd's avatar
      kprobes: silence DEBUG_STRICT_USER_COPY_CHECKS=y warning · efeb156e
      Stephen Boyd authored
      Enabling DEBUG_STRICT_USER_COPY_CHECKS causes the following warning:
      
        In file included from arch/x86/include/asm/uaccess.h:573,
                         from kernel/kprobes.c:55:
        In function 'copy_from_user',
            inlined from 'write_enabled_file_bool' at
            kernel/kprobes.c:2191:
        arch/x86/include/asm/uaccess_64.h:65:
        warning: call to 'copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct
      
      presumably due to buf_size being signed causing GCC to fail to see that
      buf_size can't become negative.
      Signed-off-by: default avatarStephen Boyd <sboyd@codeaurora.org>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Acked-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      efeb156e
    • Xiaotian Feng's avatar
      proc: fix null pointer deref in proc_pid_permission() · a2ef990a
      Xiaotian Feng authored
      get_proc_task() can fail to search the task and return NULL,
      put_task_struct() will then bomb the kernel with following oops:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
        IP: [<ffffffff81217d34>] proc_pid_permission+0x64/0xe0
        PGD 112075067 PUD 112814067 PMD 0
        Oops: 0002 [#1] PREEMPT SMP
      
      This is a regression introduced by commit 0499680a ("procfs: add hidepid=
      and gid= mount options").  The kernel should return -ESRCH if
      get_proc_task() failed.
      Signed-off-by: default avatarXiaotian Feng <dannyfeng@tencent.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Stephen Wilson <wilsons@start.ca>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2ef990a
  2. 12 Jan, 2012 14 commits
    • Anton Vorontsov's avatar
      x86: Get rid of 'dubious one-bit signed bitfield' sprase warning · bccd1729
      Anton Vorontsov authored
      This very noisy sparse warning appears on almost every file in the
      kernel:
      
        CHECK   init/main.c
        arch/x86/include/asm/thread_info.h:43:55: error: dubious one-bit signed bitfield
        arch/x86/include/asm/thread_info.h:44:46: error: dubious one-bit signed bitfield
      
      This patch changes sig_on_uaccess_error and uaccess_err flags to unsigned
      type and thus fixes the warning.
      Signed-off-by: default avatarAnton Vorontsov <cbouatmailru@gmail.com>
      Acked-by: default avatarAndy Lutomirski <luto@mit.edu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bccd1729
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · a429638c
      Linus Torvalds authored
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (526 commits)
        ASoC: twl6040 - Add method to query optimum PDM_DL1 gain
        ALSA: hda - Fix the lost power-setup of seconary pins after PM resume
        ALSA: usb-audio: add Yamaha MOX6/MOX8 support
        ALSA: virtuoso: add S/PDIF input support for all Xonars
        ALSA: ice1724 - Support for ooAoo SQ210a
        ALSA: ice1724 - Allow card info based on model only
        ALSA: ice1724 - Create capture pcm only for ADC-enabled configurations
        ALSA: hdspm - Provide unique driver id based on card serial
        ASoC: Dynamically allocate the rtd device for a non-empty release()
        ASoC: Fix recursive dependency due to select ATMEL_SSC in SND_ATMEL_SOC_SSC
        ALSA: hda - Fix the detection of "Loopback Mixing" control for VIA codecs
        ALSA: hda - Return the error from get_wcaps_type() for invalid NIDs
        ALSA: hda - Use auto-parser for HP laptops with cx20459 codec
        ALSA: asihpi - Fix potential Oops in snd_asihpi_cmode_info()
        ALSA: hdsp - Fix potential Oops in snd_hdsp_info_pref_sync_ref()
        ALSA: hda/cirrus - support for iMac12,2 model
        ASoC: cx20442: add bias control over a platform provided regulator
        ALSA: usb-audio - Avoid flood of frame-active debug messages
        ALSA: snd-usb-us122l: Delete calls to preempt_disable
        mfd: Put WM8994 into cache only mode when suspending
        ...
      
      Fix up trivial conflicts in:
       - arch/arm/mach-s3c64xx/mach-crag6410.c:
      	renamed speyside_wm8962 to tobermory, added littlemill right
      	next to it
       - drivers/base/regmap/{regcache.c,regmap.c}:
      	duplicate diff that had already come in with other changes in
      	the regmap tree
      a429638c
    • Bjorn Helgaas's avatar
      x86/PCI: build amd_bus.o only when CONFIG_AMD_NB=y · 5cf9a4e6
      Bjorn Helgaas authored
      We only need amd_bus.o for AMD systems with PCI.  arch/x86/pci/Makefile
      already depends on CONFIG_PCI=y, so this patch just adds the dependency
      on CONFIG_AMD_NB.
      
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: stable@kernel.org	# 2.6.34+ (needs adjustment for k8 -> amd rename)
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5cf9a4e6
    • Takashi Iwai's avatar
      Merge branch 'topic/hda' into for-linus · 9e4ce164
      Takashi Iwai authored
      9e4ce164
    • Takashi Iwai's avatar
      Merge branch 'topic/misc' into for-linus · 627b7962
      Takashi Iwai authored
      627b7962
    • Takashi Iwai's avatar
      Merge branch 'for-3.3' of... · 29abceb6
      Takashi Iwai authored
      Merge branch 'for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into topic/asoc
      29abceb6
    • Linus Torvalds's avatar
      Merge tag 'rmobile-for-linus' of git://github.com/pmundt/linux-sh · 4c4d285a
      Linus Torvalds authored
      SH/R-Mobile updates for 3.3 merge window.
      
      * tag 'rmobile-for-linus' of git://github.com/pmundt/linux-sh: (32 commits)
        arm: mach-shmobile: add a resource name for shdma
        ARM: mach-shmobile: r8a7779 SMP support V3
        ARM: mach-shmobile: Add kota2 defconfig.
        ARM: mach-shmobile: Add marzen defconfig.
        ARM: mach-shmobile: r8a7779 power domain support V2
        ARM: mach-shmobile: Fix up marzen build for recent GIC changes.
        ARM: mach-shmobile: r8a7779 PFC function support
        ARM: mach-shmobile: Flush caches in platform_cpu_die()
        ARM: mach-shmobile: Allow SoC specific CPU kill code
        ARM: mach-shmobile: Fix headsmp.S code to use CPUINIT
        ARM: mach-shmobile: clock-r8a7779: clkz/clkzs support
        ARM: mach-shmobile: clock-r8a7779: add DIV4 clock support
        ARM: mach-shmobile: Marzen LAN89218 support
        ARM: mach-shmobile: Marzen SCIF2/SCIF4 support
        ARM: mach-shmobile: r8a7779 PFC GPIO-only support V2
        ARM: mach-shmobile: r8a7779 and Marzen base support V2
        sh: pfc: Unlock register support
        sh: pfc: Variable bitfield width config register support
        sh: pfc: Add config_reg_helper() function
        sh: pfc: Convert index to field and value pair
        ...
      4c4d285a
    • Linus Torvalds's avatar
      Merge tag 'sh-for-linus' of git://github.com/pmundt/linux-sh · 56c8bc3b
      Linus Torvalds authored
      SuperH updates for 3.3 merge window.
      
      * tag 'sh-for-linus' of git://github.com/pmundt/linux-sh: (38 commits)
        sh: magicpanelr2: Update for parse_mtd_partitions() fallout.
        sh: mach-rsk: Update for parse_mtd_partitions() fallout.
        sh: sh2a: Improve cache flush/invalidate functions
        sh: also without PM_RUNTIME pm_runtime.o must be built
        sh: add a resource name for shdma
        sh: Remove redundant try_to_freeze() invocations.
        sh: Ensure IRQs are enabled across do_notify_resume().
        sh: Fix up store queue code for subsys_interface changes.
        sh: clkfwk: sh_clk_init_parent() should be called after clk_register()
        sh: add platform_device for renesas_usbhs in board-sh7757lcr
        sh: modify clock-sh7757 for renesas_usbhs
        sh: pfc: ioremap() support
        sh: use ioread32/iowrite32 and mapped_reg for div6
        sh: use ioread32/iowrite32 and mapped_reg for div4
        sh: use ioread32/iowrite32 and mapped_reg for mstp32
        sh: extend clock struct with mapped_reg member
        sh: clkfwk: clock-sh73a0: all div6_clks use SH_CLK_DIV6_EXT()
        sh: clkfwk: clock-sh7724: all div6_clks use SH_CLK_DIV6_EXT()
        sh: clock-sh7723: add CLKDEV_ICK_ID for cleanup
        serial: sh-sci: Handle GPIO function requests.
        ...
      56c8bc3b
    • Linus Torvalds's avatar
      Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · b8bf17d3
      Linus Torvalds authored
      * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched: Fix lockup by limiting load-balance retries on lock-break
        sched: Fix CONFIG_CGROUP_SCHED dependency
        sched: Remove empty #ifdefs
      b8bf17d3
    • Paul Mundt's avatar
      sh: magicpanelr2: Update for parse_mtd_partitions() fallout. · 1c1744cc
      Paul Mundt authored
      Follows the RSK+ change for the same rationale.
      Signed-off-by: default avatarPaul Mundt <lethal@linux-sh.org>
      1c1744cc
    • Paul Mundt's avatar
      sh: mach-rsk: Update for parse_mtd_partitions() fallout. · 603129af
      Paul Mundt authored
      The RSK+ setup code was doing some pretty dubious things with
      parse_mtd_partitions() in order to populate the physmap-flash map
      platform data. The physmap-flash driver contains all of the functionality
      that we require already, so simply drop the special casing and pad out
      the platform data accordingly.
      Signed-off-by: default avatarPaul Mundt <lethal@linux-sh.org>
      603129af
    • Paul Mundt's avatar
      Merge branch 'sh/nommu' into sh-latest · b1bdd255
      Paul Mundt authored
      b1bdd255
    • Phil Edworthy's avatar
      sh: sh2a: Improve cache flush/invalidate functions · c1537b48
      Phil Edworthy authored
      The cache functions lock out interrupts for long periods; this patch
      reduces the impact when operating on large address ranges. In such
      cases it will:
      - Invalidate the entire cache rather than individual addresses.
      - Do nothing when flushing the operand cache in write-through mode.
      - When flushing the operand cache in write-back mdoe, index the
        search for matching addresses on the cache entires instead of the
        addresses to flush
      
      Note: sh2a__flush_purge_region was only invalidating the operand
      cache, this adds flush.
      Signed-off-by: default avatarPhil Edworthy <phil.edworthy@renesas.com>
      Signed-off-by: default avatarPaul Mundt <lethal@linux-sh.org>
      c1537b48
    • Paul Mundt's avatar
      Merge branch 'sh/hwblk' into sh-latest · 9d14070f
      Paul Mundt authored
      9d14070f