1. 01 Aug, 2012 40 commits
    • Linus Torvalds's avatar
      Merge branch 'for-3.6/core' of git://git.kernel.dk/linux-block · 8cf1a3fc
      Linus Torvalds authored
      Pull core block IO bits from Jens Axboe:
       "The most complicated part if this is the request allocation rework by
        Tejun, which has been queued up for a long time and has been in
        for-next ditto as well.
      
        There are a few commits from yesterday and today, mostly trivial and
        obvious fixes.  So I'm pretty confident that it is sound.  It's also
        smaller than usual."
      
      * 'for-3.6/core' of git://git.kernel.dk/linux-block:
        block: remove dead func declaration
        block: add partition resize function to blkpg ioctl
        block: uninitialized ioc->nr_tasks triggers WARN_ON
        block: do not artificially constrain max_sectors for stacking drivers
        blkcg: implement per-blkg request allocation
        block: prepare for multiple request_lists
        block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
        blkcg: inline bio_blkcg() and friends
        block: allocate io_context upfront
        block: refactor get_request[_wait]()
        block: drop custom queue draining used by scsi_transport_{iscsi|fc}
        mempool: add @gfp_mask to mempool_create_node()
        blkcg: make root blkcg allocation use %GFP_KERNEL
        blkcg: __blkg_lookup_create() doesn't need radix preload
      8cf1a3fc
    • Linus Torvalds's avatar
      Merge branch 'for-next' of git://neil.brown.name/md · fcff06c4
      Linus Torvalds authored
      Pull md updates from NeilBrown.
      
      * 'for-next' of git://neil.brown.name/md:
        DM RAID: Add support for MD RAID10
        md/RAID1: Add missing case for attempting to repair known bad blocks.
        md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE.
        md/raid1: don't abort a resync on the first badblock.
        md: remove duplicated test on ->openers when calling do_md_stop()
        raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer
        md/raid1: prevent merging too large request
        md/raid1: read balance chooses idlest disk for SSD
        md/raid1: make sequential read detection per disk based
        MD RAID10: Export md_raid10_congested
        MD: Move macros from raid1*.h to raid1*.c
        MD RAID1: rename mirror_info structure
        MD RAID10: rename mirror_info structure
        MD RAID10: Fix compiler warning.
        raid5: add a per-stripe lock
        raid5: remove unnecessary bitmap write optimization
        raid5: lockless access raid5 overrided bi_phys_segments
        raid5: reduce chance release_stripe() taking device_lock
      fcff06c4
    • J. Bruce Fields's avatar
      locks: remove unused lm_release_private · 068535f1
      J. Bruce Fields authored
      In commit 3b6e2723 ("locks: prevent side-effects of
      locks_release_private before file_lock is initialized") we removed the
      last user of lm_release_private without removing the field itself.
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      068535f1
    • Jonathan Brassow's avatar
      DM RAID: Add support for MD RAID10 · 63f33b8d
      Jonathan Brassow authored
      Support the MD RAID10 personality through dm-raid.c
      Signed-off-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      63f33b8d
    • NeilBrown's avatar
      Merge commit 'c039c332' into md · bb181e2e
      NeilBrown authored
      Pull in pre-requisites for adding raid10 support to dm-raid.
      bb181e2e
    • Yuanhan Liu's avatar
      block: remove dead func declaration · 80799fbb
      Yuanhan Liu authored
      __generic_unplug_device() function is removed with commit
      7eaceacc, which forgot to
      remove the declaration at meantime. Here remove it.
      Signed-off-by: default avatarYuanhan Liu <yuanhan.liu@linux.intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      80799fbb
    • Vivek Goyal's avatar
      block: add partition resize function to blkpg ioctl · c83f6bf9
      Vivek Goyal authored
      Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
      allows altering the size of an existing partition, even if it is currently
      in use.
      
      This patch converts hd_struct->nr_sects into sequence counter because
      One might extend a partition while IO is happening to it and update of
      nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
      can lead to issues like reading inconsistent size of a partition. Sequence
      counter have been used so that readers don't have to take bdev mutex lock
      as we call sector_in_part() very frequently.
      
      Now all the access to hd_struct->nr_sects should happen using sequence
      counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
      There is one exception though, set_capacity()/get_capacity(). I think
      theoritically race should exist there too but this patch does not
      modify set_capacity()/get_capacity() due to sheer number of call sites
      and I am afraid that change might break something. I have left that as a
      TODO item. We can handle it later if need be. This patch does not introduce
      any new races as such w.r.t set_capacity()/get_capacity().
      
      v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarPhillip Susi <psusi@ubuntu.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c83f6bf9
    • Olof Johansson's avatar
      block: uninitialized ioc->nr_tasks triggers WARN_ON · 4638a83e
      Olof Johansson authored
      Hi,
      
      I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
      below warning as of 3.5-rc1 and later (3.4 is fine):
      
      [   10.886893] ------------[ cut here ]------------
      [   10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
      [   10.886905] Hardware name: Bochs
      [   10.886906] Modules linked in:
      [   10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
      [   10.886908] Call Trace:
      [   10.886911]  [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0
      [   10.886912]  [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20
      [   10.886913]  [<ffffffff8107c088>] copy_process+0x1488/0x1560
      [   10.886914]  [<ffffffff8107c244>] do_fork+0xb4/0x340
      [   10.886918]  [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50
      [   10.886919]  [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80
      [   10.886920]  [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60
      [   10.886923]  [<ffffffff81051db3>] sys_clone+0x23/0x30
      [   10.886925]  [<ffffffff8179bd73>] stub_clone+0x13/0x20
      [   10.886927]  [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b
      [   10.886928] ---[ end trace 32a14af7ee6a590b ]---
      
      Reproducing is easy, I can hit it on a KVM system with a very basic
      config (x86_64 make defconfig + enable the drivers needed). To hit it,
      just install dump (on debian/ubuntu, not sure what the package might be
      called on Fedora), and:
      
      dump -o -f /tmp/foo /
      
      You'll see the warning in dmesg once it forks off the I/O process and
      starts dumping filesystem contents.
      
      I bisected it down to the following commit:
      
      commit f6e8d01b
      Author: Tejun Heo <tj@kernel.org>
      Date:   Mon Mar 5 13:15:26 2012 -0800
      
          block: add io_context->active_ref
      
          Currently ioc->nr_tasks is used to decide two things - whether an ioc
          is done issuing IOs and whether it's shared by multiple tasks.  This
          patch separate out the first into ioc->active_ref, which is acquired
          and released using {get|put}_io_context_active() respectively.
      
          This will be used to associate bio's with a given task.  This patch
          doesn't introduce any visible behavior change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
          Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      
      It seems like the init of ioc->nr_tasks was removed in that patch,
      so it starts out at 0 instead of 1.
      
      Tejun, is the right thing here to add back the init, or should something else
      be done?
      
      The below patch removes the warning, but I haven't done any more extensive
      testing on it.
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4638a83e
    • Mike Snitzer's avatar
      block: do not artificially constrain max_sectors for stacking drivers · fe86cdce
      Mike Snitzer authored
      blk_set_stacking_limits is intended to allow stacking drivers to build
      up the limits of the stacked device based on the underlying devices'
      limits.  But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
      doesn't allow the stacking driver to inherit a max_sectors larger than
      1024 -- due to blk_stack_limits' use of min_not_zero.
      
      It is now clear that this artificial limit is getting in the way so
      change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
      stacking drivers like dm-multipath to inherit 'max_sectors' from the
      underlying paths).
      Reported-by: default avatarVijay Chauhan <vijay.chauhan@netapp.com>
      Tested-by: default avatarVijay Chauhan <vijay.chauhan@netapp.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fe86cdce
    • Linus Torvalds's avatar
      Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6 · 2d534926
      Linus Torvalds authored
      Pull irqdomain changes from Grant Likely:
       "Round of refactoring and enhancements to irq_domain infrastructure.
        This series starts the process of simplifying irqdomain.  The ultimate
        goal is to merge LEGACY, LINEAR and TREE mappings into a single
        system, but had to back off from that after some last minute bugs.
        Instead it mainly reorganizes the code and ensures that the reverse
        map gets populated when the irq is mapped instead of the first time it
        is looked up.
      
        Merging of the irq_domain types is deferred to v3.7
      
        In other news, this series adds helpers for creating static mappings
        on a linear or tree mapping."
      
      * tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6:
        irqdomain: Improve diagnostics when a domain mapping fails
        irqdomain: eliminate slow-path revmap lookups
        irqdomain: Fix irq_create_direct_mapping() to test irq_domain type.
        irqdomain: Eliminate dedicated radix lookup functions
        irqdomain: Support for static IRQ mapping and association.
        irqdomain: Always update revmap when setting up a virq
        irqdomain: Split disassociating code into separate function
        irq_domain: correct a minor wrong comment for linear revmap
        irq_domain: Standardise legacy/linear domain selection
        irqdomain: Make ops->map hook optional
        irqdomain: Remove unnecessary test for IRQ_DOMAIN_MAP_LEGACY
        irqdomain: Simple NUMA awareness.
        devicetree: add helper inline for retrieving a node's full name
      2d534926
    • Linus Torvalds's avatar
      Merge branch 'akpm' (Andrew's patch-bomb) · ac694dbd
      Linus Torvalds authored
      Merge Andrew's second set of patches:
       - MM
       - a few random fixes
       - a couple of RTC leftovers
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
        rtc/rtc-88pm80x: remove unneed devm_kfree
        rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
        mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
        tmpfs: distribute interleave better across nodes
        mm: remove redundant initialization
        mm: warn if pg_data_t isn't initialized with zero
        mips: zero out pg_data_t when it's allocated
        memcg: gix memory accounting scalability in shrink_page_list
        mm/sparse: remove index_init_lock
        mm/sparse: more checks on mem_section number
        mm/sparse: optimize sparse_index_alloc
        memcg: add mem_cgroup_from_css() helper
        memcg: further prevent OOM with too many dirty pages
        memcg: prevent OOM with too many dirty pages
        mm: mmu_notifier: fix freed page still mapped in secondary MMU
        mm: memcg: only check anon swapin page charges for swap cache
        mm: memcg: only check swap cache pages for repeated charging
        mm: memcg: split swapin charge function into private and public part
        mm: memcg: remove needless !mm fixup to init_mm when charging
        mm: memcg: remove unneeded shmem charge type
        ...
      ac694dbd
    • Linus Torvalds's avatar
      Merge tag 'vfio-for-v3.6' of git://github.com/awilliam/linux-vfio · a40a1d3d
      Linus Torvalds authored
      Pull VFIO core from Alex Williamson:
       "This series includes the VFIO userspace driver interface for the 3.6
        kernel merge window.  This driver is intended to provide a secure
        interface for device access using IOMMU protection for applications
        like assignment of physical devices to virtual machines.
      
        Qemu will be the first user of this interface, enabling assignment of
        PCI devices to Qemu guests.  This interface is intended to eventually
        replace the x86-specific assignment mechanism currently available in
        KVM.
      
        This interface has the advantage of being more secure, by working with
        IOMMU groups to ensure device isolation and providing it's own
        filtered resource access mechanism, and also more flexible, in not
        being x86 or KVM specific (extensions to enable POWER are already
        working).
      
        This driver is originally the work of Tom Lyon, but has since been
        handed over to me and gone through a complete overhaul thanks to the
        input from David Gibson, Ben Herrenschmidt, Chris Wright, Joerg
        Roedel, and others.  This driver has been available in linux-next for
        the last month."
      
      Paul Mackerras says:
       "I would be glad to see it go in since we want to use it with KVM on
        PowerPC.  If possible we'd like the PowerPC bits for it to go in as
        well."
      
      * tag 'vfio-for-v3.6' of git://github.com/awilliam/linux-vfio:
        vfio: Add PCI device driver
        vfio: Type1 IOMMU implementation
        vfio: Add documentation
        vfio: VFIO core
      a40a1d3d
    • Linus Torvalds's avatar
      Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random · 3e9a9708
      Linus Torvalds authored
      Pull random subsystem patches from Ted Ts'o:
       "This patch series contains a major revamp of how we collect entropy
        from interrupts for /dev/random and /dev/urandom.
      
        The goal is to addresses weaknesses discussed in the paper "Mining
        your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
        by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J.  Alex Halderman,
        which will be published in the Proceedings of the 21st Usenix Security
        Symposium, August 2012.  (See https://factorable.net for more
        information and an extended version of the paper.)"
      
      Fix up trivial conflicts due to nearby changes in
      drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}
      
      * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
        random: mix in architectural randomness in extract_buf()
        dmi: Feed DMI table to /dev/random driver
        random: Add comment to random_initialize()
        random: final removal of IRQF_SAMPLE_RANDOM
        um: remove IRQF_SAMPLE_RANDOM which is now a no-op
        sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
        [ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
        board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
        isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
        pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
        omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
        goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
        uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
        drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
        xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
        n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
        pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
        i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
        input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
        mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
        ...
      3e9a9708
    • Linus Torvalds's avatar
      Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband · 941c8726
      Linus Torvalds authored
      Pull final RDMA changes from Roland Dreier:
       - Fix IPoIB to stop using unsafe linkage between networking neighbour
         layer and private path database.
       - Small fixes for bugs found by Fengguang Wu's automated builds.
      
      * tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
        IPoIB: Use a private hash table for path lookup in xmit path
        IB/qib: Fix size of cc_supported_table_entries
        RDMA/ucma: Convert open-coded equivalent to memdup_user()
        RDMA/ocrdma: Fix check of GSI CQs
        RDMA/cma: Use PTR_RET rather than if (IS_ERR(...)) + PTR_ERR
      941c8726
    • Linus Torvalds's avatar
      Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · 8762541f
      Linus Torvalds authored
      Pull second set of media updates from Mauro Carvalho Chehab:
      
       - radio API: add support to work with radio frequency bands
      
       - new AM/FM radio drivers: radio-shark, radio-shark2
      
       - new Remote Controller USB driver: iguanair
      
       - conversion of several drivers to the v4l2 core control framework
      
       - new board additions at existing drivers
      
       - the remaining (and vast majority of the patches) are due to
         drivers/DocBook fixes/cleanups.
      
      * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (154 commits)
        [media] radio-tea5777: use library for 64bits div
        [media] tlg2300: Declare MODULE_FIRMWARE usage
        [media] lgs8gxx: Declare MODULE_FIRMWARE usage
        [media] xc5000: Add MODULE_FIRMWARE statements
        [media] s2255drv: Add MODULE_FIRMWARE statement
        [media] dib8000: move dereference after check for NULL
        [media] Documentation: Update cardlists
        [media] bttv: add support for Aposonic W-DVR
        [media] cx25821: Remove bad strcpy to read-only char*
        [media] pms.c: remove duplicated include
        [media] smiapp-core.c: remove duplicated include
        [media] via-camera: pass correct format settings to sensor
        [media] rtl2832.c: minor cleanup
        [media] Add support for the IguanaWorks USB IR Transceiver
        [media] Minor cleanups for MCE USB
        [media] drivers/media/dvb/siano/smscoreapi.c: use list_for_each_entry
        [media] Use a named union in struct v4l2_ioctl_info
        [media] mceusb: Add Twisted Melon USB IDs
        [media] staging/media/solo6x10: use module_pci_driver macro
        [media] staging/media/dt3155v4l: use module_pci_driver macro
        ...
      
      Conflicts:
      	Documentation/feature-removal-schedule.txt
      8762541f
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · 6dbb35b0
      Linus Torvalds authored
      Pull second wave of NFS client updates from Trond Myklebust:
      
       - Patches from Bryan to allow splitting of the NFSv2/v3/v4 code into
         separate modules.
      
       - Fix Oopses in the NFSv4 idmapper
      
       - Fix a deadlock whereby rpciod tries to allocate a new socket and ends
         up recursing into the NFS code due to memory reclaim.
      
       - Increase the number of permitted callback connections.
      
      * tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        nfs: explicitly reject LOCK_MAND flock() requests
        nfs: increase number of permitted callback connections.
        SUNRPC: return negative value in case rpcbind client creation error
        NFS: Convert v4 into a module
        NFS: Convert v3 into a module
        NFS: Convert v2 into a module
        NFS: Keep module parameters in the generic NFS client
        NFS: Split out remaining NFS v4 inode functions
        NFS: Pass super operations and xattr handlers in the nfs_subversion
        NFS: Only initialize the ACL client in the v3 case
        NFS: Create a try_mount rpc op
        NFS: Remove the NFS v4 xdev mount function
        NFS: Add version registering framework
        NFS: Fix a number of bugs in the idmapper
        nfs: skip commit in releasepage if we're freeing memory for fs-related reasons
        sunrpc: clarify comments on rpc_make_runnable
        pnfsblock: bail out partial page IO
      6dbb35b0
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · fd37ce34
      Linus Torvalds authored
      Pull networking update from David S. Miller:
       "I think Eric Dumazet and I have dealt with all of the known routing
        cache removal fallout.  Some other minor fixes all around.
      
        1) Fix RCU of cached routes, particular of output routes which require
           liberation via call_rcu() instead of call_rcu_bh().  From Eric
           Dumazet.
      
        2) Make sure we purge net device references in cached routes properly.
      
        3) TG3 driver bug fixes from Michael Chan.
      
        4) Fix reported 'expires' value in ipv6 routes, from Li Wei.
      
        5) TUN driver ioctl leaks kernel bytes to userspace, from Mathias
           Krause."
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (22 commits)
        ipv4: Properly purge netdev references on uncached routes.
        ipv4: Cache routes in nexthop exception entries.
        ipv4: percpu nh_rth_output cache
        ipv4: Restore old dst_free() behavior.
        bridge: make port attributes const
        ipv4: remove rt_cache_rebuild_count
        net: ipv4: fix RCU races on dst refcounts
        net: TCP early demux cleanup
        tun: Fix formatting.
        net/tun: fix ioctl() based info leaks
        tg3: Update version to 3.124
        tg3: Fix race condition in tg3_get_stats64()
        tg3: Add New 5719 Read DMA workaround
        tg3: Fix Read DMA workaround for 5719 A0.
        tg3: Request APE_LOCK_PHY before PHY access
        ipv6: fix incorrect route 'expires' value passed to userspace
        mISDN: Bugfix only few bytes are transfered on a connection
        seeq: use PTR_RET at init_module of driver
        bnx2x: remove cast around the kmalloc in bnx2x_prev_mark_path
        ipv4: clean up put_child
        ...
      fd37ce34
    • Devendra Naga's avatar
      rtc/rtc-88pm80x: remove unneed devm_kfree · 437ea90c
      Devendra Naga authored
      devm_kzalloc() doesn't need a matching devm_kfree(), the freeing mechanism
      will trigger when driver unloads.
      Signed-off-by: default avatarDevendra Naga <devendra.aaru@gmail.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Ashish Jangam <ashish.jangam@kpitcummins.com>
      Cc: David Dajun Chen <dchen@diasemi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      437ea90c
    • Devendra Naga's avatar
      rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails · 7ead5511
      Devendra Naga authored
      At the probe we are assigning ret to return value of PTR_ERR right after
      the rtc_register_drive()r, as we would have done it in the if
      (IS_ERR(ptr)) check, since the function fails and goes inside that case
      Signed-off-by: default avatarDevendra Naga <devendra.aaru@gmail.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Ashish Jangam <ashish.jangam@kpitcummins.com>
      Cc: David Dajun Chen <dchen@diasemi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ead5511
    • Mel Gorman's avatar
      mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables · d833352a
      Mel Gorman authored
      If a process creates a large hugetlbfs mapping that is eligible for page
      table sharing and forks heavily with children some of whom fault and
      others which destroy the mapping then it is possible for page tables to
      get corrupted.  Some teardowns of the mapping encounter a "bad pmd" and
      output a message to the kernel log.  The final teardown will trigger a
      BUG_ON in mm/filemap.c.
      
      This was reproduced in 3.4 but is known to have existed for a long time
      and goes back at least as far as 2.6.37.  It was probably was introduced
      in 2.6.20 by [39dde65c: shared page table for hugetlb page].  The messages
      look like this;
      
      [  ..........] Lots of bad pmd messages followed by this
      [  127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
      [  127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
      [  127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
      [  127.186778] ------------[ cut here ]------------
      [  127.186781] kernel BUG at mm/filemap.c:134!
      [  127.186782] invalid opcode: 0000 [#1] SMP
      [  127.186783] CPU 7
      [  127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
      [  127.186801]
      [  127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
      [  127.186804] RIP: 0010:[<ffffffff810ed6ce>]  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186809] RSP: 0000:ffff8804144b5c08  EFLAGS: 00010002
      [  127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
      [  127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
      [  127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
      [  127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
      [  127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
      [  127.186815] FS:  00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
      [  127.186816] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
      [  127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
      [  127.186821] Stack:
      [  127.186822]  ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
      [  127.186824]  ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
      [  127.186825]  ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
      [  127.186827] Call Trace:
      [  127.186829]  [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80
      [  127.186832]  [<ffffffff811bc925>] truncate_hugepages+0x115/0x220
      [  127.186834]  [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30
      [  127.186837]  [<ffffffff811655c7>] evict+0xa7/0x1b0
      [  127.186839]  [<ffffffff811657a3>] iput_final+0xd3/0x1f0
      [  127.186840]  [<ffffffff811658f9>] iput+0x39/0x50
      [  127.186842]  [<ffffffff81162708>] d_kill+0xf8/0x130
      [  127.186843]  [<ffffffff81162812>] dput+0xd2/0x1a0
      [  127.186845]  [<ffffffff8114e2d0>] __fput+0x170/0x230
      [  127.186848]  [<ffffffff81236e0e>] ? rb_erase+0xce/0x150
      [  127.186849]  [<ffffffff8114e3ad>] fput+0x1d/0x30
      [  127.186851]  [<ffffffff81117db7>] remove_vma+0x37/0x80
      [  127.186853]  [<ffffffff81119182>] do_munmap+0x2d2/0x360
      [  127.186855]  [<ffffffff811cc639>] sys_shmdt+0xc9/0x170
      [  127.186857]  [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b
      [  127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
      [  127.186868] RIP  [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160
      [  127.186870]  RSP <ffff8804144b5c08>
      [  127.186871] ---[ end trace 7cbac5d1db69f426 ]---
      
      The bug is a race and not always easy to reproduce.  To reproduce it I was
      doing the following on a single socket I7-based machine with 16G of RAM.
      
      $ hugeadm --pool-pages-max DEFAULT:13G
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
      $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
      $ for i in `seq 1 9000`; do ./hugetlbfs-test; done
      
      On my particular machine, it usually triggers within 10 minutes but
      enabling debug options can change the timing such that it never hits.
      Once the bug is triggered, the machine is in trouble and needs to be
      rebooted.  The machine will respond but processes accessing proc like "ps
      aux" will hang due to the BUG_ON.  shutdown will also hang and needs a
      hard reset or a sysrq-b.
      
      The basic problem is a race between page table sharing and teardown.  For
      the most part page table sharing depends on i_mmap_mutex.  In some cases,
      it is also taking the mm->page_table_lock for the PTE updates but with
      shared page tables, it is the i_mmap_mutex that is more important.
      
      Unfortunately it appears to be also insufficient. Consider the following
      situation
      
      Process A					Process B
      ---------					---------
      hugetlb_fault					shmdt
        						LockWrite(mmap_sem)
          						  do_munmap
      						    unmap_region
      						      unmap_vmas
      						        unmap_single_vma
      						          unmap_hugepage_range
            						            Lock(i_mmap_mutex)
      							    Lock(mm->page_table_lock)
      							    huge_pmd_unshare/unmap tables <--- (1)
      							    Unlock(mm->page_table_lock)
            						            Unlock(i_mmap_mutex)
        huge_pte_alloc				      ...
          Lock(i_mmap_mutex)				      ...
          vma_prio_walk, find svma, spte		      ...
          Lock(mm->page_table_lock)			      ...
          share spte					      ...
          Unlock(mm->page_table_lock)			      ...
          Unlock(i_mmap_mutex)			      ...
        hugetlb_no_page									  <--- (2)
      						      free_pgtables
      						        unlink_file_vma
      							hugetlb_free_pgd_range
      						    remove_vma_list
      
      In this scenario, it is possible for Process A to share page tables with
      Process B that is trying to tear them down.  The i_mmap_mutex on its own
      does not prevent Process A walking Process B's page tables.  At (1) above,
      the page tables are not shared yet so it unmaps the PMDs.  Process A sets
      up page table sharing and at (2) faults a new entry.  Process B then trips
      up on it in free_pgtables.
      
      This patch fixes the problem by adding a new function
      __unmap_hugepage_range_final that is only called when the VMA is about to
      be destroyed.  This function clears VM_MAYSHARE during
      unmap_hugepage_range() under the i_mmap_mutex.  This makes the VMA
      ineligible for sharing and avoids the race.  Superficially this looks like
      it would then be vunerable to truncate and madvise issues but hugetlbfs
      has its own truncate handlers so does not use unmap_mapping_range() and
      does not support madvise(DONTNEED).
      
      This should be treated as a -stable candidate if it is merged.
      
      Test program is as follows. The test case was mostly written by Michal
      Hocko with a few minor changes to reproduce this bug.
      
      ==== CUT HERE ====
      
      static size_t huge_page_size = (2UL << 20);
      static size_t nr_huge_page_A = 512;
      static size_t nr_huge_page_B = 5632;
      
      unsigned int get_random(unsigned int max)
      {
      	struct timeval tv;
      
      	gettimeofday(&tv, NULL);
      	srandom(tv.tv_usec);
      	return random() % max;
      }
      
      static void play(void *addr, size_t size)
      {
      	unsigned char *start = addr,
      		      *end = start + size,
      		      *a;
      	start += get_random(size/2);
      
      	/* we could itterate on huge pages but let's give it more time. */
      	for (a = start; a < end; a += 4096)
      		*a = 0;
      }
      
      int main(int argc, char **argv)
      {
      	key_t key = IPC_PRIVATE;
      	size_t sizeA = nr_huge_page_A * huge_page_size;
      	size_t sizeB = nr_huge_page_B * huge_page_size;
      	int shmidA, shmidB;
      	void *addrA = NULL, *addrB = NULL;
      	int nr_children = 300, n = 0;
      
      	if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      	if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
      		perror("shmget:");
      		return 1;
      	}
      
      	if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
      		perror("shmat");
      		return 1;
      	}
      
      fork_child:
      	switch(fork()) {
      		case 0:
      			switch (n%3) {
      			case 0:
      				play(addrA, sizeA);
      				break;
      			case 1:
      				play(addrB, sizeB);
      				break;
      			case 2:
      				break;
      			}
      			break;
      		case -1:
      			perror("fork:");
      			break;
      		default:
      			if (++n < nr_children)
      				goto fork_child;
      			play(addrA, sizeA);
      			break;
      	}
      	shmdt(addrA);
      	shmdt(addrB);
      	do {
      		wait(NULL);
      	} while (--n > 0);
      	shmctl(shmidA, IPC_RMID, NULL);
      	shmctl(shmidB, IPC_RMID, NULL);
      	return 0;
      }
      
      [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d833352a
    • Nathan Zimmer's avatar
      tmpfs: distribute interleave better across nodes · 09c231cb
      Nathan Zimmer authored
      When tmpfs has the interleave memory policy, it always starts allocating
      for each file from node 0 at offset 0.  When there are many small files,
      the lower nodes fill up disproportionately.
      
      This patch spreads out node usage by starting files at nodes other than 0,
      by using the inode number to bias the starting node for interleave.
      Signed-off-by: default avatarNathan Zimmer <nzimmer@sgi.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09c231cb
    • Minchan Kim's avatar
      mm: remove redundant initialization · 6527af5d
      Minchan Kim authored
      pg_data_t is zeroed before reaching free_area_init_core(), so remove the
      now unnecessary initializations.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6527af5d
    • Minchan Kim's avatar
      mm: warn if pg_data_t isn't initialized with zero · 88fdf75d
      Minchan Kim authored
      Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero
      when it is allocated.  Arch code and memory hotplug already initiailize
      pg_data_t.  So this warning should never happen.  I select fields randomly
      near the beginning, middle and end of pg_data_t for checking.
      
      This patch isn't for performance but for removing initialization code
      which is necessary to add whenever we adds new field to pg_data_t or zone.
      
      Firstly, Andrew suggested clearing out of pg_data_t in MM core part but
      Tejun doesn't like it because in the future, some archs can initialize
      some fields in arch code and pass them into general MM part so blindly
      clearing it out in mm core part would be very annoying.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88fdf75d
    • Minchan Kim's avatar
      mips: zero out pg_data_t when it's allocated · 93180cec
      Minchan Kim authored
      This patch is preparation for the next patch which removes the zeroing of
      the pg_data_t in core MM.  All archs except MIPS already do this.
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93180cec
    • Tim Chen's avatar
      memcg: gix memory accounting scalability in shrink_page_list · 69980e31
      Tim Chen authored
      I noticed in a multi-process parallel files reading benchmark I ran on a 8
      socket machine, throughput slowed down by a factor of 8 when I ran the
      benchmark within a cgroup container.  I traced the problem to the
      following code path (see below) when we are trying to reclaim memory from
      file cache.  The res_counter_uncharge function is called on every page
      that's reclaimed and created heavy lock contention.  The patch below
      allows the reclaimed pages to be uncharged from the resource counter in
      batch and recovered the regression.
      
      Tim
      
           40.67%           usemem  [kernel.kallsyms]                   [k] _raw_spin_lock
                            |
                            --- _raw_spin_lock
                               |
                               |--92.61%-- res_counter_uncharge
                               |          |
                               |          |--100.00%-- __mem_cgroup_uncharge_common
                               |          |          |
                               |          |          |--100.00%-- mem_cgroup_uncharge_cache_page
                               |          |          |          __remove_mapping
                               |          |          |          shrink_page_list
                               |          |          |          shrink_inactive_list
                               |          |          |          shrink_mem_cgroup_zone
                               |          |          |          shrink_zone
                               |          |          |          do_try_to_free_pages
                               |          |          |          try_to_free_pages
                               |          |          |          __alloc_pages_nodemask
                               |          |          |          alloc_pages_current
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69980e31
    • Gavin Shan's avatar
      mm/sparse: remove index_init_lock · c1c95183
      Gavin Shan authored
      sparse_index_init() uses the index_init_lock spinlock to protect root
      mem_section assignment.  The lock is not necessary anymore because the
      function is called only during boot (during paging init which is executed
      only from a single CPU) and from the hotplug code (by add_memory() via
      arch_add_memory()) which uses mem_hotplug_mutex.
      
      The lock was introduced by 28ae55c9 ("sparsemem extreme: hotplug
      preparation") and sparse_index_init() was used only during boot at that
      time.
      
      Later when the hotplug code (and add_memory()) was introduced there was no
      synchronization so it was possible to online more sections from the same
      root probably (though I am not 100% sure about that).  The first
      synchronization has been added by 6ad696d2 ("mm: allow memory hotplug and
      hibernation in the same kernel") which was later replaced by the
      mem_hotplug_mutex - 20d6c96b ("mem-hotplug: introduce
      {un}lock_memory_hotplug()").
      
      Let's remove the lock as it is not needed and it makes the code more
      confusing.
      
      [mhocko@suse.cz: changelog]
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1c95183
    • Gavin Shan's avatar
      mm/sparse: more checks on mem_section number · db36a461
      Gavin Shan authored
      __section_nr() was implemented to retrieve the corresponding memory
      section number according to its descriptor.  It's possible that the
      specified memory section descriptor doesn't exist in the global array.  So
      add more checking on that and report an error for a wrong case.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      db36a461
    • Gavin Shan's avatar
      mm/sparse: optimize sparse_index_alloc · 5b760e64
      Gavin Shan authored
      With CONFIG_SPARSEMEM_EXTREME, the two levels of memory section
      descriptors are allocated from slab or bootmem.  When allocating from
      slab, let slab/bootmem allocator clear the memory chunk.  We needn't clear
      it explicitly.
      Signed-off-by: default avatarGavin Shan <shangw@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b760e64
    • Wanpeng Li's avatar
      memcg: add mem_cgroup_from_css() helper · b2145145
      Wanpeng Li authored
      Add a mem_cgroup_from_css() helper to replace open-coded invokations of
      container_of().  To clarify the code and to add a little more type safety.
      
      [akpm@linux-foundation.org: fix extensive breakage]
      Signed-off-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2145145
    • Hugh Dickins's avatar
      memcg: further prevent OOM with too many dirty pages · c3b94f44
      Hugh Dickins authored
      The may_enter_fs test turns out to be too restrictive: though I saw no
      problem with it when testing on 3.5-rc6, it very soon OOMed when I tested
      on 3.5-rc6-mm1.  I don't know what the difference there is, perhaps I just
      slightly changed the way I started off the testing: dd if=/dev/zero
      of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M
      memory.limit_in_bytes cgroup to ext4 on USB stick.
      
      ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with
      AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why
      the transaction needs to be started even before allocating pagecache
      memory.  But it may not be worth worrying about these days: if direct
      reclaim avoids FS writeback, does __GFP_FS now mean anything?
      
      Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop
      device; but since that also masks off __GFP_IO, we can test for __GFP_IO
      directly, ignoring may_enter_fs and __GFP_FS.
      
      But even so, the test still OOMs sometimes: when originally testing on
      3.5-rc6, it OOMed about one time in five or ten; when testing just now on
      3.5-rc6-mm1, it OOMed on the first iteration.
      
      This residual problem comes from an accumulation of pages under ordinary
      writeback, not marked PageReclaim, so rightly not causing the memcg check
      to wait on their writeback: these too can prevent shrink_page_list() from
      freeing any pages, so many times that memcg reclaim fails and OOMs.
      
      Deal with these in the same way as direct reclaim now deals with dirty FS
      pages: mark them PageReclaim.  It is appropriate to rotate these to tail
      of list when writepage completes, but more importantly, the PageReclaim
      flag makes memcg reclaim wait on them if encountered again.  Increment
      NR_VMSCAN_IMMEDIATE?  That's arguable: I chose not.
      
      Setting PageReclaim here may occasionally race with end_page_writeback()
      clearing it: lru_deactivate_fn() already faced the same race, and
      correctly concluded that the window is small and the issue non-critical.
      
      With these changes, the test runs indefinitely without OOMing on ext4,
      ext3 and ext2: I'll move on to test with other filesystems later.
      
      Trivia: invert conditions for a clearer block without an else, and goto
      keep_locked to do the unlock_page.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3b94f44
    • Michal Hocko's avatar
      memcg: prevent OOM with too many dirty pages · e62e384e
      Michal Hocko authored
      The current implementation of dirty pages throttling is not memcg aware
      which makes it easy to have memcg LRUs full of dirty pages.  Without
      throttling, these LRUs can be scanned faster than the rate of writeback,
      leading to memcg OOM conditions when the hard limit is small.
      
      This patch fixes the problem by throttling the allocating process
      (possibly a writer) during the hard limit reclaim by waiting on
      PageReclaim pages.  We are waiting only for PageReclaim pages because
      those are the pages that made one full round over LRU and that means that
      the writeback is much slower than scanning.
      
      The solution is far from being ideal - long term solution is memcg aware
      dirty throttling - but it is meant to be a band aid until we have a real
      fix.  We are seeing this happening during nightly backups which are placed
      into containers to prevent from eviction of the real working set.
      
      The change affects only memcg reclaim and only when we encounter
      PageReclaim pages which is a signal that the reclaim doesn't catch up on
      with the writers so somebody should be throttled.  This could be
      potentially unfair because it could be somebody else from the group who
      gets throttled on behalf of the writer but as writers need to allocate as
      well and they allocate in higher rate the probability that only innocent
      processes would be penalized is not that high.
      
      I have tested this change by a simple dd copying /dev/zero to tmpfs or
      ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G
      containers) and dd got killed by OOM killer every time.  With the patch I
      could run the dd with the same size under 5M controller without any OOM.
      The issue is more visible with slower devices for output.
      
      * With the patch
      ================
      * tmpfs size=2G
      ---------------
      $ vim cgroup_cache_oom_test.sh
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s
      
      * ext3
      ------
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s
      
      * Without the patch
      ===================
      * tmpfs size=2G
      ---------------
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      ./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s
      
      * ext3
      ------
      $ ./cgroup_cache_oom_test.sh 5M
      using Limit 5M for group
      ./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
      $ ./cgroup_cache_oom_test.sh 60M
      using Limit 60M for group
      ./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
      $ ./cgroup_cache_oom_test.sh 300M
      using Limit 300M for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
      $ ./cgroup_cache_oom_test.sh 2G
      using Limit 2G for group
      1000+0 records in
      1000+0 records out
      1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s
      
      [akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
      [hughd@google.com: fix deadlock with loop driver]
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e62e384e
    • Xiao Guangrong's avatar
      mm: mmu_notifier: fix freed page still mapped in secondary MMU · 3ad3d901
      Xiao Guangrong authored
      mmu_notifier_release() is called when the process is exiting.  It will
      delete all the mmu notifiers.  But at this time the page belonging to the
      process is still present in page tables and is present on the LRU list, so
      this race will happen:
      
            CPU 0                 CPU 1
      mmu_notifier_release:    try_to_unmap:
         hlist_del_init_rcu(&mn->hlist);
                                  ptep_clear_flush_notify:
                                        mmu nofifler not found
                                  free page  !!!!!!
                                  /*
                                   * At the point, the page has been
                                   * freed, but it is still mapped in
                                   * the secondary MMU.
                                   */
      
        mn->ops->release(mn, mm);
      
      Then the box is not stable and sometimes we can get this bug:
      
      [  738.075923] BUG: Bad page state in process migrate-perf  pfn:03bec
      [  738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping:          (null) index:0x8076
      [  738.075936] page flags: 0x20000000000014(referenced|dirty)
      
      The same issue is present in mmu_notifier_unregister().
      
      We can call ->release before deleting the notifier to ensure the page has
      been unmapped from the secondary MMU before it is freed.
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ad3d901
    • Johannes Weiner's avatar
      mm: memcg: only check anon swapin page charges for swap cache · bdf4f4d2
      Johannes Weiner authored
      shmem knows for sure that the page is in swap cache when attempting to
      charge a page, because the cache charge entry function has a check for it.
      Only anon pages may be removed from swap cache already when trying to
      charge their swapin.
      
      Adjust the comment, though: '4969c119 mm: fix swapin race condition' added
      a stable PageSwapCache check under the page lock in the do_swap_page()
      before calling the memory controller, so it's unuse_pte()'s pte_same()
      that may fail.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bdf4f4d2
    • Johannes Weiner's avatar
      mm: memcg: only check swap cache pages for repeated charging · 90deb788
      Johannes Weiner authored
      Only anon and shmem pages in the swap cache are attempted to be charged
      multiple times, from every swap pte fault or from shmem_unuse().  No other
      pages require checking PageCgroupUsed().
      
      Charging pages in the swap cache is also serialized by the page lock, and
      since both the try_charge and commit_charge are called under the same page
      lock section, the PageCgroupUsed() check might as well happen before the
      counter charging, let alone reclaim.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90deb788
    • Johannes Weiner's avatar
      mm: memcg: split swapin charge function into private and public part · 0435a2fd
      Johannes Weiner authored
      When shmem is charged upon swapin, it does not need to check twice whether
      the memory controller is enabled.
      
      Also, shmem pages do not have to be checked for everything that regular
      anon pages have to be checked for, so let shmem use the internal version
      directly and allow future patches to move around checks that are only
      required when swapping in anon pages.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0435a2fd
    • Johannes Weiner's avatar
      mm: memcg: remove needless !mm fixup to init_mm when charging · 24467cac
      Johannes Weiner authored
      It does not matter to __mem_cgroup_try_charge() if the passed mm is NULL
      or init_mm, it will charge the root memcg in either case.
      
      Also fix up the comment in __mem_cgroup_try_charge() that claimed the
      init_mm would be charged when no mm was passed.  It's not really
      incorrect, but confusing.  Clarify that the root memcg is charged in this
      case.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24467cac
    • Johannes Weiner's avatar
      mm: memcg: remove unneeded shmem charge type · 62ba7442
      Johannes Weiner authored
      shmem page charges have not needed a separate charge type to tell them
      from regular file pages since 08e552c6 ("memcg: synchronized LRU").
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62ba7442
    • Johannes Weiner's avatar
      mm: memcg: move swapin charge functions above callsites · 827a03d2
      Johannes Weiner authored
      Charging cache pages may require swapin in the shmem case.  Save the
      forward declaration and just move the swapin functions above the cache
      charging functions.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      827a03d2
    • Johannes Weiner's avatar
      mm: memcg: only check for PageSwapCache when uncharging anon · 7d188958
      Johannes Weiner authored
      Only anon pages that are uncharged at the time of the last page table
      mapping vanishing may be in swapcache.
      
      When shmem pages, file pages, swap-freed anon pages, or just migrated
      pages are uncharged, they are known for sure to be not in swapcache.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d188958
    • Johannes Weiner's avatar
      mm: memcg: push down PageSwapCache check into uncharge entry functions · 0c59b89c
      Johannes Weiner authored
      Not all uncharge paths need to check if the page is swapcache, some of
      them can know for sure.
      
      Push down the check into all callsites of uncharge_common() so that the
      patch that removes some of them is more obvious.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Wanpeng Li <liwp.linux@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0c59b89c