1. 09 Jun, 2014 1 commit
  2. 24 Feb, 2014 1 commit
    • KOSAKI Motohiro's avatar
      mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq() · 9023aec3
      KOSAKI Motohiro authored
      commit a85d9df1
      
       upstream.
      
      During aio stress test, we observed the following lockdep warning.  This
      mean AIO+numa_balancing is currently deadlockable.
      
      The problem is, aio_migratepage disable interrupt, but
      __set_page_dirty_nobuffers unintentionally enable it again.
      
      Generally, all helper function should use spin_lock_irqsave() instead of
      spin_lock_irq() because they don't know caller at all.
      
         other info that might help us debug this:
          Possible unsafe locking scenario:
      
                CPU0
                ----
           lock(&(&ctx->completion_lock)->rlock);
           <Interrupt>
             lock(&(&ctx->completion_lock)->rlock);
      
          *** DEADLOCK ***
      
            dump_stack+0x19/0x1b
            print_usage_bug+0x1f7/0x208
            mark_lock+0x21d/0x2a0
            mark_held_locks+0xb9/0x140
            trace_hardirqs_on_caller+0x105/0x1d0
            trace_hardirqs_on+0xd/0x10
            _raw_spin_unlock_irq+0x2c/0x50
            __set_page_dirty_nobuffers+0x8c/0xf0
            migrate_page_copy+0x434/0x540
            aio_migratepage+0xb1/0x140
            move_to_new_page+0x7d/0x230
            migrate_pages+0x5e5/0x700
            migrate_misplaced_page+0xbc/0xf0
            do_numa_page+0x102/0x190
            handle_pte_fault+0x241/0x970
            handle_mm_fault+0x265/0x370
            __do_page_fault+0x172/0x5a0
            do_page_fault+0x1a/0x70
            page_fault+0x28/0x30
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      9023aec3
  3. 10 Feb, 2014 1 commit
    • Johannes Weiner's avatar
      mm/page-writeback.c: fix dirty_balance_reserve subtraction from dirtyable memory · bcfa693b
      Johannes Weiner authored
      commit a804552b
      
       upstream.
      
      Tejun reported stuttering and latency spikes on a system where random
      tasks would enter direct reclaim and get stuck on dirty pages.  Around
      50% of memory was occupied by tmpfs backed by an SSD, and another disk
      (rotating) was reading and writing at max speed to shrink a partition.
      
      : The problem was pretty ridiculous.  It's a 8gig machine w/ one ssd and 10k
      : rpm harddrive and I could reliably reproduce constant stuttering every
      : several seconds for as long as buffered IO was going on on the hard drive
      : either with tmpfs occupying somewhere above 4gig or a test program which
      : allocates about the same amount of anon memory.  Although swap usage was
      : zero, turning off swap also made the problem go away too.
      :
      : The trigger conditions seem quite plausible - high anon memory usage w/
      : heavy buffered IO and swap configured - and it's highly likely that this
      : is happening in the wild too.  (this can happen with copying large files
      : to usb sticks too, right?)
      
      This patch (of 2):
      
      The dirty_balance_reserve is an approximation of the fraction of free
      pages that the page allocator does not make available for page cache
      allocations.  As a result, it has to be taken into account when
      calculating the amount of "dirtyable memory", the baseline to which
      dirty_background_ratio and dirty_ratio are applied.
      
      However, currently the reserve is subtracted from the sum of free and
      reclaimable pages, which is non-sensical and leads to erroneous results
      when the system is dominated by unreclaimable pages and the
      dirty_balance_reserve is bigger than free+reclaimable.  In that case, at
      least the already allocated cache should be considered dirtyable.
      
      Fix the calculation by subtracting the reserve from the amount of free
      pages, then adding the reclaimable pages on top.
      
      [akpm@linux-foundation.org: fix CONFIG_HIGHMEM build]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Tested-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarLuis Henriques <luis.henriques@canonical.com>
      bcfa693b
  4. 04 Nov, 2013 1 commit
    • Fengguang Wu's avatar
      writeback: fix negative bdi max pause · 7c0c220d
      Fengguang Wu authored
      commit e3b6c655
      
       upstream.
      
      Toralf runs trinity on UML/i386.  After some time it hangs and the last
      message line is
      
      	BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:1521]
      
      It's found that pages_dirtied becomes very large.  More than 1000000000
      pages in this case:
      
      	period = HZ * pages_dirtied / task_ratelimit;
      	BUG_ON(pages_dirtied > 2000000000);
      	BUG_ON(pages_dirtied > 1000000000);      <---------
      
      UML debug printf shows that we got negative pause here:
      
      	ick: pause : -984
      	ick: pages_dirtied : 0
      	ick: task_ratelimit: 0
      
      	 pause:
      	+       if (pause < 0)  {
      	+               extern int printf(char *, ...);
      	+               printf("ick : pause : %li\n", pause);
      	+               printf("ick: pages_dirtied : %lu\n", pages_dirtied);
      	+               printf("ick: task_ratelimit: %lu\n", task_ratelimit);
      	+               BUG_ON(1);
      	+       }
      	        trace_balance_dirty_pages(bdi,
      
      Since pause is bounded by [min_pause, max_pause] where min_pause is also
      bounded by max_pause.  It's suspected and demonstrated that the
      max_pause calculation goes wrong:
      
      	ick: pause : -717
      	ick: min_pause : -177
      	ick: max_pause : -717
      	ick: pages_dirtied : 14
      	ick: task_ratelimit: 0
      
      The problem lies in the two "long = unsigned long" assignments in
      bdi_max_pause() which might go negative if the highest bit is 1, and the
      min_t(long, ...) check failed to protect it falling under 0.  Fix all of
      them by using "unsigned long" throughout the function.
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Reported-by: default avatarToralf Förster <toralf.foerster@gmx.de>
      Tested-by: default avatarToralf Förster <toralf.foerster@gmx.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c0c220d
  5. 14 Jul, 2013 1 commit
    • Paul Gortmaker's avatar
      kernel: delete __cpuinit usage from all core kernel files · 0db0628d
      Paul Gortmaker authored
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the uses of the __cpuinit macros from C files in
      the core kernel directories (kernel, init, lib, mm, and include)
      that don't really have a specific maintainer.
      
      [1] https://lkml.org/lkml/2013/5/20/589
      
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      0db0628d
  6. 29 Apr, 2013 1 commit
    • Darrick J. Wong's avatar
      mm: make snapshotting pages for stable writes a per-bio operation · 71368511
      Darrick J. Wong authored
      
      Walking a bio's page mappings has proved problematic, so create a new
      bio flag to indicate that a bio's data needs to be snapshotted in order
      to guarantee stable pages during writeback.  Next, for the one user
      (ext3/jbd) of snapshotting, hook all the places where writes can be
      initiated without PG_writeback set, and set BIO_SNAP_STABLE there.
      
      We must also flag journal "metadata" bios for stable writeout, since
      file data can be written through the journal.  Finally, the
      MS_SNAP_STABLE mount flag (only used by ext3) is now superfluous, so get
      rid of it.
      
      [akpm@linux-foundation.org: rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration]
      [akpm@linux-foundation.org: teeny cleanup]
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71368511
  7. 24 Feb, 2013 1 commit
  8. 22 Feb, 2013 2 commits
    • Darrick J. Wong's avatar
      block: optionally snapshot page contents to provide stable pages during write · ffecfd1a
      Darrick J. Wong authored
      
      This provides a band-aid to provide stable page writes on jbd without
      needing to backport the fixed locking and page writeback bit handling
      schemes of jbd2.  The band-aid works by using bounce buffers to snapshot
      page contents instead of waiting.
      
      For those wondering about the ext3 bandage -- fixing the jbd locking
      (which was done as part of ext4dev years ago) is a lot of surgery, and
      setting PG_writeback on data pages when we actually hold the page lock
      dropped ext3 performance by nearly an order of magnitude.  If we're
      going to migrate iscsi and raid to use stable page writes, the
      complaints about high latency will likely return.  We might as well
      centralize their page snapshotting thing to one place.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: Jan Kara <jack@suse.cz...
      ffecfd1a
    • Darrick J. Wong's avatar
      mm: only enforce stable page writes if the backing device requires it · 1d1d1a76
      Darrick J. Wong authored
      
      Create a helper function to check if a backing device requires stable
      page writes and, if so, performs the necessary wait.  Then, make it so
      that all points in the memory manager that handle making pages writable
      use the helper function.  This should provide stable page write support
      to most filesystems, while eliminating unnecessary waiting for devices
      that don't require the feature.
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own stable page guarantees or they don't block at all.
      The blocking behavior is back to what it was before 3.0 if you don't
      have a disk requiring stable page writes.
      
      Here's the result of using dbench to test latency on ext2:
      
      3.8.0-rc3:
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       WriteX        109347     0.028    59.817
       ReadX         347180     0.004     3.391
       Flush          15514    29.828   287.283
      
      Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
       WriteX        105556     0.029     4.273
       ReadX         335004     0.005     4.112
       Flush          14982    30.540   298.634
      
      Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, the maximum write latency drops considerably with this
      patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
      similarly, but see the cover letter for those results.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: default avatarSteven Whitehouse <swhiteho@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d1d1a76
  9. 07 Feb, 2013 1 commit
  10. 24 Jan, 2013 1 commit
  11. 14 Jan, 2013 1 commit
    • Tejun Heo's avatar
      writeback: add more tracepoints · 9fb0a7da
      Tejun Heo authored
      
      Add tracepoints for page dirtying, writeback_single_inode start, inode
      dirtying and writeback.  For the latter two inode events, a pair of
      events are defined to denote start and end of the operations (the
      starting one has _start suffix and the one w/o suffix happens after
      the operation is complete).  These inode ops are FS specific and can
      be non-trivial and having enclosing tracepoints is useful for external
      tracers.
      
      This is part of tracepoint additions to improve visiblity into
      dirtying / writeback operations for io tracer and userland.
      
      v2: writeback_dirty_inode[_start] TPs may be called for files on
          pseudo FSes w/ unregistered bdi.  Check whether bdi->dev is %NULL
          before dereferencing.
      
      v3: buffer dirtying moved to a block TP.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9fb0a7da
  12. 21 Dec, 2012 1 commit
  13. 12 Dec, 2012 1 commit
  14. 28 Sep, 2012 1 commit
  15. 03 Aug, 2012 1 commit
    • Artem Bityutskiy's avatar
      vfs: kill write_super and sync_supers · f0cd2dbb
      Artem Bityutskiy authored
      
      Finally we can kill the 'sync_supers' kernel thread along with the
      '->write_super()' superblock operation because all the users are gone.
      Now every file-system is supposed to self-manage own superblock and
      its dirty state.
      
      The nice thing about killing this thread is that it improves power management.
      Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up
      every 5 seconds no matter what - even if there were no dirty superblocks and
      even if there were no file-systems using this service (e.g., btrfs and
      journalled ext4 do not need it). So it was wasting power most of the time. And
      because the thread was in the core of the kernel, all systems had to have it.
      So I am quite happy to make it go away.
      
      Interestingly, this thread is a left-over from the pdflush kernel thread which
      was a self-forking kernel thread responsible for all the write-back in old
      Linux kernels. It was turned into per-block device BDI threads, and
      'sync_supers' was a left-over. Thus, R.I.P, pdflush as well.
      Signed-off-by: default avatarArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f0cd2dbb
  16. 09 Jun, 2012 1 commit
  17. 08 Jun, 2012 1 commit
  18. 06 May, 2012 1 commit
    • Fengguang Wu's avatar
      writeback: initialize global_dirty_limit · 68809c71
      Fengguang Wu authored
      
      This prevents global_dirty_limit from remaining 0 (the initial value)
      for long time, since it's only updated in update_dirty_limit() when
      above the dirty freerun area.
      
      It will avoid unexpected consequences when some random code use it as a
      convenient approximation of the global dirty threshold.
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      68809c71
  19. 14 Apr, 2012 1 commit
  20. 22 Mar, 2012 2 commits
    • Artem Bityutskiy's avatar
      mm: export dirty_writeback_interval · 91913a29
      Artem Bityutskiy authored
      
      Export 'dirty_writeback_interval' to make it visible to
      file-systems. We are going to push superblock management down to
      file-systems and get rid of the 'sync_supers' kernel thread completly.
      Signed-off-by: default avatarArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      91913a29
    • Fengguang Wu's avatar
      mm: use global_dirty_limit in throttle_vm_writeout() · 47a13333
      Fengguang Wu authored
      
      When starting a memory hog task, a desktop box w/o swap is found to go
      unresponsive for a long time.  It's solely caused by lots of congestion
      waits in throttle_vm_writeout():
      
       gnome-system-mo-4201 553.073384: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
       gnome-system-mo-4201 553.073386: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
                 gtali-4237 553.080377: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
                 gtali-4237 553.080378: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
                  Xorg-3483 553.103375: congestion_wait: throttle_vm_writeout+0x70/0x7f shrink_mem_cgroup_zone+0x48f/0x4a1
                  Xorg-3483 553.103377: writeback_congestion_wait: usec_timeout=100000 usec_delayed=100000
      
      The root cause is, the dirty threshold is knocked down a lot by the memory
      hog task.  Fixed by using global_dirty_limit which decreases gradually on
      such events and can guarantee we stay above (the also decreasing) nr_dirty
      in the progress of following down to the new dirty threshold.
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47a13333
  21. 11 Jan, 2012 4 commits
    • Johannes Weiner's avatar
      mm: try to distribute dirty pages fairly across zones · a756cf59
      Johannes Weiner authored
      
      The maximum number of dirty pages that exist in the system at any time is
      determined by a number of pages considered dirtyable and a user-configured
      percentage of those, or an absolute number in bytes.
      
      This number of dirtyable pages is the sum of memory provided by all the
      zones in the system minus their lowmem reserves and high watermarks, so
      that the system can retain a healthy number of free pages without having
      to reclaim dirty pages.
      
      But there is a flaw in that we have a zoned page allocator which does not
      care about the global state but rather the state of individual memory
      zones.  And right now there is nothing that prevents one zone from filling
      up with dirty pages while other zones are spared, which frequently leads
      to situations where kswapd, in order to restore the watermark of free
      pages, does indeed have to write pages from that zone's LRU list.  This
      can interfere so badly with IO from the flusher threads that major
      filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
      already, taking away the VM's only possibility to keep such a zone
      balanced, aside from hoping the flushers will soon clean pages from that
      zone.
      
      Enter per-zone dirty limits.  They are to a zone's dirtyable memory what
      the global limit is to the global amount of dirtyable memory, and try to
      make sure that no single zone receives more than its fair share of the
      globally allowed dirty pages in the first place.  As the number of pages
      considered dirtyable excludes the zones' lowmem reserves and high
      watermarks, the maximum number of dirty pages in a zone is such that the
      zone can always be balanced without requiring page cleaning.
      
      As this is a placement decision in the page allocator and pages are
      dirtied only after the allocation, this patch allows allocators to pass
      __GFP_WRITE when they know in advance that the page will be written to and
      become dirty soon.  The page allocator will then attempt to allocate from
      the first zone of the zonelist - which on NUMA is determined by the task's
      NUMA memory policy - that has not exceeded its dirty limit.
      
      At first glance, it would appear that the diversion to lower zones can
      increase pressure on them, but this is not the case.  With a full high
      zone, allocations will be diverted to lower zones eventually, so it is
      more of a shift in timing of the lower zone allocations.  Workloads that
      previously could fit their dirty pages completely in the higher zone may
      be forced to allocate from lower zones, but the amount of pages that
      "spill over" are limited themselves by the lower zones' dirty constraints,
      and thus unlikely to become a problem.
      
      For now, the problem of unfair dirty page distribution remains for NUMA
      configurations where the zones allowed for allocation are in sum not big
      enough to trigger the global dirty limits, wake up the flusher threads and
      remedy the situation.  Because of this, an allocation that could not
      succeed on any of the considered zones is allowed to ignore the dirty
      limits before going into direct reclaim or even failing the allocation,
      until a future patch changes the global dirty throttling and flusher
      thread activation so that they take individual zone states into account.
      
      			Test results
      
      15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
      40% dirty ratio
      16G USB thumb drive
      10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))
      
      		seconds			nr_vmscan_write
      		        (stddev)	       min|     median|        max
      xfs
      vanilla:	 549.747( 3.492)	     0.000|      0.000|      0.000
      patched:	 550.996( 3.802)	     0.000|      0.000|      0.000
      
      fuse-ntfs
      vanilla:	1183.094(53.178)	 54349.000|  59341.000|  65163.000
      patched:	 558.049(17.914)	     0.000|      0.000|     43.000
      
      btrfs
      vanilla:	 573.679(14.015)	156657.000| 460178.000| 606926.000
      patched:	 563.365(11.368)	     0.000|      0.000|   1362.000
      
      ext4
      vanilla:	 561.197(15.782)	     0.000|2725438.000|4143837.000
      patched:	 568.806(17.496)	     0.000|      0.000|      0.000
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Tested-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a756cf59
    • Johannes Weiner's avatar
      mm: writeback: cleanups in preparation for per-zone dirty limits · ccafa287
      Johannes Weiner authored
      
      The next patch will introduce per-zone dirty limiting functions in
      addition to the traditional global dirty limiting.
      
      Rename determine_dirtyable_memory() to global_dirtyable_memory() before
      adding the zone-specific version, and fix up its documentation.
      
      Also, move the functions to determine the dirtyable memory and the
      function to calculate the dirty limit based on that together so that their
      relationship is more apparent and that they can be commented on as a
      group.
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarMel Gorman <mel@suse.de>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ccafa287
    • Johannes Weiner's avatar
      mm: exclude reserved pages from dirtyable memory · ab8fabd4
      Johannes Weiner authored
      
      Per-zone dirty limits try to distribute page cache pages allocated for
      writing across zones in proportion to the individual zone sizes, to reduce
      the likelihood of reclaim having to write back individual pages from the
      LRU lists in order to make progress.
      
      This patch:
      
      The amount of dirtyable pages should not include the full number of free
      pages: there is a number of reserved pages that the page allocator and
      kswapd always try to keep free.
      
      The closer (reclaimable pages - dirty pages) is to the number of reserved
      pages, the more likely it becomes for reclaim to run into dirty pages:
      
             +----------+ ---
             |   anon   |  |
             +----------+  |
             |          |  |
             |          |  -- dirty limit new    -- flusher new
             |   file   |  |                     |
             |          |  |                     |
             |          |  -- dirty limit old    -- flusher old
             |          |                        |
             +----------+                       --- reclaim
             | reserved |
             +----------+
             |  kernel  |
             +----------+
      
      This patch introduces a per-zone dirty reserve that takes both the lowmem
      reserve as well as the high watermark of the zone into account, and a
      global sum of those per-zone values that is subtracted from the global
      amount of dirtyable pages.  The lowmem reserve is unavailable to page
      cache allocations and kswapd tries to keep the high watermark free.  We
      don't want to end up in a situation where reclaim has to clean pages in
      order to balance zones.
      
      Not treating reserved pages as dirtyable on a global level is only a
      conceptual fix.  In reality, dirty pages are not distributed equally
      across zones and reclaim runs into dirty pages on a regular basis.
      
      But it is important to get this right before tackling the problem on a
      per-zone level, where the distance between reclaim and the dirty pages is
      mostly much smaller in absolute numbers.
      
      [akpm@linux-foundation.org: fix highmem build]
      Signed-off-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab8fabd4
    • Johannes Weiner's avatar
      mm/page-writeback.c: make determine_dirtyable_memory static again · 1edf2234
      Johannes Weiner authored
      
      The tracing ring-buffer used this function briefly, but not anymore.
      Make it local to the writeback code again.
      
      Also, move the function so that no forward declaration needs to be
      reintroduced.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1edf2234
  22. 04 Jan, 2012 1 commit
  23. 18 Dec, 2011 8 commits
    • Wu Fengguang's avatar
      writeback: balanced_rate cannot exceed write bandwidth · bdaac490
      Wu Fengguang authored
      
      Add an upper limit to balanced_rate according to the below inequality.
      This filters out some rare but huge singular points, which at least
      enables more readable gnuplot figures.
      
      When there are N dd dirtiers,
      
      	balanced_dirty_ratelimit = write_bw / N
      
      So it holds that
      
      	balanced_dirty_ratelimit <= write_bw
      
      The singular points originate from dirty_rate in the below formular:
      
              balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate
      where
      	dirty_rate = (number of page dirties in the past 200ms) / 200ms
      
      In the extreme case, if all dd tasks suddenly get blocked on something
      else and hence no pages are dirtied at all, dirty_rate will be 0 and
      balanced_dirty_ratelimit will be inf. This could happen in reality.
      
      Note that these huge singular points are not a real threat, since they
      are _guaranteed_ to be filtered out by the
      	min(balanced_dirty_ratelimit, task_ratelimit)
      line in bdi_update_dirty_ratelimit(). task_ratelimit is based on the
      number of dirty pages, which will never _suddenly_ fly away like
      balanced_dirty_ratelimit. So any weirdly large balanced_dirty_ratelimit
      will be cut down to the level of task_ratelimit.
      
      There won't be tiny singular points though, as long as the dirty pages
      lie inside the dirty throttling region (above the freerun region).
      Because there the dd tasks will be throttled by balanced_dirty_pages()
      and won't be able to suddenly dirty much more pages than average.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      bdaac490
    • Wu Fengguang's avatar
      writeback: do strict bdi dirty_exceeded · 82791940
      Wu Fengguang authored
      
      This helps to reduce dirty throttling polls and hence CPU overheads.
      
      bdi->dirty_exceeded typically only helps when suddenly starting 100+
      dd's on a disk, in which case the dd's may need to poll
      balance_dirty_pages() earlier than tsk->nr_dirtied_pause.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      82791940
    • Wu Fengguang's avatar
      writeback: avoid tiny dirty poll intervals · 5b9b3574
      Wu Fengguang authored
      
      The LKP tests see big 56% regression for the case fio_mmap_randwrite_64k.
      Shaohua manages to root cause it to be the much smaller dirty pause times
      and hence much more frequent invocations to the IO-less balance_dirty_pages().
      Since fio_mmap_randwrite_64k effectively contains both reads and writes,
      the more frequent pauses triggered more idling in the cfq IO scheduler.
      
      The solution is to increase pause time all the way up to the max 200ms
      in this case, which is found to restore most performance. This will help
      reduce CPU overheads in other cases, too.
      
      Note that I don't expect many performance critical workloads to run this
      access pattern: the mmap read-on-write is rather inefficient and could
      be avoided by doing normal writes syscalls.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Reported-by: default avatarLi Shaohua <shaohua.li@intel.com>
      Tested-by: default avatarLi Shaohua <shaohua.li@intel.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      5b9b3574
    • Wu Fengguang's avatar
      writeback: max, min and target dirty pause time · 7ccb9ad5
      Wu Fengguang authored
      
      Control the pause time and the call intervals to balance_dirty_pages()
      with three parameters:
      
      1) max_pause, limited by bdi_dirty and MAX_PAUSE
      
      2) the target pause time, grows with the number of dd tasks
         and is normally limited by max_pause/2
      
      3) the minimal pause, set to half the target pause
         and is used to skip short sleeps and accumulate them into bigger ones
      
      The typical behaviors after patch:
      
      - if ever task_ratelimit is far below dirty_ratelimit, the pause time
        will remain constant at max_pause and nr_dirtied_pause will be
        fluctuating with task_ratelimit
      
      - in the normal cases, nr_dirtied_pause will remain stable (keep in the
        same pace with dirty_ratelimit) and the pause time will be fluctuating
        with task_ratelimit
      
      In summary, someone has to fluctuate with task_ratelimit, because
      
      	task_ratelimit = nr_dirtied_pause / pause
      
      We normally prefer a stable nr_dirtied_pause, until reaching max_pause.
      
      The notable behavior changes are:
      
      - in stable workloads, there will no longer be sudden big trajectory
        switching of nr_dirtied_pause as concerned by Peter. It will be as
        smooth as dirty_ratelimit and changing proportionally with it (as
        always, assuming bdi bandwidth does not fluctuate across 2^N lines,
        otherwise nr_dirtied_pause will show up in 2+ parallel trajectories)
      
      - in the rare cases when something keeps task_ratelimit far below
        dirty_ratelimit, the smoothness can no longer be retained and
        nr_dirtied_pause will be "dancing" with task_ratelimit. This fixes a
        (not that destructive but still not good) bug that
      	  dirty_ratelimit gets brought down undesirably
      	  <= balanced_dirty_ratelimit is under estimated
      	  <= weakly executed task_ratelimit
      	  <= pause goes too large and gets trimmed down to max_pause
      	  <= nr_dirtied_pause (based on dirty_ratelimit) is set too large
      	  <= dirty_ratelimit being much larger than task_ratelimit
      
      - introduce min_pause to avoid small pause sleeps
      
      - when pause is trimmed down to max_pause, try to compensate it at the
        next pause time
      
      The "refactor" type of changes are:
      
      The max_pause equation is slightly transformed to make it slightly more
      efficient.
      
      We now scale target_pause by (N * 10ms) on 2^N concurrent tasks, which
      is effectively equal to the original scaling max_pause by (N * 20ms)
      because the original code does implicit target_pause ~= max_pause / 2.
      Based on the same implicit ratio, target_pause starts with 10ms on 1 dd.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      7ccb9ad5
    • Wu Fengguang's avatar
      writeback: dirty ratelimit - think time compensation · 83712358
      Wu Fengguang authored
      
      Compensate the task's think time when computing the final pause time,
      so that ->dirty_ratelimit can be executed accurately.
      
              think time := time spend outside of balance_dirty_pages()
      
      In the rare case that the task slept longer than the 200ms period time
      (result in negative pause time), the sleep time will be compensated in
      the following periods, too, if it's less than 1 second.
      
      Accumulated errors are carefully avoided as long as the max pause area
      is not hitted.
      
      Pseudo code:
      
              period = pages_dirtied / task_ratelimit;
              think = jiffies - dirty_paused_when;
              pause = period - think;
      
      1) normal case: period > think
      
              pause = period - think
              dirty_paused_when = jiffies + pause
              nr_dirtied = 0
      
                                   period time
                    |===============================>|
                        think time      pause time
                    |===============>|==============>|
              ------|----------------|---------------|------------------------
              dirty_paused_when   jiffies
      
      2) no pause case: period <= think
      
              don't pause; reduce future pause time by:
              dirty_paused_when += period
              nr_dirtied = 0
      
                                 period time
                    |===============================>|
                                        think time
                    |===================================================>|
              ------|--------------------------------+-------------------|----
              dirty_paused_when                                       jiffies
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      83712358
    • Wu Fengguang's avatar
      writeback: fix dirtied pages accounting on redirty · 2f800fbd
      Wu Fengguang authored
      
      De-account the accumulative dirty counters on page redirty.
      
      Page redirties (very common in ext4) will introduce mismatch between
      counters (a) and (b)
      
      a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
      b) NR_WRITTEN, BDI_WRITTEN
      
      This will introduce systematic errors in balanced_rate and result in
      dirty page position errors (ie. the dirty pages are no longer balanced
      around the global/bdi setpoints).
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      2f800fbd
    • Wu Fengguang's avatar
      writeback: fix dirtied pages accounting on sub-page writes · d3bc1fef
      Wu Fengguang authored
      
      When dd in 512bytes, generic_perform_write() calls
      balance_dirty_pages_ratelimited() 8 times for the same page, but
      obviously the page is only dirtied once.
      
      Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      d3bc1fef
    • Wu Fengguang's avatar
      writeback: charge leaked page dirties to active tasks · 54848d73
      Wu Fengguang authored
      
      It's a years long problem that a large number of short-lived dirtiers
      (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
      (eg. dd) as well as pushing the dirty pages to the global hard limit.
      
      The solution is to charge the pages dirtied by the exited gcc to the
      other random dirtying tasks. It sounds not perfect, however should
      behave good enough in practice, seeing as that throttled tasks aren't
      actually running so those that are running are more likely to pick it up
      and get throttled, therefore promoting an equal spread.
      
      Randy: fix compile error: 'dirty_throttle_leaks' undeclared in exit.c
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      54848d73
  24. 08 Dec, 2011 3 commits
    • Wu Fengguang's avatar
      writeback: set max_pause to lowest value on zero bdi_dirty · 82e230a0
      Wu Fengguang authored
      
      Some trace shows lots of bdi_dirty=0 lines where it's actually some
      small value if w/o the accounting errors in the per-cpu bdi stats.
      
      In this case the max pause time should really be set to the smallest
      (non-zero) value to avoid IO queue underrun and improve throughput.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      82e230a0
    • Wu Fengguang's avatar
      writeback: permit through good bdi even when global dirty exceeded · c5c6343c
      Wu Fengguang authored
      
      On a system with 1 local mount and 1 NFS mount, if the NFS server
      becomes not responding when dd to the NFS mount, the NFS dirty pages may
      exceed the global dirty limit and _every_ task involving writing will be
      blocked. The whole system appears unresponsive.
      
      The workaround is to permit through the bdi's that only has a small
      number of dirty pages. The number chosen (bdi_stat_error pages) is not
      enough to enable the local disk to run in optimal throughput, however is
      enough to make the system responsive on a broken NFS mount. The user can
      then kill the dirtiers on the NFS mount and increase the global dirty
      limit to bring up the local disk's throughput.
      
      It risks allowing dirty pages to grow much larger than the global dirty
      limit when there are 1000+ mounts, however that's very unlikely to happen,
      especially in low memory profiles.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      c5c6343c
    • Wu Fengguang's avatar
      writeback: comment on the bdi dirty threshold · aed21ad2
      Wu Fengguang authored
      
      We do "floating proportions" to let active devices to grow its target
      share of dirty pages and stalled/inactive devices to decrease its target
      share over time.
      
      It works well except in the case of "an inactive disk suddenly goes
      busy", where the initial target share may be too small. To mitigate
      this, bdi_position_ratio() has the below line to raise a small
      bdi_thresh when it's safe to do so, so that the disk be feed with enough
      dirty pages for efficient IO and in turn fast rampup of bdi_thresh:
      
              bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
      
      balance_dirty_pages() normally does negative feedback control which
      adjusts ratelimit to balance the bdi dirty pages around the target.
      In some extreme cases when that is not enough, it will have to block
      the tasks completely until the bdi dirty pages drop below bdi_thresh.
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      aed21ad2
  25. 17 Nov, 2011 2 commits
    • Wu Fengguang's avatar
      writeback: remove vm_dirties and task->dirties · 468e6a20
      Wu Fengguang authored
      
      They are not used any more.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      468e6a20
    • Wu Fengguang's avatar
      writeback: hard throttle 1000+ dd on a slow USB stick · 1df64719
      Wu Fengguang authored
      
      The sleep based balance_dirty_pages() can pause at most MAX_PAUSE=200ms
      on every 1 4KB-page, which means it cannot throttle a task under
      4KB/200ms=20KB/s. So when there are more than 512 dd writing to a
      10MB/s USB stick, its bdi dirty pages could grow out of control.
      
      Even if we can increase MAX_PAUSE, the minimal (task_ratelimit = 1)
      means a limit of 4KB/s.
                                                             
      They can eventually be safeguarded by the global limit check 
      (nr_dirty < dirty_thresh). However if someone is also writing to an 
      HDD at the same time, it'll get poor HDD write performance.
                                                             
      We at least want to maintain good write performance for other devices
      when one device is attacked by some "massive parallel" workload, or
      suffers from slow write bandwidth, or somehow get stalled due to some 
      error condition (eg. NFS server not responding).
      
      For a stalled device, we need to completely block its dirtiers, too,
      before its bdi dirty pages grow all the way up to the global limit and
      leave no space for the other functional devices.
      
      So change the loop exit condition to
      
      	/*
      	 * Always enforce global dirty limit; also enforce bdi dirty limit
      	 * if the normal max_pause sleeps cannot keep things under control.
      	 */
      	if (nr_dirty < dirty_thresh &&
      	    (bdi_dirty < bdi_thresh || bdi->dirty_ratelimit > 1))
      		break;
      
      which can be further simplified to
      
      	if (task_ratelimit)
      		break;
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      1df64719