1. 17 Jun, 2009 18 commits
    • Wu Fengguang's avatar
    • Wu Fengguang's avatar
      readahead: introduce context readahead algorithm · 10be0b37
      Wu Fengguang authored
      Introduce page cache context based readahead algorithm.
      This is to better support concurrent read streams in general.
      
      RATIONALE
      ---------
      The current readahead algorithm detects interleaved reads in a _passive_ way.
      Given a sequence of interleaved streams 1,1001,2,1002,3,4,1003,5,1004,1005,6,...
      By checking for (offset == prev_offset + 1), it will discover the sequentialness
      between 3,4 and between 1004,1005, and start doing sequential readahead for the
      individual streams since page 4 and page 1005.
      
      The context readahead algorithm guarantees to discover the sequentialness no
      matter how the streams are interleaved. For the above example, it will start
      sequential readahead since page 2 and 1002.
      
      The trick is to poke for page @offset-1 in the page cache when it has no other
      clues on the sequentialness of request @offset: if the current requenst belongs
      to a sequential stream, that stream must have accessed page @offset-1 recently,
      and the page will still be cached now. So if page @offset-1 is there, we can
      take request @offset as a sequential access.
      
      BENEFICIARIES
      -------------
      - strictly interleaved reads  i.e. 1,1001,2,1002,3,1003,...
        the current readahead will take them as silly random reads;
        the context readahead will take them as two sequential streams.
      
      - cooperative IO processes   i.e. NFS and SCST
        They create a thread pool, farming off (sequential) IO requests to different
        threads which will be performing interleaved IO.
      
        It was not easy(or possible) to reliably tell from file->f_ra all those
        cooperative processes working on the same sequential stream, since they will
        have different file->f_ra instances. And NFSD's file->f_ra is particularly
        unusable, since their file objects are dynamically created for each request.
        The nfsd does have code trying to restore the f_ra bits, but not satisfactory.
      
        The new scheme is to detect the sequential pattern via looking up the page
        cache, which provides one single and consistent view of the pages recently
        accessed. That makes sequential detection for cooperative processes possible.
      
      USER REPORT
      -----------
      Vladislav recommends the addition of context readahead as a result of his SCST
      benchmarks. It leads to 6%~40% performance gains in various cases and achieves
      equal performance in others.                http://lkml.org/lkml/2009/3/19/239
      
      OVERHEADS
      ---------
      In theory, it introduces one extra page cache lookup per random read.  However
      the below benchmark shows context readahead to be slightly faster, wondering..
      
      Randomly reading 200MB amount of data on a sparse file, repeat 20 times for
      each block size. The average throughputs are:
      
                             	original ra	context ra	gain
       4K random reads:	 65.561MB/s	 65.648MB/s	+0.1%
      16K random reads:	124.767MB/s	124.951MB/s	+0.1%
      64K random reads: 	162.123MB/s	162.278MB/s	+0.1%
      
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Tested-by: default avatarVladislav Bolkhovitin <vst@vlnb.net>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10be0b37
    • Wu Fengguang's avatar
      readahead: move the random read case to bottom · 045a2529
      Wu Fengguang authored
      Split all readahead cases, and move the random one to bottom.
      
      No behavior changes.
      
      This is to prepare for the introduction of context readahead, and make it
      easy for inserting accounting/tracing points for each case.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      045a2529
    • Wu Fengguang's avatar
      radix-tree: add radix_tree_prev_hole() · dc566127
      Wu Fengguang authored
      The counterpart of radix_tree_next_hole(). To be used by context readahead.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc566127
    • Wu Fengguang's avatar
      readahead: record mmap read-around states in file_ra_state · d30a1100
      Wu Fengguang authored
      Mmap read-around now shares the same code style and data structure with
      readahead code.
      
      This also removes do_page_cache_readahead().  Its last user, mmap
      read-around, has been changed to call ra_submit().
      
      The no-readahead-if-congested logic is dumped by the way.  Users will be
      pretty sensitive about the slow loading of executables.  So it's
      unfavorable to disabled mmap read-around on a congested queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d30a1100
    • Wu Fengguang's avatar
      readahead: enforce full readahead size on async mmap readahead · 2fad6f5d
      Wu Fengguang authored
      We need this in one particular case and two more general ones.
      
      Now we do async readahead for sequential mmap reads, and do it with the
      help of PG_readahead.  For normal reads, PG_readahead is the sufficient
      condition to do a sequential readahead.  But unfortunately, for mmap
      reads, there is a tiny nuisance:
      
      [11736.998347] readahead-init0(process: sh/23926, file: sda1/w3m, offset=0:4503599627370495, ra=0+4-3) = 4
      [11737.014985] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=290+32-0) = 17
      [11737.019488] readahead-around(process: w3m/23926, file: sda1/w3m, offset=0:0, ra=118+32-0) = 32
      [11737.024921] readahead-interleaved(process: w3m/23926, file: sda1/w3m, offset=0:2, ra=4+6-6) = 6
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                 ~~~~~~~~~~~~~
      
      An unfavorably small readahead.  The original dumb read-around size could
      be more efficient.
      
      That happened because ld-linux.so does a read(832) in L1 before mmap(),
      which triggers a 4-page readahead, with the second page tagged
      PG_readahead.
      
      L0: open("/lib/libc.so.6", O_RDONLY)        = 3
      L1: read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\342"..., 832) = 832
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      L2: fstat(3, {st_mode=S_IFREG|0755, st_size=1420624, ...}) = 0
      L3: mmap(NULL, 3527256, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac6e51d000
      L4: mprotect(0x7fac6e671000, 2097152, PROT_NONE) = 0
      L5: mmap(0x7fac6e871000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x154000) = 0x7fac6e871000
      L6: mmap(0x7fac6e876000, 16984, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac6e876000
      L7: close(3)                                = 0
      
      In general, the PG_readahead flag will also be hit in cases
      
      - sequential reads
      
      - clustered random reads
      
      A full readahead size is desirable in both cases.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2fad6f5d
    • Wu Fengguang's avatar
      readahead: sequential mmap readahead · 70ac23cf
      Wu Fengguang authored
      Auto-detect sequential mmap reads and do readahead for them.
      
      The sequential mmap readahead will be triggered when
      - sync readahead: it's a major fault and (prev_offset == offset-1);
      - async readahead: minor fault on PG_readahead page with valid readahead state.
      
      The benefits of doing readahead instead of read-around:
      - less I/O wait thanks to async readahead
      - double real I/O size and no more cache hits
      
      The single stream case is improved a little.
      For 100,000 sequential mmap reads:
      
                                          user       system    cpu        total
      (1-1)  plain -mm, 128KB readaround: 3.224      2.554     48.40%     11.838
      (1-2)  plain -mm, 256KB readaround: 3.170      2.392     46.20%     11.976
      (2)  patched -mm, 128KB readahead:  3.117      2.448     47.33%     11.607
      
      The patched (2) has smallest total time, since it has no cache hit overheads
      and less I/O block time(thanks to async readahead). Here the I/O size
      makes no much difference, since there's only one single stream.
      
      Note that (1-1)'s real I/O size is 64KB and (1-2)'s real I/O size is 128KB,
      since the half of the read-around pages will be readahead cache hits.
      
      This is going to make _real_ differences for _concurrent_ IO streams.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70ac23cf
    • Linus Torvalds's avatar
      readahead: clean up and simplify the code for filemap page fault readahead · ef00e08e
      Linus Torvalds authored
      This shouldn't really change behavior all that much, but the single rather
      complex function with read-ahead inside a loop etc is broken up into more
      manageable pieces.
      
      The behaviour is also less subtle, with the read-ahead being done up-front
      rather than inside some subtle loop and thus avoiding the now unnecessary
      extra state variables (ie "did_readaround" is gone).
      
      Fengguang: the code split in fact fixed a bug reported by Pavel Levshin:
      the PGMAJFAULT accounting used to be bypassed when MADV_RANDOM is set, in
      which case the original code will directly jump to no_cached_page reading.
      
      Cc: Pavel Levshin <lpk@581.spb.su>
      Cc: <wli@movementarian.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef00e08e
    • Wu Fengguang's avatar
      readahead: remove sync/async readahead call dependency · 51daa88e
      Wu Fengguang authored
      The readahead call scheme is error-prone in that it expects the call sites
      to check for async readahead after doing a sync one.  I.e.
      
      			if (!page)
      				page_cache_sync_readahead();
      			page = find_get_page();
      			if (page && PageReadahead(page))
      				page_cache_async_readahead();
      
      This is because PG_readahead could be set by a sync readahead for the
      _current_ newly faulted in page, and the readahead code simply expects one
      more callback on the same page to start the async readahead.  If the
      caller fails to do so, it will miss the PG_readahead bits and never able
      to start an async readahead.
      
      Eliminate this insane constraint by piggy-backing the async part into the
      current readahead window.
      
      Now if an async readahead should be started immediately after a sync one,
      the readahead logic itself will do it.  So the following code becomes
      valid: (the 'else' in particular)
      
      			if (!page)
      				page_cache_sync_readahead();
      			else if (PageReadahead(page))
      				page_cache_async_readahead();
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51daa88e
    • Wu Fengguang's avatar
      readahead: increase interleaved readahead size · 160334a0
      Wu Fengguang authored
      Make sure interleaved readahead size is larger than request size.  This
      also makes the readahead window grow up more quickly.
      Reported-by: default avatarXu Chenfeng <xcf@ustc.edu.cn>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      160334a0
    • Wu Fengguang's avatar
      readahead: remove one unnecessary radix tree lookup · caca7cb7
      Wu Fengguang authored
      (hit_readahead_marker != 0) means the page at @offset is present, so we
      can search for non-present page starting from @offset+1.
      Reported-by: default avatarXu Chenfeng <xcf@ustc.edu.cn>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      caca7cb7
    • Wu Fengguang's avatar
      readahead: apply max_sane_readahead() limit in ondemand_readahead() · fc31d16a
      Wu Fengguang authored
      Just in case someone aggressively sets a huge readahead size.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc31d16a
    • Wu Fengguang's avatar
      readahead: move max_sane_readahead() calls into force_page_cache_readahead() · f7e839dd
      Wu Fengguang authored
      Impact: code simplification.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7e839dd
    • Wu Fengguang's avatar
      readahead: make mmap_miss an unsigned int · 1ebf26a9
      Wu Fengguang authored
      This makes the performance impact of possible mmap_miss wrap around to be
      temporary and tolerable: i.e.  MMAP_LOTSAMISS=100 extra readarounds.
      
      Otherwise if ever mmap_miss wraps around to negative, it takes INT_MAX
      cache misses to bring it back to normal state.  During the time mmap
      readaround will be _enabled_ for whatever wild random workload.  That's
      almost permanent performance impact.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ebf26a9
    • Alexey Dobriyan's avatar
      mm: consolidate init_mm definition · bb1f17b0
      Alexey Dobriyan authored
      * create mm/init-mm.c, move init_mm there
      * remove INIT_MM, initialize init_mm with C99 initializer
      * unexport init_mm on all arches:
      
        init_mm is already unexported on x86.
      
        One strange place is some OMAP driver (drivers/video/omap/) which
        won't build modular, but it's already wants get_vm_area() export.
        Somebody should look there.
      
      [akpm@linux-foundation.org: add missing #includes]
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Mike Frysinger <vapier.adi@gmail.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb1f17b0
    • Yinghai Lu's avatar
      firmware_map: fix hang with x86/32bit · 3b0fde0f
      Yinghai Lu authored
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13484
      
      Peer reported:
      | The bug is introduced from kernel 2.6.27, if E820 table reserve the memory
      | above 4G in 32bit OS(BIOS-e820: 00000000fff80000 - 0000000120000000
      | (reserved)), system will report Int 6 error and hang up. The bug is caused by
      | the following code in drivers/firmware/memmap.c, the resource_size_t is 32bit
      | variable in 32bit OS, the BUG_ON() will be invoked to result in the Int 6
      | error. I try the latest 32bit Ubuntu and Fedora distributions, all hit this
      | bug.
      |======
      |static int firmware_map_add_entry(resource_size_t start, resource_size_t end,
      |                  const char *type,
      |                  struct firmware_map_entry *entry)
      
      and it only happen with CONFIG_PHYS_ADDR_T_64BIT is not set.
      
      it turns out we need to pass u64 instead of resource_size_t for that.
      
      [akpm@linux-foundation.org: add comment]
      Reported-and-tested-by: default avatarPeer Chen <pchen@nvidia.com>
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Acked-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b0fde0f
    • Roel Kluin's avatar
      spi: takes size of a pointer to determine the size of the pointed-to type · 02141546
      Roel Kluin authored
      Do not take the size of a pointer to determine the size of the pointed-to
      type.
      Signed-off-by: default avatarRoel Kluin <roel.kluin@gmail.com>
      Acked-by: default avatarAnton Vorontsov <avorontsov@ru.mvista.com>
      Cc: David Brownell <david-b@pacbell.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Kumar Gala <galak@gate.crashing.org>
      Acked-by: default avatarGrant Likely <grant.likely@secretlab.ca>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      02141546
    • Arnd Bergmann's avatar
      time: move PIT_TICK_RATE to linux/timex.h · 08604bd9
      Arnd Bergmann authored
      PIT_TICK_RATE is currently defined in four architectures, but in three
      different places.  While linux/timex.h is not the perfect place for it, it
      is still a reasonable replacement for those drivers that traditionally use
      asm/timex.h to get CLOCK_TICK_RATE and expect it to be the PIT frequency.
      
      Note that for Alpha, the actual value changed from 1193182UL to 1193180UL.
       This is unlikely to make a difference, and probably can only improve
      accuracy.  There was a discussion on the correct value of CLOCK_TICK_RATE
      a few years ago, after which every existing instance was getting changed
      to 1193182.  According to the specification, it should be
      1193181.818181...
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Cc: Dmitry Torokhov <dtor@mail.ru>
      Cc: Takashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08604bd9
  2. 15 Jun, 2009 22 commits