1. 03 Mar, 2009 5 commits
  2. 02 Mar, 2009 1 commit
    • Ingo Molnar's avatar
      x86, mm: dont use non-temporal stores in pagecache accesses · f1800536
      Ingo Molnar authored
      Impact: standardize IO on cached ops
      
      On modern CPUs it is almost always a bad idea to use non-temporal stores,
      as the regression in this commit has shown it:
      
        30d697fa: x86: fix performance regression in write() syscall
      
      The kernel simply has no good information about whether using non-temporal
      stores is a good idea or not - and trying to add heuristics only increases
      complexity and inserts fragility.
      
      The regression on cached write()s took very long to be found - over two
      years. So dont take any chances and let the hardware decide how it makes
      use of its caches.
      
      The only exception is drivers/gpu/drm/i915/i915_gem.c: there were we are
      absolutely sure that another entity (the GPU) will pick up the dirty
      data immediately and that the CPU will not touch that data before the
      GPU will.
      
      Also, keep the _nocache() primitives to make it easier for people to
      experiment with these details. There may be more clear-cut cases where
      non-cached copies can be used, outside of filemap.c.
      
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      f1800536
  3. 25 Feb, 2009 4 commits
  4. 24 Feb, 2009 8 commits
    • Yinghai Lu's avatar
      x86: check range in reserve_early() · 46cb27f5
      Yinghai Lu authored
      Impact: cleanup
      
      one 32-bit system reports:
      
      BIOS-provided physical RAM map:
       BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
       BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
       BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
       BIOS-e820: 0000000000100000 - 000000001c000000 (usable)
       BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved)
      DMI 2.0 present.
      last_pfn = 0x1c000 max_arch_pfn = 0x100000
      kernel direct mapping tables up to 1c000000 @ 7000-c000
      ..
      RAMDISK: 1bc69000 - 1bfef4fa
      ..
      0MB HIGHMEM available.
      448MB LOWMEM available.
        mapped low ram: 0 - 1c000000
        low ram: 00000000 - 1c000000
        bootmap 00002000 - 00005800
      (9 early reservations) ==> bootmem [0000000000 - 001c000000]
        #0 [0000000000 - 0000001000]   BIOS data page ==> [0000000000 - 0000001000]
        #1 [0000001000 - 0000002000]    EX TRAMPOLINE ==> [0000001000 - 0000002000]
        #2 [0000006000 - 0000007000]       TRAMPOLINE ==> [0000006000 - 0000007000]
        #3 [0000400000 - 00009ed14c]    TEXT DATA BSS ==> [0000400000 - 00009ed14c]
        #4 [001bc69000 - 001bfef4fa]          RAMDISK ==> [001bc69000 - 001bfef4fa]
        #5 [00009ee000 - 00009f2000]    INIT_PG_TABLE ==> [00009ee000 - 00009f2000]
        #6 [000009f400 - 0000100000]    BIOS reserved ==> [000009f400 - 0000100000]
        #7 [0000007000 - 0000007000]          PGTABLE
        #8 [0000002000 - 0000006000]          BOOTMAP ==> [0000002000 - 0000006000]
      
      Notice the strange blank PGTABLE entry.
      
      The reason is init_pg_table is big enough, and zero range is called
      with init_memory_mapping/reserve_early().
      
      So try to check the range in reserve_early()
      
      v2: fix the reversed compare
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Cc: nickpiggin@yahoo.com.au
      Cc: ink@jurassic.park.msu.ru
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      46cb27f5
    • Cyrill Gorcunov's avatar
      x86: efi_stub_32,64 - add missing ENDPROCs · 9f331119
      Cyrill Gorcunov authored
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: heukelum@fastmail.fm
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      9f331119
    • Cyrill Gorcunov's avatar
      x86: head_64.S - use GLOBAL macro · bc8b2b92
      Cyrill Gorcunov authored
      Impact: cleanup
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: heukelum@fastmail.fm
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      bc8b2b92
    • Cyrill Gorcunov's avatar
      x86: entry_64.S - add missing ENDPROC · b3baaa13
      Cyrill Gorcunov authored
      native_usergs_sysret64 is described as
      
      	extern void native_usergs_sysret64(void)
      
      so lets add ENDPROC here
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: heukelum@fastmail.fm
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      b3baaa13
    • Cyrill Gorcunov's avatar
      x86: invalid_vm86_irq -- use predefined macros · 57e37293
      Cyrill Gorcunov authored
      Impact: cleanup
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: heukelum@fastmail.fm
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      57e37293
    • Cyrill Gorcunov's avatar
      x86: head_64.S - use IDT_ENTRIES instead of hardcoded number · 5e112ae2
      Cyrill Gorcunov authored
      Impact: cleanup
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: heukelum@fastmail.fm
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      5e112ae2
    • Cyrill Gorcunov's avatar
      x86: head_64.S - remove useless balign · 2a0b1001
      Cyrill Gorcunov authored
      Impact: cleanup
      
      NEXT_PAGE already has 'balign' so no
      need to keep this redundant one.
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: heukelum@fastmail.fm
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      2a0b1001
    • Salman Qazi's avatar
      x86: fix performance regression in write() syscall · 30d697fa
      Salman Qazi authored
      While the introduction of __copy_from_user_nocache (see commit:
      0812a579) may have been an improvement
      for sufficiently large writes, there is evidence to show that it is
      deterimental for small writes.  Unixbench's fstime test gives the
      following results for 256 byte writes with MAX_BLOCK of 2000:
      
          2.6.29-rc6 ( 5 samples, each in KB/sec ):
          283750, 295200, 294500, 293000, 293300
      
          2.6.29-rc6 + this patch (5 samples, each in KB/sec):
          313050, 3106750, 293350, 306300, 307900
      
          2.6.18
          395700, 342000, 399100, 366050, 359850
      
          See w_test() in src/fstime.c in unixbench version 4.1.0.  Basically, the above test
          consists of counting how much we can write in this manner:
      
          alarm(10);
          while (!sigalarm) {
                  for (f_blocks = 0; f_blocks < 2000; ++f_blocks) {
                         write(f, buf, 256);
                  }
                  lseek(f, 0L, 0);
          }
      
      Note, there are other components to the write syscall regression
      that are not addressed here.
      Signed-off-by: default avatarSalman Qazi <sqazi@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      30d697fa
  5. 22 Feb, 2009 1 commit
  6. 20 Feb, 2009 21 commits
    • Ingo Molnar's avatar
      x86, mm: fault.c, update copyrights · f8eeb2e6
      Ingo Molnar authored
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      f8eeb2e6
    • Ingo Molnar's avatar
      x86, mm: fault.c, give another attempt at prefetch handing before SIGBUS · cd1b68f0
      Ingo Molnar authored
      Impact: extend prefetch handling on 64-bit
      
      Currently there's an extra is_prefetch() check done in do_sigbus(),
      which we only do on 32 bits.
      
      This is a last-ditch check before we terminate a task, so it's worth
      giving prefetch instructions another chance - should none of our
      existing quirks have caught a prefetch instruction related spurious
      fault.
      
      The only risk is if a prefetch causes a real sigbus, in that case
      we'll not OOM but try another fault. But this code has been on
      32-bit for a long time, so it should be fine in practice.
      
      So do this on 64-bit too - and thus remove one more #ifdef.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      cd1b68f0
    • Ingo Molnar's avatar
      x86, mm: fault.c, remove #ifdef from fault_in_kernel_space() · 7c178a26
      Ingo Molnar authored
      Impact: cleanup
      
      Removal of an #ifdef in fault_in_kernel_space(), by making
      use of the new TASK_SIZE_MAX symbol which is now available
      on 32-bit too.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      7c178a26
    • Ingo Molnar's avatar
      x86, mm: rename TASK_SIZE64 => TASK_SIZE_MAX · d9517346
      Ingo Molnar authored
      Impact: cleanup
      
      Rename TASK_SIZE64 to TASK_SIZE_MAX, and provide the
      define on 32-bit too. (mapped to TASK_SIZE)
      
      This allows 32-bit code to make use of the (former-) TASK_SIZE64
      symbol as well, in a clean way.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      d9517346
    • Ingo Molnar's avatar
      x86, mm: fault.c, remove #ifdef from do_page_fault() · c3731c68
      Ingo Molnar authored
      Impact: cleanup
      
      do_page_fault() has this ugly #ifdef in its prototype:
      
        #ifdef CONFIG_X86_64
        asmlinkage
        #endif
        void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
      
      Replace it with 'dotraplinkage' which maps to exactly the above
      construct: nothing on 32-bit and asmlinkage on 64-bit.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      c3731c68
    • Ingo Molnar's avatar
      x86, mm: fault.c, unify oops handling · 1cc99544
      Ingo Molnar authored
      Impact: add oops-recursion check to 32-bit
      
      Unify the oops state-machine, to the 64-bit version. It is
      slightly more careful in that it does a recursion check
      in oops_begin(), and is thus more likely to show the relevant
      oops.
      
      It also means that 32-bit will print one more line at the
      end of pagefault triggered oopses:
      
       	printk(KERN_EMERG "CR2: %016lx\n", address);
      
      Which is generally good information to be seen in partial-dump
      digital-camera jpegs ;-)
      
      The downside is the somewhat more complex critical path. Both
      variants have been tested well meanwhile by kernel developers
      crashing their boxes so i dont think this is a practical worry.
      
      This removes 3 ugly #ifdefs from no_context() and makes the
      function a lot nicer read.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      1cc99544
    • Ingo Molnar's avatar
      x86, mm: fault.c, unify oops printing · 8f766149
      Ingo Molnar authored
      Impact: refine/extend page fault related oops printing on 64-bit
      
       - honor the pause_on_oops logic on 64-bit too
       - print out NX fault warnings on 64-bit as well
       - factor out the NX fault message to make it git-greppable and readable
      
      Note that this means that we do the PF_INSTR check on 32-bit non-PAE
      as well where it should not occur ... normally. Cannot hurt.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      8f766149
    • Ingo Molnar's avatar
      x86, mm: fault.c, reorder functions · f2f13a85
      Ingo Molnar authored
      Impact: cleanup
      
      Avoid a couple more #ifdefs by moving fundamentally non-unifiable
      functions into a single #ifdef 32-bit / #else / #endif block in
      fault.c: vmalloc*(), dump_pagetable(), check_vm8086_mode().
      
      No code changed:
      
         text	   data	    bss	    dec	    hex	filename
         4618	     32	     24	   4674	   1242	fault.o.before
         4618	     32	     24	   4674	   1242	fault.o.after
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      f2f13a85
    • Ingo Molnar's avatar
      x86, mm, kprobes: fault.c, simplify notify_page_fault() · b1801812
      Ingo Molnar authored
      Impact: cleanup
      
      Remove an #ifdef from notify_page_fault(). The function still
      compiles to nothing in the !CONFIG_KPROBES case.
      
      Introduce kprobes_built_in() and kprobe_fault_handler() helpers
      to allow this - they returns 0 if !CONFIG_KPROBES.
      
      No code changed:
      
         text	   data	    bss	    dec	    hex	filename
         4618	     32	     24	   4674	   1242	fault.o.before
         4618	     32	     24	   4674	   1242	fault.o.after
      
      Cc: Masami Hiramatsu <mhiramat@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      b1801812
    • Ingo Molnar's avatar
      x86, mm: fault.c, simplify kmmio_fault() · b814d41f
      Ingo Molnar authored
      Impact: cleanup
      
      Remove an #ifdef from kmmio_fault() - we can do this by
      providing default implementations for is_kmmio_active()
      and kmmio_handler(). The compiler optimizes it all away
      in the !CONFIG_MMIOTRACE case.
      
      Also, while at it, clean up mmiotrace.h a bit:
      
       - standard header guards
       - standard vertical spaces for structure definitions
      
      No code changed (both with mmiotrace on and off in the config):
      
         text	   data	    bss	    dec	    hex	filename
         2947	     12	     12	   2971	    b9b	fault.o.before
         2947	     12	     12	   2971	    b9b	fault.o.after
      
      Cc: Pekka Paalanen <pq@iki.fi>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      b814d41f
    • Ingo Molnar's avatar
      x86, mm: fault.c, enable PF_RSVD checks on 32-bit too · 121d5d0a
      Ingo Molnar authored
      Impact: improve page fault handling robustness
      
      The 'PF_RSVD' flag (bit 3) of the page-fault error_code is a
      relatively recent addition to x86 CPUs, so the 32-bit do_fault()
      implementation never had it. This flag gets set when the CPU
      detects nonzero values in any reserved bits of the page directory
      entries.
      
      Extend the existing 64-bit check for PF_RSVD in do_page_fault()
      to 32-bit too. If we detect such a fault then we print a more
      informative oops and the pagetables.
      
      This unifies the code some more, removes an ugly #ifdef and improves
      the 32-bit page fault code robustness a bit. It slightly increases
      the 32-bit kernel text size.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      121d5d0a
    • Ingo Molnar's avatar
      x86, mm: fault.c, factor out the vm86 fault check · 8c938f9f
      Ingo Molnar authored
      Impact: cleanup
      
      Instead of an ugly, open-coded, #ifdef-ed vm86 related legacy check
      in do_page_fault(), put it into the check_v8086_mode() helper
      function and merge it with an existing #ifdef.
      
      Also, simplify the code flow a tiny bit in the helper.
      
      No code changed:
      
      arch/x86/mm/fault.o:
      
         text	   data	    bss	    dec	    hex	filename
         2711	     12	     12	   2735	    aaf	fault.o.before
         2711	     12	     12	   2735	    aaf	fault.o.after
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      8c938f9f
    • Ingo Molnar's avatar
      x86, mm: fault.c, refactor/simplify the is_prefetch() code · 107a0367
      Ingo Molnar authored
      Impact: no functionality changed
      
      Factor out the opcode checker into a helper inline.
      
      The code got a tiny bit smaller:
      
         text	   data	    bss	    dec	    hex	filename
         4632	     32	     24	   4688	   1250	fault.o.before
         4618	     32	     24	   4674	   1242	fault.o.after
      
      And it got cleaner / easier to review as well.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      107a0367
    • Ingo Molnar's avatar
      x86, mm: fault.c cleanup · 2d4a7167
      Ingo Molnar authored
      Impact: cleanup, no code changed
      
      Clean up various small details, which can be correctness checked
      automatically:
      
       - tidy up the include file section
       - eliminate unnecessary includes
       - introduce show_signal_msg() to clean up code flow
       - standardize the code flow
       - standardize comments and other style details
       - more cleanups, pointed out by checkpatch
      
      No code changed on either 32-bit nor 64-bit:
      
      arch/x86/mm/fault.o:
      
         text	   data	    bss	    dec	    hex	filename
         4632	     32	     24	   4688	   1250	fault.o.before
         4632	     32	     24	   4688	   1250	fault.o.after
      
      the md5 changed due to a change in a single instruction:
      
         2e8a8241e7f0d69706776a5a26c90bc0  fault.o.before.asm
         c5c3d36e725586eb74f0e10692f0193e  fault.o.after.asm
      
      Because a __LINE__ reference in a WARN_ONCE() has changed.
      
      On 32-bit a few stack offsets changed - no code size difference
      nor any functionality difference.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      2d4a7167
    • Ingo Molnar's avatar
      Merge branch 'tip/x86/urgent' of... · c9e1585b
      Ingo Molnar authored
      Merge branch 'tip/x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into x86/mm
      c9e1585b
    • Ingo Molnar's avatar
      x86, pat: add large-PAT check to split_large_page() · 7a5714e0
      Ingo Molnar authored
      Impact: future-proof the split_large_page() function
      
      Linus noticed that split_large_page() is not safe wrt. the
      PAT bit: it is bit 12 on the 1GB and 2MB page table level
      (_PAGE_BIT_PAT_LARGE), and it is bit 7 on the 4K page
      table level (_PAGE_BIT_PAT).
      
      Currently it is not a problem because we never set
      _PAGE_BIT_PAT_LARGE on any of the large-page mappings - but
      should this happen in the future the split_large_page() would
      silently lift bit 12 into the lowlevel 4K pte and would start
      corrupting the physical page frame offset. Not fun.
      
      So add a debug warning, to make sure if something ever sets
      the PAT bit then this function gets updated too.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      7a5714e0
    • Steven Rostedt's avatar
      x86: check PMD in spurious_fault handler · 3c3e5694
      Steven Rostedt authored
      Impact: fix to prevent hard lockup on bad PMD permissions
      
      If the PMD does not have the correct permissions for a page access,
      but the PTE does, the spurious fault handler will mistake the fault
      as a lazy TLB transaction. This will result in an infinite loop of:
      
       fault -> spurious_fault check (pass) -> return to code -> fault
      
      This patch adds a check and a warn on if the PTE passes the permissions
      but the PMD does not.
      
      [ Updated: Ingo Molnar suggested using WARN_ONCE with some text ]
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      3c3e5694
    • Ingo Molnar's avatar
    • Ingo Molnar's avatar
      Merge branch 'x86/urgent' into x86/core · 3b6f7b9b
      Ingo Molnar authored
      3b6f7b9b
    • Vegard Nossum's avatar
      x86: use symbolic constants for MSR_IA32_MISC_ENABLE bits · ecab22aa
      Vegard Nossum authored
      Impact: Cleanup. No functional changes.
      Signed-off-by: default avatarVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      ecab22aa
    • Ingo Molnar's avatar
      x86: use the right protections for split-up pagetables · 07a66d7c
      Ingo Molnar authored
      Steven Rostedt found a bug in where in his modified kernel
      ftrace was unable to modify the kernel text, due to the PMD
      itself having been marked read-only as well in
      split_large_page().
      
      The fix, suggested by Linus, is to not try to 'clone' the
      reference protection of a huge-page, but to use the standard
      (and permissive) page protection bits of KERNPG_TABLE.
      
      The 'cloning' makes sense for the ptes but it's a confused and
      incorrect concept at the page table level - because the
      pagetable entry is a set of all ptes and hence cannot
      'clone' any single protection attribute - the ptes can be any
      mixture of protections.
      
      With the permissive KERNPG_TABLE, even if the pte protections
      get changed after this point (due to ftrace doing code-patching
      or other similar activities like kprobes), the resulting combined
      protections will still be correct and the pte's restrictive
      (or permissive) protections will control it.
      
      Also update the comment.
      
      This bug was there for a long time but has not caused visible
      problems before as it needs a rather large read-only area to
      trigger. Steve possibly hacked his kernel with some really
      large arrays or so. Anyway, the bug is definitely worth fixing.
      
      [ Huang Ying also experienced problems in this area when writing
        the EFI code, but the real bug in split_large_page() was not
        realized back then. ]
      Reported-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reported-by: default avatarHuang Ying <ying.huang@intel.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      07a66d7c