1. 02 May, 2007 40 commits
    • Jan Beulich's avatar
      [PATCH] x86: tighten kernel image page access rights · 6fb14755
      Jan Beulich authored
      On x86-64, kernel memory freed after init can be entirely unmapped instead
      of just getting 'poisoned' by overwriting with a debug pattern.
      
      On i386 and x86-64 (under CONFIG_DEBUG_RODATA), kernel text and bug table
      can also be write-protected.
      
      Compared to the first version, this one prevents re-creating deleted
      mappings in the kernel image range on x86-64, if those got removed
      previously. This, together with the original changes, prevents temporarily
      having inconsistent mappings when cacheability attributes are being
      changed on such pages (e.g. from AGP code). While on i386 such duplicate
      mappings don't exist, the same change is done there, too, both for
      consistency and because checking pte_present() before using various other
      pte_XXX functions is a requirement anyway. At once, i386 code gets
      adjusted to use pte_huge() instead of open coding this.
      
      AK: split out cpa() changes
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      6fb14755
    • Jan Beulich's avatar
      [PATCH] x86: Improve handling of kernel mappings in change_page_attr · d01ad8dd
      Jan Beulich authored
      Fix various broken corner cases in i386 and x86-64 change_page_attr.
      
      AK: split off from tighten kernel image access rights
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      d01ad8dd
    • Rusty Russell's avatar
      [PATCH] i386: rationalize paravirt wrappers · 90a0a06a
      Rusty Russell authored
      paravirt.c used to implement native versions of all low-level
      functions.  Far cleaner is to have the native versions exposed in the
      headers and as inline native_XXX, and if !CONFIG_PARAVIRT, then simply
      #define XXX native_XXX.
      
      There are several nice side effects:
      
      1) write_dt_entry() now takes the correct "struct Xgt_desc_struct *"
         not "void *".
      
      2) load_TLS is reintroduced to the for loop, not manually unrolled
         with a #error in case the bounds ever change.
      
      3) Macros become inlines, with type checking.
      
      4) Access to the native versions is trivial for KVM, lguest, Xen and
         others who might want it.
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Avi Kivity <avi@qumranet.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      90a0a06a
    • Sebastien Dugue's avatar
      [PATCH] i386: Rename boot_gdt_table to boot_gdt · 52de74dd
      Sebastien Dugue authored
      Rename boot_gdt_table to boot_gdt to avoid the duplicate T(able).
      Signed-off-by: default avatarSebastien Dugue <sebastien.dugue@bull.net>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      52de74dd
    • Rusty Russell's avatar
      [PATCH] i386: clean up cpu_init() · d2cbcc49
      Rusty Russell authored
      We now have cpu_init() and secondary_cpu_init() doing nothing but calling
      _cpu_init() with the same arguments.  Rename _cpu_init() to cpu_init() and use
      it as a replcement for secondary_cpu_init().
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d2cbcc49
    • Rusty Russell's avatar
      [PATCH] i386: Use per-cpu GDT immediately upon boot · bf504672
      Rusty Russell authored
      Now we are no longer dynamically allocating the GDT, we don't need the
      "cpu_gdt_table" at all: we can switch straight from "boot_gdt_table" to the
      per-cpu GDT.  This means initializing the cpu_gdt array in C.
      
      The boot CPU uses the per-cpu var directly, then in smp_prepare_cpus() it
      switches to the per-cpu copy just allocated.  For secondary CPUs, the
      early_gdt_descr is set to point directly to their per-cpu copy.
      
      For UP the code is very simple: it keeps using the "per-cpu" GDT as per SMP,
      but we never have to move.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bf504672
    • Rusty Russell's avatar
      [PATCH] i386: Use per-cpu variables for GDT, PDA · ae1ee11b
      Rusty Russell authored
      Allocating PDA and GDT at boot is a pain.  Using simple per-cpu variables adds
      happiness (although we need the GDT page-aligned for Xen, which we do in a
      followup patch).
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ae1ee11b
    • Bernhard Walle's avatar
      [PATCH] x86: add command line length to boot protocol · 8f9aeca7
      Bernhard Walle authored
      Because the command line is increased to 2048 characters after 2.6.21, it's
      not possible for boot loaders and userspace tools to determine the length
      of the command line the kernel can understand.  The benefit of knowing the
      length is that users can be warned if the command line size is too long
      which prevents surprise if things don't work after bootup.
      
      This patch updates the boot protocol to contain a field called
      "cmdline_size" that contain the length of the command line (excluding the
      terminating zero).
      
      The patch also adds missing fields (of protocol version 2.05) to the x86_64
      setup code.
      Signed-off-by: default avatarBernhard Walle <bwalle@suse.de>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Alon Bar-Lev <alon.barlev@gmail.com>
      Acked-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8f9aeca7
    • Ian Campbell's avatar
      [PATCH] i386: Allow i386 crash kernels to handle x86_64 dumps · 79e03011
      Ian Campbell authored
      The specific case I am encountering is kdump under Xen with a 64 bit
      hypervisor and 32 bit kernel/userspace.  The dump created is 64 bit due to
      the hypervisor but the dump kernel is 32 bit for maximum compatibility.
      
      It's possibly less likely to be useful in a purely native scenario but I
      see no reason to disallow it.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: default avatarIan Campbell <ian.campbell@xensource.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Acked-by: default avatarVivek Goyal <vgoyal@in.ibm.com>
      Cc: Horms <horms@verge.net.au>
      Cc: Magnus Damm <magnus.damm@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      79e03011
    • Rusty Russell's avatar
      [PATCH] x86-64: Introduce load_TLS to the "for" loop. · eab0c72a
      Rusty Russell authored
      GCC (4.1 at least) unrolls it anyway, but I can't believe this code
      was ever justifiable.  (I've also submitted a patch which cleans up
      i386, which is even uglier).
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eab0c72a
    • Rusty Russell's avatar
      [PATCH] i386: Initialize esp0 properly all the time · 692174b9
      Rusty Russell authored
      Whenever we schedule, __switch_to calls load_esp0 which does:
      
      	tss->esp0 = thread->esp0;
      
      This is never initialized for the initial thread (ie "swapper"), so when we're
      scheduling that, we end up setting esp0 to 0.  This is fine: the swapper never
      leaves ring 0, so this field is never used.
      
      lguest, however, gets upset that we're trying to used an unmapped page as our
      kernel stack.  Rather than work around it there, let's initialize it.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      692174b9
    • Andrew Morton's avatar
      [PATCH] i386: VDSO_PRELINK warning fix · 1b523fb5
      Andrew Morton authored
      The lguest patches somehow managed to trigger this:
      
      In file included from arch/i386/lguest/lguest.c:38:
      include/asm/asm-offsets.h:67:1: warning: "VDSO_PRELINK" redefined
      In file included from include/linux/elf.h:7,
                       from include/linux/module.h:15,
                       from include/linux/device.h:21,
                       from include/linux/interrupt.h:15,
                       from arch/i386/lguest/lguest.c:27:
      include/asm/elf.h:140:1: warning: this is the location of the previous definition
      
      I assume that using the same identifier twice was a bad idea..
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      1b523fb5
    • David Rientjes's avatar
      [PATCH] x86-64: fake numa for cpusets document · 20280195
      David Rientjes authored
      Create a document to explain how to use numa=fake in conjunction with cpusets
      for coarse memory resource management.
      
      An attempt to get more awareness and testing for this feature.
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      20280195
    • Joerg Roedel's avatar
      [PATCH] x86: remove constant_tsc reporting from /proc/cpuinfo' power flags · d824395c
      Joerg Roedel authored
      remove the reporting of the constant_tsc flag from the "power management"
      field in /proc/cpuinfo.  The NULL value there was replaced by "" because
      the former would result in a printout of [8] if the flag is set.
      Signed-off-by: default avatarJoerg Roedel <joerg.roedel@amd.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      d824395c
    • David Rientjes's avatar
      [PATCH] x86-64: fixed size remaining fake nodes · 382591d5
      David Rientjes authored
      Extends the numa=fake x86_64 command-line option to split the remaining system
      memory into nodes of fixed size.  Any leftover memory is allocated to a final
      node unless the command-line ends with a comma.
      
      For example:
        numa=fake=2*512,*128	gives two 512M nodes and the remaining system
      			memory is split into nodes of 128M each.
      
      This is beneficial for systems where the exact size of RAM is unknown or not
      necessarily relevant, but the size of the remaining nodes to be allocated is
      known based on their capacity for resource management.
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      382591d5
    • David Rientjes's avatar
      [PATCH] x86-64: split remaining fake nodes equally · 14694d73
      David Rientjes authored
      Extends the numa=fake x86_64 command-line option to split the remaining
      system memory into equal-sized nodes.
      
      For example:
      numa=fake=2*512,4*	gives two 512M nodes and the remaining system
      			memory is split into four approximately equal
      			chunks.
      
      This is beneficial for systems where the exact size of RAM is unknown or not
      necessarily relevant, but the granularity with which nodes shall be allocated
      is known.
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      14694d73
    • David Rientjes's avatar
      [PATCH] x86-64: configurable fake numa node sizes · 8b8ca80e
      David Rientjes authored
      Extends the numa=fake x86_64 command-line option to allow for configurable
      node sizes.  These nodes can be used in conjunction with cpusets for coarse
      memory resource management.
      
      The old command-line option is still supported:
        numa=fake=32	gives 32 fake NUMA nodes, ignoring the NUMA setup of the
      		actual machine.
      
      But now you may configure your system for the node sizes of your choice:
        numa=fake=2*512,1024,2*256
      		gives two 512M nodes, one 1024M node, two 256M nodes, and
      		the rest of system memory to a sixth node.
      
      The existing hash function is maintained to support the various node sizes
      that are possible with this implementation.
      
      Each node of the same size receives roughly the same amount of available
      pages, regardless of any reserved memory with its address range.  The total
      available pages on the system is calculated and divided by the number of equal
      nodes to allocate.  These nodes are then dynamically allocated and their
      borders extended until such time as their number of available pages reaches
      the required size.
      
      Configurable node sizes are recommended when used in conjunction with cpusets
      for memory control because it eliminates the overhead associated with scanning
      the zonelists of many smaller full nodes on page_alloc().
      
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8b8ca80e
    • Ahmed S. Darwish's avatar
      [PATCH] i386: fix GDT's number of quadwords in comment · 8280c0c5
      Ahmed S. Darwish authored
      Fix comments to represent the true number of quadwords in GDT.
      Signed-off-by: default avatarAhmed S. Darwish <darwish.07@gmail.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Acked-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8280c0c5
    • Adrian Bunk's avatar
      [PATCH] i386: vmi_pmd_clear() static · 8eb68fae
      Adrian Bunk authored
      This patch makes the needlessly global vmi_pmd_clear() static.
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Acked-by: default avatarZachary Amsden <zach@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8eb68fae
    • Adrian Bunk's avatar
      [PATCH] x86-64: make simnow_init() static · 786142fa
      Adrian Bunk authored
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      786142fa
    • Yinghai Lu's avatar
      [PATCH] x86-64: remove extra smp_processor_id calling · f0e13ae7
      Yinghai Lu authored
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      f0e13ae7
    • Ralf Baechle's avatar
      [PATCH] x86-64: fix ia32_binfmt.c build error · 9f7290ed
      Ralf Baechle authored
      Reorder code to avoid multiple inclusion of elf.h.
      
      #undef several symbols to avoid build errors over redefinitions.
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9f7290ed
    • john stultz's avatar
      [PATCH] x86: Log reason why TSC was marked unstable · 5a90cf20
      john stultz authored
      Change mark_tsc_unstable() so it takes a string argument, which holds the
      reason the TSC was marked unstable.
      
      This is then displayed the first time mark_tsc_unstable is called.
      
      This should help us better debug why the TSC was marked unstable on certain
      systems and allow us to make sure we're not being overly paranoid when
      throwing out this troublesome clocksource.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      5a90cf20
    • Adrian Bunk's avatar
      [PATCH] i386: workaround for a -Wmissing-prototypes warning · 27142219
      Adrian Bunk authored
      Work around a warning with -Wmissing-prototypes in
      arch/i386/kernel/asm-offsets.c
      
      The warning isn't gcc's fault - asm-offsets.c is simply a special file.
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      27142219
    • Ken Chen's avatar
      [PATCH] i386: type cast clean up for find_next_zero_bit · e48b30c1
      Ken Chen authored
      clean up unneeded type cast by properly declare data type.
      Signed-off-by: default avatarKen Chen <kenchen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      e48b30c1
    • Adrian Bunk's avatar
      [PATCH] i386: make struct vmi_ops static · 30a1528d
      Adrian Bunk authored
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Zachary Amsden <zach@vmware.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      30a1528d
    • Vivek Goyal's avatar
      [PATCH] i386: modpost apic related warning fixes · 1833d6bc
      Vivek Goyal authored
      o Modpost generates warnings for i386 if compiled with CONFIG_RELOCATABLE=y
      
      WARNING: vmlinux - Section mismatch: reference to .init.text:find_unisys_acpi_oem_table from .text between 'acpi_madt_oem_check' (at offset 0xc0101eda) and 'enable_apic_mode'
      WARNING: vmlinux - Section mismatch: reference to .init.text:acpi_get_table_header_early from .text between 'acpi_madt_oem_check' (at offset 0xc0101ef0) and 'enable_apic_mode'
      WARNING: vmlinux - Section mismatch: reference to .init.text:parse_unisys_oem from .text between 'acpi_madt_oem_check' (at offset 0xc0101f2e) and 'enable_apic_mode'
      WARNING: vmlinux - Section mismatch: reference to .init.text:setup_unisys from .text between 'acpi_madt_oem_check' (at offset 0xc0101f37) and 'enable_apic_mode'WARNING: vmlinux - Section mismatch: reference to .init.text:parse_unisys_oem from .text between 'mps_oem_check' (at offset 0xc0101ec7) and 'acpi_madt_oem_check'
      WARNING: vmlinux - Section mismatch: reference to .init.text:es7000_sw_apic from .text between 'enable_apic_mode' (at offset 0xc0101f48) and 'check_apicid_present'
      
      o Some functions which are inline (acpi_madt_oem_check) are not inlined by
        compiler as these functions are accessed using function pointer. These
        functions are put in .text section and they in-turn access __init type
        functions hence modpost generates warnings.
      
      o Do not iniline acpi_madt_oem_check, instead make it __init.
      Signed-off-by: default avatarVivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Len Brown <lenb@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1833d6bc
    • Ravikiran G Thirumalai's avatar
      [PATCH] x86-64: Set HASHDIST_DEFAULT to 1 for x86_64 NUMA · e073ae1b
      Ravikiran G Thirumalai authored
      Enable system hashtable memory to be distributed among nodes on x86_64 NUMA
      
      Forcing the kernel to use node interleaved vmalloc instead of bootmem for
      the system hashtable memory (alloc_large_system_hash) reduces the memory
      imbalance on node 0 by around 40MB on a 8 node x86_64 NUMA box:
      
      Before the following patch, on bootup of a 8 node box:
      
      Node 0 MemTotal:      3407488 kB
      Node 0 MemFree:       3206296 kB
      Node 0 MemUsed:        201192 kB
      Node 0 Active:           7012 kB
      Node 0 Inactive:          512 kB
      Node 0 Dirty:               0 kB
      Node 0 Writeback:           0 kB
      Node 0 FilePages:        1912 kB
      Node 0 Mapped:            420 kB
      Node 0 AnonPages:        5612 kB
      Node 0 PageTables:        468 kB
      Node 0 NFS_Unstable:        0 kB
      Node 0 Bounce:              0 kB
      Node 0 Slab:             5408 kB
      Node 0 SReclaimable:      644 kB
      Node 0 SUnreclaim:       4764 kB
      
      After the patch (or using hashdist=1 on the kernel command line):
      
      Node 0 MemTotal:      3407488 kB
      Node 0 MemFree:       3247608 kB
      Node 0 MemUsed:        159880 kB
      Node 0 Active:           3012 kB
      Node 0 Inactive:          616 kB
      Node 0 Dirty:               0 kB
      Node 0 Writeback:           0 kB
      Node 0 FilePages:        2424 kB
      Node 0 Mapped:            380 kB
      Node 0 AnonPages:        1200 kB
      Node 0 PageTables:        396 kB
      Node 0 NFS_Unstable:        0 kB
      Node 0 Bounce:              0 kB
      Node 0 Slab:             6304 kB
      Node 0 SReclaimable:     1596 kB
      Node 0 SUnreclaim:       4708 kB
      
      I guess it is a good idea to keep HASHDIST_DEFAULT "on" for x86_64 NUMA
      since x86_64 has no dearth of vmalloc space?  Or maybe enable hash
      distribution for all 64bit NUMA arches?  The following patch does it only
      for x86_64.
      
      I ran a HPC MPI benchmark -- 'Ansys wingsolid', which takes up quite a bit of
      memory and uses up tlb entries.  This was on a 4 way, 2 socket
      Tyan AMD box (non vsmp), with 8G total memory (4G pernode).
      
      The results with and without hash distribution are:
      
      1. Vanilla - runtime of 1188.000s
      2. With hashdist=1 runtime of 1154.000s
      
      Oprofile output for the duration of run is:
      
      1. Vanilla:
      PU: AMD64 processors, speed 2411.16 MHz (estimated)
      Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit
      mask of 0x00 (No unit mask) count 500
      samples  %        app name                 symbol name
      163054    6.5513  libansys1.so             MultiFront::decompose(int, int,
      Elemset *, int *, int, int, int)
      162061    6.5114  libansys3.so             blockSaxpy6L_fd
      162042    6.5107  libansys3.so             blockInnerProduct6L_fd
      156286    6.2794  libansys3.so             maxb33_
      87879     3.5309  libansys1.so             elmatrixmultpcg_
      84857     3.4095  libansys4.so             saxpy_pcg
      58637     2.3560  libansys4.so             .st4560
      46612     1.8728  libansys4.so             .st4282
      43043     1.7294  vmlinux-t                copy_user_generic_string
      41326     1.6604  libansys3.so             blockSaxpyBackSolve6L_fd
      41288     1.6589  libansys3.so             blockInnerProductBackSolve6L_fd
      
      2. With hashdist=1
      CPU: AMD64 processors, speed 2411.13 MHz (estimated)
      Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit
      mask of 0x00 (No unit mask) count 500
      samples  %        app name                 symbol name
      162993    6.9814  libansys1.so             MultiFront::decompose(int, int,
      Elemset *, int *, int, int, int)
      160799    6.8874  libansys3.so             blockInnerProduct6L_fd
      160459    6.8729  libansys3.so             blockSaxpy6L_fd
      156018    6.6826  libansys3.so             maxb33_
      84700     3.6279  libansys4.so             saxpy_pcg
      83434     3.5737  libansys1.so             elmatrixmultpcg_
      58074     2.4875  libansys4.so             .st4560
      46000     1.9703  libansys4.so             .st4282
      41166     1.7632  libansys3.so             blockSaxpyBackSolve6L_fd
      41033     1.7575  libansys3.so             blockInnerProductBackSolve6L_fd
      35762     1.5318  libansys1.so             inner_product_sub
      35591     1.5245  libansys1.so             inner_product_sub2
      28259     1.2104  libansys4.so             addVectors
      Signed-off-by: default avatarPravin B. Shelar <pravin.shelar@calsoftinc.com>
      Signed-off-by: default avatarRavikiran Thirumalai <kiran@scalex86.org>
      Signed-off-by: default avatarShai Fultheim <shai@scalex86.org>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Acked-by: default avatarChristoph Lameter <clameter@engr.sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e073ae1b
    • Andi Kleen's avatar
      [PATCH] x86-64: Minor white space cleanup in traps.c · d039c688
      Andi Kleen authored
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      d039c688
    • Andi Kleen's avatar
      [PATCH] x86-64: Allow sys_uselib unconditionally · fb60b839
      Andi Kleen authored
      Previously it wasn't enabled in the binfmt_aout is a module case.
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      fb60b839
    • Andi Kleen's avatar
      [PATCH] x86-64: Don't disable basic block reordering · 1652fcbf
      Andi Kleen authored
      When compiling with -Os (which is default) the compiler defaults to it
      anyways. And with -O2 it probably generates somewhat better (although
      also larger) code.
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      1652fcbf
    • Andrew Morton's avatar
      [PATCH] x86-64: fix x86_64-mm-sched-clock-share · 184c44d2
      Andrew Morton authored
      Fix for the following patch. Provide dummy cpufreq functions when
      CPUFREQ is not compiled in.
      
      Cc: Andi Kleen <ak@suse.de>
      Cc: Dave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      184c44d2
    • Vivek Goyal's avatar
      [PATCH] x86-64: Move cpu verification code to common file · a4831e08
      Vivek Goyal authored
      o This patch moves the code to verify long mode and SSE to a common file.
        This code is now shared by trampoline.S, wakeup.S, boot/setup.S and
        boot/compressed/head.S
      
      o So far we used to do very limited check in trampoline.S, wakeup.S and
        in 32bit entry point. Now all the entry paths are forced to do the
        exhaustive check, including SSE because verify_cpu is shared.
      
      o I am keeping this patch as last in the x86 relocatable series because
        previous patches have got quite some amount of testing done and don't want
        to distrub that. So that if there is problem introduced by this patch, at
        least it can be easily isolated.
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarVivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      a4831e08
    • Vivek Goyal's avatar
      [PATCH] x86-64: Extend bzImage protocol for relocatable bzImage · 8035d3ea
      Vivek Goyal authored
      o Extend the bzImage protocol (same as i386) to allow bzImage loaders to
        load the protected mode kernel at non-1MB address. Now protected mode
        component is relocatable and can be loaded at non-1MB addresses.
      
      o As of today kdump uses it to run a second kernel from a reserved memory
        area.
      Signed-off-by: default avatarVivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: default avatarVivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      8035d3ea
    • Vivek Goyal's avatar
      [PATCH] x86-64: build-time checking · 6a50a664
      Vivek Goyal authored
      o X86_64 kernel should run from 2MB aligned address for two reasons.
      	- Performance.
      	- For relocatable kernels, page tables are updated based on difference
      	  between compile time address and load time physical address.
      	  This difference should be multiple of 2MB as kernel text and data
      	  is mapped using 2MB pages and PMD should be pointing to a 2MB
      	  aligned address. Life is simpler if both compile time and load time
      	  kernel addresses are 2MB aligned.
      
      o Flag the error at compile time if one is trying to build a kernel which
        does not meet alignment restrictions.
      Signed-off-by: default avatarVivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6a50a664
    • Vivek Goyal's avatar
      [PATCH] x86-64: Relocatable Kernel Support · 1ab60e0f
      Vivek Goyal authored
      This patch modifies the x86_64 kernel so that it can be loaded and run
      at any 2M aligned address, below 512G.  The technique used is to
      compile the decompressor with -fPIC and modify it so the decompressor
      is fully relocatable.  For the main kernel the page tables are
      modified so the kernel remains at the same virtual address.  In
      addition a variable phys_base is kept that holds the physical address
      the kernel is loaded at.  __pa_symbol is modified to add that when
      we take the address of a kernel symbol.
      
      When loaded with a normal bootloader the decompressor will decompress
      the kernel to 2M and it will run there.  This both ensures the
      relocation code is always working, and makes it easier to use 2M
      pages for the kernel and the cpu.
      
      AK: changed to not make RELOCATABLE default in Kconfig
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarVivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      1ab60e0f
    • Vivek Goyal's avatar
      [PATCH] x86: __pa and __pa_symbol address space separation · 0dbf7028
      Vivek Goyal authored
      Currently __pa_symbol is for use with symbols in the kernel address
      map and __pa is for use with pointers into the physical memory map.
      But the code is implemented so you can usually interchange the two.
      
      __pa which is much more common can be implemented much more cheaply
      if it is it doesn't have to worry about any other kernel address
      spaces.  This is especially true with a relocatable kernel as
      __pa_symbol needs to peform an extra variable read to resolve
      the address.
      
      There is a third macro that is added for the vsyscall data
      __pa_vsymbol for finding the physical addesses of vsyscall pages.
      
      Most of this patch is simply sorting through the references to
      __pa or __pa_symbol and using the proper one.  A little of
      it is continuing to use a physical address when we have it
      instead of recalculating it several times.
      
      swapper_pgd is now NULL.  leave_mm now uses init_mm.pgd
      and init_mm.pgd is initialized at boot (instead of compile time)
      to the physmem virtual mapping of init_level4_pgd.  The
      physical address changed.
      
      Except for the for EMPTY_ZERO page all of the remaining references
      to __pa_symbol appear to be during kernel initialization.  So this
      should reduce the cost of __pa in the common case, even on a relocated
      kernel.
      
      As this is technically a semantic change we need to be on the lookout
      for anything I missed.  But it works for me (tm).
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarVivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      0dbf7028
    • Vivek Goyal's avatar
      [PATCH] x86-64: do not use virt_to_page on kernel data address · 1b29c164
      Vivek Goyal authored
      o virt_to_page() call should be used on kernel linear addresses and not
        on kernel text and data addresses. Swsusp code uses it on kernel data
        (statically allocated swsusp_header).
      
      o Allocate swsusp_header dynamically so that virt_to_page() can be used
        safely.
      
      o I am changing this because in next few patches, __pa() on x86_64 will
        no longer support kernel text and data addresses and hibernation breaks.
      Signed-off-by: default avatarVivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      1b29c164
    • Vivek Goyal's avatar
      [PATCH] x86: Move swsusp __pa() dependent code to arch portion · 49c3df6a
      Vivek Goyal authored
      o __pa() should be used only on kernel linearly mapped virtual addresses
        and not on kernel text and data addresses.
      
      o Hibernation code needs to determine the physical address associated
        with kernel symbol to mark a section boundary which contains pages which
        don't have to be saved and restored during hibernate/resume operation.
      
      o Move this piece of code in arch dependent section. So that architectures
        which don't have kernel text/data mapped into kernel linearly mapped
        region can come up with their own ways of determining physical addresses
        associated with a kernel text.
      Signed-off-by: default avatarVivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      49c3df6a
    • Vivek Goyal's avatar
      [PATCH] x86-64: Remove the identity mapping as early as possible · cfd243d4
      Vivek Goyal authored
      With the rewrite of the SMP trampoline and the early page
      allocator there is nothing that needs identity mapped pages,
      once we start executing C code.
      
      So add zap_identity_mappings into head64.c and remove
      zap_low_mappings() from much later in the code.  The functions
       are subtly different thus the name change.
      
      This also kills boot_level4_pgt which was from an earlier
      attempt to move the identity mappings as early as possible,
      and is now no longer needed.  Essentially I have replaced
      boot_level4_pgt with trampoline_level4_pgt in trampoline.S
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarVivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      cfd243d4