1. 13 Feb, 2010 1 commit
    • Avi Kivity's avatar
      x86-64, rwsem: Avoid store forwarding hazard in __downgrade_write · 0d1622d7
      Avi Kivity authored
      The Intel Architecture Optimization Reference Manual states that a short
      load that follows a long store to the same object will suffer a store
      forwading penalty, particularly if the two accesses use different addresses.
      Trivially, a long load that follows a short store will also suffer a penalty.
      
      __downgrade_write() in rwsem incurs both penalties:  the increment operation
      will not be able to reuse a recently-loaded rwsem value, and its result will
      not be reused by any recently-following rwsem operation.
      
      A comment in the code states that this is because 64-bit immediates are
      special and expensive; but while they are slightly special (only a single
      instruction allows them), they aren't expensive: a test shows that two loops,
      one loading a 32-bit immediate and one loading a 64-bit immediate, both take
      1.5 cycles per iteration.
      
      Fix this by changing __downgrade_write to use the same add instruction on
      i386 and on x86_64, so that it uses the same operand size as all the other
      rwsem functions.
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      LKML-Reference: <1266049992-17419-1-git-send-email-avi@redhat.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      0d1622d7
  2. 18 Jan, 2010 2 commits
    • H. Peter Anvin's avatar
      x86-64, rwsem: 64-bit xadd rwsem implementation · 1838ef1d
      H. Peter Anvin authored
      For x86-64, 32767 threads really is not enough.  Change rwsem_count_t
      to a signed long, so that it is 64 bits on x86-64.
      
      This required the following changes to the assembly code:
      
      a) %z0 doesn't work on all versions of gcc!  At least gcc 4.4.2 as
         shipped with Fedora 12 emits "ll" not "q" for 64 bits, even for
         integer operands.  Newer gccs apparently do this correctly, but
         avoid this problem by using the _ASM_ macros instead of %z.
      b) 64 bits immediates are only allowed in "movq $imm,%reg"
         constructs... no others.  Change some of the constraints to "e",
         and fix the one case where we would have had to use an invalid
         immediate -- in that case, we only care about the upper half
         anyway, so just access the upper half.
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <tip-bafaecd1@git.kernel.org>
      1838ef1d
    • Linus Torvalds's avatar
      x86: Fix breakage of UML from the changes in the rwsem system · 4126faf0
      Linus Torvalds authored
      The patches 5d0b7235 and
      bafaecd1 broke the UML build:
      
      On Sun, 17 Jan 2010, Ingo Molnar wrote:
      >
      > FYI, -tip testing found that these changes break the UML build:
      >
      > kernel/built-in.o: In function `__up_read':
      > /home/mingo/tip/arch/x86/include/asm/rwsem.h:192: undefined reference to `call_rwsem_wake'
      > kernel/built-in.o: In function `__up_write':
      > /home/mingo/tip/arch/x86/include/asm/rwsem.h:210: undefined reference to `call_rwsem_wake'
      > kernel/built-in.o: In function `__downgrade_write':
      > /home/mingo/tip/arch/x86/include/asm/rwsem.h:228: undefined reference to `call_rwsem_downgrade_wake'
      > kernel/built-in.o: In function `__down_read':
      > /home/mingo/tip/arch/x86/include/asm/rwsem.h:112: undefined reference to `call_rwsem_down_read_failed'
      > kernel/built-in.o: In function `__down_write_nested':
      > /home/mingo/tip/arch/x86/include/asm/rwsem.h:154: undefined reference to `call_rwsem_down_write_failed'
      > collect2: ld returned 1 exit status
      
      Add lib/rwsem_64.o to the UML subarch objects to fix.
      
      LKML-Reference: <alpine.LFD.2.00.1001171023440.13231@localhost.localdomain>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      4126faf0
  3. 14 Jan, 2010 2 commits
    • Linus Torvalds's avatar
      x86-64: support native xadd rwsem implementation · bafaecd1
      Linus Torvalds authored
      This one is much faster than the spinlock based fallback rwsem code,
      with certain artifical benchmarks having shown 300%+ improvement on
      threaded page faults etc.
      
      Again, note the 32767-thread limit here. So this really does need that
      whole "make rwsem_count_t be 64-bit and fix the BIAS values to match"
      extension on top of it, but that is conceptually a totally independent
      issue.
      
      NOT TESTED! The original patch that this all was based on were tested by
      KAMEZAWA Hiroyuki, but maybe I screwed up something when I created the
      cleaned-up series, so caveat emptor..
      
      Also note that it _may_ be a good idea to mark some more registers
      clobbered on x86-64 in the inline asms instead of saving/restoring them.
      They are inline functions, but they are only used in places where there
      are not a lot of live registers _anyway_, so doing for example the
      clobbers of %r8-%r11 in the asm wouldn't make the fast-path code any
      worse, and would make the slow-path code smaller.
      
      (Not that the slow-path really matters to that degree. Saving a few
      unnecessary registers is the _least_ of our problems when we hit the slow
      path. The instruction/cycle counting really only matters in the fast
      path).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <alpine.LFD.2.00.1001121810410.17145@localhost.localdomain>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      bafaecd1
    • Linus Torvalds's avatar
      x86: clean up rwsem type system · 5d0b7235
      Linus Torvalds authored
      The fast version of the rwsems (the code that uses xadd) has
      traditionally only worked on x86-32, and as a result it mixes different
      kinds of types wildly - they just all happen to be 32-bit.  We have
      "long", we have "__s32", and we have "int".
      
      To make it work on x86-64, the types suddenly matter a lot more.  It can
      be either a 32-bit or 64-bit signed type, and both work (with the caveat
      that a 32-bit counter will only have 15 bits of effective write
      counters, so it's limited to 32767 users).  But whatever type you
      choose, it needs to be used consistently.
      
      This makes a new 'rwsem_counter_t', that is a 32-bit signed type.  For a
      64-bit type, you'd need to also update the BIAS values.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <alpine.LFD.2.00.1001121755220.17145@localhost.localdomain>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      5d0b7235
  4. 13 Jan, 2010 3 commits
    • Brian Gerst's avatar
      x86: Merge show_regs() · 3bef4447
      Brian Gerst authored
      Using kernel_stack_pointer() allows 32-bit and 64-bit versions to
      be merged.  This is more correct for 64-bit, since the old %rsp is
      always saved on the stack.
      Signed-off-by: default avatarBrian Gerst <brgerst@gmail.com>
      LKML-Reference: <1263397555-27695-1-git-send-email-brgerst@gmail.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      3bef4447
    • Dave Jones's avatar
      x86: Macroise x86 cache descriptors · 2ca49b2f
      Dave Jones authored
      Use a macro to define the cache sizes when cachesize > 1 MB.
      
      This is less typing, and less prone to introducing bugs like we
      saw in e02e0e1a, and means we
      don't have to do maths when adding new non-power-of-2 updates
      like those seen recently.
      Signed-off-by: default avatarDave Jones <davej@redhat.com>
      LKML-Reference: <20100104144735.GA18390@redhat.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      2ca49b2f
    • Linus Torvalds's avatar
      x86-32: clean up rwsem inline asm statements · 59c33fa7
      Linus Torvalds authored
      This makes gcc use the right register names and instruction operand sizes
      automatically for the rwsem inline asm statements.
      
      So instead of using "(%%eax)" to specify the memory address that is the
      semaphore, we use "(%1)" or similar. And instead of forcing the operation
      to always be 32-bit, we use "%z0", taking the size from the actual
      semaphore data structure itself.
      
      This doesn't actually matter on x86-32, but if we want to use the same
      inline asm for x86-64, we'll need to have the compiler generate the proper
      64-bit names for the registers (%rax instead of %eax), and if we want to
      use a 64-bit counter too (in order to avoid the 15-bit limit on the
      write counter that limits concurrent users to 32767 threads), we'll need
      to be able to generate instructions with "q" accesses rather than "l".
      
      Since this header currently isn't enabled on x86-64, none of that matters,
      but we do want to use the xadd version of the semaphores rather than have
      to take spinlocks to do a rwsem. The mm->mmap_sem can be heavily contended
      when you have lots of threads all taking page faults, and the fallback
      rwsem code that uses a spinlock performs abysmally badly in that case.
      
      [ hpa: modified the patch to skip size suffixes entirely when they are
        redundant due to register operands. ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <alpine.LFD.2.00.1001121613560.17145@localhost.localdomain>
      Signed-off-by: default avatarH. Peter Anvin <hpa@zytor.com>
      59c33fa7
  5. 07 Jan, 2010 3 commits
  6. 30 Dec, 2009 3 commits
    • Jan Beulich's avatar
      x86-64: Modify memcpy()/memset() alternatives mechanism · 7269e881
      Jan Beulich authored
      In order to avoid unnecessary chains of branches, rather than
      implementing memcpy()/memset()'s access to their alternative
      implementations via a jump, patch the (larger) original function
      directly.
      
      The memcpy() part of this is slightly subtle: while alternative
      instruction patching does itself use memcpy(), with the
      replacement block being less than 64-bytes in size the main loop
      of the original function doesn't get used for copying memcpy_c()
      over memcpy(), and hence we can safely write over its beginning.
      
      Also note that the CFI annotations are fine for both variants of
      each of the functions.
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4B2BB8D30200007800026AF2@vpn.id2.novell.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      7269e881
    • Jan Beulich's avatar
      x86-64: Modify copy_user_generic() alternatives mechanism · 1b1d9258
      Jan Beulich authored
      In order to avoid unnecessary chains of branches, rather than
      implementing copy_user_generic() as a function consisting of
      just a single (possibly patched) branch, instead properly deal
      with patching call instructions in the alternative instructions
      framework, and move the patching into the callers.
      
      As a follow-on, one could also introduce something like
      __EXPORT_SYMBOL_ALT() to avoid patching call sites in modules.
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4B2BB8180200007800026AE7@vpn.id2.novell.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      1b1d9258
    • Jan Beulich's avatar
      x86: Lift restriction on the location of FIX_BTMAP_* · 499a5f1e
      Jan Beulich authored
      The early ioremap fixmap entries cover half (or for 32-bit
      non-PAE, a quarter) of a page table, yet they got
      uncondtitionally aligned so far to a 256-entry boundary. This is
      not necessary if the range of page table entries anyway falls
      into a single page table.
      
      This buys back, for (theoretically) 50% of all configurations
      (25% of all non-PAE ones), at least some of the lowmem
      necessarily lost with commit e621bd18.
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <4B2BB66F0200007800026AD6@vpn.id2.novell.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      499a5f1e
  7. 28 Dec, 2009 1 commit
    • Akinobu Mita's avatar
      x86, core: Optimize hweight32() · 39d997b5
      Akinobu Mita authored
      Optimize hweight32 by using the same technique in hweight64.
      
      The proof of this technique can be found in the commit log for
      f9b41929 ("bitops: hweight()
      speedup").
      
      The userspace benchmark on x86_32 showed 20% speedup with
      bitmap_weight() which uses hweight32 to count bits for each
      unsigned long on 32bit architectures.
      
       int main(void)
       {
      	#define SZ (1024 * 1024 * 512)
      
      	static DECLARE_BITMAP(bitmap, SZ) = {
      	        [0 ... 100] = 1,
      	};
      
      	return bitmap_weight(bitmap, SZ);
       }
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <1258603932-4590-1-git-send-email-akinobu.mita@gmail.com>
      [ only x86 sets ARCH_HAS_FAST_MULTIPLIER so we do this via the x86 tree]
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      39d997b5
  8. 24 Dec, 2009 25 commits