1. 08 Sep, 2008 1 commit
    • Hugh Dickins's avatar
      powerpc: Fix rare boot build breakage · 4ff23fa9
      Hugh Dickins authored
      A make -j20 powerpc kernel build broke a couple of months ago saying:
      In file included from arch/powerpc/boot/gunzip_util.h:13,
                       from arch/powerpc/boot/prpmc2800.c:21:
      arch/powerpc/boot/zlib.h:85: error: expected ‘:’, ‘,’, ‘;’, ‘}’ or ‘__attribute__’ before ‘*’ token
      arch/powerpc/boot/zlib.h:630: warning: type defaults to ‘int’ in declaration of ‘Byte’
      arch/powerpc/boot/zlib.h:630: error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token
      
      It happened again yesterday: too rare for me to confirm the fix, but
      it looks like the list of dependants on gunzip_util.h was incomplete.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      4ff23fa9
  2. 07 Sep, 2008 1 commit
  3. 05 Sep, 2008 2 commits
    • Jeremy Kerr's avatar
      powerpc/spufs: Fix race for a free SPU · b65fe035
      Jeremy Kerr authored
      We currently have a race for a free SPE. With one thread doing a
      spu_yield(), and another doing a spu_activate():
      
      thread 1				thread 2
      spu_yield(oldctx)			spu_activate(ctx)
        __spu_deactivate(oldctx)
        spu_unschedule(oldctx, spu)
        spu->alloc_state = SPU_FREE
      					spu = spu_get_idle(ctx)
      					    - searches for a SPE in
      					      state SPU_FREE, gets
      					      the context just
      					      freed by thread 1
      					spu_schedule(ctx, spu)
      					  spu->alloc_state = SPU_USED
      spu_schedule(newctx, spu)
        - assumes spu is still free
        - tries to schedule context on
          already-used spu
      
      This change introduces a 'free_spu' flag to spu_unschedule, to indicate
      whether or not the function should free the spu after descheduling the
      context. We only set this flag if we're not going to re-schedule
      another context on this SPU.
      
      Add a comment to document this behaviour.
      Signed-off-by: default avatarJeremy Kerr <jk@ozlabs.org>
      b65fe035
    • Jeremy Kerr's avatar
      powerpc/spufs: Fix multiple get_spu_context() · 9f43e391
      Jeremy Kerr authored
      Commit 8d5636fb introduced a reference
      count on SPU contexts during find_victim, but this may cause a leak in
      the reference count if we later find a better contender for a context to
      unschedule.
      
      Change the reference to after we've found our victim context, so we
      don't do the extra get_spu_context().
      Signed-off-by: default avatarJeremy Kerr <jk@ozlabs.org>
      9f43e391
  4. 03 Sep, 2008 36 commits
    • Kumar Gala's avatar
      powerpc: Fix for getting CPU number in power_save_ppc32_restore() · 7888bc2b
      Kumar Gala authored
      The calculation to get TI_CPU based off of SPRG3 was just plain wrong,
      meaning that we were getting garbage for the CPU number on 6xx/G3/G4
      based SMP boxes in this code.
      
      Just offset off the stack pointer (to get to thread_info) like all the
      other references to TI_CPU do.
      
      This was pointed out by Chen Gong <G.Chen@freescale.com>
      
      [paulus@samba.org - use rlwinm r12,r11,... instead of
       rlwinm r12,r1,...; tophys()]
      Signed-off-by: default avatarKumar Gala <galak@kernel.crashing.org>
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      7888bc2b
    • Benjamin Herrenschmidt's avatar
      powerpc: Fix build error with 64K pages and !hugetlbfs · 94ee815c
      Benjamin Herrenschmidt authored
      HAVE_ARCH_UNMAPPED_AREA and HAVE_ARCH_UNMAPPED_AREA_TOPDOWN must
      be defined whenever CONFIG_PPC_MM_SLICES is enabled, not just when
      CONFIG_HUGETLB_PAGE is.  They used to be always defined together but
      this is no longer the case since 3a8247cc
      ("powerpc: Only demote individual slices rather than whole process").
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      94ee815c
    • Tony Breeds's avatar
      powerpc: Work around gcc's -fno-omit-frame-pointer bug · 7563dc64
      Tony Breeds authored
      This bug is causing random crashes
      (http://bugzilla.kernel.org/show_bug.cgi?id=11414).
      
      -fno-omit-frame-pointer is only needed on powerpc when -pg is also
      supplied, and there is a gcc bug that causes incorrect code generation
      on 32-bit powerpc when -fno-omit-frame-pointer is used---it uses stack
      locations below the stack pointer, which is not allowed by the ABI
      because those locations can and sometimes do get corrupted by an
      interrupt.
      
      This ensures that CONFIG_FRAME_POINTER is only selected by ftrace.
      When CONFIG_FTRACE is enabled we also pass -mno-sched-epilog to work
      around the gcc codegen bug.
      
      Patch based on work by:
      	Andreas Schwab <schwab@suse.de>
      	Segher Boessenkool <segher@kernel.crashing.org>
      Signed-off-by: default avatarTony Breeds <tony@bakeyournoodle.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      7563dc64
    • Stephen Rothwell's avatar
      powerpc: Make sure _etext is after all kernel text · 303996da
      Stephen Rothwell authored
      This makes core_kernel_text() (and therefore kernel_text_address())
      return the correct result.  Currently all the __devinit routines (at
      least) will not be considered to be kernel text.
      
      This is just a quick fix for 2.6.27 - hopefully we will be able to fix
      this better in 2.6.28.
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      303996da
    • Paul Mackerras's avatar
      powerpc: Only make kernel text pages of linear mapping executable · 9e88ba4e
      Paul Mackerras authored
      Commit bc033b63 ("powerpc/mm: Fix
      attribute confusion with htab_bolt_mapping()") moved the check for
      whether we should make pages of the linear mapping executable from
      htab_bolt_mapping into its callers, including htab_initialize.
      A side-effect of this is that the decision is now made once for
      each contiguous section in the LMB array rather than for each page
      individually.  This can often mean that the whole of the linear
      mapping ends up being executable.
      
      This reverts to the previous behaviour, where individual pages are
      checked for being part of the kernel text or not, by moving the check
      back down into htab_bolt_mapping.
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      9e88ba4e
    • Michael Neuling's avatar
      powerpc: Fix uninitialised variable in VSX alignment code · 78fbc824
      Michael Neuling authored
      This fixes an uninitialised variable in the VSX alignment code.  It can
      cause warnings from GCC (noticed with gcc-4.1.1).  Gcc is actually
      correct in this instance, and this bug could cause the alignment
      interrupt handler to send a SIGSEGV to the process on a legitimate
      access.
      Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      78fbc824
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 · d26acd92
      Linus Torvalds authored
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
        ipsec: Fix deadlock in xfrm_state management.
        ipv: Re-enable IP when MTU > 68
        net/xfrm: Use an IS_ERR test rather than a NULL test
        ath9: Fix ath_rx_flush_tid() for IRQs disabled kernel warning message.
        ath9k: Incorrect key used when group and pairwise ciphers are different.
        rt2x00: Compiler warning unmasked by fix of BUILD_BUG_ON
        mac80211: Fix debugfs union misuse and pointer corruption
        wireless/libertas/if_cs.c: fix memory leaks
        orinoco: Multicast to the specified addresses
        iwlwifi: fix 64bit platform firmware loading
        iwlwifi: fix apm_stop (wrong bit polarity for FLAG_INIT_DONE)
        iwlwifi: workaround interrupt handling no some platforms
        iwlwifi: do not use GFP_DMA in iwl_tx_queue_init
        net/wireless/Kconfig: clarify the description for CONFIG_WIRELESS_EXT_SYSFS
        net: Unbreak userspace usage of linux/mroute.h
        pkt_sched: Fix locking of qdisc_root with qdisc_root_sleeping_lock()
        ipv6: When we droped a packet, we should return NET_RX_DROP instead of 0
      d26acd92
    • Thomas Gleixner's avatar
      [x86] Fix TSC calibration issues · fbb16e24
      Thomas Gleixner authored
      Larry Finger reported at http://lkml.org/lkml/2008/9/1/90:
      An ancient laptop of mine started throwing errors from b43legacy when
      I started using 2.6.27 on it. This has been bisected to commit bfc0f594
      "x86: merge tsc calibration".
      
      The unification of the TSC code adopted mostly the 64bit code, which
      prefers PMTIMER/HPET over the PIT calibration.
      
      Larrys system has an AMD K6 CPU. Such systems are known to have
      PMTIMER incarnations which run at double speed. This results in a
      miscalibration of the TSC by factor 0.5. So the resulting calibrated
      CPU/TSC speed is half of the real CPU speed, which means that the TSC
      based delay loop will run half the time it should run. That might
      explain why the b43legacy driver went berserk.
      
      On the other hand we know about systems, where the PIT based
      calibration results in random crap due to heavy SMI/SMM
      disturbance. On those systems the PMTIMER/HPET based calibration logic
      with SMI detection shows better results.
      
      According to Alok also virtualized systems suffer from the PIT
      calibration method.
      
      The solution is to use a more wreckage aware aproach than the current
      either/or decision.
      
      1) reimplement the retry loop which was dropped from the 32bit code
      during the merge. It repeats the calibration and selects the lowest
      frequency value as this is probably the closest estimate to the real
      frequency
      
      2) Monitor the delta of the TSC values in the delay loop which waits
      for the PIT counter to reach zero. If the maximum value is
      significantly different from the minimum, then we have a pretty safe
      indicator that the loop was disturbed by an SMI.
      
      3) keep the pmtimer/hpet reference as a backup solution for systems
      where the SMI disturbance is a permanent point of failure for PIT
      based calibration
      
      4) do the loop iteration for both methods, record the lowest value and
      decide after all iterations finished.
      
      5) Set a clear preference to PIT based calibration when the result
      makes sense.
      
      The implementation does the reference calibration based on
      HPET/PMTIMER around the delay, which is necessary for the PIT anyway,
      but keeps separate TSC values to ensure the "independency" of the
      resulting calibration values.
      
      Tested on various 32bit/64bit machines including Geode 266Mhz, AMD K6
      (affected machine with a double speed pmtimer which I grabbed out of
      the dump), Pentium class machines and AMD/Intel 64 bit boxen.
      Bisected-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fbb16e24
    • David S. Miller's avatar
      ipsec: Fix deadlock in xfrm_state management. · 37b08e34
      David S. Miller authored
      Ever since commit 4c563f76
      ("[XFRM]: Speed up xfrm_policy and xfrm_state walking") it is
      illegal to call __xfrm_state_destroy (and thus xfrm_state_put())
      with xfrm_state_lock held.  If we do, we'll deadlock since we
      have the lock already and __xfrm_state_destroy() tries to take
      it again.
      
      Fix this by pushing the xfrm_state_put() calls after the lock
      is dropped.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37b08e34
    • Andrew Morton's avatar
      drivers/char/random.c: fix a race which can lead to a bogus BUG() · 8b76f46a
      Andrew Morton authored
      Fix a bug reported by and diagnosed by Aaron Straus.
      
      This is a regression intruduced into 2.6.26 by
      
          commit adc782da
          Author: Matt Mackall <mpm@selenic.com>
          Date:   Tue Apr 29 01:03:07 2008 -0700
      
              random: simplify and rename credit_entropy_store
      
      credit_entropy_bits() does:
      
      	spin_lock_irqsave(&r->lock, flags);
      	...
      	if (r->entropy_count > r->poolinfo->POOLBITS)
      		r->entropy_count = r->poolinfo->POOLBITS;
      
      so there is a time window in which this BUG_ON():
      
      static size_t account(struct entropy_store *r, size_t nbytes, int min,
      		      int reserved)
      {
      	unsigned long flags;
      
      	BUG_ON(r->entropy_count > r->poolinfo->POOLBITS);
      
      	/* Hold lock while accounting */
      	spin_lock_irqsave(&r->lock, flags);
      
      can trigger.
      
      We could fix this by moving the assertion inside the lock, but it seems
      safer and saner to revert to the old behaviour wherein
      entropy_store.entropy_count at no time exceeds
      entropy_store.poolinfo->POOLBITS.
      Reported-by: default avatarAaron Straus <aaron@merfinllc.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: <stable@kernel.org>		[2.6.26.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b76f46a
    • John Kacur's avatar
      pm_qos_requirement might sleep · 9d359357
      John Kacur authored
      Make PM_QOS and CPU_IDLE play nicer when run with the RT-Preempt kernel.
      
      The purpose of the patch is to remove the spin_lock around the read in the
      function pm_qos_requirement - since spinlocks can sleep in -rt and this
      function is called from idle.
      
      CPU_IDLE polls the target_value's of some of the pm_qos parameters from
      the idle loop causing sleeping locking warnings.  Changing the
      target_value to an atomic avoids this issue.
      
      Remove the spinlock in pm_qos_requirement by making target_value an atomic
      type.
      Signed-off-by: default avatarmark gross <mgross@linux.intel.com>
      Signed-off-by: default avatarJohn Kacur <jkacur@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d359357
    • Rafael J. Wysocki's avatar
      rtc-cmos: wake again from S5 · 74c4633d
      Rafael J. Wysocki authored
      Update rtc-cmos shutdown handling to leave RTC alarms active, resolving
      http://bugzilla.kernel.org/show_bug.cgi?id=11411 on several boards.  There
      are still some systems where the ACPI event handling doesn't cooperate.
      (Possibly related to bugid 11312, reporting the spontaneous disabling of
      RTC events.)
      
      Bug 11411 reported that changes to work around some ACPI event issues
      broke wake-from-S5 handling, as used for DVR applications.  (They like to
      power off, then wake later to record programs.)
      
      [yakui.zhao@intel.com: add shutdown for PNP devices]
      [dbrownell@users.sourceforge.net: update comments]
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: default avatarZhao Yakui <yakui.zhao@intel.com>
      Signed-off-by: default avatarZhang Rui <rui.zhang@intel.com>
      Signed-off-by: default avatarDavid Brownell <dbrownell@users.sourceforge.net>
      Cc: Stefan Bauer <stefan.bauer@cs.tu-chemnitz.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74c4633d
    • Russ Anderson's avatar
      sysfs: document files in /sys/firmware/sgi_uv/ · 8b3a8944
      Russ Anderson authored
      Document files in /sys/firmware/sgi_uv/.
      Signed-off-by: default avatarRuss Anderson <rja@sgi.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8b3a8944
    • Mike Christie's avatar
      ibft: fix target info parsing in ibft module · bb8fb4e6
      Mike Christie authored
      I got this patch through Red Hat's bugzilla from the bug submitter and
      patch creator.  I have just fixed it up so it applies without fuzz to
      upstream kernels.
      
      Original patch and description from Shyam kumar Iyer:
      
      The issue [ibft module not displaying targets with short names] is because
      of an offset calculatation error in the iscsi_ibft.c code.  Due to this
      error directory structure for the target in /sys/firmware/ibft does not
      get created and so the initiator is unable to connect to the target.
      
      Note that this bug surfaced only with an name that had a short section at
      the end.  eg: "iqn.1984-05.com.dell:dell".  It did not surface when the
      iqn's had a longer section at the end.  eg:
      "iqn.2001-04.com.example:storage.disk2.sys1.xyz"
      
      So, the eot_offset was calculated such that an extra 48 bytes i.e.  the
      size of the ibft_header which has already been accounted was subtracted
      twice.
      
      This was not evident with longer iqn names because they would overshoot
      the total ibft length more than 48 bytes and thus would escape the bug.
      Signed-off-by: default avatarShyam Kumar Iyer <shyam_iyer@dell.com>
      Signed-off-by: default avatarMike Christie <michaelc@cs.wisc.edu>
      Cc: Konrad Rzeszutek <konrad@virtualiron.com>
      Cc: Peter Jones <pjones@redhat.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb8fb4e6
    • Jan Altenberg's avatar
      rtc_time_to_tm: fix signed/unsigned arithmetic · 73442daf
      Jan Altenberg authored
      commit 945185a6 ("rtc: rtc_time_to_tm: use
      unsigned arithmetic") changed the some types in rtc_time_to_tm() to
      unsigned:
      
       void rtc_time_to_tm(unsigned long time, struct rtc_time *tm)
       {
      -       register int days, month, year;
      +       unsigned int days, month, year;
      
      This doesn't work for all cases, because days is checked for < 0 later
      on:
      
      if (days < 0) {
      	year -= 1;
      	days += 365 + LEAP_YEAR(year);
      }
      
      I think the correct fix would be to keep days signed and do an appropriate
      cast later on.
      Signed-off-by: default avatarJan Altenberg <jan.altenberg@linutronix.de>
      Cc: Maciej W. Rozycki <macro@linux-mips.org>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: David Brownell <david-b@pacbell.net>
      Cc: Dmitri Vorobiev <dmitri.vorobiev@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73442daf
    • Krzysztof Helt's avatar
      tdfxfb: fix frame buffer name overrun · b4a49b12
      Krzysztof Helt authored
      If there are more then one graphics card handled by the tdfxfb driver the
      name of the frame buffer overruns reserved size.
      Signed-off-by: default avatarKrzysztof Helt <krzysztof.h1@wp.pl>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4a49b12
    • Krzysztof Helt's avatar
      tdfxfb: fix SDRAM memory size detection · bf6910c0
      Krzysztof Helt authored
      Fix memory detection on Voodoo3 cards with SDRAM memory.
      Signed-off-by: default avatarKrzysztof Helt <krzysztof.h1@wp.pl>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bf6910c0
    • Matthew Garrett's avatar
      hp-wmi: add proper hotkey support · a8823aef
      Matthew Garrett authored
      It turns out that event 0x4 merely indcates that a hotkey has been
      pressed, not which one.  A further query is required in order to determine
      the actual keypress.  The following patch adds support for that along with
      the known keycodes.
      Signed-off-by: default avatarMatthew Garrett <mjg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8823aef
    • Matthew Garrett's avatar
      hp-wmi: update to match current rfkill semantics · 3f6e2f13
      Matthew Garrett authored
      hp-wmi currently changes the RFKill state by altering the struct members
      rather than using the dedicated interface, meaning that update events
      won't be pushed to userspace.  This patch fixes that, along with fixing
      the declared type of the WWAN kill switch.  It also ensures that rfkill
      interfaces are only registered for hardware that exists.
      Signed-off-by: default avatarMatthew Garrett <mjg@redhat.com>
      Acked-by: default avatarHenrique de Moraes Holschuh <hmh@hmh.eng.br>
      Cc: Ivo van Doorn <ivdoorn@gmail.com>
      Cc: Dave Young <hidave.darkstar@gmail.com>
      Cc: Marcel Holtmann <marcel@holtmann.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3f6e2f13
    • Nadia Derbey's avatar
      ipc: document the new auto_msgmni proc file · 61e55d05
      Nadia Derbey authored
      Update Documentation/filesystems/proc.txt: it describes the file
      auto_msgmni intoduced to enable/disable msgmni automatic recomputing upon
      memory add/remove (see thread http://lkml.org/lkml/2008/7/4/27).  Also
      added a description for msgmni (this filex is only listed in
      Documentation/sysctl/kernel.txt).
      Signed-off-by: default avatarNadia Derbey <Nadia.Derbey@bull.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61e55d05
    • KOSAKI Motohiro's avatar
      mm: size of quicklists shouldn't be proportional to the number of CPUs · b9541852
      KOSAKI Motohiro authored
      Quicklists store pages for each CPU as caches.  (Each CPU can cache
      node_free_pages/16 pages)
      
      It is used for page table cache.  exit() will increase the cache size,
      while fork() consumes it.
      
      So for example if an apache-style application runs (one parent and many
      child model), one CPU process will fork() while another CPU will process
      the middleware work and exit().
      
      At that time, the CPU on which the parent runs doesn't have page table
      cache at all.  Others (on which children runs) have maximum caches.
      
      	QList_max = (#ofCPUs - 1) x Free / 16
      	=> QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1)
      
      So, How much quicklist memory is used in the maximum case?
      
      This is proposional to # of CPUs because the limit of per cpu quicklist
      cache doesn't see the number of cpus.
      
      Above calculation mean
      
      	 Number of CPUs per node            2    4    8   16
      	 ==============================  ====================
      	 QList_max / (Free + QList_max)   5.8%  16%  30%  48%
      
      Wow! Quicklist can spend about 50% memory at worst case.
      
      My demonstration program is here
      --------------------------------------------------------------------------------
      #define _GNU_SOURCE
      
      #include <stdio.h>
      #include <errno.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sched.h>
      #include <unistd.h>
      #include <sys/mman.h>
      #include <sys/wait.h>
      
      #define BUFFSIZE 512
      
      int max_cpu(void)	/* get max number of logical cpus from /proc/cpuinfo */
      {
        FILE *fd;
        char *ret, buffer[BUFFSIZE];
        int cpu = 1;
      
        fd = fopen("/proc/cpuinfo", "r");
        if (fd == NULL) {
          perror("fopen(/proc/cpuinfo)");
          exit(EXIT_FAILURE);
        }
        while (1) {
          ret = fgets(buffer, BUFFSIZE, fd);
          if (ret == NULL)
            break;
          if (!strncmp(buffer, "processor", 9))
            cpu = atoi(strchr(buffer, ':') + 2);
        }
        fclose(fd);
        return cpu;
      }
      
      void cpu_bind(int cpu)	/* bind current process to one cpu */
      {
        cpu_set_t mask;
        int ret;
      
        CPU_ZERO(&mask);
        CPU_SET(cpu, &mask);
        ret = sched_setaffinity(0, sizeof(mask), &mask);
        if (ret == -1) {
          perror("sched_setaffinity()");
          exit(EXIT_FAILURE);
        }
        sched_yield();	/* not necessary */
      }
      
      #define MMAP_SIZE (10 * 1024 * 1024)	/* 10 MB */
      #define FORK_INTERVAL 1	/* 1 second */
      
      main(int argc, char *argv[])
      {
        int cpu_max, nextcpu;
        long pagesize;
        pid_t pid;
      
        /* set max number of logical cpu */
        if (argc > 1)
          cpu_max = atoi(argv[1]) - 1;
        else
          cpu_max = max_cpu();
      
        /* get the page size */
        pagesize = sysconf(_SC_PAGESIZE);
        if (pagesize == -1) {
          perror("sysconf(_SC_PAGESIZE)");
          exit(EXIT_FAILURE);
        }
      
        /* prepare parent process */
        cpu_bind(0);
        nextcpu = cpu_max;
      
      loop:
      
        /* select destination cpu for child process by round-robin rule */
        if (++nextcpu > cpu_max)
          nextcpu = 1;
      
        pid = fork();
      
        if (pid == 0) { /* child action */
      
          char *p;
          int i;
      
          /* consume page tables */
          p = mmap(0, MMAP_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
          i = MMAP_SIZE / pagesize;
          while (i-- > 0) {
            *p = 1;
            p += pagesize;
          }
      
          /* move to other cpu */
          cpu_bind(nextcpu);
      /*
          printf("a child moved to cpu%d after mmap().\n", nextcpu);
          fflush(stdout);
       */
      
          /* back page tables to pgtable_quicklist */
          exit(0);
      
        } else if (pid > 0) { /* parent action */
      
          sleep(FORK_INTERVAL);
          waitpid(pid, NULL, WNOHANG);
      
        }
      
        goto loop;
      }
      ----------------------------------------
      
      When above program which does task migration runs, my 8GB box spends
      800MB of memory for quicklist.  This is not memory leak but doesn't seem
      good.
      
      % cat /proc/meminfo
      
      MemTotal:        7701568 kB
      MemFree:         4724672 kB
      (snip)
      Quicklists:       844800 kB
      
      because
      
      - My machine spec is
      	number of numa node: 2
      	number of cpus:      8 (4CPU x2 node)
              total mem:           8GB (4GB x2 node)
              free mem:            about 5GB
      
      - Then, 4.7GB x 16% ~= 880MB.
        So, Quicklist can use 800MB.
      
      So, if following spec machine run that program
      
         CPUs: 64 (8cpu x 8node)
         Mem:  1TB (128GB x8node)
      
      Then, quicklist can waste 300GB (= 1TB x 30%).  It is too large.
      
      So, I don't like cache policies which is proportional to # of cpus.
      
      My patch changes the number of caches
      from:
         per-cpu-cache-amount = memory_on_node / 16
      to
         per-cpu-cache-amount = memory_on_node / 16 / number_of_cpus_on_node.
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Tested-by: default avatarDavid Miller <davem@davemloft.net>
      Acked-by: default avatarMike Travis <travis@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9541852
    • KOSAKI Motohiro's avatar
      mm: show quicklist usage in /proc/meminfo · 4b856152
      KOSAKI Motohiro authored
      Quicklists can consume several GB of memory.  We should provide a means of
      monitoring this.
      
      After this patch is applied, /proc/meminfo will output the following:
      
      % cat /proc/meminfo
      
      MemTotal:      7715392 kB
      MemFree:       5401600 kB
      Buffers:         80384 kB
      Cached:         300800 kB
      SwapCached:          0 kB
      Active:         235584 kB
      Inactive:       262656 kB
      SwapTotal:     2031488 kB
      SwapFree:      2031488 kB
      Dirty:            3520 kB
      Writeback:           0 kB
      AnonPages:      117696 kB
      Mapped:          38528 kB
      Slab:          1589952 kB
      SReclaimable:    23104 kB
      SUnreclaim:    1566848 kB
      PageTables:      14656 kB
      NFS_Unstable:        0 kB
      Bounce:              0 kB
      WritebackTmp:        0 kB
      CommitLimit:   5889152 kB
      Committed_AS:   393152 kB
      VmallocTotal: 17592177655808 kB
      VmallocUsed:     29056 kB
      VmallocChunk: 17592177626432 kB
      Quicklists:     130944 kB
      HugePages_Total:     0
      HugePages_Free:      0
      HugePages_Rsvd:      0
      HugePages_Surp:      0
      Hugepagesize:    262144 kB
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b856152
    • Li Zefan's avatar
      devcgroup: fix race against rmdir() · 36fd71d2
      Li Zefan authored
      During the use of a dev_cgroup, we should guarantee the corresponding
      cgroup won't be deleted (i.e.  via rmdir).  This can be done through
      css_get(&dev_cgroup->css), but here we can just get and use the dev_cgroup
      under rcu_read_lock.
      
      And also remove checking NULL dev_cgroup, it won't be NULL since a task
      always belongs to a cgroup.
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36fd71d2
    • Krzysztof Helt's avatar
      cirrusfb: check_par fixes · 09a2910e
      Krzysztof Helt authored
      1. Check if virtual resolution fits into memory.
         Otherwise, Linux hangs during panning.
      2. When selected use all available memory to
          maximize yres_virtual to speed up panning
         (previously also xres_virtual was increased).
      3. Simplify memory restriction calculations.
      Signed-off-by: default avatarKrzysztof Helt <krzysztof.h1@poczta.fm>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09a2910e
    • Oleg Nesterov's avatar
      pid_ns: (BUG 11391) change ->child_reaper when init->group_leader exits · 950bbabb
      Oleg Nesterov authored
      We don't change pid_ns->child_reaper when the main thread of the
      subnamespace init exits.  As Robert Rex <robert.rex@exasol.com> pointed
      out this is wrong.
      
      Yes, the re-parenting itself works correctly, but if the reparented task
      exits it needs ->parent->nsproxy->pid_ns in do_notify_parent(), and if the
      main thread is zombie its ->nsproxy was already cleared by
      exit_task_namespaces().
      
      Introduce the new function, find_new_reaper(), which finds the new
      ->parent for the re-parenting and changes ->child_reaper if needed.  Kill
      the now unneeded exit_child_reaper().
      
      Also move the changing of ->child_reaper from zap_pid_ns_processes() to
      find_new_reaper(), this consolidates the games with ->child_reaper and
      makes it stable under tasklist_lock.
      
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=11391Reported-by: default avatarRobert Rex <robert.rex@exasol.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Acked-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Acked-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      950bbabb
    • Oleg Nesterov's avatar
      pid_ns: zap_pid_ns_processes: fix the ->child_reaper changing · add0d4df
      Oleg Nesterov authored
      zap_pid_ns_processes() sets pid_ns->child_reaper = NULL, this is wrong.
      
      Yes, we have already killed all tasks in this namespace, and sys_wait4()
      doesn't see any child.  But this doesn't mean ->children list is empty, we
      may have EXIT_DEAD tasks which are not visible to do_wait().  In that case
      the subsequent forget_original_parent() will crash the kernel because it
      will try to re-parent these tasks to the NULL reaper.
      
      Even if there are no childs, it is not good that forget_original_parent()
      uses reaper == NULL.
      
      Change the code to set ->child_reaper = init_pid_ns.child_reaper instead.
      We could use pid_ns->parent->child_reaper as well, I think this does not
      really matter.  These EXIT_DEAD tasks are not visible to the new ->parent
      after re-parenting, they will silently do release_task() eventually.
      
      Note that we must change ->child_reaper, otherwise
      forget_original_parent() will use reaper == father, and in that case we
      will hit the (correct) BUG_ON(!list_empty(&father->children)).
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Acked-by: default avatarSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Acked-by: default avatarPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      add0d4df
    • David Brownell's avatar
      mmc: at91_mci: don't use coherent dma buffers · e385ea63
      David Brownell authored
      At91_mci is abusing dma_free_coherent(), which may not be called with IRQs
      disabled.  I saw "mkfs.ext3" on an MMC card objecting voluminously as each
      write completed:
      
       WARNING: at arch/arm/mm/consistent.c:368 dma_free_coherent+0x2c/0x224()
       [<c002726c>] (dump_stack+0x0/0x14) from [<c00387d4>] (warn_on_slowpath+0x4c/0x68)
       [<c0038788>] (warn_on_slowpath+0x0/0x68) from [<c0028768>] (dma_free_coherent+0x2c/0x224)
        r6:00008008 r5:ffc06000 r4:00000000
       [<c002873c>] (dma_free_coherent+0x0/0x224) from [<c01918ac>] (at91_mci_irq+0x374/0x420)
       [<c0191538>] (at91_mci_irq+0x0/0x420) from [<c0065d9c>] (handle_IRQ_event+0x2c/0x6c)
       ...
      
      This bug has been around for a LONG time.  The MM warning is from late
      2005, but the driver merged a year later ...  so I'm puzzled why nobody
      noticed this before now.
      
      The fix involves noting that this buffer shouldn't be DMA-coherent; it's
      just used for normal DMA writes.  So replace it with standard kmalloc()
      buffering and DMA mapping calls.
      
      This is the quickie fix.  A better one would not rely on allocating large
      bounce buffers.  (Note that dma_alloc_coherent could have failed too, but
      that case was ignored...  kmalloc is a bit more likely to fail though.)
      Signed-off-by: default avatarDavid Brownell <dbrownell@users.sourceforge.net>
      Acked-by: default avatarPierre Ossman <drzeus-mmc@drzeus.cx>
      Cc: Andrew Victor <linux@maxim.org.za>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@atmel.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e385ea63
    • Will Newton's avatar
      8250: improve workaround for UARTs that don't re-assert THRE correctly · 363f66fe
      Will Newton authored
      Recent changes to tighten the check for UARTs that don't correctly
      re-assert THRE (01c194d9: "serial 8250:
      tighten test for using backup timer") caused problems when such a UART was
      opened for the second time - the bug could only successfully be detected
      at first initialization.  For users of this version of this particular
      UART IP it is fatal.
      
      This patch stores the information about the bug in the bugs field of the
      port structure when the port is first started up so subsequent opens can
      check this bit even if the test for the bug fails.
      
      David Brownell: "My own exposure to this is that the UART on DaVinci
      hardware, which TI allegedly derived from its original 16550 logic, has
      periodically gone from working to unusable with the mainline 8250.c ...
      and back and forth a bunch.  Currently it's "unusable", a regression from
      some previous versions.  With this patch from Will, it's usable."
      Signed-off-by: default avatarWill Newton <will.newton@gmail.com>
      Acked-by: default avatarAlex Williamson <alex.williamson@hp.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Brownell <david-b@pacbell.net>
      Cc: <stable@kernel.org>		[2.6.26.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      363f66fe
    • Henrik Rydberg's avatar
      MAINTAINERS: add a maintainer for the BCM5974 multitouch driver · bd7aa4b2
      Henrik Rydberg authored
      Signed-off-by: default avatarHenrik Rydberg <rydberg@euromail.se>
      Cc: Dmitry Torokhov <dtor@mail.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd7aa4b2
    • Marcin Slusarz's avatar
      mm/bootmem: silence section mismatch warning - contig_page_data/bootmem_node_data · 52765583
      Marcin Slusarz authored
      WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data
      The variable contig_page_data references
      the variable __initdata bootmem_node_data
      If the reference is valid then annotate the
      variable with __init* (see linux/init.h) or name the variable:
      *driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
      Signed-off-by: default avatarMarcin Slusarz <marcin.slusarz@gmail.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Sean MacLennan <smaclennan@pikatech.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52765583
    • Russ Dill's avatar
      acer-wmi: remove debugfs entries upon unloading · 39dbbb45
      Russ Dill authored
      The exit function neglects to remove debugfs entries, leading to a BUG
      on reload.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: default avatarRuss Dill <Russ.Dill@gmail.com>
      Acked-by: default avatarCarlos Corbacho <carlos@strangeworlds.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39dbbb45
    • Hisashi Hifumi's avatar
      VFS: fix dio write returning EIO when try_to_release_page fails · 6ccfa806
      Hisashi Hifumi authored
      Dio write returns EIO when try_to_release_page fails because bh is
      still referenced.
      
      The patch
      
          commit 3f31fddf
          Author: Mingming Cao <cmm@us.ibm.com>
          Date:   Fri Jul 25 01:46:22 2008 -0700
      
              jbd: fix race between free buffer and commit transaction
      
      was merged into 2.6.27-rc1, but I noticed that this patch is not enough
      to fix the race.
      
      I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
      sometimes got EIO through this test.
      
      The patch above fixed race between freeing buffer(dio) and committing
      transaction(jbd) but I discovered that there is another race, freeing
      buffer(dio) and ext3/4_ordered_writepage.
      
      : background_writeout()
           ->write_cache_pages()
             ->ext3_ordered_writepage()
           	   walk_page_buffers() -> take a bh ref
       	   block_write_full_page() -> unlock_page
      		: <- end_page_writeback
                      : <- race! (dio write->try_to_release_page fails)
            	   walk_page_buffers() ->release a bh ref
      
      ext3_ordered_writepage holds bh ref and does unlock_page remaining
      taking a bh ref, so this causes the race and failure of
      try_to_release_page.
      
      To fix this race, I used the approach of falling back to buffered
      writes if try_to_release_page() fails on a page.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: default avatarHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mingming Cao <cmm@us.ibm.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ccfa806
    • Adam Litke's avatar
      mm: make setup_zone_migrate_reserve() aware of overlapping nodes · 344c790e
      Adam Litke authored
      I have gotten to the root cause of the hugetlb badness I reported back on
      August 15th.  My system has the following memory topology (note the
      overlapping node):
      
                  Node 0 Memory: 0x8000000-0x44000000
                  Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000
      
      setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
      for a pageblock to move onto the MIGRATE_RESERVE list.  Finding no
      candidates, it happily continues the scan into 0x8000000-0x44000000.  When
      a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
      the wrong zone.  Oops.
      
      setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      344c790e
    • Adrian Bunk's avatar
      NTFS: update homepage · 169ccbd4
      Adrian Bunk authored
      Update the location of the NTFS homepage in several files.
      Signed-off-by: default avatarAdrian Bunk <bunk@kernel.org>
      Cc: Jeff Garzik <jeff@garzik.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      169ccbd4
    • Breno Leitao's avatar
      ipv: Re-enable IP when MTU > 68 · 06770843
      Breno Leitao authored
      Re-enable IP when the MTU gets back to a valid size. 
      
      This patch just checks if the in_dev is NULL on a NETDEV_CHANGEMTU event
      and if MTU is valid (bigger than 68), then re-enable in_dev. 
      
      Also a function that checks valid MTU size was created.
      Signed-off-by: default avatarBreno Leitao <leitao@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06770843
    • Julien Brunel's avatar
      net/xfrm: Use an IS_ERR test rather than a NULL test · 9d7d7402
      Julien Brunel authored
      In case of error, the function xfrm_bundle_create returns an ERR
      pointer, but never returns a NULL pointer. So a NULL test that comes
      after an IS_ERR test should be deleted.
      
      The semantic match that finds this problem is as follows:
      (http://www.emn.fr/x-info/coccinelle/)
      
      // <smpl>
      @match_bad_null_test@
      expression x, E;
      statement S1,S2;
      @@
      x =  xfrm_bundle_create(...)
      ... when != x = E
      *  if (x != NULL) 
      S1 else S2
      // </smpl>
      Signed-off-by: default avatarJulien Brunel <brunel@diku.dk>
      Signed-off-by: default avatarJulia Lawall <julia@diku.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9d7d7402