Commits · 02ac81a812fe0575a8475a93bdc22fb291ebad91 · nexedi / linux

16 Feb, 2011 1 commit

Merge branch 'x86/bootmem' into x86/mm · 02ac81a8

Ingo Molnar authored Feb 16, 2011

Merge reason: the topic is ready - consolidate it into the more generic x86/mm tree
              and prevent conflicts.
Signed-off-by: Ingo Molnar <mingo@elte.hu>

02ac81a8

14 Feb, 2011 8 commits

x86: Emit "mem=nopentium ignored" warning when not supported · 9a6d44b9

Kamal Mostafa authored Feb 03, 2011

Emit warning when "mem=nopentium" is specified on any arch other
than x86_32 (the only that arch supports it).
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
BugLink: http://bugs.launchpad.net/bugs/553464
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
LKML-Reference: <1296783486-23033-2-git-send-email-kamal@canonical.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>

9a6d44b9

x86: Fix panic when handling "mem={invalid}" param · 77eed821

Kamal Mostafa authored Feb 03, 2011

Avoid removing all of memory and panicing when "mem={invalid}"
is specified, e.g. mem=blahblah, mem=0, or mem=nopentium (on
platforms other than x86_32).
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
BugLink: http://bugs.launchpad.net/bugs/553464
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: <stable@kernel.org> # .3x: as far back as it applies
LKML-Reference: <1296783486-23033-1-git-send-email-kamal@canonical.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

77eed821

x86: Avoid tlbstate lock if not enough cpus · 7064d865

Shaohua Li authored Jan 17, 2011

This one isn't related to previous patch. If online cpus are
below NUM_INVALIDATE_TLB_VECTORS, we don't need the lock. The
comments in the code declares we don't need the check, but a hot
lock still needs an atomic operation and expensive, so add the
check here.

Uses nr_cpu_ids here as suggested by Eric Dumazet.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Andi Kleen <andi@firstfloor.org>
LKML-Reference: <1295232730.1949.710.camel@sli10-conroe>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

7064d865

x86: Scale up the number of TLB invalidate vectors with NR_CPUs, up to 32 · 70e4a369

Shaohua Li authored Jan 17, 2011

Make the maxium TLB invalidate vectors depend on NR_CPUS linearly,
with a maximum of 32 vectors.

We currently only have 8 vectors for TLB invalidation and that is clearly
inadequate. If we have a lot of CPUs, the CPUs need share the 8 vectors and
tlbstate_lock is used to protect them. flush_tlb_page() is
heavily used in page reclaim, which will cause a lot of lock
contention for tlbstate_lock.

Andi Kleen suggested increasing the vectors number to 32, which should be
good for current typical systems to reduce the tlbstate_lock contention.

My test system has 4 sockets and 64G memory, and 64 CPUs. My
workload creates 64 processes. Each process mmap reads a big
empty sparse file. The total size of the files are 2*total_mem,
so this will cause a lot of page reclaim.

Below is the result I get from perf call-graph profiling:

 without the patch:
 ------------------

    24.25%           usemem  [kernel]                                   [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |
                        |--42.15%-- native_flush_tlb_others

 with the patch:
 ------------------

    14.96%           usemem  [kernel]                                   [k] _raw_spin_lock
                     |
                     --- _raw_spin_lock
                        |--13.89%-- native_flush_tlb_others

So this heavily reduces the tlbstate_lock contention.
Suggested-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1295232727.1949.709.camel@sli10-conroe>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

70e4a369

x86: Allocate 32 tlb_invalidate_interrupt handler stubs · 3a09fb45

Shaohua Li authored Jan 17, 2011

Add up to 32 invalidate_interrupt handlers. How many handlers are
added depends on NUM_INVALIDATE_TLB_VECTORS. So if
NUM_INVALIDATE_TLB_VECTORS is smaller than 32, we reduce code
size.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <1295232725.1949.708.camel@sli10-conroe>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

3a09fb45

x86: Cleanup vector usage · 60f6e65d

Shaohua Li authored Jan 17, 2011

Cleanup the vector usage and make them continuous if possible.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <1295232722.1949.707.camel@sli10-conroe>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

60f6e65d

Merge branch 'linus' into x86/bootmem · d2137d5a

Ingo Molnar authored Feb 14, 2011

Conflicts:
	arch/x86/mm/numa_64.c

Merge reason: fix the conflict, update to latest -rc and pick up this
              dependent fix from Yinghai:

  e6d2e2b2: memblock: don't adjust size in memblock_find_base()
Signed-off-by: Ingo Molnar <mingo@elte.hu>

d2137d5a

klist: Fix object alignment on 64-bit. · 795abaf1

David Miller authored Feb 13, 2011

Commit c0e69a5b ("klist.c: bit 0 in pointer can't be used as flag")
intended to make sure that all klist objects were at least pointer size
aligned, but used the constant "4" which only works on 32-bit.

Use "sizeof(void *)" which is correct in all cases.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: stable <stable@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

795abaf1

13 Feb, 2011 6 commits

Merge branch 'spi/merge' of git://git.secretlab.ca/git/linux-2.6 · 091994cf

Linus Torvalds authored Feb 13, 2011

* 'spi/merge' of git://git.secretlab.ca/git/linux-2.6:
  devicetree-discuss is moderated for non-subscribers
  MAINTAINERS: Add entry for GPIO subsystem
  dt: add documentation of ARM dt boot interface
  dt: Remove obsolete description of powerpc boot interface
  dt: Move device tree documentation out of powerpc directory
  spi/spi_sh_msiof: fix wrong address calculation, which leads to an Oops

091994cf

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6 · d8ed516f

Linus Torvalds authored Feb 13, 2011

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
  ALSA: hda - add quirk for Ordissimo EVE using a realtek ALC662
  ALSA: hrtimer: remove superfluous tasklet invocation
  ALSA: hrtimer: handle delayed timer interrupts
  ALSA: HDA: Add subwoofer quirk for Acer Aspire 8942G
  ALSA: hda - Don't handle empty patch files
  ALSA: hda - Fix missing CA initialization for HDMI/DP
  ALSA: usbaudio - Enable the E-MU 0204 USB
  ALSA: hda - switch lfe with side in mixer for 4930g
  ASoC: Improve WM8994 digital power sequencing
  ASoC: Create an AIF1ADCDAT signal widget to match AIF2
  asoc: davinci: da830/omap-l137: correct cpu_dai_name
  ASoC: fill in snd_soc_pcm_runtime.card before calling snd_soc_dai_link.init()

d8ed516f

Revert "pci: use security_capable() when checking capablities during config space read" · f00eaeea

Linus Torvalds authored Feb 13, 2011

This reverts commit 47970b1b.

It turns out it breaks several distributions.  Looks like the stricter
selinux checks fail due to selinux policies not being set to allow the
access - breaking X, but also lspci.

So while the change was clearly the RightThing(tm) to do in theory, in
practice we have backwards compatibility issues making it not work.
Reported-by: Dave Young <hidave.darkstar@gmail.com>
Acked-by: David Airlie <airlied@linux.ie>
Acked-by: Alex Riesen <raa.lkml@gmail.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f00eaeea

Merge branch 'fix/asoc' into for-linus · 61461241
Takashi Iwai authored Feb 13, 2011

61461241
Merge branch 'devicetree/merge' into spi/merge · c170093d
Grant Likely authored Feb 12, 2011

c170093d

devicetree-discuss is moderated for non-subscribers · 78bba987

Paul Bolle authored Feb 12, 2011

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>

78bba987

12 Feb, 2011 25 commits

MAINTAINERS: Add entry for GPIO subsystem · a0dc00b4

Grant Likely authored Feb 12, 2011

I'll probably regret this....
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a0dc00b4

Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · c8e0b00e

Linus Torvalds authored Feb 12, 2011

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: call __jbd2_log_start_commit with j_state_lock write locked
  ext4: serialize unaligned asynchronous DIO
  ext4: make grpinfo slab cache names static
  ext4: Fix data corruption with multi-block writepages support
  ext4: fix up ext4 error handling
  ext4: unregister features interface on module unload
  ext4: fix panic on module unload when stopping lazyinit thread

c8e0b00e

jbd2: call __jbd2_log_start_commit with j_state_lock write locked · e4471831

Theodore Ts'o authored Feb 12, 2011

On an SMP ARM system running ext4, I've received a report that the
first J_ASSERT in jbd2_journal_commit_transaction has been triggering:

	J_ASSERT(journal->j_running_transaction != NULL);

While investigating possible causes for this problem, I noticed that
__jbd2_log_start_commit() is getting called with j_state_lock only
read-locked, in spite of the fact that it's possible for it might
j_commit_request.  Fix this by grabbing the necessary information so
we can test to see if we need to start a new transaction before
dropping the read lock, and then calling jbd2_log_start_commit() which
will grab the write lock.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

e4471831

ext4: serialize unaligned asynchronous DIO · e9e3bcec

Eric Sandeen authored Feb 12, 2011

ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.

The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and 
dio_zero_block() will zero out the unwritten portions.  When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.

Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO.  I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write().  But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.

I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex.  So that won't work.

This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment.  I've
tested a backport of this patch with qemu, and it does
avoid the corruption.  It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.

Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.

[tytso@mit.edu: Keep the mutex as a hashed array instead
 of bloating the ext4 inode]

[tytso@mit.edu: Fix up namespace issues so that global
 variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

e9e3bcec

ext4: make grpinfo slab cache names static · 2892c15d

Eric Sandeen authored Feb 12, 2011

In 2.6.37 I was running into oopses with repeated module
loads & unloads.  I tracked this down to:

fb1813f4 ext4: use dedicated slab caches for group_info structures

(this was in addition to the features advert unload problem)

The kstrdup & subsequent kfree of the cache name was causing
a double free.  In slub, at least, if I read it right it allocates
& frees the name itself, slab seems to do something different...
so in slub I think we were leaking -our- cachep->name, and double
freeing the one allocated by slub.

After getting lost in slab/slub/slob a bit, I just looked at other
sized-caches that get allocated.  jbd2, biovec, sgpool all do it
more or less the way jbd2 does.  Below patch follows the jbd2
method of dynamically allocating a cache at mount time from
a list of static names.

(This might also possibly fix a race creating the caches with
parallel mounts running).

[Folded in a fix from Dan Carpenter which fixed an off-by-one error in
the original patch]

Cc: stable@kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

2892c15d

MAINTAINERS: Add entry for GPIO subsystem · 557218e2
Grant Likely authored Feb 12, 2011
```
I'll probably regret this....
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
```
557218e2

Merge branch 'kvm-updates/2.6.38' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 3c6c0d6c

Linus Torvalds authored Feb 11, 2011

* 'kvm-updates/2.6.38' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: SVM: Make sure KERNEL_GS_BASE is valid when loading gs_index

3c6c0d6c

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp · 5b49378e
Linus Torvalds authored Feb 11, 2011
```
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp:
  amd64_edac: Fix DIMMs per DCTs output
```
5b49378e
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm · d40b0c34
Linus Torvalds authored Feb 11, 2011
```
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: use single thread workqueues
```
d40b0c34

Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 · 3aec46c1

Linus Torvalds authored Feb 11, 2011

* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: don't always drop malformed replies on the floor (try #3)
  cifs: clean up checks in cifs_echo_request
  [CIFS] Do not send SMBEcho requests on new sockets until SMBNegotiate

3aec46c1

Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging · 68c3d4b2

Linus Torvalds authored Feb 11, 2011

* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/staging:
  hwmon: (emc1403) Fix I2C address range
  hwmon: (lm63) Consider LM64 temperature offset

68c3d4b2

Merge branch 'for-linus' of... · f7909fb8

Linus Torvalds authored Feb 11, 2011

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
  pci: use security_capable() when checking capablities during config space read
  security: add cred argument to security_capable()
  tpm_tis: Use timeouts returned from TPM

f7909fb8

Merge branch 's5p-fixes-for-linus' of... · c41d40b5

Linus Torvalds authored Feb 11, 2011

Merge branch 's5p-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung

* 's5p-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/kgene/linux-samsung:
  ARM: SAMSUNG: Ensure struct sys_device is declared in plat/pm.h
  ARM: S5PV310: Cleanup System MMU
  ARM: S5PV310: Add support System MMU on SMDKV310

c41d40b5

Merge branch 'next' of git://git.monstr.eu/linux-2.6-microblaze · a288465f

Linus Torvalds authored Feb 11, 2011

* 'next' of git://git.monstr.eu/linux-2.6-microblaze:
  microblaze: Fix msr instruction detection
  microblaze: Fix pte_update function
  microblaze: Fix asm compilation warning
  microblaze: Fix IRQ flag handling for MSR=0

a288465f

drivers/w1/masters/omap_hdq.c: add missing clk_put · 80d02d27

Julia Lawall authored Feb 10, 2011

This code makes two calls to clk_get, then test both return values and
fails if either failed.

The problem is that in the first inner if, where the first call to
clk_get has failed, it don't know if the second call has failed as well.
So it don't know whether clk_get should be called on the result of the
second call.  Of course, it would be possible to test that value again.
A simpler solution is just to test the result of calling clk_get
directly after each call.

The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@r@
position p1,p2;
expression e;
statement S;
@@

e = clk_get@p1(...)
...
if@p2 (IS_ERR(e)) S

@@
expression e;
statement S;
identifier l;
position r.p1, p2 != r.p2;
@@

*e = clk_get@p1(...)
... when != clk_put(e)
*if@p2 (...)
{
  ... when != clk_put(e)
* return ...;
}// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Acked-by: Tony Lindgren <tony@atomide.com>
Acked-by: Amit Kucheria <amit.kucheria@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

80d02d27

memcg: fix leak of accounting at failure path of hugepage collapsing · 678ff896

KAMEZAWA Hiroyuki authored Feb 10, 2011

mem_cgroup_uncharge_page() should be called in all failure cases after
mem_cgroup_charge_newpage() is called in huge_memory.c::collapse_huge_page()

 [ 4209.076861] BUG: Bad page state in process khugepaged  pfn:1e9800
 [ 4209.077601] page:ffffea0006b14000 count:0 mapcount:0 mapping:          (null) index:0x2800
 [ 4209.078674] page flags: 0x40000000004000(head)
 [ 4209.079294] pc:ffff880214a30000 pc->flags:2146246697418756 pc->mem_cgroup:ffffc9000177a000
 [ 4209.082177] (/A)
 [ 4209.082500] Pid: 31, comm: khugepaged Not tainted 2.6.38-rc3-mm1 #1
 [ 4209.083412] Call Trace:
 [ 4209.083678]  [<ffffffff810f4454>] ? bad_page+0xe4/0x140
 [ 4209.084240]  [<ffffffff810f53e6>] ? free_pages_prepare+0xd6/0x120
 [ 4209.084837]  [<ffffffff8155621d>] ? rwsem_down_failed_common+0xbd/0x150
 [ 4209.085509]  [<ffffffff810f5462>] ? __free_pages_ok+0x32/0xe0
 [ 4209.086110]  [<ffffffff810f552b>] ? free_compound_page+0x1b/0x20
 [ 4209.086699]  [<ffffffff810fad6c>] ? __put_compound_page+0x1c/0x30
 [ 4209.087333]  [<ffffffff810fae1d>] ? put_compound_page+0x4d/0x200
 [ 4209.087935]  [<ffffffff810fb015>] ? put_page+0x45/0x50
 [ 4209.097361]  [<ffffffff8113f779>] ? khugepaged+0x9e9/0x1430
 [ 4209.098364]  [<ffffffff8107c870>] ? autoremove_wake_function+0x0/0x40
 [ 4209.099121]  [<ffffffff8113ed90>] ? khugepaged+0x0/0x1430
 [ 4209.099780]  [<ffffffff8107c236>] ? kthread+0x96/0xa0
 [ 4209.100452]  [<ffffffff8100dda4>] ? kernel_thread_helper+0x4/0x10
 [ 4209.101214]  [<ffffffff8107c1a0>] ? kthread+0x0/0xa0
 [ 4209.101842]  [<ffffffff8100dda0>] ? kernel_thread_helper+0x0/0x10
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

678ff896

vmscan: fix zone shrinking exit when scan work is done · f0fdc5e8

Johannes Weiner authored Feb 10, 2011

Commit 3e7d3449 ("mm: vmscan: reclaim order-0 and use compaction
instead of lumpy reclaim") introduced an indefinite loop in
shrink_zone().

It meant to break out of this loop when no pages had been reclaimed and
not a single page was even scanned.  The way it would detect the latter
is by taking a snapshot of sc->nr_scanned at the beginning of the
function and comparing it against the new sc->nr_scanned after the scan
loop.  But it would re-iterate without updating that snapshot, looping
forever if sc->nr_scanned changed at least once since shrink_zone() was
invoked.

This is not the sole condition that would exit that loop, but it
requires other processes to change the zone state, as the reclaimer that
is stuck obviously can not anymore.

This is only happening for higher-order allocations, where reclaim is
run back to back with compaction.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Kent Overstreet<kent.overstreet@gmail.com>
Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

f0fdc5e8

mlock: do not munlock pages in __do_fault() · 419d8c96

Michel Lespinasse authored Feb 10, 2011

If the page is going to be written to, __do_page needs to break COW.

However, the old page (before breaking COW) was never mapped mapped into
the current pte (__do_fault is only called when the pte is not present),
so vmscan can't have marked the old page as PageMlocked due to being
mapped in __do_fault's VMA.  Therefore, __do_fault() does not need to
worry about clearing PageMlocked() on the old page.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

419d8c96

mlock: fix race when munlocking pages in do_wp_page() · e15f8c01

Michel Lespinasse authored Feb 10, 2011

vmscan can lazily find pages that are mapped within VM_LOCKED vmas, and
set the PageMlocked bit on these pages, transfering them onto the
unevictable list.  When do_wp_page() breaks COW within a VM_LOCKED vma,
it may need to clear PageMlocked on the old page and set it on the new
page instead.

This change fixes an issue where do_wp_page() was clearing PageMlocked
on the old page while the pte was still pointing to it (as well as
rmap).  Therefore, we were not protected against vmscan immediately
transfering the old page back onto the unevictable list.  This could
cause pages to get stranded there forever.

I propose to move the corresponding code to the end of do_wp_page(),
after the pte (and rmap) have been pointed to the new page.
Additionally, we can use munlock_vma_page() instead of
clear_page_mlock(), so that the old page stays mlocked if there are
still other VM_LOCKED vmas mapping it.
Signed-off-by: Michel Lespinasse <walken@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e15f8c01

memblock: don't adjust size in memblock_find_base() · e6d2e2b2

Yinghai Lu authored Feb 10, 2011

While applying patch to use memblock to find aperture for 64bit x86.
Ingo found system with 1g + force_iommu

> No AGP bridge found
> Node 0: aperture @ 38000000 size 32 MB
> Aperture pointing to e820 RAM. Ignoring.
> Your BIOS doesn't leave a aperture memory hole
> Please enable the IOMMU option in the BIOS setup
> This costs you 64 MB of RAM
> Cannot allocate aperture memory hole (0,65536K)

the corresponding code:

	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20);
	if (addr == MEMBLOCK_ERROR || addr + aper_size > 0xffffffff) {
		printk(KERN_ERR
			"Cannot allocate aperture memory hole (%lx,%uK)\n",
				addr, aper_size>>10);
		return 0;
	}
	memblock_x86_reserve_range(addr, addr + aper_size, "aperture64")

fails because memblock core code align the size with 512M.  That could
make size way too big.

So don't align the size in that case.

actually __memblock_alloc_base, the another caller already align that
before calling that function.

BTW. x86 does not use __memblock_alloc_base...
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

e6d2e2b2

nbd: remove module-level ioctl mutex · de1f016f

Soren Hansen authored Feb 10, 2011

Commit 2a48fc0a ("block: autoconvert trivial BKL users to private
mutex") replaced uses of the BKL in the nbd driver with mutex
operations.  Since then, I've been been seeing these lock ups:

 INFO: task qemu-nbd:16115 blocked for more than 120 seconds.
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 qemu-nbd      D 0000000000000001     0 16115  16114 0x00000004
  ffff88007d775d98 0000000000000082 ffff88007d775fd8 ffff88007d774000
  0000000000013a80 ffff8800020347e0 ffff88007d775fd8 0000000000013a80
  ffff880133730000 ffff880002034440 ffffea0004333db8 ffffffffa071c020
 Call Trace:
  [<ffffffff815b9997>] __mutex_lock_slowpath+0xf7/0x180
  [<ffffffff815b93eb>] mutex_lock+0x2b/0x50
  [<ffffffffa071a21c>] nbd_ioctl+0x6c/0x1c0 [nbd]
  [<ffffffff812cb970>] blkdev_ioctl+0x230/0x730
  [<ffffffff811967a1>] block_ioctl+0x41/0x50
  [<ffffffff81175c03>] do_vfs_ioctl+0x93/0x370
  [<ffffffff81175f61>] sys_ioctl+0x81/0xa0
  [<ffffffff8100c0c2>] system_call_fastpath+0x16/0x1b

Instrumenting the nbd module's ioctl handler with some extra logging
clearly shows the NBD_DO_IT ioctl being invoked which is a long-lived
ioctl in the sense that it doesn't return until another ioctl asks the
driver to disconnect.  However, that other ioctl blocks, waiting for the
module-level mutex that replaced the BKL, and then we're stuck.

This patch removes the module-level mutex altogether.  It's clearly
wrong, and as far as I can see, it's entirely unnecessary, since the nbd
driver maintains per-device mutexes, and I don't see anything that would
require a module-level (or kernel-level, for that matter) mutex.
Signed-off-by: Soren Hansen <soren@linux2go.dk>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Paul Clements <paul.clements@steeleye.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@kernel.org>		[2.6.37.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

de1f016f

drivers/rtc/rtc-proc.c: add module_put on error path in rtc_proc_open() · 24a6f5b8

Alexander Strakh authored Feb 10, 2011

In file drivers/rtc/rtc-proc.c seq_open() can return -ENOMEM.

 86        if (!try_module_get(THIS_MODULE))
 87                return -ENODEV;
 88
 89        return single_open(file, rtc_proc_show, rtc);

In this case before exiting (line 89) from rtc_proc_open the
module_put(THIS_MODULE) must be called.

Found by Linux Device Drivers Verification Project
Signed-off-by: Alexander Strakh <strakh@ispras.ru>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

24a6f5b8

drivers/gpio/pca953x.c: add a mutex to fix race condition · 6e20fb18

Roland Stigge authored Feb 10, 2011

Add a mutex to register communication and handling.  Without the mutex,
GPIOs didn't switch as expected when toggled in a fast sequence of
status changes of multiple outputs.
Signed-off-by: Roland Stigge <stigge@antcom.de>
Acked-by: Eric Miao <eric.y.miao@gmail.com>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Marc Zyngier <maz@misterjones.org>
Cc: Ben Gardner <bgardner@wabtec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

6e20fb18

ptrace: use safer wake up on ptrace_detach() · 01e05e9a

Tejun Heo authored Feb 10, 2011

The wake_up_process() call in ptrace_detach() is spurious and not
interlocked with the tracee state.  IOW, the tracee could be running or
sleeping in any place in the kernel by the time wake_up_process() is
called.  This can lead to the tracee waking up unexpectedly which can be
dangerous.

The wake_up is spurious and should be removed but for now reduce its
toxicity by only waking up if the tracee is in TRACED or STOPPED state.

This bug can possibly be used as an attack vector.  I don't think it
will take too much effort to come up with an attack which triggers oops
somewhere.  Most sleeps are wrapped in condition test loops and should
be safe but we have quite a number of places where sleep and wakeup
conditions are expected to be interlocked.  Although the window of
opportunity is tiny, ptrace can be used by non-privileged users and with
some loading the window can definitely be extended and exploited.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

01e05e9a

vfs: call rcu_barrier after ->kill_sb() · d863b50a

Boaz Harrosh authored Feb 10, 2011

In commit fa0d7e3d ("fs: icache RCU free inodes"), we use rcu free
inode instead of freeing the inode directly.  It causes a crash when we
rmmod immediately after we umount the volume[1].

So we need to call rcu_barrier after we kill_sb so that the inode is
freed before we do rmmod.  The idea is inspired by Aneesh Kumar.
rcu_barrier will wait for all callbacks to end before preceding.  The
original patch was done by Tao Ma, but synchronize_rcu() is not enough
here.

1. http://marc.info/?l=linux-fsdevel&m=129680863330185&w=2Tested-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

d863b50a