- 01 Mar, 2013 2 commits
-
-
Steven Noonan authored
Adding an include of linux/mm.h resolves this: drivers/xen/xenbus/xenbus_client.c: In function ‘xenbus_map_ring_valloc_hvm’: drivers/xen/xenbus/xenbus_client.c:532:66: error: implicit declaration of function ‘page_to_section’ [-Werror=implicit-function-declaration] CC: stable@vger.kernel.org Signed-off-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
Konrad Rzeszutek Wilk authored
There is no hypercall to setup multiple MSI per PCI device. As such with these two new commits: - 08261d87 PCI/MSI: Enable multiple MSIs with pci_enable_msi_block_auto() - 5ca72c4f AHCI: Support multiple MSIs we would call the PHYSDEVOP_map_pirq 'nvec' times with the same contents of the PCI device. Sander discovered that we would get the same PIRQ value 'nvec' times and return said values to the caller. That of course meant that the device was configured only with one MSI and AHCI would fail with: ahci 0000:00:11.0: version 3.0 xen: registering gsi 19 triggering 0 polarity 1 xen: --> pirq=19 -> irq=19 (gsi=19) (XEN) [2013-02-27 19:43:07] IOAPIC[0]: Set PCI routing entry (6-19 -> 0x99 -> IRQ 19 Mode:1 Active:1) ahci 0000:00:11.0: AHCI 0001.0200 32 slots 4 ports 6 Gbps 0xf impl SATA mode ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part ahci: probe of 0000:00:11.0 failed with error -22 That is b/c in ahci_host_activate the second call to devm_request_threaded_irq would return -EINVAL as we passed in (on the second run) an IRQ that was never initialized. CC: stable@vger.kernel.org Reported-and-Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
- 28 Feb, 2013 1 commit
-
-
Konrad Rzeszutek Wilk authored
The git commit 8eaffa67 (xen/pat: Disable PAT support for now) explains in details why we want to disable PAT for right now. However that change was not enough and we should have also disabled the pat_enabled value. Otherwise we end up with: mmap-example:3481 map pfn expected mapping type write-back for [mem 0x00010000-0x00010fff], got uncached-minus ------------[ cut here ]------------ WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774 untrack_pfn+0xb8/0xd0() mem 0x00010000-0x00010fff], got uncached-minus ------------[ cut here ]------------ WARNING: at /build/buildd/linux-3.8.0/arch/x86/mm/pat.c:774 untrack_pfn+0xb8/0xd0() ... Pid: 3481, comm: mmap-example Tainted: GF 3.8.0-6-generic #13-Ubuntu Call Trace: [<ffffffff8105879f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff810587fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff8104bcc8>] untrack_pfn+0xb8/0xd0 [<ffffffff81156c1c>] unmap_single_vma+0xac/0x100 [<ffffffff81157459>] unmap_vmas+0x49/0x90 [<ffffffff8115f808>] exit_mmap+0x98/0x170 [<ffffffff810559a4>] mmput+0x64/0x100 [<ffffffff810560f5>] dup_mm+0x445/0x660 [<ffffffff81056d9f>] copy_process.part.22+0xa5f/0x1510 [<ffffffff81057931>] do_fork+0x91/0x350 [<ffffffff81057c76>] sys_clone+0x16/0x20 [<ffffffff816ccbf9>] stub_clone+0x69/0x90 [<ffffffff816cc89d>] ? system_call_fastpath+0x1a/0x1f ---[ end trace 4918cdd0a4c9fea4 ]--- (a similar message shows up if you end up launching 'mcelog') The call chain is (as analyzed by Liu, Jinsong): do_fork --> copy_process --> dup_mm --> dup_mmap --> copy_page_range --> track_pfn_copy --> reserve_pfn_range --> line 624: flags != want_flags It comes from different memory types of page table (_PAGE_CACHE_WB) and MTRR (_PAGE_CACHE_UC_MINUS). Stefan Bader dug in this deep and found out that: "That makes it clearer as this will do reserve_memtype(...) --> pat_x_mtrr_type --> mtrr_type_lookup --> __mtrr_type_lookup And that can return -1/0xff in case of MTRR not being enabled/initialized. Which is not the case (given there are no messages for it in dmesg). This is not equal to MTRR_TYPE_WRBACK and thus becomes _PAGE_CACHE_UC_MINUS. It looks like the problem starts early in reserve_memtype: if (!pat_enabled) { /* This is identical to page table setting without PAT */ if (new_type) { if (req_type == _PAGE_CACHE_WC) *new_type = _PAGE_CACHE_UC_MINUS; else *new_type = req_type & _PAGE_CACHE_MASK; } return 0; } This would be what we want, that is clearing the PWT and PCD flags from the supported flags - if pat_enabled is disabled." This patch does that - disabling PAT. CC: stable@vger.kernel.org # 3.3 and further Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reported-and-Tested-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
- 25 Feb, 2013 3 commits
-
-
Liu Jinsong authored
Recently at native Rafael did some cleanup for acpi, say, drop acpi_bus_add, remove unnecessary argument of acpi_bus_scan, and run acpi_bus_scan under acpi_scan_lock. This patch does similar cleanup for xen cpu hotplug, removing redundant logic, and adding lock. Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
Liu Jinsong authored
Dan Carpenter found current xen memory hotplug logic has potential issue: at func acpi_memory_get_device() *mem_device = acpi_driver_data(device); while the device may be NULL and then dereference. At native side, Rafael recently updated acpi_memory_get_device(), dropping acpi_bus_add, adding lock, and avoiding above issue. This patch updates xen memory hotplug logic accordingly, removing redundant logic, adding lock, and avoiding dereference. Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
Linus Torvalds authored
Merge tag 'stable/for-linus-3.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Pull Xen update from Konrad Rzeszutek Wilk: "This has two new ACPI drivers for Xen - a physical CPU offline/online and a memory hotplug. The way this works is that ACPI kicks the drivers and they make the appropiate hypercall to the hypervisor to tell it that there is a new CPU or memory. There also some changes to the Xen ARM ABIs and couple of fixes. One particularly nasty bug in the Xen PV spinlock code was fixed by Stefan Bader - and has been there since the 2.6.32! Features: - Xen ACPI memory and CPU hotplug drivers - allowing Xen hypervisor to be aware of new CPU and new DIMMs - Cleanups Bug-fixes: - Fixes a long-standing bug in the PV spinlock wherein we did not kick VCPUs that were in a tight loop. - Fixes in the error paths for the event channel machinery" Fix up a few semantic conflicts with the ACPI interface changes in drivers/xen/xen-acpi-{cpu,mem}hotplug.c. * tag 'stable/for-linus-3.9-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen: event channel arrays are xen_ulong_t and not unsigned long xen: Send spinlock IPI to all waiters xen: introduce xen_remap, use it instead of ioremap xen: close evtchn port if binding to irq fails xen-evtchn: correct comment and error output xen/tmem: Add missing %s in the printk statement. xen/acpi: move xen_acpi_get_pxm under CONFIG_XEN_DOM0 xen/acpi: ACPI cpu hotplug xen/acpi: Move xen_acpi_get_pxm to Xen's acpi.h xen/stub: driver for CPU hotplug xen/acpi: ACPI memory hotplug xen/stub: driver for memory hotplug xen: implement updated XENMEM_add_to_physmap_range ABI xen/smp: Move the common CPU init code a bit to prep for PVH patch.
-
- 24 Feb, 2013 34 commits
-
-
git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds authored
Pull KVM updates from Marcelo Tosatti: "KVM updates for the 3.9 merge window, including x86 real mode emulation fixes, stronger memory slot interface restrictions, mmu_lock spinlock hold time reduction, improved handling of large page faults on shadow, initial APICv HW acceleration support, s390 channel IO based virtio, amongst others" * tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits) Revert "KVM: MMU: lazily drop large spte" x86: pvclock kvm: align allocation size to page size KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr x86 emulator: fix parity calculation for AAD instruction KVM: PPC: BookE: Handle alignment interrupts booke: Added DBCR4 SPR number KVM: PPC: booke: Allow multiple exception types KVM: PPC: booke: use vcpu reference from thread_struct KVM: Remove user_alloc from struct kvm_memory_slot KVM: VMX: disable apicv by default KVM: s390: Fix handling of iscs. KVM: MMU: cleanup __direct_map KVM: MMU: remove pt_access in mmu_set_spte KVM: MMU: cleanup mapping-level KVM: MMU: lazily drop large spte KVM: VMX: cleanup vmx_set_cr0(). KVM: VMX: add missing exit names to VMX_EXIT_REASONS array KVM: VMX: disable SMEP feature when guest is in non-paging mode KVM: Remove duplicate text in api.txt Revert "KVM: MMU: split kvm_mmu_free_page" ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signalLinus Torvalds authored
Pull signal handling cleanups from Al Viro: "This is the first pile; another one will come a bit later and will contain SYSCALL_DEFINE-related patches. - a bunch of signal-related syscalls (both native and compat) unified. - a bunch of compat syscalls switched to COMPAT_SYSCALL_DEFINE (fixing several potential problems with missing argument validation, while we are at it) - a lot of now-pointless wrappers killed - a couple of architectures (cris and hexagon) forgot to save altstack settings into sigframe, even though they used the (uninitialized) values in sigreturn; fixed. - microblaze fixes for delivery of multiple signals arriving at once - saner set of helpers for signal delivery introduced, several architectures switched to using those." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (143 commits) x86: convert to ksignal sparc: convert to ksignal arm: switch to struct ksignal * passing alpha: pass k_sigaction and siginfo_t using ksignal pointer burying unused conditionals make do_sigaltstack() static arm64: switch to generic old sigaction() (compat-only) arm64: switch to generic compat rt_sigaction() arm64: switch compat to generic old sigsuspend arm64: switch to generic compat rt_sigqueueinfo() arm64: switch to generic compat rt_sigpending() arm64: switch to generic compat rt_sigprocmask() arm64: switch to generic sigaltstack sparc: switch to generic old sigsuspend sparc: COMPAT_SYSCALL_DEFINE does all sign-extension as well as SYSCALL_DEFINE sparc: kill sign-extending wrappers for native syscalls kill sparc32_open() sparc: switch to use of generic old sigaction sparc: switch sys_compat_rt_sigaction() to COMPAT_SYSCALL_DEFINE mips: switch to generic sys_fork() and sys_clone() ...
-
Linus Torvalds authored
Merge second patch-bomb from Andrew Morton: - A little DM fix - the MM queue * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (154 commits) ksm: allocate roots when needed mm: cleanup "swapcache" in do_swap_page mm,ksm: swapoff might need to copy mm,ksm: FOLL_MIGRATION do migration_entry_wait ksm: shrink 32-bit rmap_item back to 32 bytes ksm: treat unstable nid like in stable tree ksm: add some comments tmpfs: fix mempolicy object leaks tmpfs: fix use-after-free of mempolicy object mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages mm: export mmu notifier invalidates mm: accelerate mm_populate() treatment of THP pages mm: use long type for page counts in mm_populate() and get_user_pages() mm: accurately document nr_free_*_pages functions with code comments HWPOISON: change order of error_states[]'s elements HWPOISON: fix misjudgement of page_action() for errors on mlocked pages memcg: stop warning on memcg_propagate_kmem net: change type of virtio_chan->p9_max_pages vmscan: change type of vm_total_pages to unsigned long fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used ...
-
Hugh Dickins authored
It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically allocated, particularly when very few users will ever actually tune merge_across_nodes 0 to use more than 1+1 of those trees. Not a big deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity. Start off with 1+1 statically allocated, then if merge_across_nodes is ever tuned, allocate for nr_node_ids+nr_node_ids. Do not attempt to free up the extra if it's tuned back, that would be a waste of effort. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hugh Dickins authored
I dislike the way in which "swapcache" gets used in do_swap_page(): there is always a page from swapcache there (even if maybe uncached by the time we lock it), but tests are made according to "swapcache". Rework that with "page != swapcache", as has been done in unuse_pte(). Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hugh Dickins authored
Before establishing that KSM page migration was the cause of my WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which in many respects is equivalent to faulting in a page. In fact I've never caught that as the cause: but in theory it does at least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(), to avoid bringing a KSM page back in when it's not supposed to be. I intended to copy how it's done in do_swap_page(), but have a strong aversion to how "swapcache" ends up being used there: rework it with "page != swapcache". Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hugh Dickins authored
In "ksm: remove old stable nodes more thoroughly" I said that I'd never seen its WARN_ON_ONCE(page_mapped(page)). True at the time of writing, but it soon appeared once I tried fuller tests on the whole series. It turned out to be due to the KSM page migration itself: unmerge_and_ remove_all_rmap_items() failed to locate and replace all the KSM pages, because of that hiatus in page migration when old pte has been replaced by migration entry, but not yet by new pte. follow_page() finds no page at that instant, but a KSM page reappears shortly after, without a fault. Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait() for KSM's break_cow(). I'd have preferred to avoid another flag, and do it every time, in case someone else makes the same easy mistake; but did not find another transgressor (the common get_user_pages() is of course safe), and cannot be sure that every follow_page() caller is prepared to sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can already sleep there, since anon_vma locking was changed to mutex, but maybe that's somehow excluded. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hugh Dickins authored
Think of struct rmap_item as an extension of struct page (restricted to MADV_MERGEABLE areas): there may be a lot of them, we need to keep them small, especially on 32-bit architectures of limited lowmem. Siting "int nid" after "unsigned int checksum" works nicely on 64-bit, making no change to its 64-byte struct rmap_item; but bloats the 32-bit struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which rounds up to 40 bytes once allocated from slab. We'd better avoid that. Hey, I only just remembered that the anon_vma pointer in struct rmap_item has no purpose until the rmap_item is hung from a stable tree node (which has its own nid field); and rmap_item's nid field no purpose than to say which tree root to tell rb_erase() when unlinking from an unstable tree. Double them up in a union. There's just one place where we set anon_vma early (when we already hold mmap_sem): now we must remove tree_rmap_item from its unstable tree there, before overwriting nid. No need to spatter BUG()s around: we'd be seeing oopses if this were wrong. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hugh Dickins authored
An inconsistency emerged in reviewing the NUMA node changes to KSM: when meeting a page from the wrong NUMA node in a stable tree, we say that it's okay for comparisons, but not as a leaf for merging; whereas when meeting a page from the wrong NUMA node in an unstable tree, we bail out immediately. Now, it might be that a wrong NUMA node in an unstable tree is more likely to correlate with instablility (different content, with rbnode now misplaced) than page migration; but even so, we are accustomed to instablility in the unstable tree. Without strong evidence for which strategy is generally better, I'd rather be consistent with what's done in the stable tree: accept a page from the wrong NUMA node for comparison, but not as a leaf for merging. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hugh Dickins authored
Added slightly more detail to the Documentation of merge_across_nodes, a few comments in areas indicated by review, and renamed get_ksm_page()'s argument from "locked" to "lock_it". No functional change. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Greg Thelen authored
Fix several mempolicy leaks in the tmpfs mount logic. These leaks are slow - on the order of one object leaked per mount attempt. Leak 1 (umount doesn't free mpol allocated in mount): while true; do mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt umount /mnt done Leak 2 (errors parsing remount options will leak mpol): mount -t tmpfs -o size=100M nodev /mnt while true; do mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null done umount /mnt Leak 3 (multiple mpol per mount leak mpol): while true; do mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt umount /mnt done This patch fixes all of the above. I could have broken the patch into three pieces but is seemed easier to review as one. [akpm@linux-foundation.org: fix handling of mpol_parse_str() errors, per Hugh] Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Greg Thelen authored
The tmpfs remount logic preserves filesystem mempolicy if the mpol=M option is not specified in the remount request. A new policy can be specified if mpol=M is given. Before this patch remounting an mpol bound tmpfs without specifying mpol= mount option in the remount request would set the filesystem's mempolicy object to a freed mempolicy object. To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run: # mkdir /tmp/x # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x # grep /tmp/x /proc/mounts nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0 # mount -o remount,size=200M nodev /tmp/x # grep /tmp/x /proc/mounts nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0 # note ? garbage in mpol=... output above # dd if=/dev/zero of=/tmp/x/f count=1 # panic here Panic: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [< (null)>] (null) [...] Oops: 0010 [#1] SMP DEBUG_PAGEALLOC Call Trace: mpol_shared_policy_init+0xa5/0x160 shmem_get_inode+0x209/0x270 shmem_mknod+0x3e/0xf0 shmem_create+0x18/0x20 vfs_create+0xb5/0x130 do_last+0x9a1/0xea0 path_openat+0xb3/0x4d0 do_filp_open+0x42/0xa0 do_sys_open+0xfe/0x1e0 compat_sys_open+0x1b/0x20 cstar_dispatch+0x7/0x1f Non-debug kernels will not crash immediately because referencing the dangling mpol will not cause a fault. Instead the filesystem will reference a freed mempolicy object, which will cause unpredictable behavior. The problem boils down to a dropped mpol reference below if shmem_parse_options() does not allocate a new mpol: config = *sbinfo shmem_parse_options(data, &config, true) mpol_put(sbinfo->mpol) sbinfo->mpol = config.mpol /* BUG: saves unreferenced mpol */ This patch avoids the crash by not releasing the mempolicy if shmem_parse_options() doesn't create a new mpol. How far back does this issue go? I see it in both 2.6.36 and 3.3. I did not look back further. Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
Rob van der Heij reported the following (paraphrased) on private mail. The scenario is that I want to avoid backups to fill up the page cache and purge stuff that is more likely to be used again (this is with s390x Linux on z/VM, so I don't give it as much memory that we don't care anymore). So I have something with LD_PRELOAD that intercepts the close() call (from tar, in this case) and issues a posix_fadvise() just before closing the file. This mostly works, except for small files (less than 14 pages) that remains in page cache after the face. Unfortunately Rob has not had a chance to test this exact patch but the test program below should be reproducing the problem he described. The issue is the per-cpu pagevecs for LRU additions. If the pages are added by one CPU but fadvise() is called on another then the pages remain resident as the invalidate_mapping_pages() only drains the local pagevecs via its call to pagevec_release(). The user-visible effect is that a program that uses fadvise() properly is not obeyed. A possible fix for this is to put the necessary smarts into invalidate_mapping_pages() to globally drain the LRU pagevecs if a pagevec page could not be discarded. The downside with this is that an inode cache shrink would send a global IPI and memory pressure potentially causing global IPI storms is very undesirable. Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to check if invalidate_mapping_pages() discarded all the requested pages. If a subset of pages are discarded it drains the LRU pagevecs and tries again. If the second attempt fails, it assumes it is due to the pages being mapped, locked or dirty and does not care. With this patch, an application using fadvise() correctly will be obeyed but there is a downside that a malicious application can force the kernel to send global IPIs and increase overhead. If accepted, I would like this to be considered as a -stable candidate. It's not an urgent issue but it's a system call that is not working as advertised which is weak. The following test program demonstrates the problem. It should never report that pages are still resident but will without this patch. It assumes that CPU 0 and 1 exist. int main() { int fd; int pagesize = getpagesize(); ssize_t written = 0, expected; char *buf; unsigned char *vec; int resident, i; cpu_set_t set; /* Prepare a buffer for writing */ expected = FILESIZE_PAGES * pagesize; buf = malloc(expected + 1); if (buf == NULL) { printf("ENOMEM\n"); exit(EXIT_FAILURE); } buf[expected] = 0; memset(buf, 'a', expected); /* Prepare the mincore vec */ vec = malloc(FILESIZE_PAGES); if (vec == NULL) { printf("ENOMEM\n"); exit(EXIT_FAILURE); } /* Bind ourselves to CPU 0 */ CPU_ZERO(&set); CPU_SET(0, &set); if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) { perror("sched_setaffinity"); exit(EXIT_FAILURE); } /* open file, unlink and write buffer */ fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR); if (fd == -1) { perror("open"); exit(EXIT_FAILURE); } unlink("fadvise-test-file"); while (written < expected) { ssize_t this_write; this_write = write(fd, buf + written, expected - written); if (this_write == -1) { perror("write"); exit(EXIT_FAILURE); } written += this_write; } free(buf); /* * Force ourselves to another CPU. If fadvise only flushes the local * CPUs pagevecs then the fadvise will fail to discard all file pages */ CPU_ZERO(&set); CPU_SET(1, &set); if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) { perror("sched_setaffinity"); exit(EXIT_FAILURE); } /* sync and fadvise to discard the page cache */ fsync(fd); if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) { perror("posix_fadvise"); exit(EXIT_FAILURE); } /* map the file and use mincore to see which parts of it are resident */ buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0); if (buf == NULL) { perror("mmap"); exit(EXIT_FAILURE); } if (mincore(buf, expected, vec) == -1) { perror("mincore"); exit(EXIT_FAILURE); } /* Check residency */ for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) { if (vec[i]) resident++; } if (resident != 0) { printf("Nr unexpected pages resident: %d\n", resident); exit(EXIT_FAILURE); } munmap(buf, expected); close(fd); free(vec); exit(EXIT_SUCCESS); } Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Rob van der Heij <rvdheij@gmail.com> Tested-by: Rob van der Heij <rvdheij@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Cliff Wickman authored
We at SGI have a need to address some very high physical address ranges with our GRU (global reference unit), sometimes across partitioned machine boundaries and sometimes with larger addresses than the cpu supports. We do this with the aid of our own 'extended vma' module which mimics the vma. When something (either unmap or exit) frees an 'extended vma' we use the mmu notifiers to clean them up. We had been able to mimic the functions __mmu_notifier_invalidate_range_start() and __mmu_notifier_invalidate_range_end() by locking the per-mm lock and walking the per-mm notifier list. But with the change to a global srcu lock (static in mmu_notifier.c) we can no longer do that. Our module has no access to that lock. So we request that these two functions be exported. Signed-off-by: Cliff Wickman <cpw@sgi.com> Acked-by: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michel Lespinasse authored
This change adds a follow_page_mask function which is equivalent to follow_page, but with an extra page_mask argument. follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters a THP page, and to 0 in other cases. __get_user_pages() makes use of this in order to accelerate populating THP ranges - that is, when both the pages and vmas arrays are NULL, we don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and we also avoid taking mm->page_table_lock that many times). Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michel Lespinasse authored
Use long type for page counts in mm_populate() so as to avoid integer overflow when running the following test code: int main(void) { void *p = mmap(NULL, 0x100000000000, PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0); printf("p: %p\n", p); mlockall(MCL_CURRENT); printf("done\n"); return 0; } Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Zhang Yanfei authored
nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages() are horribly badly named, so accurately document them with code comments in case of the misuse of them. [akpm@linux-foundation.org: tweak comments] Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
error_states[] has two separate states "unevictable LRU page" and "mlocked LRU page", and the former one has the higher priority now. But because of that the latter one is rarely chosen because pages with PageMlocked highly likely have PG_unevictable set. On the other hand, PG_unevictable without PageMlocked is common for ramfs or SHM_LOCKed shared memory, so reversing the priority of these two states helps us clearly distinguish them. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
memory_failure() can't handle memory errors on mlocked pages correctly, because page_action() judges such errors as ones on "unknown pages" instead of ones on "unevictable LRU page" or "mlocked LRU page". In order to determine page_state page_action() checks page flags at the timing of the judgement, but such page flags are not the same with those just after memory_failure() is called, because memory_failure() does unmapping of the error pages before doing page_action(). This unmapping changes the page state, especially page_remove_rmap() (called from try_to_unmap_one()) clears PG_mlocked, so page_action() can't catch mlocked pages after that. With this patch, we store the page flag of the error page before doing unmap, and (only) if the first check with page flags at the time decided the error page is unknown, we do the second check with the stored page flag. This implementation doesn't change error handling for the page types for which the first check can determine the page state correctly. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Hugh Dickins authored
Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand, I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Zhang Yanfei authored
This member of struct virtio_chan is calculated from nr_free_buffer_pages so change its type to unsigned long in case of overflow. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: David Miller <davem@davemloft.net> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Zhang Yanfei authored
This variable is calculated from nr_free_pagecache_pages so change its type to unsigned long. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Zhang Yanfei authored
The three variables are calculated from nr_free_buffer_pages so change their types to unsigned long in case of overflow. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Zhang Yanfei authored
max_buffer_heads is calculated from nr_free_buffer_pages(), so change its type to unsigned long in case of overflow. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Zhang Yanfei authored
Now the function nr_free_buffer_pages returns unsigned long, so use %ld to print its return value. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Zhang Yanfei authored
Currently, the amount of RAM that functions nr_free_*_pages return is held in unsigned int. But in machines with big memory (exceeding 16TB), the amount may be incorrect because of overflow, so fix it. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: David Miller <davem@davemloft.net> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
We should encourage all memcg controller initialization independent on a specific mem_cgroup to be done here rather than exploit css_alloc callback and assume that nothing happens before root cgroup is created. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
memcg_stock are currently initialized during the root cgroup allocation which is OK but it pointlessly pollutes memcg allocation code with something that can be called when the memcg subsystem is initialized by mem_cgroup_init along with other controller specific parts. This patch wraps the current memcg_stock initialization code into a helper calls it from the controller subsystem initialization code. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Michal Hocko authored
Per-node-zone soft limit tree is currently initialized when the root cgroup is created which is OK but it pointlessly pollutes memcg allocation code with something that can be called when the memcg subsystem is initialized by mem_cgroup_init along with other controller specific parts. While we are at it let's make mem_cgroup_soft_limit_tree_init void because it doesn't make much sense to report memory failure because if we fail to allocate memory that early during the boot then we are screwed anyway (this saves some code). Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Minchan Kim authored
Recently, Luigi reported there are lots of free swap space when OOM happens. It's easily reproduced on zram-over-swap, where many instance of memory hogs are running and laptop_mode is enabled. He said there was no problem when he disabled laptop_mode. The problem when I investigate problem is following as. Assumption for easy explanation: There are no page cache page in system because they all are already reclaimed. 1. try_to_free_pages disable may_writepage when laptop_mode is enabled. 2. shrink_inactive_list isolates victim pages from inactive anon lru list. 3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't pageout because sc->may_writepage is 0 so the page is rotated back into inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty. 4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and retry reclaim with higher priority. 5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list but got failed because it try to isolate pages with ISOLATE_CLEAN mode but inactive anon lru list is full of dirty pages by 3 so it just returns without any reclaim progress. 6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned. Because sc->nr_scanned is increased by shrink_page_list but we don't call shrink_page_list in 5 due to short of isolated pages. Above loop is continued until OOM happens. The problem didn't happen before [1] was merged because old logic's isolatation in shrink_inactive_list was successful and tried to call shrink_page_list to pageout them but it still ends up failed to page out by may_writepage. But important point is that sc->nr_scanned was increased although we couldn't swap out them so do_try_to_free_pages could set may_writepages. Since commit f80c0673 ("mm: zone_reclaim: make isolate_lru_page() filter-aware") was introduced, it's not a good idea any more to depends on only the number of scanned pages for setting may_writepage. So this patch adds new trigger point of setting may_writepage as below DEF_PRIOIRTY - 2 which is used to show the significant memory pressure in VM so it's good fit for our purpose which would be better to lose power saving or clickety rather than OOM killing. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Luigi Semenzato <semenzato@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
David Rientjes authored
Make a sweep through mm/ and convert code that uses -1 directly to using the more appropriate NUMA_NO_NODE. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Robin Holt authored
There is a race condition between mmu_notifier_unregister() and __mmu_notifier_release(). Assume two tasks, one calling mmu_notifier_unregister() as a result of a filp_close() ->flush() callout (task A), and the other calling mmu_notifier_release() from an mmput() (task B). A B t1 srcu_read_lock() t2 if (!hlist_unhashed()) t3 srcu_read_unlock() t4 srcu_read_lock() t5 hlist_del_init_rcu() t6 synchronize_srcu() t7 srcu_read_unlock() t8 hlist_del_rcu() <--- NULL pointer deref. Additionally, the list traversal in __mmu_notifier_release() is not protected by the by the mmu_notifier_mm->hlist_lock which can result in callouts to the ->release() notifier from both mmu_notifier_unregister() and __mmu_notifier_release(). -stable suggestions: The stable trees prior to 3.7.y need commits 21a92735 and 70400303 cherry-picked in that order prior to cherry-picking this commit. The 3.7.y tree already has those two commits. Signed-off-by: Robin Holt <holt@sgi.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Sagi Grimberg <sagig@mellanox.co.il> Cc: Haggai Eran <haggaie@mellanox.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Cody P Schafer authored
Replace open coded pgdat_end_pfn() with helper function. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Cody P Schafer authored
Remove open coding of ensure_zone_is_initialzied(). Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-