1. 14 Dec, 2012 13 commits
    • Satoru Takeuchi's avatar
      module: Remove a extra null character at the top of module->strtab. · 54523ec7
      Satoru Takeuchi authored
      There is a extra null character('\0') at the top of module->strtab for
      each module. Commit 59ef28b1 introduced this bug and this patch fixes it.
      
      Live dump log of the current linus git kernel(HEAD is 2844a487):
      ============================================================================
      crash> mod | grep loop
      ffffffffa01db0a0  loop             16689  (not loaded)  [CONFIG_KALLSYMS]
      crash> module.core_symtab ffffffffa01db0a0
        core_symtab = 0xffffffffa01db320crash> rd 0xffffffffa01db320 12
      ffffffffa01db320:  0000005500000001 0000000000000000   ....U...........
      ffffffffa01db330:  0000000000000000 0002007400000002   ............t...
      ffffffffa01db340:  ffffffffa01d8000 0000000000000038   ........8.......
      ffffffffa01db350:  001a00640000000e ffffffffa01daeb0   ....d...........
      ffffffffa01db360:  00000000000000a0 0002007400000019   ............t...
      ffffffffa01db370:  ffffffffa01d8068 000000000000001b   h...............
      crash> module.core_strtab ffffffffa01db0a0
        core_strtab = 0xffffffffa01dbb30 ""
      crash> rd 0xffffffffa01dbb30 4
      ffffffffa01dbb30:  615f70616d6b0000 66780063696d6f74   ..kmap_atomic.xf
      ffffffffa01dbb40:  73636e75665f7265 72665f646e696600   er_funcs.find_fr
      ============================================================================
      
      We expect Just first one byte of '\0', but actually first two bytes
      are '\0'. Here is The relationship between symtab and strtab.
      
      	symtab_idx	strtab_idx	symbol
      	-----------------------------------------------
      	0		0x1		"\0" # startab_idx should be 0
      	1		0x2		"kmap_atomic"
      	2		0xe		"xfer_funcs"
      	3		0x19		"find_fr..."
      
      By applying this patch, it becomes as follows.
      
      	symtab_idx	strtab_idx	symbol
      	-----------------------------------------------
      	0		0x0		"\0"	# extra byte is removed
      	1		0x1		"kmap_atomic"
      	2		0xd		"xfer_funcs"
      	3		0x18		"find_fr..."
      Signed-off-by: default avatarSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Cc: Masaki Kimura <masaki.kimura.kz@hitachi.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      54523ec7
    • David Howells's avatar
      ASN.1: Use the ASN1_LONG_TAG and ASN1_INDEFINITE_LENGTH constants · 99cca91e
      David Howells authored
      Use the ASN1_LONG_TAG and ASN1_INDEFINITE_LENGTH constants in the ASN.1
      general decoder instead of the equivalent numbers.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      99cca91e
    • David Howells's avatar
      ASN.1: Define indefinite length marker constant · facc0a6b
      David Howells authored
      Define a constant to hold the marker value seen in an indefinite-length
      element.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      facc0a6b
    • Rusty Russell's avatar
      moduleparam: use __UNIQUE_ID() · 34182eea
      Rusty Russell authored
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      34182eea
    • Rusty Russell's avatar
      __UNIQUE_ID() · 6f33d587
      Rusty Russell authored
      Jan Beulich points out __COUNTER__ (gcc 4.3 and above), so let's use
      that to create unique ids.  This is better than __LINE__ which we use
      today, so provide a wrapper.
      
      Stanislaw Gruszka <sgruszka@redhat.com> reported that some module parameters
      start with a digit, so we need to prepend when we for the unique id.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Acked-by: default avatarJan Beulich <jbeulich@suse.com>
      6f33d587
    • Josh Boyer's avatar
      MODSIGN: Add modules_sign make target · d890f510
      Josh Boyer authored
      If CONFIG_MODULE_SIG is set, and 'make modules_sign' is called then this
      patch will cause the modules to get a signature appended.  The make target
      is intended to be run after 'make modules_install', and will modify the
      modules in-place in the installed location.  It can be used to produce
      signed modules after they have been processed by distribution build
      scripts.
      Signed-off-by: default avatarJosh Boyer <jwboyer@redhat.com>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (minor typo fix)
      d890f510
    • Rusty Russell's avatar
      powerpc: add finit_module syscall. · 71eac702
      Rusty Russell authored
      (This is just for Acks: this won't work without the actual syscall patches,
       sitting in my tree for -next at the moment).
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      71eac702
    • Mimi Zohar's avatar
      ima: support new kernel module syscall · fdf90729
      Mimi Zohar authored
      With the addition of the new kernel module syscall, which defines two
      arguments - a file descriptor to the kernel module and a pointer to a NULL
      terminated string of module arguments - it is now possible to measure and
      appraise kernel modules like any other file on the file system.
      
      This patch adds support to measure and appraise kernel modules in an
      extensible and consistent manner.
      
      To support filesystems without extended attribute support, additional
      patches could pass the signature as the first parameter.
      Signed-off-by: default avatarMimi Zohar <zohar@us.ibm.com>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      fdf90729
    • Kees Cook's avatar
      add finit_module syscall to asm-generic · 1625cee5
      Kees Cook authored
      This adds the finit_module syscall to the generic syscall list.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      1625cee5
    • Kees Cook's avatar
      ARM: add finit_module syscall to ARM · 4926f652
      Kees Cook authored
      Add finit_module syscall to the ARM syscall list.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      4926f652
    • Kees Cook's avatar
      security: introduce kernel_module_from_file hook · 2e72d51b
      Kees Cook authored
      Now that kernel module origins can be reasoned about, provide a hook to
      the LSMs to make policy decisions about the module file. This will let
      Chrome OS enforce that loadable kernel modules can only come from its
      read-only hash-verified root filesystem. Other LSMs can, for example,
      read extended attributes for signatures, etc.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@canonical.com>
      Acked-by: default avatarEric Paris <eparis@redhat.com>
      Acked-by: default avatarMimi Zohar <zohar@us.ibm.com>
      Acked-by: default avatarJames Morris <james.l.morris@oracle.com>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      2e72d51b
    • Rusty Russell's avatar
      module: add flags arg to sys_finit_module() · 2f3238ae
      Rusty Russell authored
      Thanks to Michael Kerrisk for keeping us honest.  These flags are actually
      useful for eliminating the only case where kmod has to mangle a module's
      internals: for overriding module versioning.
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Acked-by: default avatarLucas De Marchi <lucas.demarchi@profusion.mobi>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      2f3238ae
    • Kees Cook's avatar
      module: add syscall to load module from fd · 34e1169d
      Kees Cook authored
      As part of the effort to create a stronger boundary between root and
      kernel, Chrome OS wants to be able to enforce that kernel modules are
      being loaded only from our read-only crypto-hash verified (dm_verity)
      root filesystem. Since the init_module syscall hands the kernel a module
      as a memory blob, no reasoning about the origin of the blob can be made.
      
      Earlier proposals for appending signatures to kernel modules would not be
      useful in Chrome OS, since it would involve adding an additional set of
      keys to our kernel and builds for no good reason: we already trust the
      contents of our root filesystem. We don't need to verify those kernel
      modules a second time. Having to do signature checking on module loading
      would slow us down and be redundant. All we need to know is where a
      module is coming from so we can say yes/no to loading it.
      
      If a file descriptor is used as the source of a kernel module, many more
      things can be reasoned about. In Chrome OS's case, we could enforce that
      the module lives on the filesystem we expect it to live on.  In the case
      of IMA (or other LSMs), it would be possible, for example, to examine
      extended attributes that may contain signatures over the contents of
      the module.
      
      This introduces a new syscall (on x86), similar to init_module, that has
      only two arguments. The first argument is used as a file descriptor to
      the module and the second argument is a pointer to the NULL terminated
      string of module arguments.
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (merge fixes)
      34e1169d
  2. 03 Dec, 2012 2 commits
    • James Hogan's avatar
      modsign: add symbol prefix to certificate list · 84ecfd15
      James Hogan authored
      Add the arch symbol prefix (if applicable) to the asm definition of
      modsign_certificate_list and modsign_certificate_list_end. This uses the
      recently defined SYMBOL_PREFIX which is derived from
      CONFIG_SYMBOL_PREFIX.
      
      This fixes the build of module signing on the blackfin and metag
      architectures.
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      84ecfd15
    • James Hogan's avatar
      linux/kernel.h: define SYMBOL_PREFIX · cbdbf2ab
      James Hogan authored
      Define SYMBOL_PREFIX to be the same as CONFIG_SYMBOL_PREFIX if set by
      the architecture, or "" otherwise. This avoids the need for ugly #ifdefs
      whenever symbols are referenced in asm blocks.
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Jean Delvare <khali@linux-fr.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      cbdbf2ab
  3. 02 Dec, 2012 2 commits
    • Linus Torvalds's avatar
      Merge branch 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · 3c46f3d6
      Linus Torvalds authored
      Pull  late workqueue fixes from Tejun Heo:
       "Unfortunately, I have two really late fixes.  One was for a
        long-standing bug and queued for 3.8 but I found out about a
        regression introduced during 3.7-rc1 two days ago, so I'm sending out
        the two fixes together.
      
        The first (long-standing) one is rescuer_thread() entering exit path
        w/ TASK_INTERRUPTIBLE.  It only triggers on workqueue destructions
        which isn't very frequent and the exit path can usually survive being
        called with TASK_INTERRUPT, so it was hidden pretty well.  Apparently,
        if you're reiserfs, this could lead to the exiting kthread sleeping
        indefinitely holding a mutex, which is never good.
      
        The fix is simple - restoring TASK_RUNNING before returning from the
        kthread function.
      
        The second one is introduced by the new mod_delayed_work().
        mod_delayed_work() was missing special case handling for 0 delay.
        Instead of queueing the work item immediately, it queued the timer
        which expires on the closest next tick.  Some users of the new
        function converted from "[__]cancel_delayed_work() +
        queue_delayed_work()" combination became unhappy with the extra delay.
      
        Block unplugging led to noticeably higher number of context switches
        and intel 6250 wireless failed to associate with WPA-Enterprise
        network.  The fix, again, is fairly simple.  The 0 delay special case
        logic from queue_delayed_work_on() should be moved to
        __queue_delayed_work() which is shared by both queue_delayed_work_on()
        and mod_delayed_work_on().
      
        The first one is difficult to trigger and the failure mode for the
        latter isn't completely catastrophic, so missing these two for 3.7
        wouldn't make it a disastrous release, but both bugs are nasty and the
        fixes are fairly safe"
      
      * 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay
        workqueue: exit rescuer_thread() as TASK_RUNNING
      3c46f3d6
    • Tejun Heo's avatar
      workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay · 8852aac2
      Tejun Heo authored
      8376fe22 ("workqueue: implement mod_delayed_work[_on]()")
      implemented mod_delayed_work[_on]() using the improved
      try_to_grab_pending().  The function is later used, among others, to
      replace [__]candel_delayed_work() + queue_delayed_work() combinations.
      
      Unfortunately, a delayed_work item w/ zero @delay is handled slightly
      differently by mod_delayed_work_on() compared to
      queue_delayed_work_on().  The latter skips timer altogether and
      directly queues it using queue_work_on() while the former schedules
      timer which will expire on the closest tick.  This means, when @delay
      is zero, that [__]cancel_delayed_work() + queue_delayed_work_on()
      makes the target item immediately executable while
      mod_delayed_work_on() may induce delay of upto a full tick.
      
      This somewhat subtle difference breaks some of the converted users.
      e.g. block queue plugging uses delayed_work for deferred processing
      and uses mod_delayed_work_on() when the queue needs to be immediately
      unplugged.  The above problem manifested as noticeably higher number
      of context switches under certain circumstances.
      
      The difference in behavior was caused by missing special case handling
      for 0 delay in mod_delayed_work_on() compared to
      queue_delayed_work_on().  Joonsoo Kim posted a patch to add it -
      ("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1].
      The patch was queued for 3.8 but it was described as optimization and
      I missed that it was a correctness issue.
      
      As both queue_delayed_work_on() and mod_delayed_work_on() use
      __queue_delayed_work() for queueing, it seems that the better approach
      is to move the 0 delay special handling to the function instead of
      duplicating it in mod_delayed_work_on().
      
      Fix the problem by moving 0 delay special case handling from
      queue_delayed_work_on() to __queue_delayed_work().  This replaces
      Joonsoo's patch.
      
      [1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarAnders Kaseorg <andersk@MIT.EDU>
      Reported-and-tested-by: default avatarZlatko Calusic <zlatko.calusic@iskon.hr>
      LKML-Reference: <alpine.DEB.2.00.1211280953350.26602@dr-wily.mit.edu>
      LKML-Reference: <50A78AA9.5040904@iskon.hr>
      Cc: Joonsoo Kim <js1304@gmail.com>
      8852aac2
  4. 01 Dec, 2012 11 commits
  5. 30 Nov, 2012 12 commits
    • Vincent Palatin's avatar
      x86, fpu: Avoid FPU lazy restore after suspend · 644c1541
      Vincent Palatin authored
      When a cpu enters S3 state, the FPU state is lost.
      After resuming for S3, if we try to lazy restore the FPU for a process running
      on the same CPU, this will result in a corrupted FPU context.
      
      Ensure that "fpu_owner_task" is properly invalided when (re-)initializing a CPU,
      so nobody will try to lazy restore a state which doesn't exist in the hardware.
      
      Tested with a 64-bit kernel on a 4-core Ivybridge CPU with eagerfpu=off,
      by doing thousands of suspend/resume cycles with 4 processes doing FPU
      operations running. Without the patch, a process is killed after a
      few hundreds cycles by a SIGFPE.
      
      Cc: Duncan Laurie <dlaurie@chromium.org>
      Cc: Olof Johansson <olofj@chromium.org>
      Cc: <stable@kernel.org> v3.4+ # for 3.4 need to replace this_cpu_write by percpu_write
      Signed-off-by: default avatarVincent Palatin <vpalatin@chromium.org>
      Link: http://lkml.kernel.org/r/1354306532-1014-1-git-send-email-vpalatin@chromium.orgSigned-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      644c1541
    • Linus Torvalds's avatar
      Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux · cc19528b
      Linus Torvalds authored
      Pull DRM fixes from Dave Airlie:
       "Just driver fixes, nothing major, except maybe the Ironlake rc6
        disable:
      
         - intel:
           * revert ironlake rc6 - we still have one ilk regression, but this
             gets rid of one big one
           * turn off cloning
           * a directed fix for Apple edp
         - radeon: one modesetting fix
         - exynos: minor fixes"
      
      * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
        radeon: fix pll/ctrc mapping on dce2 and dce3 hardware
        Revert "drm/i915: enable rc6 on ilk again"
        drm/i915: do not default to 18 bpp for eDP if missing from VBT
        drm/exynos: Fix potential NULL pointer dereference in exynos_drm_encoder.c
        drm/exynos: Make exynos4/5_fimd_driver_data static
        drm/exynos: fix overlay updating issue
        drm/exynos: remove unnecessary code.
        drm/exynos: fix linux framebuffer address setting.
        drm/i915: disable cloning on sdvo
      cc19528b
    • Linus Torvalds's avatar
      Merge branch 'akpm' (Fixes from Andrew) · 50a53bbe
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "Seven fixes, some of them fingers-crossed :("
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (7 patches)
        drivers/rtc/rtc-tps65910.c: fix invalid pointer access on _remove()
        mm: soft offline: split thp at the beginning of soft_offline_page()
        mm: avoid waking kswapd for THP allocations when compaction is deferred or contended
        revert "Revert "mm: remove __GFP_NO_KSWAPD""
        mm: vmscan: fix endless loop in kswapd balancing
        mm/vmemmap: fix wrong use of virt_to_page
        mm: compaction: fix return value of capture_free_page()
      50a53bbe
    • Linus Torvalds's avatar
      Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 73efd00d
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "These are three fixes for the Marvell EBU family and one for the
        Samsung s3c platforms.  All of them are obvious should still make it
        into 3.7."
      
      * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        ARM: Kirkwood: Update PCI-E fixup
        Dove: Fix irq_to_pmu()
        Dove: Attempt to fix PMU/RTC interrupts
        ARM: S3C24XX: Fix potential NULL pointer dereference error
      73efd00d
    • Linus Torvalds's avatar
      Merge tag 'ixp4xx-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 90bf80a1
      Linus Torvalds authored
      Pull ARM ixp4xx bug fixes from Arnd Bergmann:
       "These were originally prepared by Krzysztof Halasa but not submitted
        in time for v3.7 due to some confusion about how ixp4xx patches should
        be handled.  Jason Cooper thankfully offered to help out sending the
        patches upstream through arm-soc now, but given the timing, we could
        as well delay them for 3.8."
      
      * tag 'ixp4xx-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        IXP4xx: use __iomem for MMIO
        IXP4xx: map CPU config registers within VMALLOC region.
        IXP4xx: Always ioremap() Queue Manager MMIO region at boot.
        ixp4xx: Declare MODULE_FIRMWARE usage
        IXP4xx crypto: MOD_AES{128,192,256} already include key size.
        WAN: Remove redundant HDLC info printed by IXP4xx HSS driver.
        IXP4xx: Remove time limit for PCI TRDY to enable use of slow devices.
        IXP4xx: ixp4xx_crypto driver requires Queue Manager and NPE drivers.
        IXP4xx: HW pseudo-random generator is available on IXP45x/46x only.
        IXP4xx: Fix off-by-one bug in Goramo MultiLink platform.
        IXP4xx: Fix Goramo MultiLink platform compilation.
      90bf80a1
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm · 50a561ca
      Linus Torvalds authored
      Pull final ARM fix from Russell King:
       "One final fix, spotted by Will, to do with what happens when we boot a
        SMP kernel on UP."
      
      * 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
        ARM: 7586/1: sp804: set cpumask to cpu_possible_mask for clock event device
      50a561ca
    • Kim, Milo's avatar
      drivers/rtc/rtc-tps65910.c: fix invalid pointer access on _remove() · 1430e178
      Kim, Milo authored
      The tps65910_rtc data is registered as the platform driver data in
      _probe(= ).  Therefore the tps65910_rtc should be used on unregistering
      the rtc device.  And device pointer should be retrieved from the
      platform_device structure.
      
      This patch fixes the below oops:
      
       Unable to handle kernel NULL pointer dereference at virtual address 00000008
       Modules linked in: rtc_tps65910(-)
       CPU: 0    Not tainted  (3.7.0-rc7-next-20121128-g6b1f974-dirty #7)
       PC is at tps65910_rtc_alarm_irq_enable+0x20/0x2c [rtc_tps65910]
           (tps65910_rtc_alarm_irq_enable+0x20/0x2c [rtc_tps65910])
           (tps65910_rtc_remove+0x18/0x28 [rtc_tps65910])
           (platform_drv_remove+0x18/0x1c)
           (__device_release_driver+0x70/0xcc)
           (driver_detach+0xb4/0xb8)
           (bus_remove_driver+0x7c/0xc0)
           (sys_delete_module+0x148/0x21c)
      Signed-off-by: default avatarMilo(Woogyom) Kim <milo.kim@ti.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1430e178
    • Naoya Horiguchi's avatar
      mm: soft offline: split thp at the beginning of soft_offline_page() · 783657a7
      Naoya Horiguchi authored
      When we try to soft-offline a thp tail page, put_page() is called on the
      tail page unthinkingly and VM_BUG_ON is triggered in put_compound_page().
      
      This patch splits thp before going into the main body of soft-offlining.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      783657a7
    • Mel Gorman's avatar
      mm: avoid waking kswapd for THP allocations when compaction is deferred or contended · 782fd304
      Mel Gorman authored
      With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
      based on failures" reverted, Zdenek Kabelac reported the following
      
        Hmm,  so it's just took longer to hit the problem and observe
        kswapd0 spinning on my CPU again - it's not as endless like before -
        but still it easily eats minutes - it helps to turn off  Firefox
        or TB  (memory hungry apps) so kswapd0 stops soon - and restart
        those apps again.  (And I still have like >1GB of cached memory)
      
        kswapd0         R  running task        0    30      2 0x00000000
        Call Trace:
          preempt_schedule+0x42/0x60
          _raw_spin_unlock+0x55/0x60
          put_super+0x31/0x40
          drop_super+0x22/0x30
          prune_super+0x149/0x1b0
          shrink_slab+0xba/0x510
      
      The sysrq+m indicates the system has no swap so it'll never reclaim
      anonymous pages as part of reclaim/compaction.  That is one part of the
      problem but not the root cause as file-backed pages could also be
      reclaimed.
      
      The likely underlying problem is that kswapd is woken up or kept awake
      for each THP allocation request in the page allocator slow path.
      
      If compaction fails for the requesting process then compaction will be
      deferred for a time and direct reclaim is avoided.  However, if there
      are a storm of THP requests that are simply rejected, it will still be
      the the case that kswapd is awake for a prolonged period of time as
      pgdat->kswapd_max_order is updated each time.  This is noticed by the
      main kswapd() loop and it will not call kswapd_try_to_sleep().  Instead
      it will loopp, shrinking a small number of pages and calling
      shrink_slab() on each iteration.
      
      This patch defers when kswapd gets woken up for THP allocations.  For
      !THP allocations, kswapd is always woken up.  For THP allocations,
      kswapd is woken up iff the process is willing to enter into direct
      reclaim/compaction.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      782fd304
    • Andrew Morton's avatar
      revert "Revert "mm: remove __GFP_NO_KSWAPD"" · a5091539
      Andrew Morton authored
      It apepars that this patch was innocent, and we hope that "mm: avoid
      waking kswapd for THP allocations when compaction is deferred or
      contended" will fix the final kswapd-spinning cause.
      
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5091539
    • Johannes Weiner's avatar
      mm: vmscan: fix endless loop in kswapd balancing · 60cefed4
      Johannes Weiner authored
      Kswapd does not in all places have the same criteria for a balanced
      zone.  Zones are only being reclaimed when their high watermark is
      breached, but compaction checks loop over the zonelist again when the
      zone does not meet the low watermark plus two times the size of the
      allocation.  This gets kswapd stuck in an endless loop over a small
      zone, like the DMA zone, where the high watermark is smaller than the
      compaction requirement.
      
      Add a function, zone_balanced(), that checks the watermark, and, for
      higher order allocations, if compaction has enough free memory.  Then
      use it uniformly to check for balanced zones.
      
      This makes sure that when the compaction watermark is not met, at least
      reclaim happens and progress is made - or the zone is declared
      unreclaimable at some point and skipped entirely.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarGeorge Spelvin <linux@horizon.com>
      Reported-by: default avatarJohannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
      Reported-by: default avatarTomas Racek <tracek@redhat.com>
      Tested-by: default avatarJohannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      60cefed4
    • Jianguo Wu's avatar
      mm/vmemmap: fix wrong use of virt_to_page · ae64ffca
      Jianguo Wu authored
      I enable CONFIG_DEBUG_VIRTUAL and CONFIG_SPARSEMEM_VMEMMAP, when doing
      memory hotremove, there is a kernel BUG at arch/x86/mm/physaddr.c:20.
      
      It is caused by free_section_usemap()->virt_to_page(), virt_to_page() is
      only used for kernel direct mapping address, but sparse-vmemmap uses
      vmemmap address, so it is going wrong here.
      
        ------------[ cut here ]------------
        kernel BUG at arch/x86/mm/physaddr.c:20!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: acpihp_drv acpihp_slot edd cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_intel ipv6 ixgbe igb iTCO_wdt i7core_edac edac_core pcspkr iTCO_vendor_support ioatdma microcode joydev sr_mod i2c_i801 dca lpc_ich mfd_core mdio tpm_tis i2c_core hid_generic tpm cdrom sg tpm_bios rtc_cmos button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hwmon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_piix libata megaraid_sas scsi_mod
        CPU 39
        Pid: 6454, comm: sh Not tainted 3.7.0-rc1-acpihp-final+ #45 QCI QSSC-S4R/QSSC-S4R
        RIP: 0010:[<ffffffff8103c908>]  [<ffffffff8103c908>] __phys_addr+0x88/0x90
        RSP: 0018:ffff8804440d7c08  EFLAGS: 00010006
        RAX: 0000000000000006 RBX: ffffea0012000000 RCX: 000000000000002c
        ...
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: default avatarJiang Liu <jiang.liu@huawei.com>
      Reviewd-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae64ffca