1. 10 Jan, 2014 10 commits
    • Tejun Heo's avatar
      kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers · 1ae06819
      Tejun Heo authored
      Sometimes it's necessary to implement a node which wants to delete
      nodes including itself.  This isn't straightforward because of kernfs
      active reference.  While a file operation is in progress, an active
      reference is held and kernfs_remove() waits for all such references to
      drain before completing.  For a self-deleting node, this is a deadlock
      as kernfs_remove() ends up waiting for an active reference that itself
      is sitting on top of.
      
      This currently is worked around in the sysfs layer using
      sysfs_schedule_callback() which makes such removals asynchronous.
      While it works, it's rather cumbersome and inherently breaks
      synchronicity of the operation - the file operation which triggered
      the operation may complete before the removal is finished (or even
      started) and the removal may fail asynchronously.  If a removal
      operation is immmediately followed by another operation which expects
      the specific name to be available (e.g. removal followed by rename
      onto the same name), there's no way to make the latter operation
      reliable.
      
      The thing is there's no inherent reason for this to be asynchrnous.
      All that's necessary to do this synchronous is a dedicated operation
      which drops its own active ref and deactivates self.  This patch
      implements kernfs_remove_self() and its wrappers in sysfs and driver
      core.  kernfs_remove_self() is to be called from one of the file
      operations, drops the active ref and deactivates using
      __kernfs_deactivate_self(), removes the self node, and restores active
      ref to the dead node using __kernfs_reactivate_self() so that the ref
      is balanced afterwards.  __kernfs_remove() is updated so that it takes
      an early exit if the target node is already fully removed so that the
      active ref restored by kernfs_remove_self() after removal doesn't
      confuse the deactivation path.
      
      This makes implementing self-deleting nodes very easy.  The normal
      removal path doesn't even need to be changed to use
      kernfs_remove_self() for the self-deleting node.  The method can
      invoke kernfs_remove_self() on itself before proceeding the normal
      removal path.  kernfs_remove() invoked on the node by the normal
      deletion path will simply be ignored.
      
      This will replace sysfs_schedule_callback().  A subtle feature of
      sysfs_schedule_callback() is that it collapses multiple invocations -
      even if multiple removals are triggered, the removal callback is run
      only once.  An equivalent effect can be achieved by testing the return
      value of kernfs_remove_self() - only the one which gets %true return
      value should proceed with actual deletion.  All other instances of
      kernfs_remove_self() will wait till the enclosing kernfs operation
      which invoked the winning instance of kernfs_remove_self() finishes
      and then return %false.  This trivially makes all users of
      kernfs_remove_self() automatically show correct synchronous behavior
      even when there are multiple concurrent operations - all "echo 1 >
      delete" instances will finish only after the whole operation is
      completed by one of the instances.
      
      v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing
          and sysfs_remove_file_self() had incorrect return type.  Fix it.
          Reported by kbuild test bot.
      
      v3: Updated to use __kernfs_{de|re}activate_self().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1ae06819
    • Tejun Heo's avatar
      kernfs: implement kernfs_{de|re}activate[_self]() · 9f010c2a
      Tejun Heo authored
      This patch implements four functions to manipulate deactivation state
      - deactivate, reactivate and the _self suffixed pair.  A new fields
      kernfs_node->deact_depth is added so that concurrent and nested
      deactivations are handled properly.  kernfs_node->hash is moved so
      that it's paired with the new field so that it doesn't increase the
      size of kernfs_node.
      
      A kernfs user's lock would normally nest inside active ref but during
      removal the user may want to perform kernfs_remove() while holding the
      said lock, which would introduce a reverse locking dependency.  This
      function can be used to break such reverse dependency by allowing
      deactivation step to performed separately outside user's critical
      section.
      
      This will also be used implement kernfs_remove_self().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9f010c2a
    • Tejun Heo's avatar
      kernfs: make kernfs_get_active() block if the node is deactivated but not removed · 895a068a
      Tejun Heo authored
      Currently, kernfs_get_active() fails if the target node is
      deactivated.  This is fine as a node always gets removed after
      deactivation; however, we're gonna add reactivation so the assumption
      won't hold.  It'd be incorrect for kernfs_get_active() to fail for a
      node which was deactivated only temporarily.
      
      This patch makes kernfs_get_active() block if the node is deactivated
      but not removed.  If the node gets reactivated (not yet implemented),
      it will be retried and succeed.  If the node gets removed, it will be
      woken up and fail.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      895a068a
    • Tejun Heo's avatar
      kernfs: remove kernfs_addrm_cxt · 99177a34
      Tejun Heo authored
      kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were
      added because there were operations which should be performed outside
      kernfs_mutex after adding and removing kernfs_nodes.  The necessary
      operations were recorded in kernfs_addrm_cxt and performed by
      kernfs_addrm_finish(); however, after the recent changes which
      relocated deactivation and unmapping so that they're performed
      directly during removal, the only operation kernfs_addrm_finish()
      performs is kernfs_put(), which can be moved inside the removal path
      too.
      
      This patch moves the kernfs_put() of the base ref to __kernfs_remove()
      and remove kernfs_addrm_cxt and kernfs_addrm_start/finish().
      
      * kernfs_add_one() is updated to grab and release the parent's active
        ref and kernfs_mutex itself.  kernfs_get/put_active() and
        kernfs_addrm_start/finish() invocations around it are removed from
        all users.
      
      * __kernfs_remove() puts an unlinked node directly instead of chaining
        it to kernfs_addrm_cxt.  Its callers are updated to grab and release
        kernfs_mutex instead of calling kernfs_addrm_start/finish() around
        it.
      
      v2: Updated to fit the v2 restructuring of removal path.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      99177a34
    • Tejun Heo's avatar
      kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove() · f601f9a2
      Tejun Heo authored
      kernfs_unmap_bin_file() is supposed to unmap all memory mappings of
      the target file before kernfs_remove() finishes; however, it currently
      is being called from kernfs_addrm_finish() and has the same race
      problem as the original implementation of deactivation when there are
      multiple removers - only the remover which snatches the node to its
      addrm_cxt->removed list is guaranteed to wait for its completion
      before returning.
      
      It can be fixed by moving kernfs_unmap_bin_file() invocation from
      kernfs_addrm_finish() to __kernfs_remove().  The function may be
      called multiple times but that shouldn't do any harm.
      
      We end up dropping kernfs_mutex in the removal loop and the node may
      be removed inbetween by someone else.  kernfs_unlink_sibling() is
      updated to test whether the node has already been removed and return
      accordingly.  __kernfs_remove() in turn performs post-unlinking
      cleanup only if it actually unlinked the node.
      
      KERNFS_HAS_MMAP test is moved out of the unmap function into
      __kernfs_remove() so that we don't unlock kernfs_mutex unnecessarily.
      While at it, drop the now meaningless "bin" qualifier from the
      function name.
      
      v2: Rewritten to fit the v2 restructuring of removal path.  HAS_MMAP
          test relocated.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f601f9a2
    • Tejun Heo's avatar
      kernfs: restructure removal path to fix possible premature return · 45a140e5
      Tejun Heo authored
      The recursive nature of kernfs_remove() means that, even if
      kernfs_remove() is not allowed to be called multiple times on the same
      node, there may be race conditions between removal of parent and its
      descendants.  While we can claim that kernfs_remove() shouldn't be
      called on one of the descendants while the removal of an ancestor is
      in progress, such rule is unnecessarily restrictive and very difficult
      to enforce.  It's better to simply allow invoking kernfs_remove() as
      the caller sees fit as long as the caller ensures that the node is
      accessible.
      
      The current behavior in such situations is broken.  Whoever enters
      removal path first takes the node off the hierarchy and then
      deactivates.  Following removers either return as soon as it notices
      that it's not the first one or can't even find the target node as it
      has already been removed from the hierarchy.  In both cases, the
      following removers may finish prematurely while the nodes which should
      be removed and drained are still being processed by the first one.
      
      This patch restructures so that multiple removers, whether through
      recursion or direction invocation, always follow the following rules.
      
      * When there are multiple concurrent removers, only one puts the base
        ref.
      
      * Regardless of which one puts the base ref, all removers are blocked
        until the target node is fully deactivated and removed.
      
      To achieve the above, removal path now first deactivates the subtree,
      drains it and then unlinks one-by-one.  __kernfs_deactivate() is
      called directly from __kernfs_removal() and drops and regrabs
      kernfs_mutex for each descendant to drain active refs.  As this means
      that multiple removers can enter __kernfs_deactivate() for the same
      node, the function is updated so that it can handle multiple
      deactivators of the same node - only one actually deactivates but all
      wait till drain completion.
      
      The restructured removal path guarantees that a removed node gets
      unlinked only after the node is deactivated and drained.  Combined
      with proper multiple deactivator handling, this guarantees that any
      invocation of kernfs_remove() returns only after the node itself and
      all its descendants are deactivated, drained and removed.
      
      v2: Draining separated into a separate loop (used to be in the same
          loop as unlink) and done from __kernfs_deactivate().  This is to
          allow exposing deactivation as a separate interface later.
      
          Root node removal was broken in v1 patch.  Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      45a140e5
    • Tejun Heo's avatar
      kernfs: remove KERNFS_REMOVED · ae34372e
      Tejun Heo authored
      KERNFS_REMOVED is used to mark half-initialized and dying nodes so
      that they don't show up in lookups and deny adding new nodes under or
      renaming it; however, its role overlaps those of deactivation and
      removal from rbtree.
      
      It's necessary to deny addition of new children while removal is in
      progress; however, this role considerably intersects with deactivation
      - KERNFS_REMOVED prevents new children while deactivation prevents new
      file operations.  There's no reason to have them separate making
      things more complex than necessary.
      
      KERNFS_REMOVED is also used to decide whether a node is still visible
      to vfs layer, which is rather redundant as equivalent determination
      can be made by testing whether the node is on its parent's children
      rbtree or not.
      
      This patch removes KERNFS_REMOVED.
      
      * Instead of KERNFS_REMOVED, each node now starts its life
        deactivated.  This means that we now use both atomic_add() and
        atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN.  The compiler
        generates an overflow warnings when negating INT_MIN as the negation
        can't be represented as a positive number.  Nothing is actually
        broken but let's bump BIAS by one to avoid the warnings for archs
        which negates the subtrahend..
      
      * KERNFS_REMOVED tests in add and rename paths are replaced with
        kernfs_get/put_active() of the target nodes.  Due to the way the add
        path is structured now, active ref handling is done in the callers
        of kernfs_add_one().  This will be consolidated up later.
      
      * kernfs_remove_one() is updated to deactivate instead of setting
        KERNFS_REMOVED.  This removes deactivation from kernfs_deactivate(),
        which is now renamed to kernfs_drain().
      
      * kernfs_dop_revalidate() now tests RB_EMPTY_NODE(&kn->rb) instead of
        KERNFS_REMOVED and KERNFS_REMOVED test in kernfs_dir_pos() is
        dropped.  A node which is removed from the children rbtree is not
        included in the iteration in the first place.  This means that a
        node may be visible through vfs a bit longer - it's now also visible
        after deactivation until the actual removal.  This slightly enlarged
        window difference doesn't make any difference to the userland.
      
      * Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with
        checks on the active ref.
      
      * Some comment style updates in the affected area.
      
      v2: Reordered before removal path restructuring.  kernfs_active()
          dropped and kernfs_get/put_active() used instead.  RB_EMPTY_NODE()
          used in the lookup paths.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ae34372e
    • Tejun Heo's avatar
      kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep() · a69d001c
      Tejun Heo authored
      There currently are two mechanisms gating active ref lockdep
      annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask.
      The former disables lockdep annotations in kernfs_get/put_active()
      while the latter disables all of kernfs_deactivate().
      
      While KERNFS_ACTIVE_REF also behaves as an optimization to skip the
      deactivation step for non-file nodes, the benefit is marginal and it
      needlessly diverges code paths.  Let's drop KERNFS_ACTIVE_REF and use
      KERNFS_LOCKDEP in kernfs_deactivate() too.
      
      While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP
      flag so that it's more convenient and the related code can be compiled
      out when not enabled.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a69d001c
    • Tejun Heo's avatar
      kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq · ea1c472d
      Tejun Heo authored
      kernfs_node->u.completion is used to notify deactivation completion
      from kernfs_put_active() to kernfs_deactivate().  We now allow
      multiple racing removals of the same node and the current removal
      scheme is no longer correct - kernfs_remove() invocation may return
      before the node is properly deactivated if it races against another
      removal.  The removal path will be restructured to address the issue.
      
      To help such restructure which requires supporting multiple waiters,
      this patch replaces kernfs_node->u.completion with
      kernfs_root->deactivate_waitq.  This makes deactivation event
      notifications share a per-root waitqueue_head; however, the wait path
      is quite cold and this will also allow shaving one pointer off
      kernfs_node.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ea1c472d
    • Tejun Heo's avatar
      kernfs: fix get_active failure handling in kernfs_seq_*() · d92d2e6b
      Tejun Heo authored
      When kernfs_seq_start() fails to obtain an active reference, it
      returns ERR_PTR(-ENODEV).  kernfs_seq_stop() is then invoked with the
      error pointer value; however, it still proceeds to invoke
      kernfs_put_active() on the node leading to unbalanced put.
      
      If kernfs_seq_stop() is called even after active ref failure, it
      should skip invocation of @ops->seq_stop() and put_active.
      Unfortunately, this is a bit complicated because active ref failure
      isn't the only thing which may fail with ERR_PTR(-ENODEV).
      @ops->seq_start/next() may also fail with the error value and
      kernfs_seq_stop() doesn't have a way to tell apart those failures.
      
      Work it around by factoring out the active part of kernfs_seq_stop()
      into kernfs_seq_stop_active() and invoking it directly if
      @ops->seq_start/next() fail with ERR_PTR(-ENODEV) and updating
      kernfs_seq_stop() to skip kernfs_seq_stop_active() on
      ERR_PTR(-ENODEV).  This is a bit nasty but ensures that the active put
      is skipped iff get_active failed in kernfs_seq_start().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d92d2e6b
  2. 09 Jan, 2014 1 commit
  3. 08 Jan, 2014 3 commits
    • Bart Van Assche's avatar
      driver-core: Fix use-after-free triggered by bus_unregister() · 174be70b
      Bart Van Assche authored
      Avoid that bus_unregister() triggers a use-after-free with
      CONFIG_DEBUG_KOBJECT_RELEASE=y. This patch avoids that the
      following sequence triggers a kernel crash with memory poisoning
      enabled:
      * bus_register()
      * driver_register()
      * driver_unregister()
      * bus_unregister()
      
      The above sequence causes the bus private data to be freed from
      inside the bus_unregister() call although it is not guaranteed in
      that function that the reference count on the bus private data has
      dropped to zero. As an example, with CONFIG_DEBUG_KOBJECT_RELEASE=y
      the ${bus}/drivers kobject is still holding a reference on
      bus->p->subsys.kobj via its parent pointer at the time the bus
      private data is freed. Fix this by deferring freeing the bus private
      data until the last kobject_put() call on bus->p->subsys.kobj.
      
      The kernel oops triggered by the above sequence and with memory
      poisoning enabled and that is fixed by this patch is as follows:
      
      general protection fault: 0000 [#1] PREEMPT SMP
      CPU: 3 PID: 2711 Comm: kworker/3:32 Tainted: G        W  O 3.13.0-rc4-debug+ #1
      Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      Workqueue: events kobject_delayed_cleanup
      task: ffff880037f866d0 ti: ffff88003b638000 task.ti: ffff88003b638000
      Call Trace:
       [<ffffffff81263105>] ? kobject_get_path+0x25/0x100
       [<ffffffff81264354>] kobject_uevent_env+0x134/0x600
       [<ffffffff8126482b>] kobject_uevent+0xb/0x10
       [<ffffffff81262fa2>] kobject_delayed_cleanup+0xc2/0x1b0
       [<ffffffff8106c047>] process_one_work+0x217/0x700
       [<ffffffff8106bfdb>] ? process_one_work+0x1ab/0x700
       [<ffffffff8106c64b>] worker_thread+0x11b/0x3a0
       [<ffffffff8106c530>] ? process_one_work+0x700/0x700
       [<ffffffff81074b70>] kthread+0xf0/0x110
       [<ffffffff81074a80>] ? insert_kthread_work+0x80/0x80
       [<ffffffff815673bc>] ret_from_fork+0x7c/0xb0
       [<ffffffff81074a80>] ? insert_kthread_work+0x80/0x80
      Code: 89 f8 48 89 e5 f6 82 c0 27 63 81 20 74 15 0f 1f 44 00 00 48 83 c0 01 0f b6 10 f6 82 c0 27 63 81 20 75 f0 5d c3 66 0f 1f 44 00 00 <80> 3f 00 55 48 89 e5 74 15 48 89 f8 0f 1f 40 00 48 83 c0 01 80
      RIP  [<ffffffff81267ed0>] strlen+0x0/0x30
       RSP <ffff88003b639c70>
      ---[ end trace 210f883ef80376aa ]---
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Acked-by: default avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      174be70b
    • Bart Van Assche's avatar
      firmware loader: Add sparse annotation · 98233b21
      Bart Van Assche authored
      Avoid that sparse reports the following warning on __fw_free_buf():
      
      drivers/base/firmware_class.c:230:9: warning: context imbalance in '__fw_free_buf' - unexpected unlock
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Acked-by: default avatarMing Lei <ming.lei@canonical.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      98233b21
    • Bart Van Assche's avatar
  4. 05 Jan, 2014 1 commit
  5. 24 Dec, 2013 1 commit
  6. 22 Dec, 2013 6 commits
    • Linus Torvalds's avatar
      Linux 3.13-rc5 · 413541dd
      Linus Torvalds authored
      413541dd
    • Linus Torvalds's avatar
      Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 93579aee
      Linus Torvalds authored
      Pull ARM SoC fixes from Olof Johansson:
       "Much smaller batch of fixes this week.
      
        Biggest one is a revert of an OMAP display change that removed some
        non-DT pinmux code that was still needed for 3.13 to get DSI displays
        to work.
      
        There's also a fix that resolves some misdescribed GPIO controller
        resources on shmobile.  The rest are mostly smaller fixes, a couple of
        MAINTAINERS updates, etc"
      
      * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        Revert "ARM: OMAP2+: Remove legacy mux code for display.c"
        MAINTAINERS: Add keystone clock drivers
        MAINTAINERS: Add keystone git tree information
        ARM: s3c64xx: dt: Fix boot failure due to double clock initialization
        ARM: shmobile: r8a7790: Fix GPIO resources in DTS
        irqchip: renesas-intc-irqpin: Fix register bitfield shift calculation
        ARM: shmobile: lager: phy fixup needs CONFIG_PHYLIB
      93579aee
    • Linus Torvalds's avatar
      Merge tag 'firewire-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 · ba8b844f
      Linus Torvalds authored
      Pull firewire fixlet from Stefan Richter:
       "A one-liner to reenable WRITE SAME over SBP-2 like in v3.8...v3.12.
        Buggy targets which could malfunction when being subjected to this
        command are already sufficiently protected by a scsi_level check in sd
        + SCSI core"
      
      * tag 'firewire-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
        firewire: sbp2: bring back WRITE SAME support
      ba8b844f
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · 1733348b
      Linus Torvalds authored
      Pull SCSI target fixes from Nicholas Bellinger:
       "Mostly minor items this time around, the most notable being a FILEIO
        backend change to enforce hw_max_sectors based upon the current
        block_size to address a bug where large sized I/Os (> 1M) where being
        rejected"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
        qla2xxx: Fix scsi_host leak on qlt_lport_register callback failure
        target: Remove extra percpu_ref_init
        target/file: Update hw_max_sectors based on current block_size
        iser-target: Move INIT_WORK setup into isert_create_device_ib_res
        iscsi-target: Fix incorrect np->np_thread NULL assignment
        qla2xxx: Fix schedule_delayed_work() for target timeout calculations
        iser-target: fix error return code in isert_create_device_ib_res()
        iscsi-target: Fix-up all zero data-length CDBs with R/W_BIT set
        target: Remove write-only stats fields and lock from struct se_node_acl
        iscsi-target: return -EINVAL on oversized configfs parameter
      1733348b
    • Linus Torvalds's avatar
      Merge git://git.kvack.org/~bcrl/aio-next · a8472b4b
      Linus Torvalds authored
      Pull AIO leak fixes from Ben LaHaise:
       "I've put these two patches plus Linus's change through a round of
        tests, and it passes millions of iterations of the aio numa
        migratepage test, as well as a number of repetitions of a few simple
        read and write tests.
      
        The first patch fixes the memory leak Kent introduced, while the
        second patch makes aio_migratepage() much more paranoid and robust"
      
      * git://git.kvack.org/~bcrl/aio-next:
        aio/migratepages: make aio migrate pages sane
        aio: fix kioctx leak introduced by "aio: Fix a trinity splat"
      a8472b4b
    • Linus Torvalds's avatar
      aio: clean up and fix aio_setup_ring page mapping · 3dc9acb6
      Linus Torvalds authored
      Since commit 36bc08cc ("fs/aio: Add support to aio ring pages
      migration") the aio ring setup code has used a special per-ring backing
      inode for the page allocations, rather than just using random anonymous
      pages.
      
      However, rather than remembering the pages as it allocated them, it
      would allocate the pages, insert them into the file mapping (dirty, so
      that they couldn't be free'd), and then forget about them.  And then to
      look them up again, it would mmap the mapping, and then use
      "get_user_pages()" to get back an array of the pages we just created.
      
      Now, not only is that incredibly inefficient, it also leaked all the
      pages if the mmap failed (which could happen due to excessive number of
      mappings, for example).
      
      So clean it all up, making it much more straightforward.  Also remove
      some left-overs of the previous (broken) mm_populate() usage that was
      removed in commit d6c355c7 ("aio: fix race in ring buffer page
      lookup introduced by page migration support") but left the pointless and
      now misleading MAP_POPULATE flag around.
      Tested-and-acked-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3dc9acb6
  7. 21 Dec, 2013 3 commits
    • Benjamin LaHaise's avatar
      aio/migratepages: make aio migrate pages sane · 8e321fef
      Benjamin LaHaise authored
      The arbitrary restriction on page counts offered by the core
      migrate_page_move_mapping() code results in rather suspicious looking
      fiddling with page reference counts in the aio_migratepage() operation.
      To fix this, make migrate_page_move_mapping() take an extra_count parameter
      that allows aio to tell the code about its own reference count on the page
      being migrated.
      
      While cleaning up aio_migratepage(), make it validate that the old page
      being passed in is actually what aio_migratepage() expects to prevent
      misbehaviour in the case of races.
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      8e321fef
    • Benjamin LaHaise's avatar
      aio: fix kioctx leak introduced by "aio: Fix a trinity splat" · 1881686f
      Benjamin LaHaise authored
      e34ecee2 reworked the percpu reference
      counting to correct a bug trinity found.  Unfortunately, the change lead
      to kioctxes being leaked because there was no final reference count to
      put.  Add that reference count back in to fix things.
      Signed-off-by: default avatarBenjamin LaHaise <bcrl@kvack.org>
      Cc: stable@vger.kernel.org
      1881686f
    • Linus Torvalds's avatar
      Don't set the INITRD_COMPRESS environment variable automatically · b7000ade
      Linus Torvalds authored
      Commit 1bf49dd4 ("./Makefile: export initial ramdisk compression
      config option") started setting the INITRD_COMPRESS environment variable
      depending on which decompression models the kernel had available.
      
      That is completely broken.
      
      For example, we by default have CONFIG_RD_LZ4 enabled, and are able to
      decompress such an initrd, but the user tools to *create* such an initrd
      may not be availble.  So trying to tell dracut to generate an
      lz4-compressed image just because we can decode such an image is
      completely inappropriate.
      
      Cc: J P <ppandit@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Beulich <JBeulich@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b7000ade
  8. 20 Dec, 2013 15 commits
    • Linus Torvalds's avatar
      Merge tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfs · a6ddeee3
      Linus Torvalds authored
      Pull xfs bugfixes from Ben Myers:
       "This contains fixes for some asserts
         related to project quotas, a memory leak, a hang when disabling group or
         project quotas before disabling user quotas, Dave's email address, several
         fixes for the alignment of file allocation to stripe unit/width geometry, a
         fix for an assertion with xfs_zero_remaining_bytes, and the behavior of
         metadata writeback in the face of IO errors.
      
         Details:
         - fix memory leak in xfs_dir2_node_removename
         - fix quota assertion in xfs_setattr_size
         - fix quota assertions in xfs_qm_vop_create_dqattach
         - fix for hang when disabling group and project quotas before
           disabling user quotas
         - fix Dave Chinner's email address in MAINTAINERS
         - fix for file allocation alignment
         - fix for assertion in xfs_buf_stale by removing xfsbdstrat
         - fix for alignment with swalloc mount option
         - fix for "retry forever" semantics on IO errors"
      
      * tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfs:
        xfs: abort metadata writeback on permanent errors
        xfs: swalloc doesn't align allocations properly
        xfs: remove xfsbdstrat error
        xfs: align initial file allocations correctly
        MAINTAINERS: fix incorrect mail address of XFS maintainer
        xfs: fix infinite loop by detaching the group/project hints from user dquot
        xfs: fix assertion failure at xfs_setattr_nonsize
        xfs: fix false assertion at xfs_qm_vop_create_dqattach
        xfs: fix memory leak in xfs_dir2_node_removename
      a6ddeee3
    • Olof Johansson's avatar
      mm: fix build of split ptlock code · 40b64acd
      Olof Johansson authored
      Commit 597d795a ('mm: do not allocate page->ptl dynamically, if
      spinlock_t fits to long') restructures some allocators that are compiled
      even if USE_SPLIT_PTLOCKS arn't used.  It results in compilation
      failure:
      
        mm/memory.c:4282:6: error: 'struct page' has no member named 'ptl'
        mm/memory.c:4288:12: error: 'struct page' has no member named 'ptl'
      
      Add in the missing ifdef.
      
      Fixes: 597d795a ('mm: do not allocate page->ptl dynamically, if spinlock_t fits to long')
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40b64acd
    • Linus Torvalds's avatar
      Merge tag 'arc-fixes-for-3.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · 4773ef22
      Linus Torvalds authored
      Pull ARC fix from Vineet Gupta:
       "Fix busted syscall table due to unistd header inclusion issue"
      
      * tag 'arc-fixes-for-3.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
        ARC: Allow conditional multiple inclusion of uapi/asm/unistd.h
      4773ef22
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · a81ce79b
      Linus Torvalds authored
      Pull arm64 ptrace fix from Catalin Marinas.
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: ptrace: avoid using HW_BREAKPOINT_EMPTY for disabled events
      a81ce79b
    • Luck, Tony's avatar
      pstore: Don't allow high traffic options on fragile devices · df36ac1b
      Luck, Tony authored
      Some pstore backing devices use on board flash as persistent
      storage. These have limited numbers of write cycles so it
      is a poor idea to use them from high frequency operations.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df36ac1b
    • Linus Torvalds's avatar
      Merge tag 'dmaengine-fixes-3.13-rc4' of... · eaadcfeb
      Linus Torvalds authored
      Merge tag 'dmaengine-fixes-3.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine
      
      Pull dmaengine fixes from Dan Williams:
      
       - deprecation of net_dma to be removed in 3.14
      
       - crash regression fix in pl330 from the dmaengine_unmap rework
      
       - crash regression fix for any channel running raid ops without
         CONFIG_ASYNC_TX_DMA from dmaengine_unmap
      
       - memory leak regression in mv_xor from dmaengine_unmap
      
       - build warning regressions in mv_xor, fsldma, ppc4xx, txx9, and
         at_hdmac from dmaengine_unmap
      
       - sleep in atomic regression in dma_async_memcpy_pg_to_pg
      
       - new fix in mv_xor for handling channel initialization failures
      
      * tag 'dmaengine-fixes-3.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
        net_dma: mark broken
        dma: pl330: ensure DMA descriptors are zero-initialised
        dmaengine: fix sleep in atomic
        dmaengine: mv_xor: fix oops when channels fail to initialise
        dma: mv_xor: Use dmaengine_unmap_data for the self-tests
        dmaengine: fix enable for high order unmap pools
        dma: fix build warnings in txx9
        dmatest: fix build warning on mips
        dma: fix fsldma build warnings
        dma: fix build warnings in ppc4xx
        dmaengine: at_hdmac: remove unused function
        dma: mv_xor: remove mv_desc_get_dest_addr()
      eaadcfeb
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 46dd0835
      Linus Torvalds authored
      Pull KVM fixes from Paolo Bonzini:
       "The PPC folks had a large amount of changes queued for 3.13, and now
        they are fixing the bugs"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: PPC: Book3S HV: Don't drop low-order page address bits
        powerpc: book3s: kvm: Don't abuse host r2 in exit path
        powerpc/kvm/booke: Fix build break due to stack frame size warning
        KVM: PPC: Book3S: PR: Enable interrupts earlier
        KVM: PPC: Book3S: PR: Make svcpu -> vcpu store preempt savvy
        KVM: PPC: Book3S: PR: Export kvmppc_copy_to|from_svcpu
        KVM: PPC: Book3S: PR: Don't clobber our exit handler id
        powerpc: kvm: fix rare but potential deadlock scene
        KVM: PPC: Book3S HV: Take SRCU read lock around kvm_read_guest() call
        KVM: PPC: Book3S HV: Make tbacct_lock irq-safe
        KVM: PPC: Book3S HV: Refine barriers in guest entry/exit
        KVM: PPC: Book3S HV: Fix physical address calculations
      46dd0835
    • Kirill A. Shutemov's avatar
      mm: do not allocate page->ptl dynamically, if spinlock_t fits to long · 597d795a
      Kirill A. Shutemov authored
      In struct page we have enough space to fit long-size page->ptl there,
      but we use dynamically-allocated page->ptl if size(spinlock_t) is larger
      than sizeof(int).
      
      It hurts 64-bit architectures with CONFIG_GENERIC_LOCKBREAK, where
      sizeof(spinlock_t) == 8, but it easily fits into struct page.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      597d795a
    • Rashika Kheria's avatar
      drivers: base: Add prototype declaration to the header file · 41f10726
      Rashika Kheria authored
      Add prototype declaration of function memory_block_size_bytes() to
      the header file include/linux/memory.h.
      
      This eliminates the following warning in memory.c:
      drivers/base/memory.c:87:1: warning: no previous prototype for ‘memory_block_size_bytes’ [-Wmissing-prototypes]
      Signed-off-by: default avatarRashika Kheria <rashika.kheria@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41f10726
    • Johannes Weiner's avatar
      mm: page_alloc: revert NUMA aspect of fair allocation policy · fff4068c
      Johannes Weiner authored
      Commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") meant
      to bring aging fairness among zones in system, but it was overzealous
      and badly regressed basic workloads on NUMA systems.
      
      Due to the way kswapd and page allocator interacts, we still want to
      make sure that all zones in any given node are used equally for all
      allocations to maximize memory utilization and prevent thrashing on the
      highest zone in the node.
      
      While the same principle applies to NUMA nodes - memory utilization is
      obviously improved by spreading allocations throughout all nodes -
      remote references can be costly and so many workloads prefer locality
      over memory utilization.  The original change assumed that
      zone_reclaim_mode would be a good enough predictor for that, but it
      turned out to be as indicative as a coin flip.
      
      Revert the NUMA aspect of the fairness until we can find a proper way to
      make it configurable and agree on a sane default.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: <stable@kernel.org> # 3.12
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fff4068c
    • Mel Gorman's avatar
      Revert "mm: page_alloc: exclude unreclaimable allocations from zone fairness policy" · 8798cee2
      Mel Gorman authored
      This reverts commit 73f038b8.  The NUMA behaviour of this patch is
      less than ideal.  An alternative approch is to interleave allocations
      only within local zones which is implemented in the next patch.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8798cee2
    • Kirill A. Shutemov's avatar
      mm: Fix NULL pointer dereference in madvise(MADV_WILLNEED) support · ee53664b
      Kirill A. Shutemov authored
      Sasha Levin found a NULL pointer dereference that is due to a missing
      page table lock, which in turn is due to the pmd entry in question being
      a transparent huge-table entry.
      
      The code - introduced in commit 1998cc04 ("mm: make
      madvise(MADV_WILLNEED) support swap file prefetch") - correctly checks
      for this situation using pmd_none_or_trans_huge_or_clear_bad(), but it
      turns out that that function doesn't work correctly.
      
      pmd_none_or_trans_huge_or_clear_bad() expected that pmd_bad() would
      trigger if the transparent hugepage bit was set, but it doesn't do that
      if pmd_numa() is also set. Note that the NUMA bit only gets set on real
      NUMA machines, so people trying to reproduce this on most normal
      development systems would never actually trigger this.
      
      Fix it by removing the very subtle (and subtly incorrect) expectation,
      and instead just checking pmd_trans_huge() explicitly.
      Reported-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Acked-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      [ Additionally remove the now stale test for pmd_trans_huge() inside the
        pmd_bad() case - Linus ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ee53664b
    • Kevin Hilman's avatar
      Merge tag 'renesas-fixes-for-v3.13' of... · 95fcfa70
      Kevin Hilman authored
      Merge tag 'renesas-fixes-for-v3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas into fixes
      
      From Simon Horman:
      Renesas ARM based SoC fixes for v3.13
      
      * r8a7790 (R-Car H1) SoC
        - Correct GPIO resources in DT.
      
          This problem has been present since GPIOs were added to the r8a7790 SoC
          by f98e10c8 ("ARM: shmobile: r8a7790: Add GPIO controller
          devices to device tree") in v3.12-rc1.
      
      * irqchip renesas-intc-irqpin
        - Correct register bitfield shift calculation
      
          This bug has been present since the renesas-intc-irqpin driver was
          introduced by 44358048 ("irqchip: Renesas INTC External IRQ pin
          driver") in v3.10-rc1
      
      * Lager board
        - Do not build the phy fixup unless CONFIG_PHYLIB is enabled
      
          This problem was introduced by 48c8b96f
      
      * tag 'renesas-fixes-for-v3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas:
        ARM: shmobile: r8a7790: Fix GPIO resources in DTS
        irqchip: renesas-intc-irqpin: Fix register bitfield shift calculation
        ARM: shmobile: lager: phy fixup needs CONFIG_PHYLIB
      Signed-off-by: default avatarKevin Hilman <khilman@linaro.org>
      95fcfa70
    • Paolo Bonzini's avatar
      Merge tag 'signed-for-3.13' of git://github.com/agraf/linux-2.6 into kvm-master · 5e6d26cf
      Paolo Bonzini authored
      Patch queue for 3.13 - 2013-12-18
      
      This fixes some grave issues we've only found after 3.13-rc1:
      
        - Make the modularized HV/PR book3s kvm work well as modules
        - Fix some race conditions
        - Fix compilation with certain compilers (booke)
        - Fix THP for book3s_hv
        - Fix preemption for book3s_pr
      
      Alexander Graf (4):
            KVM: PPC: Book3S: PR: Don't clobber our exit handler id
            KVM: PPC: Book3S: PR: Export kvmppc_copy_to|from_svcpu
            KVM: PPC: Book3S: PR: Make svcpu -> vcpu store preempt savvy
            KVM: PPC: Book3S: PR: Enable interrupts earlier
      
      Aneesh Kumar K.V (1):
            powerpc: book3s: kvm: Don't abuse host r2 in exit path
      
      Paul Mackerras (5):
            KVM: PPC: Book3S HV: Fix physical address calculations
            KVM: PPC: Book3S HV: Refine barriers in guest entry/exit
            KVM: PPC: Book3S HV: Make tbacct_lock irq-safe
            KVM: PPC: Book3S HV: Take SRCU read lock around kvm_read_guest() call
            KVM: PPC: Book3S HV: Don't drop low-order page address bits
      
      Scott Wood (1):
            powerpc/kvm/booke: Fix build break due to stack frame size warning
      
      pingfan liu (1):
            powerpc: kvm: fix rare but potential deadlock scene
      5e6d26cf
    • Linus Torvalds's avatar
      Merge tag 'stable/for-linus-3.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 4203d0eb
      Linus Torvalds authored
      Pull Xen bugfixes from Konrad Rzeszutek Wilk:
       - Fix balloon driver for auto-translate guests (PVHVM, ARM) to not use
         scratch pages.
       - Fix block API header for ARM32 and ARM64 to have proper layout
       - On ARM when mapping guests, stick on PTE_SPECIAL
       - When using SWIOTLB under ARM, don't call swiotlb functions twice
       - When unmapping guests memory and if we fail, don't return pages which
         failed to be unmapped.
       - Grant driver was using the wrong address on ARM.
      
      * tag 'stable/for-linus-3.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen/balloon: Seperate the auto-translate logic properly (v2)
        xen/block: Correctly define structures in public headers on ARM32 and ARM64
        arm: xen: foreign mapping PTEs are special.
        xen/arm64: do not call the swiotlb functions twice
        xen: privcmd: do not return pages which we have failed to unmap
        XEN: Grant table address, xen_hvm_resume_frames, is a phys_addr not a pfn
      4203d0eb