1. 06 Dec, 2014 9 commits
  2. 21 Nov, 2014 31 commits
    • Greg Kroah-Hartman's avatar
      Linux 3.10.61 · 252f23ea
      Greg Kroah-Hartman authored
      252f23ea
    • Johannes Weiner's avatar
      mm: memcg: handle non-error OOM situations more gracefully · f8a51179
      Johannes Weiner authored
      commit 49426420 upstream.
      
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") assumed that only a few places that can trigger a
      memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
      readahead.  But there are many more and it's impractical to annotate
      them all.
      
      First of all, we don't want to invoke the OOM killer when the failed
      allocation is gracefully handled, so defer the actual kill to the end of
      the fault handling as well.  This simplifies the code quite a bit for
      added bonus.
      
      Second, since a failed allocation might not be the abrupt end of the
      fault, the memcg OOM handler needs to be re-entrant until the fault
      finishes for subsequent allocation attempts.  If an allocation is
      attempted after the task already OOMed, allow it to bypass the limit so
      that it can quickly finish the fault and invoke the OOM killer.
      Reported-by: default avatarazurIt <azurit@pobox.sk>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f8a51179
    • Johannes Weiner's avatar
      mm: memcg: do not trap chargers with full callstack on OOM · f79d6a46
      Johannes Weiner authored
      commit 3812c8c8 upstream.
      
      The memcg OOM handling is incredibly fragile and can deadlock.  When a
      task fails to charge memory, it invokes the OOM killer and loops right
      there in the charge code until it succeeds.  Comparably, any other task
      that enters the charge path at this point will go to a waitqueue right
      then and there and sleep until the OOM situation is resolved.  The problem
      is that these tasks may hold filesystem locks and the mmap_sem; locks that
      the selected OOM victim may need to exit.
      
      For example, in one reported case, the task invoking the OOM killer was
      about to charge a page cache page during a write(), which holds the
      i_mutex.  The OOM killer selected a task that was just entering truncate()
      and trying to acquire the i_mutex:
      
      OOM invoking task:
        mem_cgroup_handle_oom+0x241/0x3b0
        mem_cgroup_cache_charge+0xbe/0xe0
        add_to_page_cache_locked+0x4c/0x140
        add_to_page_cache_lru+0x22/0x50
        grab_cache_page_write_begin+0x8b/0xe0
        ext3_write_begin+0x88/0x270
        generic_file_buffered_write+0x116/0x290
        __generic_file_aio_write+0x27c/0x480
        generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
        do_sync_write+0xea/0x130
        vfs_write+0xf3/0x1f0
        sys_write+0x51/0x90
        system_call_fastpath+0x18/0x1d
      
      OOM kill victim:
        do_truncate+0x58/0xa0              # takes i_mutex
        do_last+0x250/0xa30
        path_openat+0xd7/0x440
        do_filp_open+0x49/0xa0
        do_sys_open+0x106/0x240
        sys_open+0x20/0x30
        system_call_fastpath+0x18/0x1d
      
      The OOM handling task will retry the charge indefinitely while the OOM
      killed task is not releasing any resources.
      
      A similar scenario can happen when the kernel OOM killer for a memcg is
      disabled and a userspace task is in charge of resolving OOM situations.
      In this case, ALL tasks that enter the OOM path will be made to sleep on
      the OOM waitqueue and wait for userspace to free resources or increase
      the group's limit.  But a userspace OOM handler is prone to deadlock
      itself on the locks held by the waiting tasks.  For example one of the
      sleeping tasks may be stuck in a brk() call with the mmap_sem held for
      writing but the userspace handler, in order to pick an optimal victim,
      may need to read files from /proc/<pid>, which tries to acquire the same
      mmap_sem for reading and deadlocks.
      
      This patch changes the way tasks behave after detecting a memcg OOM and
      makes sure nobody loops or sleeps with locks held:
      
      1. When OOMing in a user fault, invoke the OOM killer and restart the
         fault instead of looping on the charge attempt.  This way, the OOM
         victim can not get stuck on locks the looping task may hold.
      
      2. When OOMing in a user fault but somebody else is handling it
         (either the kernel OOM killer or a userspace handler), don't go to
         sleep in the charge context.  Instead, remember the OOMing memcg in
         the task struct and then fully unwind the page fault stack with
         -ENOMEM.  pagefault_out_of_memory() will then call back into the
         memcg code to check if the -ENOMEM came from the memcg, and then
         either put the task to sleep on the memcg's OOM waitqueue or just
         restart the fault.  The OOM victim can no longer get stuck on any
         lock a sleeping task may hold.
      
      Debugged by Michal Hocko.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarazurIt <azurit@pobox.sk>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f79d6a46
    • Johannes Weiner's avatar
      mm: memcg: rework and document OOM waiting and wakeup · 7a147e0c
      Johannes Weiner authored
      commit fb2a6fc5 upstream.
      
      The memcg OOM handler open-codes a sleeping lock for OOM serialization
      (trylock, wait, repeat) because the required locking is so specific to
      memcg hierarchies.  However, it would be nice if this construct would be
      clearly recognizable and not be as obfuscated as it is right now.  Clean
      up as follows:
      
      1. Remove the return value of mem_cgroup_oom_unlock()
      
      2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().
      
      3. Pull the prepare_to_wait() out of the memcg_oom_lock scope.  This
         makes it more obvious that the task has to be on the waitqueue
         before attempting to OOM-trylock the hierarchy, to not miss any
         wakeups before going to sleep.  It just didn't matter until now
         because it was all lumped together into the global memcg_oom_lock
         spinlock section.
      
      4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
         It is proctected by the hierarchical OOM-lock.
      
      5. The memcg_oom_lock spinlock is only required to propagate the OOM
         lock in any given hierarchy atomically.  Restrict its scope to
         mem_cgroup_oom_(trylock|unlock).
      
      6. Do not wake up the waitqueue unconditionally at the end of the
         function.  Only the lockholder has to wake up the next in line
         after releasing the lock.
      
         Note that the lockholder kicks off the OOM-killer, which in turn
         leads to wakeups from the uncharges of the exiting task.  But a
         contender is not guaranteed to see them if it enters the OOM path
         after the OOM kills but before the lockholder releases the lock.
         Thus there has to be an explicit wakeup after releasing the lock.
      
      7. Put the OOM task on the waitqueue before marking the hierarchy as
         under OOM as that is the point where we start to receive wakeups.
         No point in listening before being on the waitqueue.
      
      8. Likewise, unmark the hierarchy before finishing the sleep, for
         symmetry.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7a147e0c
    • Johannes Weiner's avatar
      mm: memcg: enable memcg OOM killer only for user faults · 11f34787
      Johannes Weiner authored
      commit 519e5247 upstream.
      
      System calls and kernel faults (uaccess, gup) can handle an out of memory
      situation gracefully and just return -ENOMEM.
      
      Enable the memcg OOM killer only for user faults, where it's really the
      only option available.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      11f34787
    • Johannes Weiner's avatar
      x86: finish user fault error path with fatal signal · ed368ae7
      Johannes Weiner authored
      commit 3a13c4d7 upstream.
      
      The x86 fault handler bails in the middle of error handling when the
      task has a fatal signal pending.  For a subsequent patch this is a
      problem in OOM situations because it relies on pagefault_out_of_memory()
      being called even when the task has been killed, to perform proper
      per-task OOM state unwinding.
      
      Shortcutting the fault like this is a rather minor optimization that
      saves a few instructions in rare cases.  Just remove it for
      user-triggered faults.
      
      Use the opportunity to split the fault retry handling from actual fault
      errors and add locking documentation that reads suprisingly similar to
      ARM's.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ed368ae7
    • Johannes Weiner's avatar
      arch: mm: pass userspace fault flag to generic fault handler · e2ec2c2b
      Johannes Weiner authored
      commit 759496ba upstream.
      
      Unlike global OOM handling, memory cgroup code will invoke the OOM killer
      in any OOM situation because it has no way of telling faults occuring in
      kernel context - which could be handled more gracefully - from
      user-triggered faults.
      
      Pass a flag that identifies faults originating in user space from the
      architecture-specific fault handlers to generic code so that memcg OOM
      handling can be improved.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e2ec2c2b
    • Johannes Weiner's avatar
      arch: mm: do not invoke OOM killer on kernel fault OOM · 086c6cc5
      Johannes Weiner authored
      commit 87134102 upstream.
      
      Kernel faults are expected to handle OOM conditions gracefully (gup,
      uaccess etc.), so they should never invoke the OOM killer.  Reserve this
      for faults triggered in user context when it is the only option.
      
      Most architectures already do this, fix up the remaining few.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      086c6cc5
    • Johannes Weiner's avatar
      arch: mm: remove obsolete init OOM protection · 20c92c01
      Johannes Weiner authored
      commit 94bce453 upstream.
      
      The memcg code can trap tasks in the context of the failing allocation
      until an OOM situation is resolved.  They can hold all kinds of locks
      (fs, mm) at this point, which makes it prone to deadlocking.
      
      This series converts memcg OOM handling into a two step process that is
      started in the charge context, but any waiting is done after the fault
      stack is fully unwound.
      
      Patches 1-4 prepare architecture handlers to support the new memcg
      requirements, but in doing so they also remove old cruft and unify
      out-of-memory behavior across architectures.
      
      Patch 5 disables the memcg OOM handling for syscalls, readahead, kernel
      faults, because they can gracefully unwind the stack with -ENOMEM.  OOM
      handling is restricted to user triggered faults that have no other
      option.
      
      Patch 6 reworks memcg's hierarchical OOM locking to make it a little
      more obvious wth is going on in there: reduce locked regions, rename
      locking functions, reorder and document.
      
      Patch 7 implements the two-part OOM handling such that tasks are never
      trapped with the full charge stack in an OOM situation.
      
      This patch:
      
      Back before smart OOM killing, when faulting tasks were killed directly on
      allocation failures, the arch-specific fault handlers needed special
      protection for the init process.
      
      Now that all fault handlers call into the generic OOM killer (see commit
      609838cf: "mm: invoke oom-killer from remaining unconverted page
      fault handlers"), which already provides init protection, the
      arch-specific leftovers can be removed.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Acked-by: Vineet Gupta <vgupta@synopsys.com>	[arch/arc bits]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      20c92c01
    • Johannes Weiner's avatar
      mm: invoke oom-killer from remaining unconverted page fault handlers · b13a714f
      Johannes Weiner authored
      commit 609838cf upstream.
      
      A few remaining architectures directly kill the page faulting task in an
      out of memory situation.  This is usually not a good idea since that
      task might not even use a significant amount of memory and so may not be
      the optimal victim to resolve the situation.
      
      Since 2.6.29's 1c0fe6e3 ("mm: invoke oom-killer from page fault") there
      is a hook that architecture page fault handlers are supposed to call to
      invoke the OOM killer and let it pick the right task to kill.  Convert
      the remaining architectures over to this hook.
      
      To have the previous behavior of simply taking out the faulting task the
      vm.oom_kill_allocating_task sysctl can be set to 1.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: Vineet Gupta <vgupta@synopsys.com>   [arch/arc bits]
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Chen Liqin <liqin.chen@sunplusct.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b13a714f
    • Daniel Borkmann's avatar
      net: sctp: fix skb_over_panic when receiving malformed ASCONF chunks · cda702df
      Daniel Borkmann authored
      commit 9de7922b upstream.
      
      Commit 6f4c618d ("SCTP : Add paramters validity check for
      ASCONF chunk") added basic verification of ASCONF chunks, however,
      it is still possible to remotely crash a server by sending a
      special crafted ASCONF chunk, even up to pre 2.6.12 kernels:
      
      skb_over_panic: text:ffffffffa01ea1c3 len:31056 put:30768
       head:ffff88011bd81800 data:ffff88011bd81800 tail:0x7950
       end:0x440 dev:<NULL>
       ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:129!
      [...]
      Call Trace:
       <IRQ>
       [<ffffffff8144fb1c>] skb_put+0x5c/0x70
       [<ffffffffa01ea1c3>] sctp_addto_chunk+0x63/0xd0 [sctp]
       [<ffffffffa01eadaf>] sctp_process_asconf+0x1af/0x540 [sctp]
       [<ffffffff8152d025>] ? _read_unlock_bh+0x15/0x20
       [<ffffffffa01e0038>] sctp_sf_do_asconf+0x168/0x240 [sctp]
       [<ffffffffa01e3751>] sctp_do_sm+0x71/0x1210 [sctp]
       [<ffffffff8147645d>] ? fib_rules_lookup+0xad/0xf0
       [<ffffffffa01e6b22>] ? sctp_cmp_addr_exact+0x32/0x40 [sctp]
       [<ffffffffa01e8393>] sctp_assoc_bh_rcv+0xd3/0x180 [sctp]
       [<ffffffffa01ee986>] sctp_inq_push+0x56/0x80 [sctp]
       [<ffffffffa01fcc42>] sctp_rcv+0x982/0xa10 [sctp]
       [<ffffffffa01d5123>] ? ipt_local_in_hook+0x23/0x28 [iptable_filter]
       [<ffffffff8148bdc9>] ? nf_iterate+0x69/0xb0
       [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0
       [<ffffffff8148bf86>] ? nf_hook_slow+0x76/0x120
       [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0
       [<ffffffff81496ded>] ip_local_deliver_finish+0xdd/0x2d0
       [<ffffffff81497078>] ip_local_deliver+0x98/0xa0
       [<ffffffff8149653d>] ip_rcv_finish+0x12d/0x440
       [<ffffffff81496ac5>] ip_rcv+0x275/0x350
       [<ffffffff8145c88b>] __netif_receive_skb+0x4ab/0x750
       [<ffffffff81460588>] netif_receive_skb+0x58/0x60
      
      This can be triggered e.g., through a simple scripted nmap
      connection scan injecting the chunk after the handshake, for
      example, ...
      
        -------------- INIT[ASCONF; ASCONF_ACK] ------------->
        <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
        ------------------ ASCONF; UNKNOWN ------------------>
      
      ... where ASCONF chunk of length 280 contains 2 parameters ...
      
        1) Add IP address parameter (param length: 16)
        2) Add/del IP address parameter (param length: 255)
      
      ... followed by an UNKNOWN chunk of e.g. 4 bytes. Here, the
      Address Parameter in the ASCONF chunk is even missing, too.
      This is just an example and similarly-crafted ASCONF chunks
      could be used just as well.
      
      The ASCONF chunk passes through sctp_verify_asconf() as all
      parameters passed sanity checks, and after walking, we ended
      up successfully at the chunk end boundary, and thus may invoke
      sctp_process_asconf(). Parameter walking is done with
      WORD_ROUND() to take padding into account.
      
      In sctp_process_asconf()'s TLV processing, we may fail in
      sctp_process_asconf_param() e.g., due to removal of the IP
      address that is also the source address of the packet containing
      the ASCONF chunk, and thus we need to add all TLVs after the
      failure to our ASCONF response to remote via helper function
      sctp_add_asconf_response(), which basically invokes a
      sctp_addto_chunk() adding the error parameters to the given
      skb.
      
      When walking to the next parameter this time, we proceed
      with ...
      
        length = ntohs(asconf_param->param_hdr.length);
        asconf_param = (void *)asconf_param + length;
      
      ... instead of the WORD_ROUND()'ed length, thus resulting here
      in an off-by-one that leads to reading the follow-up garbage
      parameter length of 12336, and thus throwing an skb_over_panic
      for the reply when trying to sctp_addto_chunk() next time,
      which implicitly calls the skb_put() with that length.
      
      Fix it by using sctp_walk_params() [ which is also used in
      INIT parameter processing ] macro in the verification *and*
      in ASCONF processing: it will make sure we don't spill over,
      that we walk parameters WORD_ROUND()'ed. Moreover, we're being
      more defensive and guard against unknown parameter types and
      missized addresses.
      
      Joint work with Vlad Yasevich.
      
      Fixes: b896b82b ("[SCTP] ADDIP: Support for processing incoming ASCONF_ACK chunks.")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cda702df
    • Daniel Borkmann's avatar
      net: sctp: fix panic on duplicate ASCONF chunks · 33291255
      Daniel Borkmann authored
      commit b69040d8 upstream.
      
      When receiving a e.g. semi-good formed connection scan in the
      form of ...
      
        -------------- INIT[ASCONF; ASCONF_ACK] ------------->
        <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
        ---------------- ASCONF_a; ASCONF_b ----------------->
      
      ... where ASCONF_a equals ASCONF_b chunk (at least both serials
      need to be equal), we panic an SCTP server!
      
      The problem is that good-formed ASCONF chunks that we reply with
      ASCONF_ACK chunks are cached per serial. Thus, when we receive a
      same ASCONF chunk twice (e.g. through a lost ASCONF_ACK), we do
      not need to process them again on the server side (that was the
      idea, also proposed in the RFC). Instead, we know it was cached
      and we just resend the cached chunk instead. So far, so good.
      
      Where things get nasty is in SCTP's side effect interpreter, that
      is, sctp_cmd_interpreter():
      
      While incoming ASCONF_a (chunk = event_arg) is being marked
      !end_of_packet and !singleton, and we have an association context,
      we do not flush the outqueue the first time after processing the
      ASCONF_ACK singleton chunk via SCTP_CMD_REPLY. Instead, we keep it
      queued up, although we set local_cork to 1. Commit 2e3216cd
      changed the precedence, so that as long as we get bundled, incoming
      chunks we try possible bundling on outgoing queue as well. Before
      this commit, we would just flush the output queue.
      
      Now, while ASCONF_a's ASCONF_ACK sits in the corked outq, we
      continue to process the same ASCONF_b chunk from the packet. As
      we have cached the previous ASCONF_ACK, we find it, grab it and
      do another SCTP_CMD_REPLY command on it. So, effectively, we rip
      the chunk->list pointers and requeue the same ASCONF_ACK chunk
      another time. Since we process ASCONF_b, it's correctly marked
      with end_of_packet and we enforce an uncork, and thus flush, thus
      crashing the kernel.
      
      Fix it by testing if the ASCONF_ACK is currently pending and if
      that is the case, do not requeue it. When flushing the output
      queue we may relink the chunk for preparing an outgoing packet,
      but eventually unlink it when it's copied into the skb right
      before transmission.
      
      Joint work with Vlad Yasevich.
      
      Fixes: 2e3216cd ("sctp: Follow security requirement of responding with 1 packet")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      33291255
    • Daniel Borkmann's avatar
      net: sctp: fix remote memory pressure from excessive queueing · bf53932b
      Daniel Borkmann authored
      commit 26b87c78 upstream.
      
      This scenario is not limited to ASCONF, just taken as one
      example triggering the issue. When receiving ASCONF probes
      in the form of ...
      
        -------------- INIT[ASCONF; ASCONF_ACK] ------------->
        <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------
        -------------------- COOKIE-ECHO -------------------->
        <-------------------- COOKIE-ACK ---------------------
        ---- ASCONF_a; [ASCONF_b; ...; ASCONF_n;] JUNK ------>
        [...]
        ---- ASCONF_m; [ASCONF_o; ...; ASCONF_z;] JUNK ------>
      
      ... where ASCONF_a, ASCONF_b, ..., ASCONF_z are good-formed
      ASCONFs and have increasing serial numbers, we process such
      ASCONF chunk(s) marked with !end_of_packet and !singleton,
      since we have not yet reached the SCTP packet end. SCTP does
      only do verification on a chunk by chunk basis, as an SCTP
      packet is nothing more than just a container of a stream of
      chunks which it eats up one by one.
      
      We could run into the case that we receive a packet with a
      malformed tail, above marked as trailing JUNK. All previous
      chunks are here goodformed, so the stack will eat up all
      previous chunks up to this point. In case JUNK does not fit
      into a chunk header and there are no more other chunks in
      the input queue, or in case JUNK contains a garbage chunk
      header, but the encoded chunk length would exceed the skb
      tail, or we came here from an entirely different scenario
      and the chunk has pdiscard=1 mark (without having had a flush
      point), it will happen, that we will excessively queue up
      the association's output queue (a correct final chunk may
      then turn it into a response flood when flushing the
      queue ;)): I ran a simple script with incremental ASCONF
      serial numbers and could see the server side consuming
      excessive amount of RAM [before/after: up to 2GB and more].
      
      The issue at heart is that the chunk train basically ends
      with !end_of_packet and !singleton markers and since commit
      2e3216cd ("sctp: Follow security requirement of responding
      with 1 packet") therefore preventing an output queue flush
      point in sctp_do_sm() -> sctp_cmd_interpreter() on the input
      chunk (chunk = event_arg) even though local_cork is set,
      but its precedence has changed since then. In the normal
      case, the last chunk with end_of_packet=1 would trigger the
      queue flush to accommodate possible outgoing bundling.
      
      In the input queue, sctp_inq_pop() seems to do the right thing
      in terms of discarding invalid chunks. So, above JUNK will
      not enter the state machine and instead be released and exit
      the sctp_assoc_bh_rcv() chunk processing loop. It's simply
      the flush point being missing at loop exit. Adding a try-flush
      approach on the output queue might not work as the underlying
      infrastructure might be long gone at this point due to the
      side-effect interpreter run.
      
      One possibility, albeit a bit of a kludge, would be to defer
      invalid chunk freeing into the state machine in order to
      possibly trigger packet discards and thus indirectly a queue
      flush on error. It would surely be better to discard chunks
      as in the current, perhaps better controlled environment, but
      going back and forth, it's simply architecturally not possible.
      I tried various trailing JUNK attack cases and it seems to
      look good now.
      
      Joint work with Vlad Yasevich.
      
      Fixes: 2e3216cd ("sctp: Follow security requirement of responding with 1 packet")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf53932b
    • Nadav Amit's avatar
      KVM: x86: Don't report guest userspace emulation error to userspace · c75f3949
      Nadav Amit authored
      commit a2b9e6c1 upstream.
      
      Commit fc3a9157 ("KVM: X86: Don't report L2 emulation failures to
      user-space") disabled the reporting of L2 (nested guest) emulation failures to
      userspace due to race-condition between a vmexit and the instruction emulator.
      The same rational applies also to userspace applications that are permitted by
      the guest OS to access MMIO area or perform PIO.
      
      This patch extends the current behavior - of injecting a #UD instead of
      reporting it to userspace - also for guest userspace code.
      Signed-off-by: default avatarNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c75f3949
    • Tomas Henzl's avatar
      SCSI: hpsa: fix a race in cmd_free/scsi_done · 2e4ce498
      Tomas Henzl authored
      commit 2cc5bfaf upstream.
      
      When the driver calls scsi_done and after that frees it's internal
      preallocated memory it can happen that a new job is enqueud before
      the memory is freed. The allocation fails and the message
      "cmd_alloc returned NULL" is shown.
      Patch below fixes it by moving cmd->scsi_done after cmd_free.
      Signed-off-by: default avatarTomas Henzl <thenzl@redhat.com>
      Acked-by: default avatarStephen M. Cameron <scameron@beardog.cce.hp.com>
      Signed-off-by: default avatarJames Bottomley <JBottomley@Parallels.com>
      Cc: Masoud Sharbiani <msharbiani@twitter.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e4ce498
    • Eugenia Emantayev's avatar
      net/mlx4_en: Fix BlueFlame race · 50e0289d
      Eugenia Emantayev authored
      commit 2d4b6466 upstream.
      
      Fix a race between BlueFlame flow and stamping in post send flow.
      Example:
      	SW: Build WQE 0 on the TX buffer, except the ownership bit
      	SW: Set ownership for WQE 0 on the TX buffer
      	SW: Ring doorbell for WQE 0
      	SW: Build WQE 1 on the TX buffer, except the ownership bit
      	SW: Set ownership for WQE 1 on the TX buffer
      	HW: Read WQE 0 and then WQE 1, before doorbell was rung/BF was done for WQE 1
      	HW: Produce CQEs for WQE 0 and WQE 1
      	SW: Process the CQEs, and stamp WQE 0 and WQE 1 accordingly (on the TX buffer)
      	SW: Copy WQE 1 from the TX buffer to the BF register - ALREADY STAMPED!
      	HW: CQE error with index 0xFFFF  - the BF WQE's control segment is STAMPED,
      		so the BF index is 0xFFFF. Error: Invalid Opcode.
      As a result QP enters the error state and no traffic can be sent.
      
      Solution:
      When stamping - do not stamp last completed wqe.
      Signed-off-by: default avatarEugenia Emantayev <eugenia@mellanox.com>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Vinson Lee <vlee@twopensource.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      50e0289d
    • Ben Dooks's avatar
      ARM: Correct BUG() assembly to ensure it is endian-agnostic · 9f6bb0c2
      Ben Dooks authored
      commit 63328070 upstream.
      
      Currently BUG() uses .word or .hword to create the necessary illegal
      instructions. However if we are building BE8 then these get swapped
      by the linker into different illegal instructions in the text. This
      means that the BUG() macro does not get trapped properly.
      
      Change to using <asm/opcodes.h> to provide the necessary ARM instruction
      building as we cannot rely on gcc/gas having the `.inst` instructions
      which where added to try and resolve this issue (reported by Dave Martin
      <Dave.Martin@arm.com>).
      Signed-off-by: default avatarBen Dooks <ben.dooks@codethink.co.uk>
      Reviewed-by: default avatarDave Martin <Dave.Martin@arm.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9f6bb0c2
    • Vince Weaver's avatar
      perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge · f3c34e7e
      Vince Weaver authored
      commit 1996388e upstream.
      
      This was discussed back in February:
      
      	https://lkml.org/lkml/2014/2/18/956
      
      But I never saw a patch come out of it.
      
      On IvyBridge we share the SandyBridge cache event tables, but the
      dTLB-load-miss event is not compatible.  Patch it up after
      the fact to the proper DTLB_LOAD_MISSES.DEMAND_LD_MISS_CAUSES_A_WALK
      Signed-off-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1407141528200.17214@vincent-weaver-1.umelst.maine.eduSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Hou Pengyang <houpengyang@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f3c34e7e
    • Alexander Usyskin's avatar
      mei: bus: fix possible boundaries violation · ba8beb4c
      Alexander Usyskin authored
      commit cfda2794 upstream.
      
      function 'strncpy' will fill whole buffer 'id.name' of fixed size (32)
      with string value and will not leave place for NULL-terminator.
      Possible buffer boundaries violation in following string operations.
      Replace strncpy with strlcpy.
      Signed-off-by: default avatarAlexander Usyskin <alexander.usyskin@intel.com>
      Signed-off-by: default avatarTomas Winkler <tomas.winkler@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      ba8beb4c
    • Pawel Moll's avatar
      perf: Handle compat ioctl · 85887973
      Pawel Moll authored
      commit b3f20785 upstream.
      
      When running a 32-bit userspace on a 64-bit kernel (eg. i386
      application on x86_64 kernel or 32-bit arm userspace on arm64
      kernel) some of the perf ioctls must be treated with special
      care, as they have a pointer size encoded in the command.
      
      For example, PERF_EVENT_IOC_ID in 32-bit world will be encoded
      as 0x80042407, but 64-bit kernel will expect 0x80082407. In
      result the ioctl will fail returning -ENOTTY.
      
      This patch solves the problem by adding code fixing up the
      size as compat_ioctl file operation.
      Reported-by: default avatarDrew Richardson <drew.richardson@arm.com>
      Signed-off-by: default avatarPawel Moll <pawel.moll@arm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: http://lkml.kernel.org/r/1402671812-9078-1-git-send-email-pawel.moll@arm.comSigned-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarDavid Ahern <daahern@cisco.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      85887973
    • Yoichi Yuasa's avatar
      MIPS: Fix forgotten preempt_enable() when CPU has inclusive pcaches · 3b851c17
      Yoichi Yuasa authored
      commit 5596b0b2 upstream.
      
      [    1.904000] BUG: scheduling while atomic: swapper/1/0x00000002
      [    1.908000] Modules linked in:
      [    1.916000] CPU: 0 PID: 1 Comm: swapper Not tainted 3.12.0-rc2-lemote-los.git-5318619-dirty #1
      [    1.920000] Stack : 0000000031aac000 ffffffff810d0000 0000000000000052 ffffffff802730a4
                0000000000000000 0000000000000001 ffffffff810cdf90 ffffffff810d0000
                ffffffff8068b968 ffffffff806f5537 ffffffff810cdf90 980000009f0782e8
                0000000000000001 ffffffff80720000 ffffffff806b0000 980000009f078000
                980000009f290000 ffffffff805f312c 980000009f05b5d8 ffffffff80233518
                980000009f05b5e8 ffffffff80274b7c 980000009f078000 ffffffff8068b968
                0000000000000000 0000000000000000 0000000000000000 0000000000000000
                0000000000000000 980000009f05b520 0000000000000000 ffffffff805f2f6c
                0000000000000000 ffffffff80700000 ffffffff80700000 ffffffff806fc758
                ffffffff80700000 ffffffff8020be98 ffffffff806fceb0 ffffffff805f2f6c
                ...
      [    2.028000] Call Trace:
      [    2.032000] [<ffffffff8020be98>] show_stack+0x80/0x98
      [    2.036000] [<ffffffff805f2f6c>] __schedule_bug+0x44/0x6c
      [    2.040000] [<ffffffff805fac58>] __schedule+0x518/0x5b0
      [    2.044000] [<ffffffff805f8a58>] schedule_timeout+0x128/0x1f0
      [    2.048000] [<ffffffff80240314>] msleep+0x3c/0x60
      [    2.052000] [<ffffffff80495400>] do_probe+0x238/0x3a8
      [    2.056000] [<ffffffff804958b0>] ide_probe_port+0x340/0x7e8
      [    2.060000] [<ffffffff80496028>] ide_host_register+0x2d0/0x7a8
      [    2.064000] [<ffffffff8049c65c>] ide_pci_init_two+0x4e4/0x790
      [    2.068000] [<ffffffff8049f9b8>] amd74xx_probe+0x148/0x2c8
      [    2.072000] [<ffffffff803f571c>] pci_device_probe+0xc4/0x130
      [    2.076000] [<ffffffff80478f60>] driver_probe_device+0x98/0x270
      [    2.080000] [<ffffffff80479298>] __driver_attach+0xe0/0xe8
      [    2.084000] [<ffffffff80476ab0>] bus_for_each_dev+0x78/0xe0
      [    2.088000] [<ffffffff80478468>] bus_add_driver+0x230/0x310
      [    2.092000] [<ffffffff80479b44>] driver_register+0x84/0x158
      [    2.096000] [<ffffffff80200504>] do_one_initcall+0x104/0x160
      Signed-off-by: default avatarYoichi Yuasa <yuasa@linux-mips.org>
      Reported-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Tested-by: default avatarAaro Koskinen <aaro.koskinen@iki.fi>
      Cc: linux-mips@linux-mips.org
      Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
      Patchwork: https://patchwork.linux-mips.org/patch/5941/Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Cc: Alexandre Oliva <lxoliva@fsfla.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b851c17
    • Pali Rohár's avatar
      dell-wmi: Fix access out of memory · 2a7978ef
      Pali Rohár authored
      commit a666b6ff upstream.
      
      Without this patch, dell-wmi is trying to access elements of dynamically
      allocated array without checking the array size. This can lead to memory
      corruption or a kernel panic. This patch adds the missing checks for
      array size.
      Signed-off-by: default avatarPali Rohár <pali.rohar@gmail.com>
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2a7978ef
    • Ben Dooks's avatar
      ARM: probes: fix instruction fetch order with <asm/opcodes.h> · a2ad9bef
      Ben Dooks authored
      commit 888be254 upstream.
      
      If we are running BE8, the data and instruction endianness do not
      match, so use <asm/opcodes.h> to correctly translate memory accesses
      into ARM instructions.
      Acked-by: default avatarJon Medhurst <tixy@linaro.org>
      Signed-off-by: default avatarBen Dooks <ben.dooks@codethink.co.uk>
      [taras.kondratiuk@linaro.org: fixed Thumb instruction fetch order]
      Signed-off-by: default avatarTaras Kondratiuk <taras.kondratiuk@linaro.org>
      [wangnan: backport to 3.10 and 3.14:
       - adjust context
       - backport all changes on arch/arm/kernel/probes.c to
         arch/arm/kernel/kprobes-common.c since we don't have
         commit c18377c3.
       - After the above adjustments, becomes same to Taras Kondratiuk's
         original patch:
           http://lists.linaro.org/pipermail/linaro-kernel/2014-January/010346.html
      ]
      Signed-off-by: default avatarWang Nan <wangnan0@huawei.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2ad9bef
    • Jiri Pirko's avatar
      br: fix use of ->rx_handler_data in code executed on non-rx_handler path · a4ad890a
      Jiri Pirko authored
      commit 859828c0 upstream.
      
      br_stp_rcv() is reached by non-rx_handler path. That means there is no
      guarantee that dev is bridge port and therefore simple NULL check of
      ->rx_handler_data is not enough. There is need to check if dev is really
      bridge port and since only rcu read lock is held here, do it by checking
      ->rx_handler pointer.
      
      Note that synchronize_net() in netdev_rx_handler_unregister() ensures
      this approach as valid.
      
      Introduced originally by:
      commit f350a0a8
        "bridge: use rx_handler_data pointer to store net_bridge_port pointer"
      
      Fixed but not in the best way by:
      commit b5ed54e9
        "bridge: fix RCU races with bridge port"
      
      Reintroduced by:
      commit 716ec052
        "bridge: fix NULL pointer deref of br_port_get_rcu"
      
      Please apply to stable trees as well. Thanks.
      
      RH bugzilla reference: https://bugzilla.redhat.com/show_bug.cgi?id=1025770Reported-by: default avatarLaine Stump <laine@redhat.com>
      Debugged-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Andrew Collins <bsderandrew@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a4ad890a
    • Florian Westphal's avatar
      netfilter: nf_nat: fix oops on netns removal · 5eb4491e
      Florian Westphal authored
      commit 945b2b2d upstream.
      
      Quoting Samu Kallio:
      
       Basically what's happening is, during netns cleanup,
       nf_nat_net_exit gets called before ipv4_net_exit. As I understand
       it, nf_nat_net_exit is supposed to kill any conntrack entries which
       have NAT context (through nf_ct_iterate_cleanup), but for some
       reason this doesn't happen (perhaps something else is still holding
       refs to those entries?).
      
       When ipv4_net_exit is called, conntrack entries (including those
       with NAT context) are cleaned up, but the
       nat_bysource hashtable is long gone - freed in nf_nat_net_exit. The
       bug happens when attempting to free a conntrack entry whose NAT hash
       'prev' field points to a slot in the freed hash table (head for that
       bin).
      
      We ignore conntracks with null nat bindings.  But this is wrong,
      as these are in bysource hash table as well.
      
      Restore nat-cleaning for the netns-is-being-removed case.
      
      bug:
      https://bugzilla.kernel.org/show_bug.cgi?id=65191
      
      Fixes: c2d421e1 ('netfilter: nf_nat: fix race when unloading protocol modules')
      Reported-by: default avatarSamu Kallio <samu.kallio@aberdeencloud.com>
      Debugged-by: default avatarSamu Kallio <samu.kallio@aberdeencloud.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Tested-by: default avatarSamu Kallio <samu.kallio@aberdeencloud.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      [samu.kallio@aberdeencloud.com: backport to 3.10-stable]
      Signed-off-by: default avatarSamu Kallio <samu.kallio@aberdeencloud.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5eb4491e
    • Pablo Neira's avatar
      netfilter: xt_bpf: add mising opaque struct sk_filter definition · 7c059c04
      Pablo Neira authored
      commit e10038a8 upstream.
      
      This structure is not exposed to userspace, so fix this by defining
      struct sk_filter; so we skip the casting in kernelspace. This is safe
      since userspace has no way to lurk with that internal pointer.
      
      Fixes: e6f30c73 ("netfilter: x_tables: add xt_bpf match")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7c059c04
    • Houcheng Lin's avatar
      netfilter: nf_log: release skbuff on nlmsg put failure · 0bf7a5e1
      Houcheng Lin authored
      commit b51d3fa3 upstream.
      
      The kernel should reserve enough room in the skb so that the DONE
      message can always be appended.  However, in case of e.g. new attribute
      erronously not being size-accounted for, __nfulnl_send() will still
      try to put next nlmsg into this full skbuf, causing the skb to be stuck
      forever and blocking delivery of further messages.
      
      Fix issue by releasing skb immediately after nlmsg_put error and
      WARN() so we can track down the cause of such size mismatch.
      
      [ fw@strlen.de: add tailroom/len info to WARN ]
      Signed-off-by: default avatarHoucheng Lin <houcheng@gmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0bf7a5e1
    • Florian Westphal's avatar
      netfilter: nfnetlink_log: fix maximum packet length logged to userspace · 07b17069
      Florian Westphal authored
      commit c1e7dc91 upstream.
      
      don't try to queue payloads > 0xffff - NLA_HDRLEN, it does not work.
      The nla length includes the size of the nla struct, so anything larger
      results in u16 integer overflow.
      
      This patch is similar to
      9cefbbc9 (netfilter: nfnetlink_queue: cleanup copy_range usage).
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07b17069
    • Florian Westphal's avatar
      netfilter: nf_log: account for size of NLMSG_DONE attribute · 3a758a2b
      Florian Westphal authored
      commit 9dfa1dfe upstream.
      
      We currently neither account for the nlattr size, nor do we consider
      the size of the trailing NLMSG_DONE when allocating nlmsg skb.
      
      This can result in nflog to stop working, as __nfulnl_send() re-tries
      sending forever if it failed to append NLMSG_DONE (which will never
      work if buffer is not large enough).
      Reported-by: default avatarHoucheng Lin <houcheng@gmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a758a2b
    • Andrey Vagin's avatar
      ipc: always handle a new value of auto_msgmni · f2f25589
      Andrey Vagin authored
      commit 1195d94e upstream.
      
      proc_dointvec_minmax() returns zero if a new value has been set.  So we
      don't need to check all charecters have been handled.
      
      Below you can find two examples.  In the new value has not been handled
      properly.
      
      $ strace ./a.out
      open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
      write(3, "0\n\0", 3)                    = 2
      close(3)                                = 0
      exit_group(0)
      $ cat /sys/kernel/debug/tracing/trace
      
      $strace ./a.out
      open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
      write(3, "0\n", 2)                      = 2
      close(3)                                = 0
      
      $ cat /sys/kernel/debug/tracing/trace
      a.out-697   [000] ....  3280.998235: unregister_ipcns_notifier <-proc_ipcauto_dointvec_minmax
      
      Fixes: 9eefe520 ("ipc: do not use a negative value to re-enable msgmni automatic recomputin")
      Signed-off-by: default avatarAndrey Vagin <avagin@openvz.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f2f25589
    • Bjorn Helgaas's avatar
      clocksource: Remove "weak" from clocksource_default_clock() declaration · 88d96d8e
      Bjorn Helgaas authored
      commit 96a2adbc upstream.
      
      kernel/time/jiffies.c provides a default clocksource_default_clock()
      definition explicitly marked "weak".  arch/s390 provides its own definition
      intended to override the default, but the "weak" attribute on the
      declaration applied to the s390 definition as well, so the linker chose one
      based on link order (see 10629d71 ("PCI: Remove __weak annotation from
      pcibios_get_phb_of_node decl")).
      
      Remove the "weak" attribute from the clocksource_default_clock()
      declaration so we always prefer a non-weak definition over the weak one,
      independent of link order.
      
      Fixes: f1b82746 ("clocksource: Cleanup clocksource selection")
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Acked-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      CC: Daniel Lezcano <daniel.lezcano@linaro.org>
      CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      88d96d8e