1. 21 Mar, 2016 1 commit
    • Rainer Weikusat's avatar
      unix: avoid use-after-free in ep_remove_wait_queue · ec54d5ae
      Rainer Weikusat authored
      commit 7d267278
      
       upstream.
      
      Rainer Weikusat <rweikusat@mobileactivedefense.com> writes:
      An AF_UNIX datagram socket being the client in an n:1 association with
      some server socket is only allowed to send messages to the server if the
      receive queue of this socket contains at most sk_max_ack_backlog
      datagrams. This implies that prospective writers might be forced to go
      to sleep despite none of the message presently enqueued on the server
      receive queue were sent by them. In order to ensure that these will be
      woken up once space becomes again available, the present unix_dgram_poll
      routine does a second sock_poll_wait call with the peer_wait wait queue
      of the server socket as queue argument (unix_dgram_recvmsg does a wake
      up on this queue after a datagram was received). This is inherently
      problematic because the server socket is only guaranteed to remain alive
      for as long as the client still holds a reference to it. In case the
      connection is dissolved via connect or by the dead peer detection logic
      in unix_dgram_sendmsg, the server socket may be freed despite "the
      polling mechanism" (in particular, epoll) still has a pointer to the
      corresponding peer_wait queue. There's no way to forcibly deregister a
      wait queue with epoll.
      
      Based on an idea by Jason Baron, the patch below changes the code such
      that a wait_queue_t belonging to the client socket is enqueued on the
      peer_wait queue of the server whenever the peer receive queue full
      condition is detected by either a sendmsg or a poll. A wake up on the
      peer queue is then relayed to the ordinary wait queue of the client
      socket via wake function. The connection to the peer wait queue is again
      dissolved if either a wake up is about to be relayed or the client
      socket reconnects or a dead peer is detected or the client socket is
      itself closed. This enables removing the second sock_poll_wait from
      unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
      that no blocked writer sleeps forever.
      Signed-off-by: default avatarRainer Weikusat <rweikusat@mobileactivedefense.com>
      Fixes: ec0d215f
      
       ("af_unix: fix 'poll for write'/connected DGRAM sockets")
      Reviewed-by: default avatarJason Baron <jbaron@akamai.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarZefan Li <lizefan@huawei.com>
      ec54d5ae
  2. 27 Apr, 2014 1 commit
  3. 15 Jan, 2014 2 commits
  4. 08 Dec, 2013 1 commit
    • Hannes Frederic Sowa's avatar
      net: rework recvmsg handler msg_name and msg_namelen logic · 18719a4c
      Hannes Frederic Sowa authored
      [ Upstream commit f3d33426
      
       ]
      
      This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
      set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
      to return msg_name to the user.
      
      This prevents numerous uninitialized memory leaks we had in the
      recvmsg handlers and makes it harder for new code to accidentally leak
      uninitialized memory.
      
      Optimize for the case recvfrom is called with NULL as address. We don't
      need to copy the address at all, so set it to NULL before invoking the
      recvmsg handler. We can do so, because all the recvmsg handlers must
      cope with the case a plain read() is called on them. read() also sets
      msg_name to NULL.
      
      Also document these changes in include/linux/net.h as suggested by David
      Miller.
      
      Changes since RFC:
      
      Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
      non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
      affect sendto as it would bail out earlier while trying to copy-in the
      address. It also more naturally reflects the logic by the callers of
      verify_iovec.
      
      With this change in place I could remove "
      if (!uaddr || msg_sys->msg_namelen == 0)
      	msg->msg_name = NULL
      ".
      
      This change does not alter the user visible error logic as we ignore
      msg_namelen as long as msg_name is NULL.
      
      Also remove two unnecessary curly brackets in ___sys_recvmsg and change
      comments to netdev style.
      
      Cc: David Miller <davem@davemloft.net>
      Suggested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18719a4c
  5. 04 Nov, 2013 2 commits
    • Daniel Borkmann's avatar
      net: unix: inherit SOCK_PASS{CRED, SEC} flags from socket to fix race · 4dde1cb0
      Daniel Borkmann authored
      [ Upstream commit 90c6bd34 ]
      
      In the case of credentials passing in unix stream sockets (dgram
      sockets seem not affected), we get a rather sparse race after
      commit 16e57262 ("af_unix: dont send SCM_CREDENTIALS by default").
      
      We have a stream server on receiver side that requests credential
      passing from senders (e.g. nc -U). Since we need to set SO_PASSCRED
      on each spawned/accepted socket on server side to 1 first (as it's
      not inherited), it can happen that in the time between accept() and
      setsockopt() we get interrupted, the sender is being scheduled and
      continues with passing data to our receiver. At that time SO_PASSCRED
      is neither set on sender nor receiver side, hence in cmsg's
      SCM_CREDENTIALS we get eventually pid:0, uid:65534, gid:65534
      (== overflow{u,g}id) instead of what we actually would like to see.
      
      On the sender side, here nc -U, the tests in maybe_add_creds()
      invoked through unix_stream_sendmsg() would fail, as at that exact
      time, as mentioned, the sender has neither SO_PASSCRED on his side
      nor sees it on the server side, and we have a valid 'other' socket
      in place. Thus, sender believes it would just look like a normal
      connection, not needing/requesting SO_PASSCRED at that time.
      
      As reverting 16e57262
      
       would not be an option due to the significant
      performance regression reported when having creds always passed,
      one way/trade-off to prevent that would be to set SO_PASSCRED on
      the listener socket and allow inheriting these flags to the spawned
      socket on server side in accept(). It seems also logical to do so
      if we'd tell the listener socket to pass those flags onwards, and
      would fix the race.
      
      Before, strace:
      
      recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"blub\n", 4096}],
              msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET,
              cmsg_type=SCM_CREDENTIALS{pid=0, uid=65534, gid=65534}},
              msg_flags=0}, 0) = 5
      
      After, strace:
      
      recvmsg(4, {msg_name(0)=NULL, msg_iov(1)=[{"blub\n", 4096}],
              msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET,
              cmsg_type=SCM_CREDENTIALS{pid=11580, uid=1000, gid=1000}},
              msg_flags=0}, 0) = 5
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4dde1cb0
    • Mathias Krause's avatar
      unix_diag: fix info leak · f4358dfd
      Mathias Krause authored
      [ Upstream commit 6865d1e8
      
       ]
      
      When filling the netlink message we miss to wipe the pad field,
      therefore leak one byte of heap memory to userland. Fix this by
      setting pad to 0.
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f4358dfd
  6. 01 May, 2013 1 commit
  7. 05 Apr, 2013 1 commit
    • Paul Moore's avatar
      unix: fix a race condition in unix_release() · fbb7347e
      Paul Moore authored
      [ Upstream commit ded34e0f
      
       ]
      
      As reported by Jan, and others over the past few years, there is a
      race condition caused by unix_release setting the sock->sk pointer
      to NULL before properly marking the socket as dead/orphaned.  This
      can cause a problem with the LSM hook security_unix_may_send() if
      there is another socket attempting to write to this partially
      released socket in between when sock->sk is set to NULL and it is
      marked as dead/orphaned.  This patch fixes this by only setting
      sock->sk to NULL after the socket has been marked as dead; I also
      take the opportunity to make unix_release_sock() a void function
      as it only ever returned 0/success.
      
      Dave, I think this one should go on the -stable pile.
      
      Special thanks to Jan for coming up with a reproducer for this
      problem.
      Reported-by: default avatarJan Stancek <jan.stancek@gmail.com>
      Signed-off-by: default avatarPaul Moore <pmoore@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fbb7347e
  8. 02 Oct, 2012 1 commit
    • Eric Dumazet's avatar
      af_netlink: force credentials passing [CVE-2012-3520] · 7c799a1e
      Eric Dumazet authored
      [ Upstream commit e0e3cea4 ]
      
      Pablo Neira Ayuso discovered that avahi and
      potentially NetworkManager accept spoofed Netlink messages because of a
      kernel bug.  The kernel passes all-zero SCM_CREDENTIALS ancillary data
      to the receiver if the sender did not provide such data, instead of not
      including any such data at all or including the correct data from the
      peer (as it is the case with AF_UNIX).
      
      This bug was introduced in commit 16e57262
      
      
      (af_unix: dont send SCM_CREDENTIALS by default)
      
      This patch forces passing credentials for netlink, as
      before the regression.
      
      Another fix would be to not add SCM_CREDENTIALS in
      netlink messages if not provided by the sender, but it
      might break some programs.
      
      With help from Florian Weimer & Petr Matousek
      
      This issue is designated as CVE-2012-3520
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c799a1e
  9. 23 Mar, 2012 1 commit
    • Hans Verkuil's avatar
      poll: add poll_requested_events() and poll_does_not_wait() functions · 626cf236
      Hans Verkuil authored
      
      In some cases the poll() implementation in a driver has to do different
      things depending on the events the caller wants to poll for.  An example
      is when a driver needs to start a DMA engine if the caller polls for
      POLLIN, but doesn't want to do that if POLLIN is not requested but instead
      only POLLOUT or POLLPRI is requested.  This is something that can happen
      in the video4linux subsystem among others.
      
      Unfortunately, the current epoll/poll/select implementation doesn't
      provide that information reliably.  The poll_table_struct does have it: it
      has a key field with the event mask.  But once a poll() call matches one
      or more bits of that mask any following poll() calls are passed a NULL
      poll_table pointer.
      
      Also, the eventpoll implementation always left the key field at ~0 instead
      of using the requested events mask.
      
      This was changed in eventpoll.c so the key field now contains the actual
      events that should be polled for as set by the caller.
      
      The solution to the NULL poll_table pointer is to set the qproc field to
      NULL in poll_table once poll() matches the events, not the poll_table
      pointer itself.  That way drivers can obtain the mask through a new
      poll_requested_events inline.
      
      The poll_table_struct can still be NULL since some kernel code calls it
      internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h).  In
      that case poll_requested_events() returns ~0 (i.e.  all events).
      
      Very rarely drivers might want to know whether poll_wait will actually
      wait.  If another earlier file descriptor in the set already matched the
      events the caller wanted to wait for, then the kernel will return from the
      select() call without waiting.  This might be useful information in order
      to avoid doing expensive work.
      
      A new helper function poll_does_not_wait() is added that drivers can use
      to detect this situation.  This is now used in sock_poll_wait() in
      include/net/sock.h.  This was the only place in the kernel that needed
      this information.
      
      Drivers should no longer access any of the poll_table internals, but use
      the poll_requested_events() and poll_does_not_wait() access functions
      instead.  In order to enforce that the poll_table fields are now prepended
      with an underscore and a comment was added warning against using them
      directly.
      
      This required a change in unix_dgram_poll() in unix/af_unix.c which used
      the key field to get the requested events.  It's been replaced by a call
      to poll_requested_events().
      
      For qproc it was especially important to change its name since the
      behavior of that field changes with this patch since this function pointer
      can now be NULL when that wasn't possible in the past.
      
      Any driver accessing the qproc or key fields directly will now fail to compile.
      
      Some notes regarding the correctness of this patch: the driver's poll()
      function is called with a 'struct poll_table_struct *wait' argument.  This
      pointer may or may not be NULL, drivers can never rely on it being one or
      the other as that depends on whether or not an earlier file descriptor in
      the select()'s fdset matched the requested events.
      
      There are only three things a driver can do with the wait argument:
      
      1) obtain the key field:
      
      	events = wait ? wait->key : ~0;
      
         This will still work although it should be replaced with the new
         poll_requested_events() function (which does exactly the same).
         This will now even work better, since wait is no longer set to NULL
         unnecessarily.
      
      2) use the qproc callback. This could be deadly since qproc can now be
         NULL. Renaming qproc should prevent this from happening. There are no
         kernel drivers that actually access this callback directly, BTW.
      
      3) test whether wait == NULL to determine whether poll would return without
         waiting. This is no longer sufficient as the correct test is now
         wait == NULL || wait->_qproc == NULL.
      
         However, the worst that can happen here is a slight performance hit in
         the case where wait != NULL and wait->_qproc == NULL. In that case the
         driver will assume that poll_wait() will actually add the fd to the set
         of waiting file descriptors. Of course, poll_wait() will not do that
         since it tests for wait->_qproc. This will not break anything, though.
      
         There is only one place in the whole kernel where this happens
         (sock_poll_wait() in include/net/sock.h) and that code will be replaced
         by a call to poll_does_not_wait() in the next patch.
      
         Note that even if wait->_qproc != NULL drivers cannot rely on poll_wait()
         actually waiting. The next file descriptor from the set might match the
         event mask and thus any possible waits will never happen.
      Signed-off-by: default avatarHans Verkuil <hans.verkuil@cisco.com>
      Reviewed-by: default avatarJonathan Corbet <corbet@lwn.net>
      Reviewed-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      626cf236
  10. 21 Mar, 2012 2 commits
  11. 26 Feb, 2012 1 commit
  12. 22 Feb, 2012 1 commit
    • Eric Dumazet's avatar
      af_unix: MSG_TRUNC support for dgram sockets · 9f6f9af7
      Eric Dumazet authored
      
      Piergiorgio Beruto expressed the need to fetch size of first datagram in
      queue for AF_UNIX sockets and suggested a patch against SIOCINQ ioctl.
      
      I suggested instead to implement MSG_TRUNC support as a recv() input
      flag, as already done for RAW, UDP & NETLINK sockets.
      
      len = recv(fd, &byte, 1, MSG_PEEK | MSG_TRUNC);
      
      MSG_TRUNC asks recv() to return the real length of the packet, even when
      is was longer than the passed buffer.
      
      There is risk that a userland application used MSG_TRUNC by accident
      (since it had no effect on af_unix sockets) and this might break after
      this patch.
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Tested-by: default avatarPiergiorgio Beruto <piergiorgio.beruto@gmail.com>
      CC: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f6f9af7
  13. 21 Feb, 2012 2 commits
  14. 30 Jan, 2012 1 commit
    • Eric Dumazet's avatar
      af_unix: fix EPOLLET regression for stream sockets · 6f01fd6e
      Eric Dumazet authored
      Commit 0884d7aa
      
       (AF_UNIX: Fix poll blocking problem when reading from
      a stream socket) added a regression for epoll() in Edge Triggered mode
      (EPOLLET)
      
      Appropriate fix is to use skb_peek()/skb_unlink() instead of
      skb_dequeue(), and only call skb_unlink() when skb is fully consumed.
      
      This remove the need to requeue a partial skb into sk_receive_queue head
      and the extra sk->sk_data_ready() calls that added the regression.
      
      This is safe because once skb is given to sk_receive_queue, it is not
      modified by a writer, and readers are serialized by u->readlock mutex.
      
      This also reduce number of spinlock acquisition for small reads or
      MSG_PEEK users so should improve overall performance.
      Reported-by: default avatarNick Mathewson <nickm@freehaven.net>
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Cc: Alexey Moiseytsev <himeraster@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f01fd6e
  15. 07 Jan, 2012 1 commit
  16. 04 Jan, 2012 1 commit
  17. 30 Dec, 2011 3 commits
  18. 26 Dec, 2011 2 commits
  19. 20 Dec, 2011 1 commit
  20. 16 Dec, 2011 10 commits
  21. 26 Nov, 2011 1 commit
  22. 28 Sep, 2011 1 commit
    • Eric Dumazet's avatar
      af_unix: dont send SCM_CREDENTIALS by default · 16e57262
      Eric Dumazet authored
      Since commit 7361c36c (af_unix: Allow credentials to work across
      user and pid namespaces) af_unix performance dropped a lot.
      
      This is because we now take a reference on pid and cred in each write(),
      and release them in read(), usually done from another process,
      eventually from another cpu. This triggers false sharing.
      
      # Events: 154K cycles
      #
      # Overhead  Command       Shared Object        Symbol
      # ........  .......  ..................  .........................
      #
          10.40%  hackbench  [kernel.kallsyms]   [k] put_pid
           8.60%  hackbench  [kernel.kallsyms]   [k] unix_stream_recvmsg
           7.87%  hackbench  [kernel.kallsyms]   [k] unix_stream_sendmsg
           6.11%  hackbench  [kernel.kallsyms]   [k] do_raw_spin_lock
           4.95%  hackbench  [kernel.kallsyms]   [k] unix_scm_to_skb
           4.87%  hackbench  [kernel.kallsyms]   [k] pid_nr_ns
           4.34%  hackbench  [kernel.kallsyms]   [k] cred_to_ucred
           2.39%  hackbench  [kernel.kallsyms]   [k] unix_destruct_scm
         ...
      16e57262
  23. 16 Sep, 2011 1 commit
  24. 25 Aug, 2011 1 commit
    • Tim Chen's avatar
      Scm: Remove unnecessary pid & credential references in Unix socket's send and receive path · 0856a304
      Tim Chen authored
      Patch series 109f6e39..7361c36c
      
       back in 2.6.36 added functionality to
      allow credentials to work across pid namespaces for packets sent via
      UNIX sockets.  However, the atomic reference counts on pid and
      credentials caused plenty of cache bouncing when there are numerous
      threads of the same pid sharing a UNIX socket.  This patch mitigates the
      problem by eliminating extraneous reference counts on pid and
      credentials on both send and receive path of UNIX sockets. I found a 2x
      improvement in hackbench's threaded case.
      
      On the receive path in unix_dgram_recvmsg, currently there is an
      increment of reference count on pid and credentials in scm_set_cred.
      Then there are two decrement of the reference counts.  Once in scm_recv
      and once when skb_free_datagram call skb->destructor function
      unix_destruct_scm.  One pair of increment and decrement of ref count on
      pid and credentials can be eliminated from the receive path.  Until we
      destroy the skb, we already set a reference when we created the skb on
      the send side.
      
      On the send path, there are two increments of ref count on pid and
      credentials, once in scm_send and once in unix_scm_to_skb.  Then there
      is a decrement of the reference counts in scm_destroy's call to
      scm_destroy_cred at the end of unix_dgram_sendmsg functions.   One pair
      of increment and decrement of the reference counts can be removed so we
      only need to increment the ref counts once.
      
      By incorporating these changes, for hackbench running on a 4 socket
      NHM-EX machine with 40 cores, the execution of hackbench on
      50 groups of 20 threads sped up by factor of 2.
      
      Hackbench command used for testing:
      ./hackbench 50 thread 2000
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0856a304