1. 28 Oct, 2015 21 commits
  2. 27 Oct, 2015 19 commits
    • Stas Sergeev's avatar
      of_mdio: add new DT property 'managed' to specify the PHY management type · 71a386c7
      Stas Sergeev authored
      [ Upstream commit 4cba5c21 ]
      
      Currently the PHY management type is selected by the MAC driver arbitrary.
      The decision is based on the presence of the "fixed-link" node and on a
      will of the driver's authors.
      This caused a regression recently, when mvneta driver suddenly started
      to use the in-band status for auto-negotiation on fixed links.
      It appears the auto-negotiation may not work when expected by the MAC driver.
      Sebastien Rannou explains:
      << Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's
      a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with
      inband status) is connected to the switch through QSGMII, and in this context
      we are on the media side of the PHY. >>
      https://lkml.org/lkml/2015/7/10/206
      
      This patch introduces the new string property 'managed' that allows
      the user to set the management type explicitly.
      The supported values are:
      "auto" - default. Uses either MDIO or nothing, depending on the presence
      of the fixed-link node
      "in-band-status" - use in-band status
      Signed-off-by: default avatarStas Sergeev <stsp@users.sourceforge.net>
      
      CC: Rob Herring <robh+dt@kernel.org>
      CC: Pawel Moll <pawel.moll@arm.com>
      CC: Mark Rutland <mark.rutland@arm.com>
      CC: Ian Campbell <ijc+devicetree@hellion.org.uk>
      CC: Kumar Gala <galak@codeaurora.org>
      CC: Florian Fainelli <f.fainelli@gmail.com>
      CC: Grant Likely <grant.likely@linaro.org>
      CC: devicetree@vger.kernel.org
      CC: linux-kernel@vger.kernel.org
      CC: netdev@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      71a386c7
    • Florian Fainelli's avatar
      net: dsa: bcm_sf2: Do not override speed settings · c0fb0993
      Florian Fainelli authored
      [ Upstream commit d2eac98f ]
      
      The SF2 driver currently overrides speed settings for its port
      configured using a fixed PHY, this is both unnecessary and incorrect,
      because we keep feedback to the hardware parameters that we read from
      the PHY device, which in the case of a fixed PHY cannot possibly change
      speed.
      
      This is a required change to allow the fixed PHY code to allow
      registering a PHY with a link configured as DOWN by default and avoid
      some sort of circular dependency where we require the link_update
      callback to run to program the hardware, and we then utilize the fixed
      PHY parameters to program the hardware with the same settings.
      
      Fixes: 246d7f77 ("net: dsa: add Broadcom SF2 switch driver")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      c0fb0993
    • Eric Dumazet's avatar
      tcp: add proper TS val into RST packets · 9a2c1f52
      Eric Dumazet authored
      [ Upstream commit 675ee231 ]
      
      RST packets sent on behalf of TCP connections with TS option (RFC 7323
      TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.
      
      A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
      ecr 0], length 0
      B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
      1460,nop,nop,TS val 7264344 ecr 100], length 0
      A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
      7264344], length 0
      
      B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
      ecr 110], length 0
      
      We need to call skb_mstamp_get() to get proper TS val,
      derived from skb->skb_mstamp
      
      Note that RFC 1323 was advocating to not send TS option in RST segment,
      but RFC 7323 recommends the opposite :
      
        Once TSopt has been successfully negotiated, that is both <SYN> and
        <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
        segment for the duration of the connection, and SHOULD be sent in an
        <RST> segment (see Section 5.2 for details)
      
      Note this RFC recommends to send TS val = 0, but we believe it is
      premature : We do not know if all TCP stacks are properly
      handling the receive side :
      
         When an <RST> segment is
         received, it MUST NOT be subjected to the PAWS check by verifying an
         acceptable value in SEG.TSval, and information from the Timestamps
         option MUST NOT be used to update connection state information.
         SEG.TSecr MAY be used to provide stricter <RST> acceptance checks.
      
      In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
      to decide to send TS val = 0, if it buys something.
      
      Fixes: 7faee5c0 ("tcp: remove TCP_SKB_CB(skb)->when")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      9a2c1f52
    • Florian Fainelli's avatar
      net: dsa: bcm_sf2: Fix 64-bits register writes · 646cd5ed
      Florian Fainelli authored
      [ Upstream commit 03679a14 ]
      
      The macro to write 64-bits quantities to the 32-bits register swapped
      the value and offsets arguments, we want to preserve the ordering of the
      arguments with respect to how writel() is implemented for instance:
      value first, offset/base second.
      
      Fixes: 246d7f77 ("net: dsa: add Broadcom SF2 switch driver")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      646cd5ed
    • Atsushi Nemoto's avatar
      net: eth: altera: fix napi poll_list corruption · ca41797a
      Atsushi Nemoto authored
      [ Upstream commit 4548a697 ]
      
      tse_poll() calls __napi_complete() with irq enabled.  This leads napi
      poll_list corruption and may stop all napi drivers working.
      Use napi_complete() instead of __napi_complete().
      Signed-off-by: default avatarAtsushi Nemoto <nemoto@toshiba-tops.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      ca41797a
    • Eric Sandeen's avatar
      ext4: don't manipulate recovery flag when freezing no-journal fs · 826d518a
      Eric Sandeen authored
      [ Upstream commit c642dc9e ]
      
      At some point along this sequence of changes:
      
      f6e63f90 ext4: fold ext4_nojournal_sops into ext4_sops
      bb044576 ext4: support freezing ext2 (nojournal) file systems
      9ca92389 ext4: Use separate super_operations structure for no_journal filesystems
      
      ext4 started setting needs_recovery on filesystems without journals
      when they are unfrozen.  This makes no sense, and in fact confuses
      blkid to the point where it doesn't recognize the filesystem at all.
      
      (freeze ext2; unfreeze ext2; run blkid; see no output; run dumpe2fs,
      see needs_recovery set on fs w/ no journal).
      
      To fix this, don't manipulate the INCOMPAT_RECOVER feature on
      filesystems without journals.
      Reported-by: default avatarStu Mark <smark@datto.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      826d518a
    • Daniel Axtens's avatar
      cxl: Fix unbalanced pci_dev_get in cxl_probe · 5324b253
      Daniel Axtens authored
      [ Upstream commit 2925c2fd ]
      
      Currently the first thing we do in cxl_probe is to grab a reference
      on the pci device. Later on, we call device_register on our adapter.
      In our remove path, we call device_unregister, but we never call
      pci_dev_put. We therefore leak the device every time we do a
      reflash.
      
      device_register/unregister is sufficient to hold the reference.
      Therefore, drop the call to pci_dev_get.
      
      Here's why this is safe.
      The proposed cxl_probe(pdev) calls cxl_adapter_init:
          a) init calls cxl_adapter_alloc, which creates a struct cxl,
             conventionally called adapter. This struct contains a
             device entry, adapter->dev.
      
          b) init calls cxl_configure_adapter, where we set
             adapter->dev.parent = &dev->dev (here dev is the pci dev)
      
      So at this point, the cxl adapter's device's parent is the PCI
      device that I want to be refcounted properly.
      
          c) init calls cxl_register_adapter
             *) cxl_register_adapter calls device_register(&adapter->dev)
      
      So now we're in device_register, where dev is the adapter device, and
      we want to know if the PCI device is safe after we return.
      
      device_register(&adapter->dev) calls device_initialize() and then
      device_add().
      
      device_add() does a get_device(). device_add() also explicitly grabs
      the device's parent, and calls get_device() on it:
      
               parent = get_device(dev->parent);
      
      So therefore, device_register() takes a lock on the parent PCI dev,
      which is what pci_dev_get() was guarding. pci_dev_get() can therefore
      be safely removed.
      
      Fixes: f204e0b8 ("cxl: Driver code for powernv PCIe based cards for userspace access")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Acked-by: default avatarIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5324b253
    • Shota Suzuki's avatar
      igb: Fix oops caused by missing queue pairing · 5be042b1
      Shota Suzuki authored
      [ Upstream commit 72ddef05 ]
      
      When initializing igb driver (e.g. 82576, I350), IGB_FLAG_QUEUE_PAIRS is
      set if adapter->rss_queues exceeds half of max_rss_queues in
      igb_init_queue_configuration().
      On the other hand, IGB_FLAG_QUEUE_PAIRS is not set even if the number of
      queues exceeds half of max_combined in igb_set_channels() when changing
      the number of queues by "ethtool -L".
      In this case, if numvecs is larger than MAX_MSIX_ENTRIES (10), the size
      of adapter->msix_entries[], an overflow can occur in
      igb_set_interrupt_capability(), which in turn leads to an oops.
      
      Fix this problem as follows:
       - When changing the number of queues by "ethtool -L", set
         IGB_FLAG_QUEUE_PAIRS in the same way as initializing igb driver.
       - When increasing the size of q_vector, reallocate it appropriately.
         (With IGB_FLAG_QUEUE_PAIRS set, the size of q_vector gets larger.)
      
      Another possible way to fix this problem is to cap the queues at its
      initial number, which is the number of the initial online cpus. But this
      is not the optimal way because we cannot increase queues when another
      cpu becomes online.
      
      Note that before commit cd14ef54 ("igb: Change to use statically
      allocated array for MSIx entries"), this problem did not cause oops
      but just made the number of queues become 1 because of entering msi_only
      mode in igb_set_interrupt_capability().
      
      Fixes: 907b7835 ("igb: Add ethtool support to configure number of channels")
      CC: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarShota Suzuki <suzuki_shota_t3@lab.ntt.co.jp>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5be042b1
    • Larry Finger's avatar
      rtlwifi: rtl8821ae: Fix an expression that is always false · e936a4c6
      Larry Finger authored
      [ Upstream commit 251086f5 ]
      
      In routine _rtl8821ae_set_media_status(), an incorrect mask results in a test
      for AP status to always be false. Similar bugs were fixed in rtl8192cu and
      rtl8192de, but this instance was missed at that time.
      Reported-by: default avatarDavid Binderman <dcb314@hotmail.com>
      Signed-off-by: default avatarLarry Finger <Larry.Finger@lwfinger.net>
      Cc: Stable <stable@vger.kernel.org> [3.18+]
      Cc: David Binderman <dcb314@hotmail.com>
      Signed-off-by: default avatarKalle Valo <kvalo@codeaurora.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      e936a4c6
    • Andy Lutomirski's avatar
      x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection · 4bc532d8
      Andy Lutomirski authored
      [ Upstream commit 810bc075 ]
      
      We have a tricky bug in the nested NMI code: if we see RSP
      pointing to the NMI stack on NMI entry from kernel mode, we
      assume that we are executing a nested NMI.
      
      This isn't quite true.  A malicious userspace program can point
      RSP at the NMI stack, issue SYSCALL, and arrange for an NMI to
      happen while RSP is still pointing at the NMI stack.
      
      Fix it with a sneaky trick.  Set DF in the region of code that
      the RSP check is intended to detect.  IRET will clear DF
      atomically.
      
      ( Note: other than paravirt, there's little need for all this
        complexity. We could check RIP instead of RSP. )
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      4bc532d8
    • Andy Lutomirski's avatar
      x86/nmi/64: Reorder nested NMI checks · eb0bad52
      Andy Lutomirski authored
      [ Upstream commit a27507ca ]
      
      Check the repeat_nmi .. end_repeat_nmi special case first.  The
      next patch will rework the RSP check and, as a side effect, the
      RSP check will no longer detect repeat_nmi .. end_repeat_nmi, so
      we'll need this ordering of the checks.
      
      Note: this is more subtle than it appears.  The check for
      repeat_nmi .. end_repeat_nmi jumps straight out of the NMI code
      instead of adjusting the "iret" frame to force a repeat.  This
      is necessary, because the code between repeat_nmi and
      end_repeat_nmi sets "NMI executing" and then writes to the
      "iret" frame itself.  If a nested NMI comes in and modifies the
      "iret" frame while repeat_nmi is also modifying it, we'll end up
      with garbage.  The old code got this right, as does the new
      code, but the new code is a bit more explicit.
      
      If we were to move the check right after the "NMI executing"
      check, then we'd get it wrong and have random crashes.
      
      ( Because the "NMI executing" check would jump to the code that would
        modify the "iret" frame without checking if the interrupted NMI was
        currently modifying it. )
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      eb0bad52
    • Andy Lutomirski's avatar
      x86/nmi/64: Improve nested NMI comments · 092f7a2a
      Andy Lutomirski authored
      [ Upstream commit 0b22930e ]
      
      I found the nested NMI documentation to be difficult to follow.
      Improve the comments.
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      092f7a2a
    • Ivan Vecera's avatar
      bna: fix interrupts storm caused by erroneous packets · 5e4c0ae9
      Ivan Vecera authored
      [ Upstream commit ade4dc3e ]
      
      The commit "e29aa339 bna: Enable Multi Buffer RX" moved packets counter
      increment from the beginning of the NAPI processing loop after the check
      for erroneous packets so they are never accounted. This counter is used
      to inform firmware about number of processed completions (packets).
      As these packets are never acked the firmware fires IRQs for them again
      and again.
      
      Fixes: e29aa339 ("bna: Enable Multi Buffer RX")
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Acked-by: default avatarRasesh Mody <rasesh.mody@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      5e4c0ae9
    • Eric Dumazet's avatar
      udp: fix dst races with multicast early demux · 2ab4f113
      Eric Dumazet authored
      [ Upstream commit 10e2eb87 ]
      
      Multicast dst are not cached. They carry DST_NOCACHE.
      
      As mentioned in commit f8864972 ("ipv4: fix dst race in
      sk_dst_get()"), these dst need special care before caching them
      into a socket.
      
      Caching them is allowed only if their refcnt was not 0, ie we
      must use atomic_inc_not_zero()
      
      Also, we must use READ_ONCE() to fetch sk->sk_rx_dst, as mentioned
      in commit d0c294c5 ("tcp: prevent fetching dst twice in early demux
      code")
      
      Fixes: 421b3885 ("udp: ipv4: Add udp early demux")
      Tested-by: default avatarGregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarGregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
      Reported-by: default avatarAlex Gartrell <agartrell@fb.com>
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      2ab4f113
    • Lars Westerhoff's avatar
      packet: missing dev_put() in packet_do_bind() · f0efe010
      Lars Westerhoff authored
      [ Upstream commit 158cd4af ]
      
      When binding a PF_PACKET socket, the use count of the bound interface is
      always increased with dev_hold in dev_get_by_{index,name}.  However,
      when rebound with the same protocol and device as in the previous bind
      the use count of the interface was not decreased.  Ultimately, this
      caused the deletion of the interface to fail with the following message:
      
      unregister_netdevice: waiting for dummy0 to become free. Usage count = 1
      
      This patch moves the dev_put out of the conditional part that was only
      executed when either the protocol or device changed on a bind.
      
      Fixes: 902fefb8 ('packet: improve socket create/bind latency in some cases')
      Signed-off-by: default avatarLars Westerhoff <lars.westerhoff@newtec.eu>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      f0efe010
    • Wilson Kok's avatar
      fib_rules: fix fib rule dumps across multiple skbs · 71960d66
      Wilson Kok authored
      [ Upstream commit 41fc0143 ]
      
      dump_rules returns skb length and not error.
      But when family == AF_UNSPEC, the caller of dump_rules
      assumes that it returns an error. Hence, when family == AF_UNSPEC,
      we continue trying to dump on -EMSGSIZE errors resulting in
      incorrect dump idx carried between skbs belonging to the same dump.
      This results in fib rule dump always only dumping rules that fit
      into the first skb.
      
      This patch fixes dump_rules to return error so that we exit correctly
      and idx is correctly maintained between skbs that are part of the
      same dump.
      Signed-off-by: default avatarWilson Kok <wkok@cumulusnetworks.com>
      Signed-off-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      71960d66
    • Jesse Gross's avatar
      openvswitch: Zero flows on allocation. · ae688bc6
      Jesse Gross authored
      [ Upstream commit ae5f2fb1 ]
      
      When support for megaflows was introduced, OVS needed to start
      installing flows with a mask applied to them. Since masking is an
      expensive operation, OVS also had an optimization that would only
      take the parts of the flow keys that were covered by a non-zero
      mask. The values stored in the remaining pieces should not matter
      because they are masked out.
      
      While this works fine for the purposes of matching (which must always
      look at the mask), serialization to netlink can be problematic. Since
      the flow and the mask are serialized separately, the uninitialized
      portions of the flow can be encoded with whatever values happen to be
      present.
      
      In terms of functionality, this has little effect since these fields
      will be masked out by definition. However, it leaks kernel memory to
      userspace, which is a potential security vulnerability. It is also
      possible that other code paths could look at the masked key and get
      uninitialized data, although this does not currently appear to be an
      issue in practice.
      
      This removes the mask optimization for flows that are being installed.
      This was always intended to be the case as the mask optimizations were
      really targetting per-packet flow operations.
      
      Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
      Signed-off-by: default avatarJesse Gross <jesse@nicira.com>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      ae688bc6
    • Marcelo Ricardo Leitner's avatar
      sctp: fix race on protocol/netns initialization · 779c19e0
      Marcelo Ricardo Leitner authored
      [ Upstream commit 8e2d61e0 ]
      
      Consider sctp module is unloaded and is being requested because an user
      is creating a sctp socket.
      
      During initialization, sctp will add the new protocol type and then
      initialize pernet subsys:
      
              status = sctp_v4_protosw_init();
              if (status)
                      goto err_protosw_init;
      
              status = sctp_v6_protosw_init();
              if (status)
                      goto err_v6_protosw_init;
      
              status = register_pernet_subsys(&sctp_net_ops);
      
      The problem is that after those calls to sctp_v{4,6}_protosw_init(), it
      is possible for userspace to create SCTP sockets like if the module is
      already fully loaded. If that happens, one of the possible effects is
      that we will have readers for net->sctp.local_addr_list list earlier
      than expected and sctp_net_init() does not take precautions while
      dealing with that list, leading to a potential panic but not limited to
      that, as sctp_sock_init() will copy a bunch of blank/partially
      initialized values from net->sctp.
      
      The race happens like this:
      
           CPU 0                           |  CPU 1
        socket()                           |
         __sock_create                     | socket()
          inet_create                      |  __sock_create
           list_for_each_entry_rcu(        |
              answer, &inetsw[sock->type], |
              list) {                      |   inet_create
            /* no hits */                  |
           if (unlikely(err)) {            |
            ...                            |
            request_module()               |
            /* socket creation is blocked  |
             * the module is fully loaded  |
             */                            |
             sctp_init                     |
              sctp_v4_protosw_init         |
               inet_register_protosw       |
                list_add_rcu(&p->list,     |
                             last_perm);   |
                                           |  list_for_each_entry_rcu(
                                           |     answer, &inetsw[sock->type],
              sctp_v6_protosw_init         |     list) {
                                           |     /* hit, so assumes protocol
                                           |      * is already loaded
                                           |      */
                                           |  /* socket creation continues
                                           |   * before netns is initialized
                                           |   */
              register_pernet_subsys       |
      
      Simply inverting the initialization order between
      register_pernet_subsys() and sctp_v4_protosw_init() is not possible
      because register_pernet_subsys() will create a control sctp socket, so
      the protocol must be already visible by then. Deferring the socket
      creation to a work-queue is not good specially because we loose the
      ability to handle its errors.
      
      So, as suggested by Vlad, the fix is to split netns initialization in
      two moments: defaults and control socket, so that the defaults are
      already loaded by when we register the protocol, while control socket
      initialization is kept at the same moment it is today.
      
      Fixes: 4db67e80 ("sctp: Make the address lists per network namespace")
      Signed-off-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      779c19e0
    • Daniel Borkmann's avatar
      netlink, mmap: transform mmap skb into full skb on taps · d3820009
      Daniel Borkmann authored
      [ Upstream commit 1853c949 ]
      
      Ken-ichirou reported that running netlink in mmap mode for receive in
      combination with nlmon will throw a NULL pointer dereference in
      __kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
      to handle kernel paging request". The problem is the skb_clone() in
      __netlink_deliver_tap_skb() for skbs that are mmaped.
      
      I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
      skb has it pointed to netlink_skb_destructor(), set in the handler
      netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
      that in such cases, __kfree_skb() doesn't perform a skb_release_data()
      via skb_release_all(), where skb->head is possibly being freed through
      kfree(head) into slab allocator, although netlink mmap skb->head points
      to the mmap buffer. Similarly, the same has to be done also for large
      netlink skbs where the data area is vmalloced. Therefore, as discussed,
      make a copy for these rather rare cases for now. This fixes the issue
      on my and Ken-ichirou's test-cases.
      
      Reference: http://thread.gmane.org/gmane.linux.network/371129
      Fixes: bcbde0d4 ("net: netlink: virtual tap device management")
      Reported-by: default avatarKen-ichirou MATSUZAWA <chamaken@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarKen-ichirou MATSUZAWA <chamaken@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      d3820009