1. 13 Mar, 2020 1 commit
  2. 12 Mar, 2020 14 commits
    • Chris Packham's avatar
      net: mvmdio: avoid error message for optional IRQ · e1f550dc
      Chris Packham authored
      Per the dt-binding the interrupt is optional so use
      platform_get_irq_optional() instead of platform_get_irq(). Since
      commit 7723f4c5 ("driver core: platform: Add an error message to
      platform_get_irq*()") platform_get_irq() produces an error message
      
        orion-mdio f1072004.mdio: IRQ index 0 not found
      
      which is perfectly normal if one hasn't specified the optional property
      in the device tree.
      Signed-off-by: default avatarChris Packham <chris.packham@alliedtelesis.co.nz>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1f550dc
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register · 012fc745
      Andrew Lunn authored
      Only the bottom 12 bits contain the ATU bin occupancy statistics. The
      upper bits need masking off.
      
      Fixes: e0c69ca7 ("net: dsa: mv88e6xxx: Add ATU occupancy via devlink resources")
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      012fc745
    • Eric Dumazet's avatar
      net: memcg: fix lockdep splat in inet_csk_accept() · 06669ea3
      Eric Dumazet authored
      Locking newsk while still holding the listener lock triggered
      a lockdep splat [1]
      
      We can simply move the memcg code after we release the listener lock,
      as this can also help if multiple threads are sharing a common listener.
      
      Also fix a typo while reading socket sk_rmem_alloc.
      
      [1]
      WARNING: possible recursive locking detected
      5.6.0-rc3-syzkaller #0 Not tainted
      --------------------------------------------
      syz-executor598/9524 is trying to acquire lock:
      ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
      ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492
      
      but task is already holding lock:
      ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
      ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(sk_lock-AF_INET6);
        lock(sk_lock-AF_INET6);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      1 lock held by syz-executor598/9524:
       #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
       #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445
      
      stack backtrace:
      CPU: 0 PID: 9524 Comm: syz-executor598 Not tainted 5.6.0-rc3-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x188/0x20d lib/dump_stack.c:118
       print_deadlock_bug kernel/locking/lockdep.c:2370 [inline]
       check_deadlock kernel/locking/lockdep.c:2411 [inline]
       validate_chain kernel/locking/lockdep.c:2954 [inline]
       __lock_acquire.cold+0x114/0x288 kernel/locking/lockdep.c:3954
       lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4484
       lock_sock_nested+0xc5/0x110 net/core/sock.c:2947
       lock_sock include/net/sock.h:1541 [inline]
       inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492
       inet_accept+0xe9/0x7c0 net/ipv4/af_inet.c:734
       __sys_accept4_file+0x3ac/0x5b0 net/socket.c:1758
       __sys_accept4+0x53/0x90 net/socket.c:1809
       __do_sys_accept4 net/socket.c:1821 [inline]
       __se_sys_accept4 net/socket.c:1818 [inline]
       __x64_sys_accept4+0x93/0xf0 net/socket.c:1818
       do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4445c9
      Code: e8 0c 0d 03 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffc35b37608 EFLAGS: 00000246 ORIG_RAX: 0000000000000120
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004445c9
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
      RBP: 0000000000000000 R08: 0000000000306777 R09: 0000000000306777
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 00000000004053d0 R14: 0000000000000000 R15: 0000000000000000
      
      Fixes: d752a498 ("net: memcg: late association of sock to memcg")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06669ea3
    • David S. Miller's avatar
      Merge branch 's390-qeth-fixes' · 5e72b237
      David S. Miller authored
      Julian Wiedmann says:
      
      ====================
      s390/qeth: fixes 2020-03-11
      
      please apply the following patch series for qeth to netdev's net tree.
      
      Just one fix to get the RX buffer pool resizing right, with two
      preparatory cleanups.
      This is on the larger side given where we are in the -rc cycle, but a
      big chunk of the delta is just refactoring to make the fix look nice.
      
      I intentionally split these off from yesterday's series. No objections
      if you'd rather punt them to net-next, the series should apply cleanly.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e72b237
    • Julian Wiedmann's avatar
      s390/qeth: implement smarter resizing of the RX buffer pool · 5d4f7856
      Julian Wiedmann authored
      The RX buffer pool is allocated in qeth_alloc_qdio_queues().
      A subsequent pool resizing is then handled in a very simple way:
      first free the current pool, then allocate a new pool of the requested
      size.
      
      There's two ways where this can go wrong:
      1. if the resize action happens _before_ the initial pool was allocated,
         then a subsequent initialization will call qeth_alloc_qdio_queues()
         and fill the pool with a second(!) set of pages. We consume twice the
         planned amount of memory.
         This is easy to fix - just skip the resizing if the queues haven't
         been allocated yet.
      2. if the initial pool was created by qeth_alloc_qdio_queues() but a
         subsequent resizing fails, then the device has no(!) RX buffer pool.
         The next initialization will _not_ call qeth_alloc_qdio_queues(), and
         attempting to back the RX buffers with pages in
         qeth_init_qdio_queues() will fail.
         Not very difficult to fix either - instead of re-allocating the whole
         pool, just allocate/free as many entries to match the desired size.
      
      Fixes: 4a71df50 ("qeth: new qeth device driver")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5d4f7856
    • Julian Wiedmann's avatar
      s390/qeth: refactor buffer pool code · 0f75e149
      Julian Wiedmann authored
      In preparation for a subsequent fix, split out helpers to allocate/free
      individual pool entries.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f75e149
    • Julian Wiedmann's avatar
      s390/qeth: use page pointers to manage RX buffer pool · f81649df
      Julian Wiedmann authored
      The RX buffer elements are always backed with full pages, reflect this
      in the pointer type.
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f81649df
    • Paolo Lungaroni's avatar
      seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number · 26776253
      Paolo Lungaroni authored
      The Internet Assigned Numbers Authority (IANA) has recently assigned
      a protocol number value of 143 for Ethernet [1].
      
      Before this assignment, encapsulation mechanisms such as Segment Routing
      used the IPv6-NoNxt protocol number (59) to indicate that the encapsulated
      payload is an Ethernet frame.
      
      In this patch, we add the definition of the Ethernet protocol number to the
      kernel headers and update the SRv6 L2 tunnels to use it.
      
      [1] https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtmlSigned-off-by: default avatarPaolo Lungaroni <paolo.lungaroni@cnit.it>
      Reviewed-by: default avatarAndrea Mayer <andrea.mayer@uniroma2.it>
      Acked-by: default avatarAhmed Abdelsalam <ahmed.abdelsalam@gssi.it>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26776253
    • Andrew Lunn's avatar
      net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed · a20f9970
      Andrew Lunn authored
      By default, DSA drivers should configure CPU and DSA ports to their
      maximum speed. In many configurations this is sufficient to make the
      link work.
      
      In some cases it is necessary to configure the link to run slower,
      e.g. because of limitations of the SoC it is connected to. Or back to
      back PHYs are used and the PHY needs to be driven in order to
      establish link. In this case, phylink is used.
      
      Only instantiate phylink if it is required. If there is no PHY, or no
      fixed link properties, phylink can upset a link which works in the
      default configuration.
      
      Fixes: 0e279218 ("net: dsa: Use PHYLINK for the CPU/DSA ports")
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a20f9970
    • Willem de Bruijn's avatar
      net/packet: tpacket_rcv: do not increment ring index on drop · 46e4c421
      Willem de Bruijn authored
      In one error case, tpacket_rcv drops packets after incrementing the
      ring producer index.
      
      If this happens, it does not update tp_status to TP_STATUS_USER and
      thus the reader is stalled for an iteration of the ring, causing out
      of order arrival.
      
      The only such error path is when virtio_net_hdr_from_skb fails due
      to encountering an unknown GSO type.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      46e4c421
    • Dominik Czarnota's avatar
      sxgbe: Fix off by one in samsung driver strncpy size arg · f3cc008b
      Dominik Czarnota authored
      This patch fixes an off-by-one error in strncpy size argument in
      drivers/net/ethernet/samsung/sxgbe/sxgbe_main.c. The issue is that in:
      
              strncmp(opt, "eee_timer:", 6)
      
      the passed string literal: "eee_timer:" has 10 bytes (without the NULL
      byte) and the passed size argument is 6. As a result, the logic will
      also accept other, malformed strings, e.g. "eee_tiXXX:".
      
      This bug doesn't seem to have any security impact since its present in
      module's cmdline parsing code.
      Signed-off-by: default avatarDominik Czarnota <dominik.b.czarnota@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f3cc008b
    • Amol Grover's avatar
      net: caif: Add lockdep expression to RCU traversal primitive · f9fc28a8
      Amol Grover authored
      caifdevs->list is traversed using list_for_each_entry_rcu()
      outside an RCU read-side critical section but under the
      protection of rtnl_mutex. Hence, add the corresponding lockdep
      expression to silence the following false-positive warning:
      
      [   10.868467] =============================
      [   10.869082] WARNING: suspicious RCU usage
      [   10.869817] 5.6.0-rc1-00177-g06ec0a154aae4 #1 Not tainted
      [   10.870804] -----------------------------
      [   10.871557] net/caif/caif_dev.c:115 RCU-list traversed in non-reader section!!
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAmol Grover <frextrite@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9fc28a8
    • Jakub Kicinski's avatar
      MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer · eecba79e
      Jakub Kicinski authored
      Remove Sathya Perla, sathya.perla@broadcom.com is bouncing.
      The driver has 3 more maintainers.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eecba79e
    • Jakub Kicinski's avatar
      net: fec: validate the new settings in fec_enet_set_coalesce() · ab14961d
      Jakub Kicinski authored
      fec_enet_set_coalesce() validates the previously set params
      and if they are within range proceeds to apply the new ones.
      The new ones, however, are not validated. This seems backwards,
      probably a copy-paste error?
      
      Compile tested only.
      
      Fixes: d851b47b ("net: fec: add interrupt coalescence feature support")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Acked-by: default avatarFugang Duan <fugang.duan@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab14961d
  3. 11 Mar, 2020 6 commits
  4. 10 Mar, 2020 19 commits
    • David S. Miller's avatar
      Merge branch 's390-qeth-fixes' · 2165fdf4
      David S. Miller authored
      Julian Wiedmann says:
      
      ====================
      s390/qeth: fixes 2020-03-10
      
      This fixes three minor issues:
      1) a setup parameter gets cleared unnecessarily when the HW config
         changes,
      2) insufficient error handling when initially filling the RX ring, and
      3) a rarely used worker that needs to be cancelled during tear down.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2165fdf4
    • Julian Wiedmann's avatar
      s390/qeth: cancel RX reclaim work earlier · 0e635c2a
      Julian Wiedmann authored
      When qeth's napi poll code fails to refill an entirely empty RX ring, it
      kicks off buffer_reclaim_work to try again later.
      
      Make sure that this worker is cancelled when setting the qeth device
      offline. Otherwise a RX refill action can unexpectedly end up running
      concurrently to bigger re-configurations (eg. resizing the buffer pool),
      without any locking.
      
      Fixes: b3332930 ("qeth: add support for af_iucv HiperSockets transport")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e635c2a
    • Julian Wiedmann's avatar
      s390/qeth: handle error when backing RX buffer · 17413852
      Julian Wiedmann authored
      qeth_init_qdio_queues() fills the RX ring with an initial set of
      RX buffers. If qeth_init_input_buffer() fails to back one of the RX
      buffers with memory, we need to bail out and report the error.
      
      Fixes: 4a71df50 ("qeth: new qeth device driver")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      17413852
    • Julian Wiedmann's avatar
      s390/qeth: don't reset default_out_queue · 240c1948
      Julian Wiedmann authored
      When an OSA device in prio-queue setup is reduced to 1 TX queue due to
      HW restrictions, we reset its the default_out_queue to 0.
      
      In the old code this was needed so that qeth_get_priority_queue() gets
      the queue selection right. But with proper multiqueue support we already
      reduced dev->real_num_tx_queues to 1, and so the stack puts all traffic
      on txq 0 without even calling .ndo_select_queue.
      
      Thus we can preserve the user's configuration, and apply it if the OSA
      device later re-gains support for multiple TX queues.
      
      Fixes: 73dc2daf ("s390/qeth: add TX multiqueue support for OSA devices")
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      240c1948
    • David S. Miller's avatar
      Merge branch 'MACSec-bugfixes-related-to-MAC-address-change' · a2d8bf77
      David S. Miller authored
      Igor Russkikh says:
      
      ====================
      MACSec bugfixes related to MAC address change
      
      We found out that there's an issue in MACSec code when the MAC address
      is changed.
      Both s/w and offloaded implementations don't update SCI when the MAC
      address changes at the moment, but they should do so, because SCI contains
      MAC in its first 6 octets.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2d8bf77
    • Dmitry Bogdanov's avatar
      net: macsec: invoke mdo_upd_secy callback when mac address changed · 09f4136c
      Dmitry Bogdanov authored
      Notify the offload engine about MAC address change to reconfigure it
      accordingly.
      
      Fixes: 3cf3227a ("net: macsec: hardware offloading infrastructure")
      Signed-off-by: default avatarDmitry Bogdanov <dbogdanov@marvell.com>
      Signed-off-by: default avatarMark Starovoytov <mstarovoitov@marvell.com>
      Signed-off-by: default avatarIgor Russkikh <irusskikh@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09f4136c
    • Dmitry Bogdanov's avatar
      net: macsec: update SCI upon MAC address change. · 6fc498bc
      Dmitry Bogdanov authored
      SCI should be updated, because it contains MAC in its first 6 octets.
      
      Fixes: c09440f7 ("macsec: introduce IEEE 802.1AE driver")
      Signed-off-by: default avatarDmitry Bogdanov <dbogdanov@marvell.com>
      Signed-off-by: default avatarMark Starovoytov <mstarovoitov@marvell.com>
      Signed-off-by: default avatarIgor Russkikh <irusskikh@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fc498bc
    • Juliet Kim's avatar
      ibmvnic: Do not process device remove during device reset · 7d7195a0
      Juliet Kim authored
      The ibmvnic driver does not check the device state when the device
      is removed. If the device is removed while a device reset is being
      processed, the remove may free structures needed by the reset,
      causing an oops.
      
      Fix this by checking the device state before processing device remove.
      Signed-off-by: default avatarJuliet Kim <julietk@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d7195a0
    • Karsten Graul's avatar
      net/smc: cancel event worker during device removal · ece0d7bd
      Karsten Graul authored
      During IB device removal, cancel the event worker before the device
      structure is freed.
      
      Fixes: a4cf0443 ("smc: introduce SMC as an IB-client")
      Reported-by: syzbot+b297c6825752e7a07272@syzkaller.appspotmail.com
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Reviewed-by: default avatarUrsula Braun <ubraun@linux.ibm.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ece0d7bd
    • Hangbin Liu's avatar
      ipv6/addrconf: call ipv6_mc_up() for non-Ethernet interface · 60380488
      Hangbin Liu authored
      Rafał found an issue that for non-Ethernet interface, if we down and up
      frequently, the memory will be consumed slowly.
      
      The reason is we add allnodes/allrouters addressed in multicast list in
      ipv6_add_dev(). When link down, we call ipv6_mc_down(), store all multicast
      addresses via mld_add_delrec(). But when link up, we don't call ipv6_mc_up()
      for non-Ethernet interface to remove the addresses. This makes idev->mc_tomb
      getting bigger and bigger. The call stack looks like:
      
      addrconf_notify(NETDEV_REGISTER)
      	ipv6_add_dev
      		ipv6_dev_mc_inc(ff01::1)
      		ipv6_dev_mc_inc(ff02::1)
      		ipv6_dev_mc_inc(ff02::2)
      
      addrconf_notify(NETDEV_UP)
      	addrconf_dev_config
      		/* Alas, we support only Ethernet autoconfiguration. */
      		return;
      
      addrconf_notify(NETDEV_DOWN)
      	addrconf_ifdown
      		ipv6_mc_down
      			igmp6_group_dropped(ff02::2)
      				mld_add_delrec(ff02::2)
      			igmp6_group_dropped(ff02::1)
      			igmp6_group_dropped(ff01::1)
      
      After investigating, I can't found a rule to disable multicast on
      non-Ethernet interface. In RFC2460, the link could be Ethernet, PPP, ATM,
      tunnels, etc. In IPv4, it doesn't check the dev type when calls ip_mc_up()
      in inetdev_event(). Even for IPv6, we don't check the dev type and call
      ipv6_add_dev(), ipv6_dev_mc_inc() after register device.
      
      So I think it's OK to fix this memory consumer by calling ipv6_mc_up() for
      non-Ethernet interface.
      
      v2: Also check IFF_MULTICAST flag to make sure the interface supports
          multicast
      Reported-by: default avatarRafał Miłecki <zajec5@gmail.com>
      Tested-by: default avatarRafał Miłecki <zajec5@gmail.com>
      Fixes: 74235a25 ("[IPV6] addrconf: Fix IPv6 on tuntap tunnels")
      Fixes: 1666d49e ("mld: do not remove mld souce list info when set link down")
      Signed-off-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60380488
    • Shakeel Butt's avatar
      net: memcg: late association of sock to memcg · d752a498
      Shakeel Butt authored
      If a TCP socket is allocated in IRQ context or cloned from unassociated
      (i.e. not associated to a memcg) in IRQ context then it will remain
      unassociated for its whole life. Almost half of the TCPs created on the
      system are created in IRQ context, so, memory used by such sockets will
      not be accounted by the memcg.
      
      This issue is more widespread in cgroup v1 where network memory
      accounting is opt-in but it can happen in cgroup v2 if the source socket
      for the cloning was created in root memcg.
      
      To fix the issue, just do the association of the sockets at the accept()
      time in the process context and then force charge the memory buffer
      already used and reserved by the socket.
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d752a498
    • Shakeel Butt's avatar
      cgroup: memcg: net: do not associate sock with unrelated cgroup · e876ecc6
      Shakeel Butt authored
      We are testing network memory accounting in our setup and noticed
      inconsistent network memory usage and often unrelated cgroups network
      usage correlates with testing workload. On further inspection, it
      seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
      irq context specially for cgroup v1.
      
      mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
      and kind of assumes that this can only happen from sk_clone_lock()
      and the source sock object has already associated cgroup. However in
      cgroup v1, where network memory accounting is opt-in, the source sock
      can be unassociated with any cgroup and the new cloned sock can get
      associated with unrelated interrupted cgroup.
      
      Cgroup v2 can also suffer if the source sock object was created by
      process in the root cgroup or if sk_alloc() is called in irq context.
      The fix is to just do nothing in interrupt.
      
      WARNING: Please note that about half of the TCP sockets are allocated
      from the IRQ context, so, memory used by such sockets will not be
      accouted by the memcg.
      
      The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
      
      CPU: 70 PID: 12720 Comm: ssh Tainted:  5.6.0-smp-DEV #1
      Hardware name: ...
      Call Trace:
       <IRQ>
       dump_stack+0x57/0x75
       mem_cgroup_sk_alloc+0xe9/0xf0
       sk_clone_lock+0x2a7/0x420
       inet_csk_clone_lock+0x1b/0x110
       tcp_create_openreq_child+0x23/0x3b0
       tcp_v6_syn_recv_sock+0x88/0x730
       tcp_check_req+0x429/0x560
       tcp_v6_rcv+0x72d/0xa40
       ip6_protocol_deliver_rcu+0xc9/0x400
       ip6_input+0x44/0xd0
       ? ip6_protocol_deliver_rcu+0x400/0x400
       ip6_rcv_finish+0x71/0x80
       ipv6_rcv+0x5b/0xe0
       ? ip6_sublist_rcv+0x2e0/0x2e0
       process_backlog+0x108/0x1e0
       net_rx_action+0x26b/0x460
       __do_softirq+0x104/0x2a6
       do_softirq_own_stack+0x2a/0x40
       </IRQ>
       do_softirq.part.19+0x40/0x50
       __local_bh_enable_ip+0x51/0x60
       ip6_finish_output2+0x23d/0x520
       ? ip6table_mangle_hook+0x55/0x160
       __ip6_finish_output+0xa1/0x100
       ip6_finish_output+0x30/0xd0
       ip6_output+0x73/0x120
       ? __ip6_finish_output+0x100/0x100
       ip6_xmit+0x2e3/0x600
       ? ipv6_anycast_cleanup+0x50/0x50
       ? inet6_csk_route_socket+0x136/0x1e0
       ? skb_free_head+0x1e/0x30
       inet6_csk_xmit+0x95/0xf0
       __tcp_transmit_skb+0x5b4/0xb20
       __tcp_send_ack.part.60+0xa3/0x110
       tcp_send_ack+0x1d/0x20
       tcp_rcv_state_process+0xe64/0xe80
       ? tcp_v6_connect+0x5d1/0x5f0
       tcp_v6_do_rcv+0x1b1/0x3f0
       ? tcp_v6_do_rcv+0x1b1/0x3f0
       __release_sock+0x7f/0xd0
       release_sock+0x30/0xa0
       __inet_stream_connect+0x1c3/0x3b0
       ? prepare_to_wait+0xb0/0xb0
       inet_stream_connect+0x3b/0x60
       __sys_connect+0x101/0x120
       ? __sys_getsockopt+0x11b/0x140
       __x64_sys_connect+0x1a/0x20
       do_syscall_64+0x51/0x200
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
      Fixes: 2d758073 ("mm: memcontrol: consolidate cgroup socket tracking")
      Fixes: d979a39d ("cgroup: duplicate cgroup reference when cloning sockets")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e876ecc6
    • Jakub Kicinski's avatar
      MAINTAINERS: update cxgb4vf maintainer to Vishal · 65dfcf08
      Jakub Kicinski authored
      Casey Leedomn <leedom@chelsio.com> is bouncing,
      Vishal indicated he's happy to take the role.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65dfcf08
    • David S. Miller's avatar
      Merge tag 'batadv-net-for-davem-20200306' of git://git.open-mesh.org/linux-merge · 23620594
      David S. Miller authored
      Simon Wunderlich says:
      
      ====================
      Here is a batman-adv bugfix:
      
       - Don't schedule OGM for disabled interface, by Sven Eckelmann
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      23620594
    • Vladimir Oltean's avatar
      net: mscc: ocelot: properly account for VLAN header length when setting MRU · a8015ded
      Vladimir Oltean authored
      What the driver writes into MAC_MAXLEN_CFG does not actually represent
      VLAN_ETH_FRAME_LEN but instead ETH_FRAME_LEN + ETH_FCS_LEN. Yes they are
      numerically equal, but the difference is important, as the switch treats
      VLAN-tagged traffic specially and knows to increase the maximum accepted
      frame size automatically. So it is always wrong to account for VLAN in
      the MAC_MAXLEN_CFG register.
      
      Unconditionally increase the maximum allowed frame size for
      double-tagged traffic. Accounting for the additional length does not
      mean that the other VLAN membership checks aren't performed, so there's
      no harm done.
      
      Also, stop abusing the MTU name for configuring the MRU. There is no
      support for configuring the MRU on an interface at the moment.
      
      Fixes: a556c76a ("net: mscc: Add initial Ocelot switch support")
      Fixes: fa914e9c ("net: mscc: ocelot: create a helper for changing the port MTU")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8015ded
    • Eric Dumazet's avatar
      ipvlan: do not use cond_resched_rcu() in ipvlan_process_multicast() · afe207d8
      Eric Dumazet authored
      Commit e18b353f ("ipvlan: add cond_resched_rcu() while
      processing muticast backlog") added a cond_resched_rcu() in a loop
      using rcu protection to iterate over slaves.
      
      This is breaking rcu rules, so lets instead use cond_resched()
      at a point we can reschedule
      
      Fixes: e18b353f ("ipvlan: add cond_resched_rcu() while processing muticast backlog")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      afe207d8
    • Dmitry Yakunin's avatar
      cgroup, netclassid: periodically release file_lock on classid updating · 018d26fc
      Dmitry Yakunin authored
      In our production environment we have faced with problem that updating
      classid in cgroup with heavy tasks cause long freeze of the file tables
      in this tasks. By heavy tasks we understand tasks with many threads and
      opened sockets (e.g. balancers). This freeze leads to an increase number
      of client timeouts.
      
      This patch implements following logic to fix this issue:
      аfter iterating 1000 file descriptors file table lock will be released
      thus providing a time gap for socket creation/deletion.
      
      Now update is non atomic and socket may be skipped using calls:
      
      dup2(oldfd, newfd);
      close(oldfd);
      
      But this case is not typical. Moreover before this patch skip is possible
      too by hiding socket fd in unix socket buffer.
      
      New sockets will be allocated with updated classid because cgroup state
      is updated before start of the file descriptors iteration.
      
      So in common cases this patch has no side effects.
      Signed-off-by: default avatarDmitry Yakunin <zeil@yandex-team.ru>
      Reviewed-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      018d26fc
    • Mahesh Bandewar's avatar
      macvlan: add cond_resched() during multicast processing · ce9a4186
      Mahesh Bandewar authored
      The Rx bound multicast packets are deferred to a workqueue and
      macvlan can also suffer from the same attack that was discovered
      by Syzbot for IPvlan. This solution is not as effective as in
      IPvlan. IPvlan defers all (Tx and Rx) multicast packet processing
      to a workqueue while macvlan does this way only for the Rx. This
      fix should address the Rx codition to certain extent.
      
      Tx is still suseptible. Tx multicast processing happens when
      .ndo_start_xmit is called, hence we cannot add cond_resched().
      However, it's not that severe since the user which is generating
       / flooding will be affected the most.
      
      Fixes: 412ca155 ("macvlan: Move broadcasts into a work queue")
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce9a4186
    • Mahesh Bandewar's avatar
      ipvlan: add cond_resched_rcu() while processing muticast backlog · e18b353f
      Mahesh Bandewar authored
      If there are substantial number of slaves created as simulated by
      Syzbot, the backlog processing could take much longer and result
      into the issue found in the Syzbot report.
      
      INFO: rcu_sched detected stalls on CPUs/tasks:
              (detected by 1, t=10502 jiffies, g=5049, c=5048, q=752)
      All QSes seen, last rcu_sched kthread activity 10502 (4294965563-4294955061), jiffies_till_next_fqs=1, root ->qsmask 0x0
      syz-executor.1  R  running task on cpu   1  10984 11210   3866 0x30020008 179034491270
      Call Trace:
       <IRQ>
       [<ffffffff81497163>] _sched_show_task kernel/sched/core.c:8063 [inline]
       [<ffffffff81497163>] _sched_show_task.cold+0x2fd/0x392 kernel/sched/core.c:8030
       [<ffffffff8146a91b>] sched_show_task+0xb/0x10 kernel/sched/core.c:8073
       [<ffffffff815c931b>] print_other_cpu_stall kernel/rcu/tree.c:1577 [inline]
       [<ffffffff815c931b>] check_cpu_stall kernel/rcu/tree.c:1695 [inline]
       [<ffffffff815c931b>] __rcu_pending kernel/rcu/tree.c:3478 [inline]
       [<ffffffff815c931b>] rcu_pending kernel/rcu/tree.c:3540 [inline]
       [<ffffffff815c931b>] rcu_check_callbacks.cold+0xbb4/0xc29 kernel/rcu/tree.c:2876
       [<ffffffff815e3962>] update_process_times+0x32/0x80 kernel/time/timer.c:1635
       [<ffffffff816164f0>] tick_sched_handle+0xa0/0x180 kernel/time/tick-sched.c:161
       [<ffffffff81616ae4>] tick_sched_timer+0x44/0x130 kernel/time/tick-sched.c:1193
       [<ffffffff815e75f7>] __run_hrtimer kernel/time/hrtimer.c:1393 [inline]
       [<ffffffff815e75f7>] __hrtimer_run_queues+0x307/0xd90 kernel/time/hrtimer.c:1455
       [<ffffffff815e90ea>] hrtimer_interrupt+0x2ea/0x730 kernel/time/hrtimer.c:1513
       [<ffffffff844050f4>] local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1031 [inline]
       [<ffffffff844050f4>] smp_apic_timer_interrupt+0x144/0x5e0 arch/x86/kernel/apic/apic.c:1056
       [<ffffffff84401cbe>] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778
      RIP: 0010:do_raw_read_lock+0x22/0x80 kernel/locking/spinlock_debug.c:153
      RSP: 0018:ffff8801dad07ab8 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff12
      RAX: 0000000000000000 RBX: ffff8801c4135680 RCX: 0000000000000000
      RDX: 1ffff10038826afe RSI: ffff88019d816bb8 RDI: ffff8801c41357f0
      RBP: ffff8801dad07ac0 R08: 0000000000004b15 R09: 0000000000310273
      R10: ffff88019d816bb8 R11: 0000000000000001 R12: ffff8801c41357e8
      R13: 0000000000000000 R14: ffff8801cfb19850 R15: ffff8801cfb198b0
       [<ffffffff8101460e>] __raw_read_lock_bh include/linux/rwlock_api_smp.h:177 [inline]
       [<ffffffff8101460e>] _raw_read_lock_bh+0x3e/0x50 kernel/locking/spinlock.c:240
       [<ffffffff840d78ca>] ipv6_chk_mcast_addr+0x11a/0x6f0 net/ipv6/mcast.c:1006
       [<ffffffff84023439>] ip6_mc_input+0x319/0x8e0 net/ipv6/ip6_input.c:482
       [<ffffffff840211c8>] dst_input include/net/dst.h:449 [inline]
       [<ffffffff840211c8>] ip6_rcv_finish+0x408/0x610 net/ipv6/ip6_input.c:78
       [<ffffffff840214de>] NF_HOOK include/linux/netfilter.h:292 [inline]
       [<ffffffff840214de>] NF_HOOK include/linux/netfilter.h:286 [inline]
       [<ffffffff840214de>] ipv6_rcv+0x10e/0x420 net/ipv6/ip6_input.c:278
       [<ffffffff83a29efa>] __netif_receive_skb_one_core+0x12a/0x1f0 net/core/dev.c:5303
       [<ffffffff83a2a15c>] __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:5417
       [<ffffffff83a2f536>] process_backlog+0x216/0x6c0 net/core/dev.c:6243
       [<ffffffff83a30d1b>] napi_poll net/core/dev.c:6680 [inline]
       [<ffffffff83a30d1b>] net_rx_action+0x47b/0xfb0 net/core/dev.c:6748
       [<ffffffff846002c8>] __do_softirq+0x2c8/0x99a kernel/softirq.c:317
       [<ffffffff813e656a>] invoke_softirq kernel/softirq.c:399 [inline]
       [<ffffffff813e656a>] irq_exit+0x16a/0x1a0 kernel/softirq.c:439
       [<ffffffff84405115>] exiting_irq arch/x86/include/asm/apic.h:561 [inline]
       [<ffffffff84405115>] smp_apic_timer_interrupt+0x165/0x5e0 arch/x86/kernel/apic/apic.c:1058
       [<ffffffff84401cbe>] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778
       </IRQ>
      RIP: 0010:__sanitizer_cov_trace_pc+0x26/0x50 kernel/kcov.c:102
      RSP: 0018:ffff880196033bd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
      RAX: ffff88019d8161c0 RBX: 00000000ffffffff RCX: ffffc90003501000
      RDX: 0000000000000002 RSI: ffffffff816236d1 RDI: 0000000000000005
      RBP: ffff880196033bd8 R08: ffff88019d8161c0 R09: 0000000000000000
      R10: 1ffff10032c067f0 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000080 R14: 0000000000000000 R15: 0000000000000000
       [<ffffffff816236d1>] do_futex+0x151/0x1d50 kernel/futex.c:3548
       [<ffffffff816260f0>] C_SYSC_futex kernel/futex_compat.c:201 [inline]
       [<ffffffff816260f0>] compat_SyS_futex+0x270/0x3b0 kernel/futex_compat.c:175
       [<ffffffff8101da17>] do_syscall_32_irqs_on arch/x86/entry/common.c:353 [inline]
       [<ffffffff8101da17>] do_fast_syscall_32+0x357/0xe1c arch/x86/entry/common.c:415
       [<ffffffff84401a9b>] entry_SYSENTER_compat+0x8b/0x9d arch/x86/entry/entry_64_compat.S:139
      RIP: 0023:0xf7f23c69
      RSP: 002b:00000000f5d1f12c EFLAGS: 00000282 ORIG_RAX: 00000000000000f0
      RAX: ffffffffffffffda RBX: 000000000816af88 RCX: 0000000000000080
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000816af8c
      RBP: 00000000f5d1f228 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      rcu_sched kthread starved for 10502 jiffies! g5049 c5048 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=1
      rcu_sched       R  running task on cpu   1  13048     8      2 0x90000000 179099587640
      Call Trace:
       [<ffffffff8147321f>] context_switch+0x60f/0xa60 kernel/sched/core.c:3209
       [<ffffffff8100095a>] __schedule+0x5aa/0x1da0 kernel/sched/core.c:3934
       [<ffffffff810021df>] schedule+0x8f/0x1b0 kernel/sched/core.c:4011
       [<ffffffff8101116d>] schedule_timeout+0x50d/0xee0 kernel/time/timer.c:1803
       [<ffffffff815c13f1>] rcu_gp_kthread+0xda1/0x3b50 kernel/rcu/tree.c:2327
       [<ffffffff8144b318>] kthread+0x348/0x420 kernel/kthread.c:246
       [<ffffffff84400266>] ret_from_fork+0x56/0x70 arch/x86/entry/entry_64.S:393
      
      Fixes: ba35f858 (“ipvlan: Defer multicast / broadcast processing to a work-queue”)
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e18b353f