1. 19 Jun, 2019 40 commits
    • Eric Dumazet's avatar
      inet: fix various use-after-free in defrags units · d5dd8879
      Eric Dumazet authored
      syzbot reported another issue caused by my recent patches. [1]
      
      The issue here is that fqdir_exit() is initiating a work queue
      and immediately returns. A bit later cleanup_net() was able
      to free the MIB (percpu data) and the whole struct net was freed,
      but we had active frag timers that fired and triggered use-after-free.
      
      We need to make sure that timers can catch fqdir->dead being set,
      to bailout.
      
      Since RCU is used for the reader side, this means
      we want to respect an RCU grace period between these operations :
      
      1) qfdir->dead = 1;
      
      2) netns dismantle (freeing of various data structure)
      
      This patch uses new new (struct pernet_operations)->pre_exit
      infrastructure to ensures a full RCU grace period
      happens between fqdir_pre_exit() and fqdir_exit()
      
      This also means we can use a regular work queue, we no
      longer need rcu_work.
      
      Tested:
      
      $ time for i in {1..1000}; do unshare -n /bin/false;done
      
      real	0m2.585s
      user	0m0.160s
      sys	0m2.214s
      
      [1]
      
      BUG: KASAN: use-after-free in ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
      Read of size 8 at addr ffff88808b9fe330 by task syz-executor.4/11860
      
      CPU: 1 PID: 11860 Comm: syz-executor.4 Not tainted 5.2.0-rc2+ #22
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
       __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       kasan_report+0x12/0x20 mm/kasan/common.c:614
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
       ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
       call_timer_fn+0x193/0x720 kernel/time/timer.c:1322
       expire_timers kernel/time/timer.c:1366 [inline]
       __run_timers kernel/time/timer.c:1685 [inline]
       __run_timers kernel/time/timer.c:1653 [inline]
       run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698
       __do_softirq+0x25c/0x94c kernel/softirq.c:293
       invoke_softirq kernel/softirq.c:374 [inline]
       irq_exit+0x180/0x1d0 kernel/softirq.c:414
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
       </IRQ>
      RIP: 0010:tomoyo_domain_quota_is_ok+0x131/0x540 security/tomoyo/util.c:1035
      Code: 24 4c 3b 65 d0 0f 84 9c 00 00 00 e8 19 1d 73 fe 49 8d 7c 24 18 48 ba 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 0f b6 04 10 <48> 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 69 03 00 00 41 0f b6 5c
      RSP: 0018:ffff88806ae079c0 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff13
      RAX: 0000000000000000 RBX: 0000000000000010 RCX: ffffc9000e655000
      RDX: dffffc0000000000 RSI: ffffffff82fd88a7 RDI: ffff888086202398
      RBP: ffff88806ae07a00 R08: ffff88808b6c8700 R09: ffffed100d5c0f4d
      R10: ffffed100d5c0f4c R11: 0000000000000000 R12: ffff888086202380
      R13: 0000000000000030 R14: 00000000000000d3 R15: 0000000000000000
       tomoyo_supervisor+0x2e8/0xef0 security/tomoyo/common.c:2087
       tomoyo_audit_path_number_log security/tomoyo/file.c:235 [inline]
       tomoyo_path_number_perm+0x42f/0x520 security/tomoyo/file.c:734
       tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335
       security_file_ioctl+0x77/0xc0 security/security.c:1370
       ksys_ioctl+0x57/0xd0 fs/ioctl.c:711
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4592c9
      Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f8db5e44c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004592c9
      RDX: 0000000020000080 RSI: 00000000000089f1 RDI: 0000000000000006
      RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8db5e456d4
      R13: 00000000004cc770 R14: 00000000004d5cd8 R15: 00000000ffffffff
      
      Allocated by task 9047:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_kmalloc mm/kasan/common.c:489 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
       kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
       slab_post_alloc_hook mm/slab.h:437 [inline]
       slab_alloc mm/slab.c:3326 [inline]
       kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3488
       kmem_cache_zalloc include/linux/slab.h:732 [inline]
       net_alloc net/core/net_namespace.c:386 [inline]
       copy_net_ns+0xed/0x340 net/core/net_namespace.c:426
       create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
       unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
       ksys_unshare+0x440/0x980 kernel/fork.c:2692
       __do_sys_unshare kernel/fork.c:2760 [inline]
       __se_sys_unshare kernel/fork.c:2758 [inline]
       __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 2541:
       save_stack+0x23/0x90 mm/kasan/common.c:71
       set_track mm/kasan/common.c:79 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
       __cache_free mm/slab.c:3432 [inline]
       kmem_cache_free+0x86/0x260 mm/slab.c:3698
       net_free net/core/net_namespace.c:402 [inline]
       net_drop_ns.part.0+0x70/0x90 net/core/net_namespace.c:409
       net_drop_ns net/core/net_namespace.c:408 [inline]
       cleanup_net+0x538/0x960 net/core/net_namespace.c:571
       process_one_work+0x989/0x1790 kernel/workqueue.c:2269
       worker_thread+0x98/0xe40 kernel/workqueue.c:2415
       kthread+0x354/0x420 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      The buggy address belongs to the object at ffff88808b9fe100
       which belongs to the cache net_namespace of size 6784
      The buggy address is located 560 bytes inside of
       6784-byte region [ffff88808b9fe100, ffff88808b9ffb80)
      The buggy address belongs to the page:
      page:ffffea00022e7f80 refcount:1 mapcount:0 mapping:ffff88821b6f60c0 index:0x0 compound_mapcount: 0
      flags: 0x1fffc0000010200(slab|head)
      raw: 01fffc0000010200 ffffea000256f288 ffffea0001bbef08 ffff88821b6f60c0
      raw: 0000000000000000 ffff88808b9fe100 0000000100000001 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88808b9fe200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88808b9fe280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff88808b9fe300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                           ^
       ffff88808b9fe380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88808b9fe400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 3c8fc878 ("inet: frags: rework rhashtable dismantle")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d5dd8879
    • Eric Dumazet's avatar
      netns: add pre_exit method to struct pernet_operations · d7d99872
      Eric Dumazet authored
      Current struct pernet_operations exit() handlers are highly
      discouraged to call synchronize_rcu().
      
      There are cases where we need them, and exit_batch() does
      not help the common case where a single netns is dismantled.
      
      This patch leverages the existing synchronize_rcu() call
      in cleanup_net()
      
      Calling optional ->pre_exit() method before ->exit() or
      ->exit_batch() allows to benefit from a single synchronize_rcu()
      call.
      
      Note that the synchronize_rcu() calls added in this patch
      are only in error paths or slow paths.
      
      Tested:
      
      $ time for i in {1..1000}; do unshare -n /bin/false;done
      
      real	0m2.612s
      user	0m0.171s
      sys	0m2.216s
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7d99872
    • David S. Miller's avatar
      Merge branch 'xdp-page_pool-fixes-and-in-flight-accounting' · 2a54003e
      David S. Miller authored
      Jesper Dangaard Brouer says:
      
      ====================
      xdp: page_pool fixes and in-flight accounting
      
      This patchset fix page_pool API and users, such that drivers can use it for
      DMA-mapping. A number of places exist, where the DMA-mapping would not get
      released/unmapped, all these are fixed. This occurs e.g. when an xdp_frame
      gets converted to an SKB. As network stack doesn't have any callback for XDP
      memory models.
      
      The patchset also address a shutdown race-condition. Today removing a XDP
      memory model, based on page_pool, is only delayed one RCU grace period. This
      isn't enough as redirected xdp_frames can still be in-flight on different
      queues (remote driver TX, cpumap or veth).
      
      We stress that when drivers use page_pool for DMA-mapping, then they MUST
      use one packet per page. This might change in the future, but more work lies
      ahead, before we can lift this restriction.
      
      This patchset change the page_pool API to be more strict, as in-flight page
      accounting is added.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2a54003e
    • Jesper Dangaard Brouer's avatar
      page_pool: make sure struct device is stable · f71fec47
      Jesper Dangaard Brouer authored
      For DMA mapping use-case the page_pool keeps a pointer
      to the struct device, which is used in DMA map/unmap calls.
      
      For our in-flight handling, we also need to make sure that
      the struct device have not disappeared.  This is assured
      via using get_device/put_device API.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reported-by: default avatarIvan Khoronzhuk <ivan.khoronzhuk@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f71fec47
    • Jesper Dangaard Brouer's avatar
      page_pool: add tracepoints for page_pool with details need by XDP · 32c28f7e
      Jesper Dangaard Brouer authored
      The xdp tracepoints for mem id disconnect don't carry information about, why
      it was not safe_to_remove.  The tracepoint page_pool:page_pool_inflight in
      this patch can be used for extract this info for further debugging.
      
      This patchset also adds tracepoint for the pages_state_* release/hold
      transitions, including a pointer to the page.  This can be used for stats
      about in-flight pages, or used to debug page leakage via keeping track of
      page pointer and combining this with kprobe for __put_page().
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      32c28f7e
    • Jesper Dangaard Brouer's avatar
      xdp: add tracepoints for XDP mem · f033b688
      Jesper Dangaard Brouer authored
      These tracepoints make it easier to troubleshoot XDP mem id disconnect.
      
      The xdp:mem_disconnect tracepoint cannot be replaced via kprobe. It is
      placed at the last stable place for the pointer to struct xdp_mem_allocator,
      just before it's scheduled for RCU removal. It also extract info on
      'safe_to_remove' and 'force'.
      
      Detailed info about in-flight pages is not available at this layer. The next
      patch will added tracepoints needed at the page_pool layer for this.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f033b688
    • Jesper Dangaard Brouer's avatar
      xdp: force mem allocator removal and periodic warning · d956a048
      Jesper Dangaard Brouer authored
      If bugs exists or are introduced later e.g. by drivers misusing the API,
      then we want to warn about the issue, such that developer notice. This patch
      will generate a bit of noise in form of periodic pr_warn every 30 seconds.
      
      It is not nice to have this stall warning running forever. Thus, this patch
      will (after 120 attempts) force disconnect the mem id (from the rhashtable)
      and free the page_pool object. This will cause fallback to the put_page() as
      before, which only potentially leak DMA-mappings, if objects are really
      stuck for this long. In that unlikely case, a WARN_ONCE should show us the
      call stack.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d956a048
    • Jesper Dangaard Brouer's avatar
      xdp: tracking page_pool resources and safe removal · 99c07c43
      Jesper Dangaard Brouer authored
      This patch is needed before we can allow drivers to use page_pool for
      DMA-mappings. Today with page_pool and XDP return API, it is possible to
      remove the page_pool object (from rhashtable), while there are still
      in-flight packet-pages. This is safely handled via RCU and failed lookups in
      __xdp_return() fallback to call put_page(), when page_pool object is gone.
      In-case page is still DMA mapped, this will result in page note getting
      correctly DMA unmapped.
      
      To solve this, the page_pool is extended with tracking in-flight pages. And
      XDP disconnect system queries page_pool and waits, via workqueue, for all
      in-flight pages to be returned.
      
      To avoid killing performance when tracking in-flight pages, the implement
      use two (unsigned) counters, that in placed on different cache-lines, and
      can be used to deduct in-flight packets. This is done by mapping the
      unsigned "sequence" counters onto signed Two's complement arithmetic
      operations. This is e.g. used by kernel's time_after macros, described in
      kernel commit 1ba3aab3 and 5a581b36, and also explained in RFC1982.
      
      The trick is these two incrementing counters only need to be read and
      compared, when checking if it's safe to free the page_pool structure. Which
      will only happen when driver have disconnected RX/alloc side. Thus, on a
      non-fast-path.
      
      It is chosen that page_pool tracking is also enabled for the non-DMA
      use-case, as this can be used for statistics later.
      
      After this patch, using page_pool requires more strict resource "release",
      e.g. via page_pool_release_page() that was introduced in this patchset, and
      previous patches implement/fix this more strict requirement.
      
      Drivers no-longer call page_pool_destroy(). Drivers already call
      xdp_rxq_info_unreg() which call xdp_rxq_info_unreg_mem_model(), which will
      attempt to disconnect the mem id, and if attempt fails schedule the
      disconnect for later via delayed workqueue.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      99c07c43
    • Jesper Dangaard Brouer's avatar
      mlx5: more strict use of page_pool API · 29b006a6
      Jesper Dangaard Brouer authored
      The mlx5 driver is using page_pool, but not for DMA-mapping (currently), and
      is a little too relaxed about returning or releasing page resources, as it
      is not strictly necessary, when not using DMA-mappings.
      
      As this patchset is working towards tracking page_pool resources, to know
      about in-flight frames on shutdown. Then fix places where mlx5 leak
      page_pool resource.
      
      In case of dma_mapping_error, then recycle into page_pool.
      
      In mlx5e_free_rq() moved the page_pool_destroy() call to after the
      mlx5e_page_release() calls, as it is more correct.
      
      In mlx5e_page_release() when no recycle was requested, then release page
      from the page_pool, via page_pool_release_page().
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29b006a6
    • Jesper Dangaard Brouer's avatar
      page_pool: introduce page_pool_free and use in mlx5 · e54cfd7e
      Jesper Dangaard Brouer authored
      In case driver fails to register the page_pool with XDP return API (via
      xdp_rxq_info_reg_mem_model()), then the driver can free the page_pool
      resources more directly than calling page_pool_destroy(), which does a
      unnecessarily RCU free procedure.
      
      This patch is preparing for removing page_pool_destroy(), from driver
      invocation.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e54cfd7e
    • Jesper Dangaard Brouer's avatar
      veth: use xdp_release_frame for XDP_PASS · cbf33510
      Jesper Dangaard Brouer authored
      Like cpumap use xdp_release_frame() when an xdp_frame got
      converted into an SKB and send towars the network stack.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cbf33510
    • Jesper Dangaard Brouer's avatar
      xdp: page_pool related fix to cpumap · 6bf071bf
      Jesper Dangaard Brouer authored
      When converting an xdp_frame into an SKB, and sending this into the network
      stack, then the underlying XDP memory model need to release associated
      resources, because the network stack don't have callbacks for XDP memory
      models.  The only memory model that needs this is page_pool, when a driver
      use the DMA-mapping feature.
      
      Introduce page_pool_release_page(), which basically does the same as
      page_pool_unmap_page(). Add xdp_release_frame() as the XDP memory model
      interface for calling it, if the memory model match MEM_TYPE_PAGE_POOL, to
      save the function call overhead for others. Have cpumap call
      xdp_release_frame() before xdp_scrub_frame().
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6bf071bf
    • Jesper Dangaard Brouer's avatar
      xdp: fix leak of IDA cyclic id if rhashtable_insert_slow fails · 516a7593
      Jesper Dangaard Brouer authored
      Fix error handling case, where inserting ID with rhashtable_insert_slow
      fails in xdp_rxq_info_reg_mem_model, which leads to never releasing the IDA
      ID, as the lookup in xdp_rxq_info_unreg_mem_model fails and thus
      ida_simple_remove() is never called.
      
      Fix by releasing ID via ida_simple_remove(), and mark xdp_rxq->mem.id with
      zero, which is already checked in xdp_rxq_info_unreg_mem_model().
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      516a7593
    • Ilias Apalodimas's avatar
      net: page_pool: add helper function to unmap dma addresses · a25d50bf
      Ilias Apalodimas authored
      On a previous patch dma addr was stored in 'struct page'.
      Use that to unmap DMA addresses used by network drivers
      Signed-off-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a25d50bf
    • Ilias Apalodimas's avatar
      net: page_pool: add helper function to retrieve dma addresses · 0afdeeed
      Ilias Apalodimas authored
      On a previous patch dma addr was stored in 'struct page'.
      Use that to retrieve DMA addresses used by network drivers
      Signed-off-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0afdeeed
    • Ilias Apalodimas's avatar
      net: netsec: remove loops in napi Rx process · 9371a56f
      Ilias Apalodimas authored
      netsec_process_rx was running in a loop trying to process as many packets
      as possible before re-enabling interrupts. With the recent DMA changes
      this is not needed anymore as we manage to consume all the budget without
      looping over the function.
      Since it has no performance penalty let's remove that and simplify the Rx
      path a bit
      Signed-off-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9371a56f
    • Ilias Apalodimas's avatar
      net: netsec: initialize tx ring on ndo_open · 39e3622e
      Ilias Apalodimas authored
      Since we changed the Tx ring handling and now depends on bit31 to figure
      out the owner of the descriptor, we should initialize this every time
      the device goes down-up instead of doing it once on driver init. If the
      value is not correctly initialized the device won't have any available
      descriptors
      
      Changes since v1:
      - Typo fixes
      
      Fixes: 35e07d23 ("net: socionext: remove mmio reads on Tx")
      Signed-off-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39e3622e
    • Rasmus Villemoes's avatar
      net: dsa: mv88e6xxx: fix shift of FID bits in mv88e6250_g1_vtu_loadpurge() · e41d4bc5
      Rasmus Villemoes authored
      The comment is correct, but the code ends up moving the bits four
      places too far, into the VTUOp field.
      
      Fixes: bec8e572 (net: dsa: mv88e6xxx: implement vtu_getnext and vtu_loadpurge for mv88e6250)
      Signed-off-by: default avatarRasmus Villemoes <rasmus.villemoes@prevas.dk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e41d4bc5
    • David S. Miller's avatar
      act_ctinfo: Don't use BIT() in UAPI headers. · 23cdf875
      David S. Miller authored
      Use _BITUL() instead.
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      23cdf875
    • David S. Miller's avatar
      Merge branch 'mlxsw-Implement-flower-ingress-device-matching-offload' · cfecf0d0
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Implement flower ingress device matching offload
      
      Jiri says:
      
      In case of using shared block, user might find it handy to be able to insert
      filters to match on particular ingress device. This patchset exposes the
      ingress ifindex through flow_dissector and flow_offload so mlxsw can use it to
      push down to HW. See the selftests for examples of usage.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfecf0d0
    • Jiri Pirko's avatar
      selftests: tc: add ingress device matching support · dcc5e1f9
      Jiri Pirko authored
      Extend tc_flower to test plain ingress device matching and also
      tc_shblock to test ingress device matching on shared block.
      Add new tc_flower_router.sh where ingress device matching on egress
      (after routing) is done.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dcc5e1f9
    • Jiri Pirko's avatar
      mlxsw: spectrum_flower: Implement support for ingress device matching · 0c1f391d
      Jiri Pirko authored
      Benefit from the previously extended flow_dissector infrastructure and
      offload matching on ingress port.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c1f391d
    • Jiri Pirko's avatar
      mlxsw: spectrum_acl: Fix SRC_SYS_PORT element size · d8e94614
      Jiri Pirko authored
      Fix the size of the SRC_SYS_PORT element to be 16.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d8e94614
    • Jiri Pirko's avatar
      mlxsw: spectrum_acl: Avoid size check for RX_ACL_SYSTEM_PORT element · ff5405f6
      Jiri Pirko authored
      RX_ACL_SYSTEM_PORT is 8 bit but SRC_SYS_PORT is 16 bits. Internally,
      SRC_SYS_PORT is used to carry the value. Relax the checker in case of
      RX_ACL_SYSTEM_PORT and allow different size.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff5405f6
    • Jiri Pirko's avatar
      mlxsw: spectrum_acl: Write RX_ACL_SYSTEM_PORT acl element correctly · 511a5adc
      Jiri Pirko authored
      RX_ACL_SYSTEM_PORT is equal to SRC_SYS_PORT - 1. So before write to
      block we need to adjust the key value. Introduce new "EXT" helper to
      implement this.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      511a5adc
    • Jiri Pirko's avatar
      net: flow_offload: implement support for meta key · 9558a83a
      Jiri Pirko authored
      Implement support for previously added flow dissector meta key.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9558a83a
    • Jiri Pirko's avatar
      net: sched: cls_flower: use flow_dissector for ingress ifindex · 8212ed77
      Jiri Pirko authored
      Use previously introduced infra to obtain and store ingress ifindex
      instead doing it locally.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8212ed77
    • Jiri Pirko's avatar
      flow_dissector: add support for ingress ifindex dissection · 82828b88
      Jiri Pirko authored
      Add new key meta that contains ingress ifindex value and add a function
      to dissect this from skb. The key and function is prepared to cover
      other potential skb metadata values dissection.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      82828b88
    • Colin Ian King's avatar
      net/mlx5: add missing void argument to function mlx5_devlink_alloc · 39f58860
      Colin Ian King authored
      Function mlx5_devlink_alloc is missing a void argument, add it
      to clean up the non-ANSI function declaration.
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      39f58860
    • David S. Miller's avatar
      Merge branch 'net-mvpp2-cls-Allow-steering-based-on-vlan-tag' · da21ad27
      David S. Miller authored
      Maxime Chevallier says:
      
      ====================
      net: mvpp2: cls: Allow steering based on vlan tag
      
      The PPv2 classifier can perform flow steering based on keys extracted
      from the VLAN tag. This series adds support for using the vlan id and
      the vlan prio as keys, using the ethtool interface.
      
      Patch 1 is a preparatory patch that prevent false-positive matches,
      using a dedicated lookup id for the RSS C2 lookup.
      
      Patch 2 allows to separate the flows based on the header fields they
      contain. The main goal is to be able to separate tagged traffic from
      untagged traffic for flow steering, just as we already do for RSS.
      
      Patch 3 solves an issue we have when extracting fields that aren't full
      bytes, such as the vlan tag which is 12 bits wide, or the priority which
      is 3 bits wide.
      
      Finally, patch 4 adds support for steering based on both vlan id and
      priority, extracted from the outermost tag.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      da21ad27
    • Maxime Chevallier's avatar
      net: mvpp2: cls: Add steering based on vlan Id and priority. · 1274daed
      Maxime Chevallier authored
      This commit allows using the vlan Id and priority as parts of the key
      for classification offload. These fields are extracted from the
      outermost tag, if multiple tags are present.
      
      Vlan Id and priority are considered as 2 different fields by the
      classifier, however the fields are both appended in the Header Extracted
      Key in the same layout as they are found in the tags. This means that
      when steering only based on the prio, a 16-bit slot is still taken in
      the HEK.
      
      The classifier doesn't allow extracting the DEI bit from the tag, so we
      explicitly prevent user from using this bit in the key.
      
      This commit adds the vlan priotity as a compatible HEK field for
      tagged traffic, meaning that we limit the possibility of extracting this
      field only to the flows that contain tagged traffic.
      Signed-off-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1274daed
    • Maxime Chevallier's avatar
      net: mvpp2: cls: right-justify the C2 TCAM keys · 12b8e2dd
      Maxime Chevallier authored
      The C2 TCAM used for classification uses a key (Header Extracted Key)
      built by concatenating several fields extracted from the packet header.
      
      After a lot of trial-and-error and some guess work, it seems the HEK is
      right justified, with the first fields being stored in the MSB, then
      concatenated up until the LSB.
      
      Until now, this doesn't cause any issue since all HEK fields we use are
      full bytes. However this is an issue for the upcoming VLAN id and pri
      extraction, which aren't full bytes.
      
      Rework the way we built that TCAM key, by changing the order in which we
      append the fields.
      Signed-off-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      12b8e2dd
    • Maxime Chevallier's avatar
      net: mvpp2: cls: Only select applicable flows of classification offload · 834df6ea
      Maxime Chevallier authored
      The way we currently handle classification offload and RSS is by having
      dedicated lookup sequences in the flow table, each being selected
      depending on several fields being present in the packet header.
      
      We need to make sure the classification operation we want to perform can
      be done in each flow we want to insert it into. As an example,
      classifying on VLAN tag can only be done on flows used for tagged
      traffic.
      
      This commit makes sure we don't insert rules in flows we aren't
      compatible with.
      Signed-off-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      834df6ea
    • Maxime Chevallier's avatar
      net: mvpp2: cls: Use a dedicated lu_type for the RSS lookup · c641af4f
      Maxime Chevallier authored
      When performing a TCAM lookup in the C2 engine, it's possible that
      multiple entries match the packet. To make sure the correct entry match
      when performing a lookup, the Flow Table can set a lookup type, which
      will be used in the TCAM lookup, thus preventing such false-positives.
      
      We need to make sure the RSS match doesn't interfere with other
      classification lookups, hence we use a dedicated lookup_type for it.
      Signed-off-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c641af4f
    • David S. Miller's avatar
      Merge branch 'macb-SiFive-FU540-C000' · 9368b8e2
      David S. Miller authored
      Yash Shah says:
      
      ====================
      Add macb support for SiFive FU540-C000
      
      On FU540, the management IP block is tightly coupled with the Cadence
      MACB IP block. It manages many of the boundary signals from the MACB IP
      This patchset controls the tx_clk input signal to the MACB IP. It
      switches between the local TX clock (125MHz) and PHY TX clocks. This
      is necessary to toggle between 1Gb and 100/10Mb speeds.
      
      Future patches may add support for monitoring or controlling other IP
      boundary signals.
      
      This patchset is mostly based on work done by
      Wesley Terpstra <wesley@sifive.com>
      
      This patchset is based on Linux v5.2-rc1 and tested on HiFive Unleashed
      board with additional board related patches needed for testing can be
      found at dev/yashs/ethernet_v3 branch of:
      https://github.com/yashshah7/riscv-linux.git
      
      Change History:
      V3:
      - Revert "MACB_SIFIVE_FU540" config changes in Kconfig and driver code.
        The driver does not depend on SiFive GPIO driver.
      
      V2:
      - Change compatible string from "cdns,fu540-macb" to "sifive,fu540-macb"
      - Add "MACB_SIFIVE_FU540" in Kconfig to support SiFive FU540 in macb
        driver. This is needed because on FU540, the macb driver depends on
        SiFive GPIO driver.
      - Avoid writing the result of a comparison to a register.
      - Fix the issue of probe fail on reloading the module reported by:
        Andreas Schwab <schwab@suse.de>
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9368b8e2
    • Yash Shah's avatar
      macb: Add support for SiFive FU540-C000 · c218ad55
      Yash Shah authored
      The management IP block is tightly coupled with the Cadence MACB IP
      block on the FU540, and manages many of the boundary signals from the
      MACB IP. This patch only controls the tx_clk input signal to the MACB
      IP. Future patches may add support for monitoring or controlling other
      IP boundary signals.
      Signed-off-by: default avatarYash Shah <yash.shah@sifive.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c218ad55
    • Yash Shah's avatar
      macb: bindings doc: add sifive fu540-c000 binding · d4993e19
      Yash Shah authored
      Add the compatibility string documentation for SiFive FU540-C0000
      interface.
      On the FU540, this driver also needs to read and write registers in a
      management IP block that monitors or drives boundary signals for the
      GEMGXL IP block that are not directly mapped to GEMGXL registers.
      Therefore, add additional range to "reg" property for SiFive GEMGXL
      management IP registers.
      Signed-off-by: default avatarYash Shah <yash.shah@sifive.com>
      Reviewed-by: default avatarPaul Walmsley <paul.walmsley@sifive.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4993e19
    • David S. Miller's avatar
      Merge branch 'hinic-add-rss-support-and-rss-parameters-configuration' · d75d5f97
      David S. Miller authored
      Xue Chaojing says:
      
      ====================
      hinic: add rss support and rss parameters configuration
      
      This series add rss support for HINIC driver and implement the ethtool
      interface related to rss parameter configuration. user can use ethtool
      configure rss parameters or show rss parameters.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d75d5f97
    • Xue Chaojing's avatar
      hinic: add support for rss parameters with ethtool · 4fdc51bb
      Xue Chaojing authored
      This patch adds support rss parameters with ethtool,
      user can change hash key, hash indirection table, hash
      function by ethtool -X, and show rss parameters by ethtool -x.
      Signed-off-by: default avatarXue Chaojing <xuechaojing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fdc51bb
    • Xue Chaojing's avatar
      hinic: move ethtool code into hinic_ethtool · eb8ce9ac
      Xue Chaojing authored
      This patch moves ethtool code from hinic_main.c to hinic_ethtool.c
      Signed-off-by: default avatarXue Chaojing <xuechaojing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb8ce9ac