1. 04 Nov, 2022 19 commits
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Refactor map->off_arr handling · f71b2f64
      Kumar Kartikeya Dwivedi authored
      Refactor map->off_arr handling into generic functions that can work on
      their own without hardcoding map specific code. The btf_fields_offs
      structure is now returned from btf_parse_field_offs, which can be reused
      later for types in program BTF.
      
      All functions like copy_map_value, zero_map_value call generic
      underlying functions so that they can also be reused later for copying
      to values allocated in programs which encode specific fields.
      
      Later, some helper functions will also require access to this
      btf_field_offs structure to be able to skip over special fields at
      runtime.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-9-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f71b2f64
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Consolidate spin_lock, timer management into btf_record · db559117
      Kumar Kartikeya Dwivedi authored
      Now that kptr_off_tab has been refactored into btf_record, and can hold
      more than one specific field type, accomodate bpf_spin_lock and
      bpf_timer as well.
      
      While they don't require any more metadata than offset, having all
      special fields in one place allows us to share the same code for
      allocated user defined types and handle both map values and these
      allocated objects in a similar fashion.
      
      As an optimization, we still keep spin_lock_off and timer_off offsets in
      the btf_record structure, just to avoid having to find the btf_field
      struct each time their offset is needed. This is mostly needed to
      manipulate such objects in a map value at runtime. It's ok to hardcode
      just one offset as more than one field is disallowed.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-8-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      db559117
    • Alexei Starovoitov's avatar
      Merge branch 'veristat: replay, filtering, sorting' · af085f55
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      
      This patch set adds a bunch of new featurs and improvements that were sorely
      missing during recent active use of veristat to develop BPF verifier precision
      changes. Individual patches provide justification, explanation and often
      examples showing how new capabilities can be used.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      af085f55
    • Andrii Nakryiko's avatar
      selftests/bpf: support stat filtering in comparison mode in veristat · d5ce4b89
      Andrii Nakryiko authored
      Finally add support for filtering stats values, similar to
      non-comparison mode filtering. For comparison mode 4 variants of stats
      are important for filtering, as they allow to filter either A or B side,
      but even more importantly they allow to filter based on value
      difference, and for verdict stat value difference is MATCH/MISMATCH
      classification. So with these changes it's finally possible to easily
      check if there were any mismatches between failure/success outcomes on
      two separate data sets. Like in an example below:
      
        $ ./veristat -e file,prog,verdict,insns -C ~/baseline-results.csv ~/shortest-results.csv -f verdict_diff=mismatch
        File                                   Program                Verdict (A)  Verdict (B)  Verdict (DIFF)  Insns (A)  Insns (B)  Insns        (DIFF)
        -------------------------------------  ---------------------  -----------  -----------  --------------  ---------  ---------  -------------------
        dynptr_success.bpf.linked1.o           test_data_slice        success      failure      MISMATCH               85          0       -85 (-100.00%)
        dynptr_success.bpf.linked1.o           test_read_write        success      failure      MISMATCH             1992          0     -1992 (-100.00%)
        dynptr_success.bpf.linked1.o           test_ringbuf           success      failure      MISMATCH               74          0       -74 (-100.00%)
        kprobe_multi.bpf.linked1.o             test_kprobe            failure      success      MISMATCH                0        246      +246 (+100.00%)
        kprobe_multi.bpf.linked1.o             test_kprobe_manual     failure      success      MISMATCH                0        246      +246 (+100.00%)
        kprobe_multi.bpf.linked1.o             test_kretprobe         failure      success      MISMATCH                0        248      +248 (+100.00%)
        kprobe_multi.bpf.linked1.o             test_kretprobe_manual  failure      success      MISMATCH                0        248      +248 (+100.00%)
        kprobe_multi.bpf.linked1.o             trigger                failure      success      MISMATCH                0          2        +2 (+100.00%)
        netcnt_prog.bpf.linked1.o              bpf_nextcnt            failure      success      MISMATCH                0         56       +56 (+100.00%)
        pyperf600_nounroll.bpf.linked1.o       on_event               success      failure      MISMATCH           568128    1000001    +431873 (+76.02%)
        ringbuf_bench.bpf.linked1.o            bench_ringbuf          success      failure      MISMATCH                8          0        -8 (-100.00%)
        strobemeta.bpf.linked1.o               on_event               success      failure      MISMATCH           557149    1000001    +442852 (+79.49%)
        strobemeta_nounroll1.bpf.linked1.o     on_event               success      failure      MISMATCH            57240    1000001  +942761 (+1647.03%)
        strobemeta_nounroll2.bpf.linked1.o     on_event               success      failure      MISMATCH           501725    1000001    +498276 (+99.31%)
        strobemeta_subprogs.bpf.linked1.o      on_event               success      failure      MISMATCH            65420    1000001  +934581 (+1428.59%)
        test_map_in_map_invalid.bpf.linked1.o  xdp_noop0              success      failure      MISMATCH                2          0        -2 (-100.00%)
        test_mmap.bpf.linked1.o                test_mmap              success      failure      MISMATCH               46          0       -46 (-100.00%)
        test_verif_scale3.bpf.linked1.o        balancer_ingress       success      failure      MISMATCH           845499    1000001    +154502 (+18.27%)
        -------------------------------------  ---------------------  -----------  -----------  --------------  ---------  ---------  -------------------
      
      Note that by filtering on verdict_diff=mismatch, it's now extremely easy and
      fast to see any changes in verdict. Example above showcases both failure ->
      success transitions (which are generally surprising) and success -> failure
      transitions (which are expected if bugs are present).
      
      Given veristat allows to query relative percent difference values, internal
      logic for comparison mode is based on floating point numbers, so requires a bit
      of epsilon precision logic, deviating from typical integer simple handling
      rules.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-11-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d5ce4b89
    • Andrii Nakryiko's avatar
      selftests/bpf: support stats ordering in comparison mode in veristat · fa9bb590
      Andrii Nakryiko authored
      Introduce the concept of "stat variant", by which it's possible to
      specify whether to use the value from A (baseline) side, B (comparison
      or control) side, the absolute difference value or relative (percentage)
      difference value.
      
      To support specifying this, veristat recognizes `_a`, `_b`, `_diff`,
      `_pct` suffixes, which can be appended to stat name(s). In
      non-comparison mode variants are ignored (there is only `_a` variant
      effectively), if no variant suffix is provided, `_b` is assumed, as
      control group is of primary interest in comparison mode.
      
      These stat variants can be flexibly combined with asc/desc orders.
      
      Here's an example of ordering results first by verdict match/mismatch (or n/a
      if one of the sides is missing; n/a is always considered to be the lowest
      value), and within each match/mismatch/n/a group further sort by number of
      instructions in B side. In this case we don't have MISMATCH cases, but N/A are
      split from MATCH, demonstrating this custom ordering.
      
        $ ./veristat -e file,prog,verdict,insns -s verdict_diff,insns_b_ -C ~/base.csv ~/comp.csv
        File                Program                         Verdict (A)  Verdict (B)  Verdict (DIFF)  Insns (A)  Insns (B)  Insns   (DIFF)
        ------------------  ------------------------------  -----------  -----------  --------------  ---------  ---------  --------------
        bpf_xdp.o           tail_lb_ipv6                    N/A          success      N/A                   N/A     151895             N/A
        bpf_xdp.o           tail_nodeport_nat_egress_ipv4   N/A          success      N/A                   N/A      15619             N/A
        bpf_xdp.o           tail_nodeport_ipv6_dsr          N/A          success      N/A                   N/A       1206             N/A
        bpf_xdp.o           tail_nodeport_ipv4_dsr          N/A          success      N/A                   N/A       1162             N/A
        bpf_alignchecker.o  tail_icmp6_send_echo_reply      N/A          failure      N/A                   N/A         74             N/A
        bpf_alignchecker.o  __send_drop_notify              success      N/A          N/A                    53        N/A             N/A
        bpf_host.o          __send_drop_notify              success      N/A          N/A                    53        N/A             N/A
        bpf_host.o          cil_from_host                   success      N/A          N/A                   762        N/A             N/A
        bpf_xdp.o           tail_lb_ipv4                    success      success      MATCH               71736      73430  +1694 (+2.36%)
        bpf_xdp.o           tail_handle_nat_fwd_ipv4        success      success      MATCH               21547      20920   -627 (-2.91%)
        bpf_xdp.o           tail_rev_nodeport_lb6           success      success      MATCH               17954      17905    -49 (-0.27%)
        bpf_xdp.o           tail_handle_nat_fwd_ipv6        success      success      MATCH               16974      17039    +65 (+0.38%)
        bpf_xdp.o           tail_nodeport_nat_ingress_ipv4  success      success      MATCH                7658       7713    +55 (+0.72%)
        bpf_xdp.o           tail_rev_nodeport_lb4           success      success      MATCH                7126       6934   -192 (-2.69%)
        bpf_xdp.o           tail_nodeport_nat_ingress_ipv6  success      success      MATCH                6405       6397     -8 (-0.12%)
        bpf_xdp.o           tail_nodeport_nat_ipv6_egress   failure      failure      MATCH                 752        752     +0 (+0.00%)
        bpf_xdp.o           cil_xdp_entry                   success      success      MATCH                 423        423     +0 (+0.00%)
        bpf_xdp.o           __send_drop_notify              success      success      MATCH                 151        151     +0 (+0.00%)
        bpf_alignchecker.o  tail_icmp6_handle_ns            failure      failure      MATCH                  33         33     +0 (+0.00%)
        ------------------  ------------------------------  -----------  -----------  --------------  ---------  ---------  --------------
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-10-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fa9bb590
    • Andrii Nakryiko's avatar
      selftests/bpf: handle missing records in comparison mode better in veristat · a5710848
      Andrii Nakryiko authored
      When comparing two datasets, if either side is missing corresponding
      record with the same file and prog name, currently veristat emits
      misleading zeros/failures, and even tried to calculate a difference,
      even though there is no data to compare against.
      
      This patch improves internal logic of handling such situations. Now
      we'll emit "N/A" in places where data is missing and comparison is
      non-sensical.
      
      As an example, in an artificially truncated and mismatched Cilium
      results, the output looks like below:
      
        $ ./veristat -e file,prog,verdict,insns -C ~/base.csv ~/comp.csv
        File                Program                         Verdict (A)  Verdict (B)  Verdict (DIFF)  Insns (A)  Insns (B)  Insns   (DIFF)
        ------------------  ------------------------------  -----------  -----------  --------------  ---------  ---------  --------------
        bpf_alignchecker.o  __send_drop_notify              success      N/A          N/A                    53        N/A             N/A
        bpf_alignchecker.o  tail_icmp6_handle_ns            failure      failure      MATCH                  33         33     +0 (+0.00%)
        bpf_alignchecker.o  tail_icmp6_send_echo_reply      N/A          failure      N/A                   N/A         74             N/A
        bpf_host.o          __send_drop_notify              success      N/A          N/A                    53        N/A             N/A
        bpf_host.o          cil_from_host                   success      N/A          N/A                   762        N/A             N/A
        bpf_xdp.o           __send_drop_notify              success      success      MATCH                 151        151     +0 (+0.00%)
        bpf_xdp.o           cil_xdp_entry                   success      success      MATCH                 423        423     +0 (+0.00%)
        bpf_xdp.o           tail_handle_nat_fwd_ipv4        success      success      MATCH               21547      20920   -627 (-2.91%)
        bpf_xdp.o           tail_handle_nat_fwd_ipv6        success      success      MATCH               16974      17039    +65 (+0.38%)
        bpf_xdp.o           tail_lb_ipv4                    success      success      MATCH               71736      73430  +1694 (+2.36%)
        bpf_xdp.o           tail_lb_ipv6                    N/A          success      N/A                   N/A     151895             N/A
        bpf_xdp.o           tail_nodeport_ipv4_dsr          N/A          success      N/A                   N/A       1162             N/A
        bpf_xdp.o           tail_nodeport_ipv6_dsr          N/A          success      N/A                   N/A       1206             N/A
        bpf_xdp.o           tail_nodeport_nat_egress_ipv4   N/A          success      N/A                   N/A      15619             N/A
        bpf_xdp.o           tail_nodeport_nat_ingress_ipv4  success      success      MATCH                7658       7713    +55 (+0.72%)
        bpf_xdp.o           tail_nodeport_nat_ingress_ipv6  success      success      MATCH                6405       6397     -8 (-0.12%)
        bpf_xdp.o           tail_nodeport_nat_ipv6_egress   failure      failure      MATCH                 752        752     +0 (+0.00%)
        bpf_xdp.o           tail_rev_nodeport_lb4           success      success      MATCH                7126       6934   -192 (-2.69%)
        bpf_xdp.o           tail_rev_nodeport_lb6           success      success      MATCH               17954      17905    -49 (-0.27%)
        ------------------  ------------------------------  -----------  -----------  --------------  ---------  ---------  --------------
      
      Internally veristat now separates joining two datasets and remembering the
      join, and actually emitting a comparison view. This will come handy when we add
      support for filtering and custom ordering in comparison mode.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-9-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a5710848
    • Andrii Nakryiko's avatar
      selftests/bpf: make veristat emit all stats in CSV mode by default · 77534401
      Andrii Nakryiko authored
      Make veristat distinguish between table and CSV output formats and use
      different default set of stats (columns) that are emitted. While for
      human-readable table output it doesn't make sense to output all known
      stats, it is very useful for CSV mode to record all possible data, so
      that it can later be queried and filtered in replay or comparison mode.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-8-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      77534401
    • Andrii Nakryiko's avatar
      selftests/bpf: support simple filtering of stats in veristat · 1bb4ec81
      Andrii Nakryiko authored
      Define simple expressions to filter not just by file and program name,
      but also by resulting values of collected stats. Support usual
      equality and inequality operators. Verdict, which is a boolean-like
      field can be also filtered either as 0/1, failure/success (with f/s,
      fail/succ, and failure/success aliases) symbols, or as false/true (f/t).
      Aliases are case insensitive.
      
      Currently this filtering is honored only in verification and replay
      modes. Comparison mode support will be added in next patch.
      
      Here's an example of verifying a bunch of BPF object files and emitting
      only results for successfully validated programs that have more than 100
      total instructions processed by BPF verifier, sorted by number of
      instructions in ascending order:
      
        $ sudo ./veristat *.bpf.o -s insns^ -f 'insns>100'
      
      There can be many filters (both allow and deny flavors), all of them are
      combined.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-7-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1bb4ec81
    • Andrii Nakryiko's avatar
      selftests/bpf: allow to define asc/desc ordering for sort specs in veristat · d68c07e2
      Andrii Nakryiko authored
      Allow to specify '^' at the end of stat name to designate that it should
      be sorted in ascending order. Similarly, allow any of 'v', 'V', '.',
      '!', or '_' suffix "symbols" to designate descending order. It's such
      a zoo for descending order because there is no single intuitive symbol
      that could be used (using 'v' looks pretty weird in practice), so few
      symbols that are "downwards leaning or pointing" were chosen. Either
      way, it shouldn't cause any troubles in practice.
      
      This new feature allows to customize sortering order to match user's
      needs.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-6-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d68c07e2
    • Andrii Nakryiko's avatar
      selftests/bpf: ensure we always have non-ambiguous sorting in veristat · b9670b90
      Andrii Nakryiko authored
      Always fall back to unique file/prog comparison if user's custom order
      specs are ambiguous. This ensures stable output no matter what.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-5-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b9670b90
    • Andrii Nakryiko's avatar
      selftests/bpf: consolidate and improve file/prog filtering in veristat · 10b1b3f3
      Andrii Nakryiko authored
      Slightly change rules of specifying file/prog glob filters. In practice
      it's quite often inconvenient to do `*/<prog-glob>` if that program glob
      is unique enough and won't accidentally match any file names.
      
      This patch changes the rules so that `-f <glob>` will apply specified
      glob to both file and program names. User still has all the control by
      doing '*/<prog-only-glob>' or '<file-only-glob/*'. We also now allow
      '/<prog-glob>' and '<file-glob/' (all matching wildcard is assumed if
      missing).
      
      Also, internally unify file-only and file+prog checks
      (should_process_file and should_process_prog are now
      should_process_file_prog that can handle prog name as optional). This
      makes maintaining and extending this code easier.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-4-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      10b1b3f3
    • Andrii Nakryiko's avatar
      selftests/bpf: shorten "Total insns/states" column names in veristat · 62d2c08b
      Andrii Nakryiko authored
      In comparison mode the "Total " part is pretty useless, but takes
      a considerable amount of horizontal space. Drop the "Total " parts.
      
      Also make sure that table headers for numerical columns are aligned in
      the same fashion as integer values in those columns. This looks better
      and is now more obvious with shorter "Insns" and "States" column
      headers.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-3-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      62d2c08b
    • Andrii Nakryiko's avatar
      selftests/bpf: add veristat replay mode · 9b5e3536
      Andrii Nakryiko authored
      Replay mode allow to parse previously stored CSV file with verification
      results and present it in desired output (presumable human-readable
      table, but CSV to CSV convertion is supported as well). While doing
      that, it's possible to use veristat's sorting rules, specify subset of
      columns, and filter by file and program name.
      
      In subsequent patches veristat's filtering capabilities will just grow
      making replay mode even more useful in practice for post-processing
      results.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20221103055304.2904589-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9b5e3536
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Refactor kptr_off_tab into btf_record · aa3496ac
      Kumar Kartikeya Dwivedi authored
      To prepare the BPF verifier to handle special fields in both map values
      and program allocated types coming from program BTF, we need to refactor
      the kptr_off_tab handling code into something more generic and reusable
      across both cases to avoid code duplication.
      
      Later patches also require passing this data to helpers at runtime, so
      that they can work on user defined types, initialize them, destruct
      them, etc.
      
      The main observation is that both map values and such allocated types
      point to a type in program BTF, hence they can be handled similarly. We
      can prepare a field metadata table for both cases and store them in
      struct bpf_map or struct btf depending on the use case.
      
      Hence, refactor the code into generic btf_record and btf_field member
      structs. The btf_record represents the fields of a specific btf_type in
      user BTF. The cnt indicates the number of special fields we successfully
      recognized, and field_mask is a bitmask of fields that were found, to
      enable quick determination of availability of a certain field.
      
      Subsequently, refactor the rest of the code to work with these generic
      types, remove assumptions about kptr and kptr_off_tab, rename variables
      to more meaningful names, etc.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-7-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      aa3496ac
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Drop reg_type_may_be_refcounted_or_null · a28ace78
      Kumar Kartikeya Dwivedi authored
      It is not scalable to maintain a list of types that can have non-zero
      ref_obj_id. It is never set for scalars anyway, so just remove the
      conditional on register types and print it whenever it is non-zero.
      Acked-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-6-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a28ace78
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Fix slot type check in check_stack_write_var_off · f5e477a8
      Kumar Kartikeya Dwivedi authored
      For the case where allow_ptr_leaks is false, code is checking whether
      slot type is STACK_INVALID and STACK_SPILL and rejecting other cases.
      This is a consequence of incorrectly checking for register type instead
      of the slot type (NOT_INIT and SCALAR_VALUE respectively). Fix the
      check.
      
      Fixes: 01f810ac ("bpf: Allow variable-offset stack access")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-5-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f5e477a8
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Clobber stack slot when writing over spilled PTR_TO_BTF_ID · 261f4664
      Kumar Kartikeya Dwivedi authored
      When support was added for spilled PTR_TO_BTF_ID to be accessed by
      helper memory access, the stack slot was not overwritten to STACK_MISC
      (and that too is only safe when env->allow_ptr_leaks is true).
      
      This means that helpers who take ARG_PTR_TO_MEM and write to it may
      essentially overwrite the value while the verifier continues to track
      the slot for spilled register.
      
      This can cause issues when PTR_TO_BTF_ID is spilled to stack, and then
      overwritten by helper write access, which can then be passed to BPF
      helpers or kfuncs.
      
      Handle this by falling back to the case introduced in a later commit,
      which will also handle PTR_TO_BTF_ID along with other pointer types,
      i.e. cd17d38f ("bpf: Permits pointers on stack for helper calls").
      
      Finally, include a comment on why REG_LIVE_WRITTEN is not being set when
      clobber is set to true. In short, the reason is that while when clobber
      is unset, we know that we won't be writing, when it is true, we *may*
      write to any of the stack slots in that range. It may be a partial or
      complete write, to just one or many stack slots.
      
      We cannot be sure, hence to be conservative, we leave things as is and
      never set REG_LIVE_WRITTEN for any stack slot. However, clobber still
      needs to reset them to STACK_MISC assuming writes happened. However read
      marks still need to be propagated upwards from liveness point of view,
      as parent stack slot's contents may still continue to matter to child
      states.
      
      Cc: Yonghong Song <yhs@meta.com>
      Fixes: 1d68f22b ("bpf: Handle spilled PTR_TO_BTF_ID properly when checking stack_boundary")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-4-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      261f4664
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Allow specifying volatile type modifier for kptrs · 23da464d
      Kumar Kartikeya Dwivedi authored
      This is useful in particular to mark the pointer as volatile, so that
      compiler treats each load and store to the field as a volatile access.
      The alternative is having to define and use READ_ONCE and WRITE_ONCE in
      the BPF program.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-3-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      23da464d
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Document UAPI details for special BPF types · 9805af8d
      Kumar Kartikeya Dwivedi authored
      The kernel recognizes some special BPF types in map values or local
      kptrs. Document that only bpf_spin_lock and bpf_timer will preserve
      backwards compatibility, and kptr will preserve backwards compatibility
      for the operations on the pointer, not the types supported for such
      kptrs.
      
      For local kptrs, document that there are no stability guarantees at all.
      
      Finally, document that 'bpf_' namespace is reserved for adding future
      special fields, hence BPF programs must not declare types with such
      names in their programs and still expect backwards compatibility.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20221103191013.1236066-2-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9805af8d
  2. 03 Nov, 2022 21 commits
    • Stanislav Fomichev's avatar
      bpf: make sure skb->len != 0 when redirecting to a tunneling device · 07ec7b50
      Stanislav Fomichev authored
      syzkaller managed to trigger another case where skb->len == 0
      when we enter __dev_queue_xmit:
      
      WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 skb_assert_len include/linux/skbuff.h:2576 [inline]
      WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 __dev_queue_xmit+0x2069/0x35e0 net/core/dev.c:4295
      
      Call Trace:
       dev_queue_xmit+0x17/0x20 net/core/dev.c:4406
       __bpf_tx_skb net/core/filter.c:2115 [inline]
       __bpf_redirect_no_mac net/core/filter.c:2140 [inline]
       __bpf_redirect+0x5fb/0xda0 net/core/filter.c:2163
       ____bpf_clone_redirect net/core/filter.c:2447 [inline]
       bpf_clone_redirect+0x247/0x390 net/core/filter.c:2419
       bpf_prog_48159a89cb4a9a16+0x59/0x5e
       bpf_dispatcher_nop_func include/linux/bpf.h:897 [inline]
       __bpf_prog_run include/linux/filter.h:596 [inline]
       bpf_prog_run include/linux/filter.h:603 [inline]
       bpf_test_run+0x46c/0x890 net/bpf/test_run.c:402
       bpf_prog_test_run_skb+0xbdc/0x14c0 net/bpf/test_run.c:1170
       bpf_prog_test_run+0x345/0x3c0 kernel/bpf/syscall.c:3648
       __sys_bpf+0x43a/0x6c0 kernel/bpf/syscall.c:5005
       __do_sys_bpf kernel/bpf/syscall.c:5091 [inline]
       __se_sys_bpf kernel/bpf/syscall.c:5089 [inline]
       __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5089
       do_syscall_64+0x54/0x70 arch/x86/entry/common.c:48
       entry_SYSCALL_64_after_hwframe+0x61/0xc6
      
      The reproducer doesn't really reproduce outside of syzkaller
      environment, so I'm taking a guess here. It looks like we
      do generate correct ETH_HLEN-sized packet, but we redirect
      the packet to the tunneling device. Before we do so, we
      __skb_pull l2 header and arrive again at skb->len == 0.
      Doesn't seem like we can do anything better than having
      an explicit check after __skb_pull?
      
      Cc: Eric Dumazet <edumazet@google.com>
      Reported-by: syzbot+f635e86ec3fa0a37e019@syzkaller.appspotmail.com
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20221027225537.353077-1-sdf@google.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      07ec7b50
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · fbeb229a
      Jakub Kicinski authored
      No conflicts.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fbeb229a
    • Linus Torvalds's avatar
      Merge tag 'net-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 9521c9d6
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from bluetooth and netfilter.
      
        Current release - regressions:
      
         - net: several zerocopy flags fixes
      
         - netfilter: fix possible memory leak in nf_nat_init()
      
         - openvswitch: add missing .resv_start_op
      
        Previous releases - regressions:
      
         - neigh: fix null-ptr-deref in neigh_table_clear()
      
         - sched: fix use after free in red_enqueue()
      
         - dsa: fall back to default tagger if we can't load the one from DT
      
         - bluetooth: fix use-after-free in l2cap_conn_del()
      
        Previous releases - always broken:
      
         - netfilter: netlink notifier might race to release objects
      
         - nfc: fix potential memory leak of skb
      
         - bluetooth: fix use-after-free caused by l2cap_reassemble_sdu
      
         - bluetooth: use skb_put to set length
      
         - eth: tun: fix bugs for oversize packet when napi frags enabled
      
         - eth: lan966x: fixes for when MTU is changed
      
         - eth: dwmac-loongson: fix invalid mdio_node"
      
      * tag 'net-6.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (53 commits)
        vsock: fix possible infinite sleep in vsock_connectible_wait_data()
        vsock: remove the unused 'wait' in vsock_connectible_recvmsg()
        ipv6: fix WARNING in ip6_route_net_exit_late()
        bridge: Fix flushing of dynamic FDB entries
        net, neigh: Fix null-ptr-deref in neigh_table_clear()
        net/smc: Fix possible leaked pernet namespace in smc_init()
        stmmac: dwmac-loongson: fix invalid mdio_node
        ibmvnic: Free rwi on reset success
        net: mdio: fix undefined behavior in bit shift for __mdiobus_register
        Bluetooth: L2CAP: Fix attempting to access uninitialized memory
        Bluetooth: L2CAP: Fix l2cap_global_chan_by_psm
        Bluetooth: L2CAP: Fix accepting connection request for invalid SPSM
        Bluetooth: hci_conn: Fix not restoring ISO buffer count on disconnect
        Bluetooth: L2CAP: Fix memory leak in vhci_write
        Bluetooth: L2CAP: fix use-after-free in l2cap_conn_del()
        Bluetooth: virtio_bt: Use skb_put to set length
        Bluetooth: hci_conn: Fix CIS connection dst_type handling
        Bluetooth: L2CAP: Fix use-after-free caused by l2cap_reassemble_sdu
        netfilter: ipset: enforce documented limit to prevent allocating huge memory
        isdn: mISDN: netjet: fix wrong check of device registration
        ...
      9521c9d6
    • Linus Torvalds's avatar
      Merge tag 'powerpc-6.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 4d740391
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - Fix an endian thinko in the asm-generic compat_arg_u64() which led to
         syscall arguments being swapped for some compat syscalls.
      
       - Fix syscall wrapper handling of syscalls with 64-bit arguments on
         32-bit kernels, which led to syscall arguments being misplaced.
      
       - A build fix for amdgpu on Book3E with AltiVec disabled.
      
      Thanks to Andreas Schwab, Christian Zigotzky, and Arnd Bergmann.
      
      * tag 'powerpc-6.1-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/32: Select ARCH_SPLIT_ARG64
        powerpc/32: fix syscall wrappers with 64-bit arguments
        asm-generic: compat: fix compat_arg_u64() and compat_arg_u64_dual()
        powerpc/64e: Fix amdgpu build on Book3E w/o AltiVec
      4d740391
    • Paolo Abeni's avatar
      Merge branch 'add-new-pcp-and-apptrust-attributes-to-dcbnl' · d9095f92
      Paolo Abeni authored
      Daniel Machon says:
      
      ====================
      Add new PCP and APPTRUST attributes to dcbnl
      
      This patch series adds new extension attributes to dcbnl, to support PCP
      prioritization (and thereby hw offloadable pcp-based queue
      classification) and per-selector trust and trust order. Additionally,
      the microchip sparx5 driver has been dcb-enabled to make use of the new
      attributes to offload PCP, DSCP and Default prio to the switch, and
      implement trust order of selectors.
      
      For pre-RFC discussion see:
      https://lore.kernel.org/netdev/Yv9VO1DYAxNduw6A@DEN-LT-70577/
      
      For RFC series see:
      https://lore.kernel.org/netdev/20220915095757.2861822-1-daniel.machon@microchip.com/
      
      In summary: there currently exist no convenient way to offload per-port
      PCP-based queue classification to hardware. The DCB subsystem offers
      different ways to prioritize through its APP table, but lacks an option
      for PCP. Similarly, there is no way to indicate the notion of trust for
      APP table selectors. This patch series addresses both topics.
      
      PCP based queue classification:
        - 8021Q standardizes the Priority Code Point table (see 6.9.3 of IEEE
          Std 802.1Q-2018).  This patch series makes it possible, to offload
          the PCP classification to said table.  The new PCP selector is not a
          standard part of the APP managed object, therefore it is
          encapsulated in a new non-std extension attribute.
      
      Selector trust:
        - ASIC's often has the notion of trust DSCP and trust PCP. The new
          attribute makes it possible to specify a trust order of app
          selectors, which drivers can then react on.
      
      DCB-enable sparx5 driver:
       - Now supports offloading of DSCP, PCP and default priority. Only one
         mapping of protocol:priority is allowed. Consecutive mappings of the
         same protocol to some new priority, will overwrite the previous. This
         is to keep a consistent view of the app table and the hardware.
       - Now supports dscp and pcp trust, by use of the introduced
         dcbnl_set/getapptrust ops. Sparx5 supports trust orders: [], [dscp],
         [pcp] and [dscp, pcp]. For now, only DSCP and PCP selectors are
         supported by the driver, everything else is bounced.
      
      Patch #1 introduces a new PCP selector to the APP object, which makes it
      possible to encode PCP and DEI in the app triplet and offload it to the
      PCP table of the ASIC.
      
      Patch #2 Introduces the new extension attributes
      DCB_ATTR_DCB_APP_TRUST_TABLE and DCB_ATTR_DCB_APP_TRUST. Trusted
      selectors are passed in the nested DCB_ATTR_DCB_APP_TRUST_TABLE
      attribute, and assembled into an array of selectors:
      
        u8 selectors[256];
      
      where lower indexes has higher precedence.  In the array, selectors are
      stored consecutively, starting from index zero. With a maximum number of
      256 unique selectors, the list has the same maximum size.
      
      Patch #3 Sets up the dcbnl ops hook, and adds support for offloading pcp
      app entries, to the PCP table of the switch.
      
      Patch #4 Makes use of the dcbnl_set/getapptrust ops, to set a per-port
      trust order.
      
      Patch #5 Adds support for offloading dscp app entries to the DSCP table
      of the switch.
      
      Patch #6 Adds support for offloading default prio app entries to the
      switch.
      
      ====================
      
      Link: https://lore.kernel.org/r/20221101094834.2726202-1-daniel.machon@microchip.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      d9095f92
    • Daniel Machon's avatar
      net: microchip: sparx5: add support for offloading default prio · c58ff3ed
      Daniel Machon authored
      Add support for offloading default prio {ETHERTYPE, 0, prio}.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c58ff3ed
    • Daniel Machon's avatar
      net: microchip: sparx5: add support for offloading dscp table · 8dcf69a6
      Daniel Machon authored
      Add support for offloading dscp app entries. Dscp values are global for
      all ports on the sparx5 switch. Therefore, we replicate each dscp app
      entry per-port.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      8dcf69a6
    • Daniel Machon's avatar
      net: microchip: sparx5: add support for apptrust · 23f8382c
      Daniel Machon authored
      Make use of set/getapptrust() to implement per-selector trust and trust
      order.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      23f8382c
    • Daniel Machon's avatar
      net: microchip: sparx5: add support for offloading pcp table · 92ef3d01
      Daniel Machon authored
      Add new registers and functions to support offload of pcp app entries.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      92ef3d01
    • Daniel Machon's avatar
      net: dcb: add new apptrust attribute · 6182d587
      Daniel Machon authored
      Add new apptrust extension attributes to the 8021Qaz APP managed object.
      
      Two new attributes, DCB_ATTR_DCB_APP_TRUST_TABLE and
      DCB_ATTR_DCB_APP_TRUST, has been added. Trusted selectors are passed in
      the nested attribute DCB_ATTR_DCB_APP_TRUST, in order of precedence.
      
      The new attributes are meant to allow drivers, whose hw supports the
      notion of trust, to be able to set whether a particular app selector is
      trusted - and in which order.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6182d587
    • Daniel Machon's avatar
      net: dcb: add new pcp selector to app object · ec32c0c4
      Daniel Machon authored
      Add new PCP selector for the 8021Qaz APP managed object.
      
      As the PCP selector is not part of the 8021Qaz standard, a new non-std
      extension attribute DCB_ATTR_DCB_APP has been introduced. Also two
      helper functions to translate between selector and app attribute type
      has been added. The new selector has been given a value of 255, to
      minimize the risk of future overlap of std- and non-std attributes.
      
      The new DCB_ATTR_DCB_APP is sent alongside the ieee std attribute in the
      app table. This means that the dcb_app struct can now both contain std-
      and non-std app attributes. Currently there is no overlap between the
      selector values of the two attributes.
      
      The purpose of adding the PCP selector, is to be able to offload
      PCP-based queue classification to the 8021Q Priority Code Point table,
      see 6.9.3 of IEEE Std 802.1Q-2018.
      
      PCP and DEI is encoded in the protocol field as 8*dei+pcp, so that a
      mapping of PCP 2 and DEI 1 to priority 3 is encoded as {255, 10, 3}.
      Signed-off-by: default avatarDaniel Machon <daniel.machon@microchip.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      ec32c0c4
    • Saurabh Sengar's avatar
      net: mana: Assign interrupts to CPUs based on NUMA nodes · 71fa6887
      Saurabh Sengar authored
      In large VMs with multiple NUMA nodes, network performance is usually
      best if network interrupts are all assigned to the same virtual NUMA
      node. This patch assigns online CPU according to a numa aware policy,
      local cpus are returned first, followed by non-local ones, then it wraps
      around.
      Signed-off-by: default avatarSaurabh Sengar <ssengar@linux.microsoft.com>
      Reviewed-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Link: https://lore.kernel.org/r/1667282761-11547-1-git-send-email-ssengar@linux.microsoft.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      71fa6887
    • Paolo Abeni's avatar
      Merge branch 'vsock-remove-an-unused-variable-and-fix-infinite-sleep' · 715aee0f
      Paolo Abeni authored
      Dexuan Cui says:
      
      ====================
      vsock: remove an unused variable and fix infinite sleep
      
      Patch 1 removes the unused 'wait' variable.
      Patch 2 fixes an infinite sleep issue reported by a hv_sock user.
      ====================
      
      Link: https://lore.kernel.org/r/20221101021706.26152-1-decui@microsoft.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      715aee0f
    • Dexuan Cui's avatar
      vsock: fix possible infinite sleep in vsock_connectible_wait_data() · 466a8533
      Dexuan Cui authored
      Currently vsock_connectible_has_data() may miss a wakeup operation
      between vsock_connectible_has_data() == 0 and the prepare_to_wait().
      
      Fix the race by adding the process to the wait queue before checking
      vsock_connectible_has_data().
      
      Fixes: b3f7fd54 ("af_vsock: separate wait data loop")
      Signed-off-by: default avatarDexuan Cui <decui@microsoft.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Reported-by: default avatarFrédéric Dalleau <frederic.dalleau@docker.com>
      Tested-by: default avatarFrédéric Dalleau <frederic.dalleau@docker.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      466a8533
    • Dexuan Cui's avatar
      vsock: remove the unused 'wait' in vsock_connectible_recvmsg() · cf6ff0df
      Dexuan Cui authored
      Remove the unused variable introduced by 19c1b90e.
      
      Fixes: 19c1b90e ("af_vsock: separate receive data loop")
      Signed-off-by: default avatarDexuan Cui <decui@microsoft.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      cf6ff0df
    • Shenwei Wang's avatar
      net: fec: add initial XDP support · 6d6b39f1
      Shenwei Wang authored
      This patch adds the initial XDP support to Freescale driver. It supports
      XDP_PASS, XDP_DROP and XDP_REDIRECT actions. Upcoming patches will add
      support for XDP_TX and Zero Copy features.
      
      As the patch is rather large, the part of codes to collect the
      statistics is separated and will prepare a dedicated patch for that
      part.
      
      I just tested with the application of xdpsock.
        -- Native here means running command of "xdpsock -i eth0"
        -- SKB-Mode means running command of "xdpsock -S -i eth0"
      
      The following are the testing result relating to XDP mode:
      
      root@imx8qxpc0mek:~/bpf# ./xdpsock -i eth0
       sock0@eth0:0 rxdrop xdp-drv
                         pps            pkts           1.00
      rx                 371347         2717794
      tx                 0              0
      
      root@imx8qxpc0mek:~/bpf# ./xdpsock -S -i eth0
       sock0@eth0:0 rxdrop xdp-skb
                         pps            pkts           1.00
      rx                 202229         404528
      tx                 0              0
      
      root@imx8qxpc0mek:~/bpf# ./xdp2 eth0
      proto 0:     496708 pkt/s
      proto 0:     505469 pkt/s
      proto 0:     505283 pkt/s
      proto 0:     505443 pkt/s
      proto 0:     505465 pkt/s
      
      root@imx8qxpc0mek:~/bpf# ./xdp2 -S eth0
      proto 0:          0 pkt/s
      proto 17:     118778 pkt/s
      proto 17:     118989 pkt/s
      proto 0:          1 pkt/s
      proto 17:     118987 pkt/s
      proto 0:          0 pkt/s
      proto 17:     118943 pkt/s
      proto 17:     118976 pkt/s
      proto 0:          1 pkt/s
      proto 17:     119006 pkt/s
      proto 0:          0 pkt/s
      proto 17:     119071 pkt/s
      proto 17:     119092 pkt/s
      Signed-off-by: default avatarShenwei Wang <shenwei.wang@nxp.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/r/20221031185350.2045675-1-shenwei.wang@nxp.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6d6b39f1
    • Ilya Maximets's avatar
      net: tun: bump the link speed from 10Mbps to 10Gbps · 598d2982
      Ilya Maximets authored
      The 10Mbps link speed was set in 2004 when the ethtool interface was
      initially added to the tun driver.  It might have been a good
      assumption 18 years ago, but CPUs and network stack came a long way
      since then.
      
      Other virtual ports typically report much higher speeds.  For example,
      veth reports 10Gbps since its introduction in 2007.
      
      Some userspace applications rely on the current link speed in
      certain situations.  For example, Open vSwitch is using link speed
      as an upper bound for QoS configuration if user didn't specify the
      maximum rate.  Advertised 10Mbps doesn't match reality in a modern
      world, so users have to always manually override the value with
      something more sensible to avoid configuration issues, e.g. limiting
      the traffic too much.  This also creates additional confusion among
      users.
      
      Bump the advertised speed to at least match the veth.
      
      Alternative might be to explicitly report UNKNOWN and let the user
      decide on a right value for them.  And it is indeed "the right way"
      of fixing the problem.  However, that may cause issues with bonding
      or with some userspace applications that may rely on speed value to
      be reported (even though they should not).  Just changing the speed
      value should be a safer option.
      
      Users can still override the speed with ethtool, if necessary.
      
      RFC discussion is linked below.
      
      Link: https://lore.kernel.org/lkml/20221021114921.3705550-1-i.maximets@ovn.org/
      Link: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-July/051958.htmlSigned-off-by: default avatarIlya Maximets <i.maximets@ovn.org>
      Reviewed-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Link: https://lore.kernel.org/r/20221031173953.614577-1-i.maximets@ovn.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      598d2982
    • Zhengchao Shao's avatar
      ipv6: fix WARNING in ip6_route_net_exit_late() · 768b3c74
      Zhengchao Shao authored
      During the initialization of ip6_route_net_init_late(), if file
      ipv6_route or rt6_stats fails to be created, the initialization is
      successful by default. Therefore, the ipv6_route or rt6_stats file
      doesn't be found during the remove in ip6_route_net_exit_late(). It
      will cause WRNING.
      
      The following is the stack information:
      name 'rt6_stats'
      WARNING: CPU: 0 PID: 9 at fs/proc/generic.c:712 remove_proc_entry+0x389/0x460
      Modules linked in:
      Workqueue: netns cleanup_net
      RIP: 0010:remove_proc_entry+0x389/0x460
      PKRU: 55555554
      Call Trace:
      <TASK>
      ops_exit_list+0xb0/0x170
      cleanup_net+0x4ea/0xb00
      process_one_work+0x9bf/0x1710
      worker_thread+0x665/0x1080
      kthread+0x2e4/0x3a0
      ret_from_fork+0x1f/0x30
      </TASK>
      
      Fixes: cdb18761 ("[NETNS][IPV6] route6 - create route6 proc files for the namespace")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20221102020610.351330-1-shaozhengchao@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      768b3c74
    • Ido Schimmel's avatar
      bridge: Fix flushing of dynamic FDB entries · 628ac04a
      Ido Schimmel authored
      The following commands should result in all the dynamic FDB entries
      being flushed, but instead all the non-local (non-permanent) entries are
      flushed:
      
       # bridge fdb add 00:aa:bb:cc:dd:ee dev dummy1 master static
       # bridge fdb add 00:11:22:33:44:55 dev dummy1 master dynamic
       # ip link set dev br0 type bridge fdb_flush
       # bridge fdb show brport dummy1
       00:00:00:00:00:01 master br0 permanent
       33:33:00:00:00:01 self permanent
       01:00:5e:00:00:01 self permanent
      
      This is because br_fdb_flush() works with FDB flags and not the
      corresponding enumerator values. Fix by passing the FDB flag instead.
      
      After the fix:
      
       # bridge fdb add 00:aa:bb:cc:dd:ee dev dummy1 master static
       # bridge fdb add 00:11:22:33:44:55 dev dummy1 master dynamic
       # ip link set dev br0 type bridge fdb_flush
       # bridge fdb show brport dummy1
       00:aa:bb:cc:dd:ee master br0 static
       00:00:00:00:00:01 master br0 permanent
       33:33:00:00:00:01 self permanent
       01:00:5e:00:00:01 self permanent
      
      Fixes: 1f78ee14 ("net: bridge: fdb: add support for fine-grained flushing")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Link: https://lore.kernel.org/r/20221101185753.2120691-1-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      628ac04a
    • Jakub Kicinski's avatar
      Merge branch 'rocker-two-small-changes' · d3a47063
      Jakub Kicinski authored
      Ido Schimmel says:
      
      ====================
      rocker: Two small changes
      
      Patch #1 avoids allocating and scheduling a work item when it is not
      doing any work.
      
      Patch #2 aligns rocker with other switchdev drivers to explicitly mark
      FDB entries as offloaded. Needed for upcoming MAB offload [1].
      
      [1] https://lore.kernel.org/netdev/20221025100024.1287157-1-idosch@nvidia.com/
      ====================
      
      Link: https://lore.kernel.org/r/20221101123936.1900453-1-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d3a47063
    • Ido Schimmel's avatar
      rocker: Explicitly mark learned FDB entries as offloaded · 386b4174
      Ido Schimmel authored
      Currently, FDB entries that are notified to the bridge driver via
      'SWITCHDEV_FDB_ADD_TO_BRIDGE' are always marked as offloaded by the
      bridge. With MAB enabled, this will no longer be universally true.
      Device drivers will report locked FDB entries to the bridge to let it
      know that the corresponding hosts required authorization, but it does
      not mean that these entries are necessarily programmed in the underlying
      hardware.
      
      We would like to solve it by having the bridge driver determine the
      offload indication based of the 'offloaded' bit in the FDB notification
      [1].
      
      Prepare for that change by having rocker explicitly mark learned FDB
      entries as offloaded. This is consistent with all the other switchdev
      drivers.
      
      [1] https://lore.kernel.org/netdev/20221025100024.1287157-4-idosch@nvidia.com/Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      386b4174