1. 10 Jun, 2019 7 commits
  2. 09 Jun, 2019 4 commits
  3. 07 Jun, 2019 21 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 38e406f6
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2019-06-07
      
      The following pull-request contains BPF updates for your *net* tree.
      
      The main changes are:
      
      1) Fix several bugs in riscv64 JIT code emission which forgot to clear high
         32-bits for alu32 ops, from Björn and Luke with selftests covering all
         relevant BPF alu ops from Björn and Jiong.
      
      2) Two fixes for UDP BPF reuseport that avoid calling the program in case of
         __udp6_lib_err and UDP GRO which broke reuseport_select_sock() assumption
         that skb->data is pointing to transport header, from Martin.
      
      3) Two fixes for BPF sockmap: a use-after-free from sleep in psock's backlog
         workqueue, and a missing restore of sk_write_space when psock gets dropped,
         from Jakub and John.
      
      4) Fix unconnected UDP sendmsg hook API which is insufficient as-is since it
         breaks standard applications like DNS if reverse NAT is not performed upon
         receive, from Daniel.
      
      5) Fix an out-of-bounds read in __bpf_skc_lookup which in case of AF_INET6
         fails to verify that the length of the tuple is long enough, from Lorenz.
      
      6) Fix libbpf's libbpf__probe_raw_btf to return an fd instead of 0/1 (for
         {un,}successful probe) as that is expected to be propagated as an fd to
         load_sk_storage_btf() and thus closing the wrong descriptor otherwise,
         from Michal.
      
      7) Fix bpftool's JSON output for the case when a lookup fails, from Krzesimir.
      
      8) Minor misc fixes in docs, samples and selftests, from various others.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38e406f6
    • Eli Britstein's avatar
      net/mlx5e: Support tagged tunnel over bond · 45e7d4c0
      Eli Britstein authored
      Stacked devices like bond interface may have a VLAN device on top of
      them. Detect lag state correctly under this condition, and return the
      correct routed net device, according to it the encap header is built.
      
      Fixes: e32ee6c7 ("net/mlx5e: Support tunnel encap over tagged Ethernet")
      Signed-off-by: default avatarEli Britstein <elibr@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      45e7d4c0
    • Alaa Hleihel's avatar
      net/mlx5e: Avoid detaching non-existing netdev under switchdev mode · 47c9d2c9
      Alaa Hleihel authored
      After introducing dedicated uplink representor, the netdev instance
      set over the esw manager vport (PF) became no longer in use, so it was
      removed in the cited commit once we're on switchdev mode.
      However, the mlx5e_detach function was not updated accordingly, and it
      still tries to detach a non-existing netdev, causing a kernel crash.
      
      This patch fixes this issue.
      
      Fixes: aec002f6 ("net/mlx5e: Uninstantiate esw manager vport netdev on switchdev mode")
      Signed-off-by: default avatarAlaa Hleihel <alaa@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      47c9d2c9
    • Raed Salem's avatar
      net/mlx5e: Fix source port matching in fdb peer flow rule · b83c0730
      Raed Salem authored
      The cited commit changed the initialization placement of the eswitch
      attributes so it is done prior to parse tc actions function call,
      including among others the in_rep and in_mdev fields which are mistakenly
      reassigned inside the parse actions function.
      
      This breaks the source port matching criteria of the peer redirect rule.
      
      Fix by removing the now redundant reassignment of the already initialized
      fields.
      
      Fixes: 988ab9c7 ("net/mlx5e: Introduce mlx5e_flow_esw_attr_init() helper")
      Signed-off-by: default avatarRaed Salem <raeds@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      b83c0730
    • Shay Agroskin's avatar
      net/mlx5e: Replace reciprocal_scale in TX select queue function · 57c70d87
      Shay Agroskin authored
      The TX queue index returned by the fallback function ranges
      between [0,NUM CHANNELS - 1] if QoS isn't set and
      [0, (NUM CHANNELS)*(NUM TCs) -1] otherwise.
      
      Our HW uses different TC mapping than the fallback function
      (which is denoted as 'up', user priority) so we only need to extract
      a channel number out of the returned value.
      
      Since (NUM CHANNELS)*(NUM TCs) is a relatively small number, using
      reciprocal scale almost always returns zero.
      We instead access the 'txq2sq' table to extract the sq (and with it the
      channel number) associated with the tx queue, thus getting
      a more evenly distributed channel number.
      
      Perf:
      
      Rx/Tx side with Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz and ConnectX-5.
      Used 'iperf' UDP traffic, 10 threads, and priority 5.
      
      Before:	0.566Mpps
      After:	 2.37Mpps
      
      As expected, releasing the existing bottleneck of steering all traffic
      to TX queue zero significantly improves transmission rates.
      
      Fixes: 7ccdd084 ("net/mlx5e: Fix select queue callback")
      Signed-off-by: default avatarShay Agroskin <shayag@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      57c70d87
    • Chris Mi's avatar
      net/mlx5e: Add ndo_set_feature for uplink representor · d3cbd425
      Chris Mi authored
      After we have a dedicated uplink representor, the new netdev ops
      doesn't support ndo_set_feature. Because of that, we can't change
      some features, eg. rxvlan. Now add it back.
      
      In this patch, I also do a cleanup for the features flag handling,
      eg. remove duplicate NETIF_F_HW_TC flag setting.
      
      Fixes: aec002f6 ("net/mlx5e: Uninstantiate esw manager vport netdev on switchdev mode")
      Signed-off-by: default avatarChris Mi <chrism@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      d3cbd425
    • Alaa Hleihel's avatar
      net/mlx5: Avoid reloading already removed devices · dd80857b
      Alaa Hleihel authored
      Prior to reloading a device we must first verify that it was not already
      removed. Otherwise, the attempt to remove the device will do nothing, and
      in that case we will end up proceeding with adding an new device that no
      one was expecting to remove, leaving behind used resources such as EQs that
      causes a failure to destroy comp EQs and syndrome (0x30f433).
      
      Fix that by making sure that we try to remove and add a device (based on a
      protocol) only if the device is already added.
      
      Fixes: c5447c70 ("net/mlx5: E-Switch, Reload IB interface when switching devlink modes")
      Signed-off-by: default avatarAlaa Hleihel <alaa@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      dd80857b
    • Edward Srouji's avatar
      net/mlx5: Update pci error handler entries and command translation · 6a6fabbf
      Edward Srouji authored
      Add missing entries for create/destroy UCTX and UMEM commands.
      This could get us wrong "unknown FW command" error in flows
      where we unbind the device or reset the driver.
      
      Also the translation of these commands from opcodes to string
      was missing.
      
      Fixes: 6e3722ba ("IB/mlx5: Use the correct commands for UMEM and UCTX allocation")
      Signed-off-by: default avatarEdward Srouji <edwards@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      6a6fabbf
    • Willem de Bruijn's avatar
      can: purge socket error queue on sock destruct · fd704bd5
      Willem de Bruijn authored
      CAN supports software tx timestamps as of the below commit. Purge
      any queued timestamp packets on socket destroy.
      
      Fixes: 51f31cab ("ip: support for TX timestamps on UDP and RAW sockets")
      Reported-by: syzbot+a90604060cb40f5bdd16@syzkaller.appspotmail.com
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Cc: linux-stable <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      fd704bd5
    • Fabio Estevam's avatar
      can: flexcan: Remove unneeded registration message · eb503004
      Fabio Estevam authored
      Currently the following message is observed when the flexcan
      driver is probed:
      
      flexcan 2090000.flexcan: device registered (reg_base=(ptrval), irq=23)
      
      The reason for printing 'ptrval' is explained at
      Documentation/core-api/printk-formats.rst:
      
      "Pointers printed without a specifier extension (i.e unadorned %p) are
      hashed to prevent leaking information about the kernel memory layout. This
      has the added benefit of providing a unique identifier. On 64-bit machines
      the first 32 bits are zeroed. The kernel will print ``(ptrval)`` until it
      gathers enough entropy."
      
      Instead of passing %pK, which can print the correct address, simply
      remove the entire message as it is not really that useful.
      Signed-off-by: default avatarFabio Estevam <festevam@gmail.com>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      eb503004
    • YueHaibing's avatar
      can: af_can: Fix error path of can_init() · c5a3aed1
      YueHaibing authored
      This patch add error path for can_init() to avoid possible crash if some
      error occurs.
      
      Fixes: 0d66548a ("[CAN]: Add PF_CAN core module")
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Acked-by: default avatarOliver Hartkopp <socketcan@hartkopp.net>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      c5a3aed1
    • Eugen Hristev's avatar
      can: m_can: implement errata "Needless activation of MRAF irq" · 3e82f2f3
      Eugen Hristev authored
      During frame reception while the MCAN is in Error Passive state and the
      Receive Error Counter has thevalue MCAN_ECR.REC = 127, it may happen
      that MCAN_IR.MRAF is set although there was no Message RAM access
      failure. If MCAN_IR.MRAF is enabled, an interrupt to the Host CPU is
      generated.
      
      Work around:
      The Message RAM Access Failure interrupt routine needs to check whether
      
          MCAN_ECR.RP = '1' and MCAN_ECR.REC = '127'.
      
      In this case, reset MCAN_IR.MRAF. No further action is required.
      This affects versions older than 3.2.0
      
      Errata explained on Sama5d2 SoC which includes this hardware block:
      http://ww1.microchip.com/downloads/en/DeviceDoc/SAMA5D2-Family-Silicon-Errata-and-Data-Sheet-Clarification-DS80000803B.pdf
      chapter 6.2
      
      Reproducibility: If 2 devices with m_can are connected back to back,
      configuring different bitrate on them will lead to interrupt storm on
      the receiving side, with error "Message RAM access failure occurred".
      Another way is to have a bad hardware connection. Bad wire connection
      can lead to this issue as well.
      
      This patch fixes the issue according to provided workaround.
      Signed-off-by: default avatarEugen Hristev <eugen.hristev@microchip.com>
      Reviewed-by: default avatarLudovic Desroches <ludovic.desroches@microchip.com>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      3e82f2f3
    • Sean Nyekjaer's avatar
      can: mcp251x: add support for mcp25625 · 35b7fa4d
      Sean Nyekjaer authored
      Fully compatible with mcp2515, the mcp25625 have integrated transceiver.
      
      This patch adds support for the mcp25625 to the existing mcp251x driver.
      Signed-off-by: default avatarSean Nyekjaer <sean@geanix.com>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      35b7fa4d
    • Sean Nyekjaer's avatar
      dt-bindings: can: mcp251x: add mcp25625 support · 0df82dcd
      Sean Nyekjaer authored
      Fully compatible with mcp2515, the mcp25625 have integrated transceiver.
      
      This patch add the mcp25625 to the device tree bindings documentation.
      Signed-off-by: default avatarSean Nyekjaer <sean@geanix.com>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      0df82dcd
    • Anssi Hannula's avatar
      can: xilinx_can: use correct bittiming_const for CAN FD core · 904044dd
      Anssi Hannula authored
      Commit 9e5f1b27 ("can: xilinx_can: add support for Xilinx CAN FD
      core") added a new can_bittiming_const structure for CAN FD cores that
      support larger values for tseg1, tseg2, and sjw than previous Xilinx CAN
      cores, but the commit did not actually take that into use.
      
      Fix that.
      
      Tested with CAN FD core on a ZynqMP board.
      
      Fixes: 9e5f1b27 ("can: xilinx_can: add support for Xilinx CAN FD core")
      Reported-by: default avatarShubhrajyoti Datta <shubhrajyoti.datta@gmail.com>
      Signed-off-by: default avatarAnssi Hannula <anssi.hannula@bitwise.fi>
      Cc: Michal Simek <michal.simek@xilinx.com>
      Reviewed-by: default avatarShubhrajyoti Datta <shubhrajyoti.datta@gmail.com>
      Cc: linux-stable <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      904044dd
    • Joakim Zhang's avatar
      can: flexcan: fix timeout when set small bitrate · 247e5356
      Joakim Zhang authored
      Current we can meet timeout issue when setting a small bitrate like
      10000 as follows on i.MX6UL EVK board (ipg clock = 66MHZ, per clock =
      30MHZ):
      
      | root@imx6ul7d:~# ip link set can0 up type can bitrate 10000
      
      A link change request failed with some changes committed already.
      Interface can0 may have been left with an inconsistent configuration,
      please check.
      
      | RTNETLINK answers: Connection timed out
      
      It is caused by calling of flexcan_chip_unfreeze() timeout.
      
      Originally the code is using usleep_range(10, 20) for unfreeze
      operation, but the patch (8badd65e can: flexcan: avoid calling
      usleep_range from interrupt context) changed it into udelay(10) which is
      only a half delay of before, there're also some other delay changes.
      
      After double to FLEXCAN_TIMEOUT_US to 100 can fix the issue.
      
      Meanwhile, Rasmus Villemoes reported that even with a timeout of 100,
      flexcan_probe() fails on the MPC8309, which requires a value of at least
      140 to work reliably. 250 works for everyone.
      Signed-off-by: default avatarJoakim Zhang <qiangqing.zhang@nxp.com>
      Reviewed-by: default avatarDong Aisheng <aisheng.dong@nxp.com>
      Cc: linux-stable <stable@vger.kernel.org>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      247e5356
    • Alexander Dahl's avatar
      can: usb: Kconfig: Remove duplicate menu entry · 0ed89d77
      Alexander Dahl authored
      This seems to have slipped in by accident when sorting the entries.
      
      Fixes: ffbdd917Signed-off-by: default avatarAlexander Dahl <ada@thorsis.com>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      0ed89d77
    • David S. Miller's avatar
      Merge tag 'wireless-drivers-for-davem-2019-06-07' of... · c7e3c93a
      David S. Miller authored
      Merge tag 'wireless-drivers-for-davem-2019-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers
      
      Kalle Valo says:
      
      ====================
      wireless-drivers fixes for 5.2
      
      First set of fixes for 5.2. Most important here are buffer overflow
      fixes for mwifiex.
      
      rtw88
      
      * fix out of bounds compiler warning
      
      * fix rssi handling to get 4x more throughput
      
      * avoid circular locking
      
      rsi
      
      * fix unitilised data warning, these are hopefully the last ones so
        that the warning can be enabled by default
      
      mwifiex
      
      * fix buffer overflows
      
      iwlwifi
      
      * remove not used debugfs file
      
      * various fixes
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c7e3c93a
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 1e1d9263
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Free AF_PACKET po->rollover properly, from Willem de Bruijn.
      
       2) Read SFP eeprom in max 16 byte increments to avoid problems with
          some SFP modules, from Russell King.
      
       3) Fix UDP socket lookup wrt. VRF, from Tim Beale.
      
       4) Handle route invalidation properly in s390 qeth driver, from Julian
          Wiedmann.
      
       5) Memory leak on unload in RDS, from Zhu Yanjun.
      
       6) sctp_process_init leak, from Neil HOrman.
      
       7) Fix fib_rules rule insertion semantic change that broke Android,
          from Hangbin Liu.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
        pktgen: do not sleep with the thread lock held.
        net: mvpp2: Use strscpy to handle stat strings
        net: rds: fix memory leak in rds_ib_flush_mr_pool
        ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
        ipv6: use READ_ONCE() for inet->hdrincl as in ipv4
        Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"
        net: aquantia: fix wol configuration not applied sometimes
        ethtool: fix potential userspace buffer overflow
        Fix memory leak in sctp_process_init
        net: rds: fix memory leak when unload rds_rdma
        ipv6: fix the check before getting the cookie in rt6_get_cookie
        ipv4: not do cache for local delivery if bc_forwarding is enabled
        s390/qeth: handle error when updating TX queue count
        s390/qeth: fix VLAN attribute in bridge_hostnotify udev event
        s390/qeth: check dst entry before use
        s390/qeth: handle limited IPv4 broadcast in L3 TX path
        net: fix indirect calls helpers for ptype list hooks.
        net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set
        udp: only choose unbound UDP socket for multicast when not in a VRF
        net/tls: replace the sleeping lock around RX resync with a bit lock
        ...
      1e1d9263
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma · 6e38335d
      Linus Torvalds authored
      Pull rdma fixes from Jason Gunthorpe:
       "Things are looking pretty quiet here in RDMA, not too many bug fixes
        rolling in right now. The usual driver bug fixes and fixes for a
        couple of regressions introduced in 5.2:
      
         - Fix a race on bootup with RDMA device renaming and srp. SRP also
           needs to rename its internal sys files
      
         - Fix a memory leak in hns
      
         - Don't leak resources in efa on certain error unwinds
      
         - Don't panic in certain error unwinds in ib_register_device
      
         - Various small user visible bug fix patches for the hfi and efa
           drivers
      
         - Fix the 32 bit compilation break"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
        RDMA/efa: Remove MAYEXEC flag check from mmap flow
        mlx5: avoid 64-bit division
        IB/hfi1: Validate page aligned for a given virtual address
        IB/{qib, hfi1, rdmavt}: Correct ibv_devinfo max_mr value
        IB/hfi1: Insure freeze_work work_struct is canceled on shutdown
        IB/rdmavt: Fix alloc_qpn() WARN_ON()
        RDMA/core: Fix panic when port_data isn't initialized
        RDMA/uverbs: Pass udata on uverbs error unwind
        RDMA/core: Clear out the udata before error unwind
        RDMA/hns: Fix PD memory leak for internal allocation
        RDMA/srp: Rename SRP sysfs name after IB device rename trigger
      6e38335d
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · a02a532c
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
       "Another round of mostly-benign fixes, the exception being a boot crash
        on SVE2-capable CPUs (although I don't know where you'd find such a
        thing, so maybe it's benign too).
      
        We're in the process of resolving some big-endian ptrace breakage, so
        I'll probably have some more for you next week.
      
        Summary:
      
         - Fix boot crash on platforms with SVE2 due to missing register
           encoding
      
         - Fix architected timer accessors when CONFIG_OPTIMIZE_INLINING=y
      
         - Move cpu_logical_map into smp.h for use by upcoming irqchip drivers
      
         - Trivial typo fix in comment
      
         - Disable some useless, noisy warnings from GCC 9"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: Silence gcc warnings about arch ABI drift
        ARM64: trivial: s/TIF_SECOMP/TIF_SECCOMP/ comment typo fix
        arm64: arch_timer: mark functions as __always_inline
        arm64: smp: Moved cpu_logical_map[] to smp.h
        arm64: cpufeature: Fix missing ZFR0 in __read_sysreg_by_encoding()
      a02a532c
  4. 06 Jun, 2019 8 commits
    • Alexei Starovoitov's avatar
      Merge branch 'fix-unconnected-udp' · 4aeba328
      Alexei Starovoitov authored
      Daniel Borkmann says:
      
      ====================
      Please refer to the patch 1/6 as the main patch with the details
      on the current sendmsg hook API limitations and proposal to fix
      it in order to work with basic applications like DNS. Remaining
      patches are the usual uapi and tooling updates as well as test
      cases. Thanks a lot!
      
      v2 -> v3:
        - Add attach types to test_section_names.c and libbpf (Andrey)
        - Added given Acks, rest as-is
      v1 -> v2:
        - Split off uapi header sync and bpftool bits (Martin, Alexei)
        - Added missing bpftool doc and bash completion as well
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4aeba328
    • Daniel Borkmann's avatar
      bpf: expand section tests for test_section_names · b714560f
      Daniel Borkmann authored
      Add cgroup/recvmsg{4,6} to test_section_names as well. Test run output:
      
        # ./test_section_names
        libbpf: failed to guess program type based on ELF section name 'InvAliD'
        libbpf: supported section(type) names are: [...]
        libbpf: failed to guess attach type based on ELF section name 'InvAliD'
        libbpf: attachable section(type) names are: [...]
        libbpf: failed to guess program type based on ELF section name 'cgroup'
        libbpf: supported section(type) names are: [...]
        libbpf: failed to guess attach type based on ELF section name 'cgroup'
        libbpf: attachable section(type) names are: [...]
        Summary: 38 PASSED, 0 FAILED
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b714560f
    • Daniel Borkmann's avatar
      bpf: more msg_name rewrite tests to test_sock_addr · 1812291e
      Daniel Borkmann authored
      Extend test_sock_addr for recvmsg test cases, bigger parts of the
      sendmsg code can be reused for this. Below are the strace view of
      the recvmsg rewrites; the sendmsg side does not have a BPF prog
      connected to it for the context of this test:
      
      IPv4 test case:
      
        [pid  4846] bpf(BPF_PROG_ATTACH, {target_fd=3, attach_bpf_fd=4, attach_type=0x13 /* BPF_??? */, attach_flags=BPF_F_ALLOW_OVERRIDE}, 112) = 0
        [pid  4846] socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
        [pid  4846] bind(5, {sa_family=AF_INET, sin_port=htons(4444), sin_addr=inet_addr("127.0.0.1")}, 128) = 0
        [pid  4846] socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 6
        [pid  4846] sendmsg(6, {msg_name={sa_family=AF_INET, sin_port=htons(4444), sin_addr=inet_addr("127.0.0.1")}, msg_namelen=128, msg_iov=[{iov_base="a", iov_len=1}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
        [pid  4846] select(6, [5], NULL, NULL, {tv_sec=2, tv_usec=0}) = 1 (in [5], left {tv_sec=1, tv_usec=999995})
        [pid  4846] recvmsg(5, {msg_name={sa_family=AF_INET, sin_port=htons(4040), sin_addr=inet_addr("192.168.1.254")}, msg_namelen=128->16, msg_iov=[{iov_base="a", iov_len=64}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
        [pid  4846] close(6)                    = 0
        [pid  4846] close(5)                    = 0
        [pid  4846] bpf(BPF_PROG_DETACH, {target_fd=3, attach_type=0x13 /* BPF_??? */}, 112) = 0
      
      IPv6 test case:
      
        [pid  4846] bpf(BPF_PROG_ATTACH, {target_fd=3, attach_bpf_fd=4, attach_type=0x14 /* BPF_??? */, attach_flags=BPF_F_ALLOW_OVERRIDE}, 112) = 0
        [pid  4846] socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 5
        [pid  4846] bind(5, {sa_family=AF_INET6, sin6_port=htons(6666), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 128) = 0
        [pid  4846] socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 6
        [pid  4846] sendmsg(6, {msg_name={sa_family=AF_INET6, sin6_port=htons(6666), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=128, msg_iov=[{iov_base="a", iov_len=1}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
        [pid  4846] select(6, [5], NULL, NULL, {tv_sec=2, tv_usec=0}) = 1 (in [5], left {tv_sec=1, tv_usec=999996})
        [pid  4846] recvmsg(5, {msg_name={sa_family=AF_INET6, sin6_port=htons(6060), inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=128->28, msg_iov=[{iov_base="a", iov_len=64}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
        [pid  4846] close(6)                    = 0
        [pid  4846] close(5)                    = 0
        [pid  4846] bpf(BPF_PROG_DETACH, {target_fd=3, attach_type=0x14 /* BPF_??? */}, 112) = 0
      
      test_sock_addr run w/o strace view:
      
        # ./test_sock_addr.sh
        [...]
        Test case: recvmsg4: return code ok .. [PASS]
        Test case: recvmsg4: return code !ok .. [PASS]
        Test case: recvmsg6: return code ok .. [PASS]
        Test case: recvmsg6: return code !ok .. [PASS]
        Test case: recvmsg4: rewrite IP & port (asm) .. [PASS]
        Test case: recvmsg6: rewrite IP & port (asm) .. [PASS]
        [...]
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1812291e
    • Daniel Borkmann's avatar
      bpf, bpftool: enable recvmsg attach types · 000aa125
      Daniel Borkmann authored
      Trivial patch to bpftool in order to complete enabling attaching programs
      to BPF_CGROUP_UDP{4,6}_RECVMSG.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      000aa125
    • Daniel Borkmann's avatar
      bpf, libbpf: enable recvmsg attach types · 9bb59ac1
      Daniel Borkmann authored
      Another trivial patch to libbpf in order to enable identifying and
      attaching programs to BPF_CGROUP_UDP{4,6}_RECVMSG by section name.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9bb59ac1
    • Daniel Borkmann's avatar
      bpf: sync tooling uapi header · 3dbc6ada
      Daniel Borkmann authored
      Sync BPF uapi header in order to pull in BPF_CGROUP_UDP{4,6}_RECVMSG
      attach types. This is done and preferred as an extra patch in order
      to ease sync of libbpf.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3dbc6ada
    • Daniel Borkmann's avatar
      bpf: fix unconnected udp hooks · 983695fa
      Daniel Borkmann authored
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarMartynas Pumputis <m@lambda.lt>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      983695fa
    • Linus Torvalds's avatar
      Merge branch 'parisc-5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 16d72dd4
      Linus Torvalds authored
      Pull parisc fixes from Helge Deller:
      
       - Fix crashes when accessing PCI devices on some machines like C240 and
         J5000. The crashes were triggered because we replaced cache flushes
         by nops in the alternative coding where we shouldn't for some
         machines.
      
       - Dave fixed a race in the usage of the sr1 space register when used to
         load the coherence index.
      
       - Use the hardware lpa instruction to to load the physical address of
         kernel virtual addresses in the iommu driver code.
      
       - The kernel may fail to link when CONFIG_MLONGCALLS isn't set. Solve
         that by rearranging functions in the final vmlinux executeable.
      
       - Some defconfig cleanups and removal of compiler warnings.
      
      * 'parisc-5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Fix crash due alternative coding for NP iopdir_fdc bit
        parisc: Use lpa instruction to load physical addresses in driver code
        parisc: configs: Remove useless UEVENT_HELPER_PATH
        parisc: Use implicit space register selection for loading the coherence index of I/O pdirs
        parisc: Fix compiler warnings in float emulation code
        parisc/slab: cleanup after /proc/slab_allocators removal
        parisc: Allow building 64-bit kernel without -mlong-calls compiler option
        parisc: Kconfig: remove ARCH_DISCARD_MEMBLOCK
      16d72dd4