- 23 Oct, 2015 4 commits
-
-
Eric Dumazet authored
Multiple cpus can process duplicates of incoming ACK messages matching a SYN_RECV request socket. This is a rare event under normal operations, but definitely can happen. Only one must win the race, otherwise corruption would occur. To fix this without adding new atomic ops, we use logic in inet_ehash_nolisten() to detect the request was present in the same ehash bucket where we try to insert the new child. If request socket was not found, we have to undo the child creation. This actually removes a spin_lock()/spin_unlock() pair in reqsk_queue_unlink() for the fast path. Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets") Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Paolo Abeni authored
Currently adding a new ipv4 address always cause the creation of the related network route, with default metric. When a host has multiple interfaces on the same network, multiple routes with the same metric are created. If the userspace wants to set specific metric on each routes, i.e. giving better metric to ethernet links in respect to Wi-Fi ones, the network routes must be deleted and recreated, which is error-prone. This patch implements the support for IFA_F_NOPREFIXROUTE for ipv4 address. When an address is added with such flag set, no associated network route is created, no network route is deleted when said IP is gone and it's up to the user space manage such route. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael Chan authored
Broadcom ethernet driver for the new family of NetXtreme-C/E ethernet devices. v5: - Removed empty blank lines at end of files (noted by David Miller). - Moved busy poll helper functions to bnxt.h to at least make the .c file look less cluttered with #ifdef (noted by Stephen Hemminger). v4: - Broke up 2 long message strings with "\n" (suggested by John Linville) - Constify an array of strings (suggested by Stephen Hemminger) - Improve bnxt_vf_pciid() (suggested by Stephen Hemminger) - Use PCI_VDEVICE() to populate pci_device_id table for more compact source. v3: - Fixed 2 more sparse warnings. - Removed some unused structures in .h files. v2: - Fixed all kbuild test robot reported warnings. - Fixed many of the checkpatch.pl errors and warnings. - Fixed the Kconfig description (noted by Dmitry Kravkov). Acked-by: Eddie Wai <eddie.wai@broadcom.com> Acked-by: Jeffrey Huang <huangjw@broadcom.com> Signed-off-by: Prashant Sreedharan <prashant@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vivien Didelot authored
It is preferable to have a common debugfs interface for DSA or switchdev instead of a driver specific one. Thus remove the mv88e6xxx debug code. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 22 Oct, 2015 32 commits
-
-
David S. Miller authored
Vivien Didelot says: ==================== net: dsa: implement port_fdb_dump in drivers Not all switch chips provide a Get Next kind of operation to dump FDB entries. It is preferred to let the driver handle the dump operation the way it works best for the chip. Thus, drop port_fdb_getnext and implement the port_fdb_dump operation in DSA, which pushes the switchdev FDB dump callback down to the drivers. mv88e6xxx is the only driver affected and is updated accordingly. v3 -> v4: fix rejects on latest net-next v2 -> v3: opencode switchdev_obj_dump_cb_t to avoid multiple typedef; use ether_addr_copy in fdb_dump v1 -> v2: fix a few "return err" instead of "goto unlock" in mv88e6xxx.c ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vivien Didelot authored
No driver implements port_fdb_getnext anymore, and port_fdb_dump is preferred anyway, so remove this function from DSA. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vivien Didelot authored
Now that port_fdb_dump is implemented and even simpler, get rid of port_fdb_getnext. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vivien Didelot authored
Implement the port_fdb_dump DSA operation. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vivien Didelot authored
There is no need to write the MAC address before every Get Next operation, since ATU MAC registers are not cleared between calls. Move the _mv88e6xxx_atu_mac_write call outside of _mv88e6xxx_atu_getnext so future code could call ATU Get Next multiple times and save a few register access. Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vivien Didelot authored
There is no need to write the VLAN ID before every Get Next operation, since the VTU VID register is not cleared between calls. Move the VID write call in a _mv88e6xxx_vtu_vid_write function outside of _mv88e6xxx_vtu_getnext so future code could call VTU Get Next multiple times and save a few register accesses. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vivien Didelot authored
Not all switch chips support a Get Next operation to iterate on its FDB. So add a more simple port_fdb_dump function for them. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Merge tag 'mac80211-next-for-davem-2015-10-21' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== Here's another set of patches for the current cycle: * I merged net-next back to avoid a conflict with the * cfg80211 scheduled scan API extensions * preparations for better scan result timestamping * regulatory cleanups * mac80211 statistics cleanups * a few other small cleanups and fixes ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
yankejian authored
the global Soc configuration is treated by syscon, and sub ctrl bus is Soc bus. it has to be treated by syscon. Signed-off-by: yankejian <yankejian@huawei.com> Signed-off-by: lisheng <lisheng011@huawei.com> Signed-off-by: lipeng <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Hariprasad Shenai says: ==================== Trivial fixes for cxgb4 driver This patch series updates driver description for next gen. adapters, updates firmware info., returns error for setup_rss error case, restores L1 configuration in case of FW rejects new config, updates and aligns ethtool get stats settings, etc This patch series has been created against net-next tree and includes patches on cxgb4 and cxgb4vf driver. We have included all the maintainers of respective drivers. Kindly review the change and let us know in case of any review comments. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hariprasad Shenai authored
Update ethtool get_drvinfo to display regdump len and also update firmware string version print to display N/A in case FW isn't present Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hariprasad Shenai authored
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hariprasad Shenai authored
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hariprasad Shenai authored
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hariprasad Shenai authored
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hariprasad Shenai authored
In the ethtool set_settings() routine we need to remember our old L1 Configuration in case the firmware rejects the request and then restore that. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hariprasad Shenai authored
For {1, 10, 40} Gb/s. Prohibiting turning off autonegotiation isn't anywhere in the standard. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Hariprasad Shenai authored
Align the ethtool get stats settings with the rest so it looks uniform Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Pravin B Shelar authored
With use of lwtunnel, we can directly call dev_queue_xmit() rather than calling netdev vport send operation. Following change make tunnel vport code bit cleaner. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Pravin B Shelar authored
Patch fixes following sparse warning. net/openvswitch/flow_netlink.c:583:30: warning: incorrect type in assignment (different base types) net/openvswitch/flow_netlink.c:583:30: expected restricted __be16 [usertype] ipv4 net/openvswitch/flow_netlink.c:583:30: got int Fixes: 6b26ba3a ("openvswitch: netlink attributes for IPv6 tunneling") Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Alexei Starovoitov says: ==================== bpf_perf_event_output helper Over the last year there were multiple attempts to let eBPF programs output data into perf events by He Kuang and Wangnan. The last one was: https://lkml.org/lkml/2015/7/20/736 It was almost perfect with exception that all bpf programs would sent data into one global perf_event. This patch set takes different approach by letting user space open independent PERF_COUNT_SW_BPF_OUTPUT events, so that program output won't collide. Wangnan is working on corresponding perf patches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexei Starovoitov authored
Performance test and example of bpf_perf_event_output(). kprobe is attached to sys_write() and trivial bpf program streams pid+cookie into userspace via PERF_COUNT_SW_BPF_OUTPUT event. Usage: $ sudo ./bld_x64/samples/bpf/trace_output recv 2968913 events per sec Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexei Starovoitov authored
This helper is used to send raw data from eBPF program into special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event. User space needs to perf_event_open() it (either for one or all cpus) and store FD into perf_event_array (similar to bpf_perf_event_read() helper) before eBPF program can send data into it. Today the programs triggered by kprobe collect the data and either store it into the maps or print it via bpf_trace_printk() where latter is the debug facility and not suitable to stream the data. This new helper replaces such bpf_trace_printk() usage and allows programs to have dedicated channel into user space for post-processing of the raw data collected. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexei Starovoitov authored
Instead of WARN_ON in perf_event_output() on unpaded raw samples, pad them automatically. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Brenden Blanco authored
In the ipv4 outbound path of an ipvlan device in l3 mode, the ifindex is being grabbed from dev_get_iflink. This works for the physical device case, since as the documentation of that function notes: "Physical interfaces have the same 'ifindex' and 'iflink' values.". However, if the master device is a veth, and the pairs are in separate net namespaces, the route lookup will fail with -ENODEV due to outer veth pair being in a separate namespace from the ipvlan master/routing namespace. ns0 | ns1 | ns2 veth0a--|--veth0b--|--ipvl0 In ipvlan_process_v4_outbound(), a packet sent from ipvl0 in the above configuration will pass fl.flowi4_oif == veth0a to ip_route_output_flow(), but *net == ns1. Notice also that ipv6 processing is not using iflink. Since there is a discrepancy in usage, fixup both v4 and v6 case to use local dev variable. Tested this with l3 ipvlan on top of veth, as well as with single physical interface in the top namespace. Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Reviewed-by: Jiri Benc <jbenc@redhat.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Allowing an application to set whatever limit for the list of recently RST fastopen sessions [1] is not wise, as it open ways to deplete kernel memory. Cap the user provided limit by somaxconn sysctl, like listen() backlog. [1] https://tools.ietf.org/html/rfc7413#section-5.1Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vivien Didelot authored
This header file only contains the platform data structure definition, so move it to the include/linux/platform_data/ directory. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vivien Didelot authored
Remove the inclusion of linux/mdio-gpio.h in nas4220b, wbd111 and wbd222 boards since mdio-gpio is not used. Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Wu Fengguang authored
drivers/net/ethernet/hisilicon/hns/hnae.c:442:1-3: WARNING: PTR_ERR_OR_ZERO can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: scripts/coccinelle/api/ptr_ret.cocci CC: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Tom Herbert added SIT support to GRO with commit 19424e05 ("sit: Add gro callbacks to sit_offload"), later reverted by Herbert Xu. The problem came because Tom patch was building GRO packets without proper meta data : If packets were locally delivered, we would not care. But if packets needed to be forwarded, GSO engine was not able to segment individual segments. With the following patch, we correctly set skb->encapsulation and inner network header. We also update gso_type. Tested: Server : netserver modprobe dummy ifconfig dummy0 8.0.0.1 netmask 255.255.255.0 up arp -s 8.0.0.100 4e:32:51:04:47:e5 iptables -I INPUT -s 10.246.7.151 -j TEE --gateway 8.0.0.100 ifconfig sixtofour0 sixtofour0 Link encap:IPv6-in-IPv4 inet6 addr: 2002:af6:798::1/128 Scope:Global inet6 addr: 2002:af6:798::/128 Scope:Global UP RUNNING NOARP MTU:1480 Metric:1 RX packets:411169 errors:0 dropped:0 overruns:0 frame:0 TX packets:409414 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:20319631739 (20.3 GB) TX bytes:29529556 (29.5 MB) Client : netperf -H 2002:af6:798::1 -l 1000 & Checked on server traffic copied on dummy0 and verify segments were properly rebuilt, with proper IP headers, TCP checksums... tcpdump on eth0 shows proper GRO aggregation takes place. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
While testing my SIT/GRO patch using netfilter TEE module and a dummy device, I found some features were missing : TSO IPv6, UFO, and encapsulated traffic. ethtool -k dummy0 now gives : ... tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: on tx-tcp6-segmentation: on udp-fragmentation-offload: on ... tx-gre-segmentation: on tx-ipip-segmentation: on tx-sit-segmentation: on tx-udp_tnl-segmentation: on Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Arad, Ronen authored
if_nlmsg_size() overestimates the minimum allocation size of netlink dump request (when called from rtnl_calcit()) or the size of the message (when called from rtnl_getlink()). This is because ext_filter_mask is not supported by rtnl_link_get_af_size() and rtnl_link_get_size(). The over-estimation is significant when at least one netdev has many VLANs configured (8 bytes for each configured VLAN). This patch-set "rightsizes" the protocol specific attribute size calculation by propagating ext_filter_mask to rtnl_link_get_af_size() and adding this a argument to get_link_af_size op in rtnl_af_ops. Bridge module already used filtering aware sizing for notifications. br_get_link_af_size_filtered() is consistent with the modified get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops. br_get_link_af_size() becomes unused and thus removed. Signed-off-by: Ronen Arad <ronen.arad@intel.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 21 Oct, 2015 4 commits
-
-
Elad Raz authored
Configure ageing time to the HW for newly bridged device CC: Scott Feldman <sfeldma@gmail.com> CC: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Elad Raz <eladr@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Yuchung Cheng says: ==================== RACK loss detection RACK (Recent ACK) loss recovery uses the notion of time instead of packet sequence (FACK) or counts (dupthresh). It's inspired by the FACK heuristic in tcp_mark_lost_retrans(): when a limited transmit (new data packet) is sacked in recovery, then any retransmission sent before that newly sacked packet was sent must have been lost, since at least one round trip time has elapsed. But that existing heuristic from tcp_mark_lost_retrans() has several limitations: 1) it can't detect tail drops since it depends on limited transmit 2) it's disabled upon reordering (assumes no reordering) 3) it's only enabled in fast recovery but not timeout recovery RACK addresses these limitations with a core idea: an unacknowledged packet P1 is deemed lost if a packet P2 that was sent later is is s/acked, since at least one round trip has passed. Since RACK cares about the time sequence instead of the data sequence of packets, it can detect tail drops when a later retransmission is s/acked, while FACK or dupthresh can't. For reordering RACK uses a dynamically adjusted reordering window ("reo_wnd") to reduce false positives on ever (small) degree of reordering, similar to the delayed Early Retransmit. In the current patch set RACK is only a supplemental loss detection and does not trigger fast recovery. However we are developing RACK to replace or consolidate FACK/dupthresh, early retransmit, and thin-dupack. These heuristics all implicitly bear the time notion. For example, the delayed Early Retransmit is simply applying RACK to trigger the fast recovery with small inflight. RACK requires measuring the minimum RTT. Tracking a global min is less robust due to traffic engineering pathing changes. Therefore it uses a windowed filter by Kathleen Nichols. The min RTT can also be useful for various other purposes like congestion control or stat monitoring. This patch has been used on Google servers for well over 1 year. RACK has also been implemented in the QUIC protocol. We are submitting an IETF draft as well. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Yuchung Cheng authored
This patch implements the second half of RACK that uses the the most recent transmit time among all delivered packets to detect losses. tcp_rack_mark_lost() is called upon receiving a dubious ACK. It then checks if an not-yet-sacked packet was sent at least "reo_wnd" prior to the sent time of the most recently delivered. If so the packet is deemed lost. The "reo_wnd" reordering window starts with 1msec for fast loss detection and changes to min-RTT/4 when reordering is observed. We found 1msec accommodates well on tiny degree of reordering (<3 pkts) on faster links. We use min-RTT instead of SRTT because reordering is more of a path property but SRTT can be inflated by self-inflicated congestion. The factor of 4 is borrowed from the delayed early retransmit and seems to work reasonably well. Since RACK is still experimental, it is now used as a supplemental loss detection on top of existing algorithms. It is only effective after the fast recovery starts or after the timeout occurs. The fast recovery is still triggered by FACK and/or dupack threshold instead of RACK. We introduce a new sysctl net.ipv4.tcp_recovery for future experiments of loss recoveries. For now RACK can be disabled by setting it to 0. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Yuchung Cheng authored
This patch is the first half of the RACK loss recovery. RACK loss recovery uses the notion of time instead of packet sequence (FACK) or counts (dupthresh). It's inspired by the previous FACK heuristic in tcp_mark_lost_retrans(): when a limited transmit (new data packet) is sacked, then current retransmitted sequence below the newly sacked sequence must been lost, since at least one round trip time has elapsed. But it has several limitations: 1) can't detect tail drops since it depends on limited transmit 2) is disabled upon reordering (assumes no reordering) 3) only enabled in fast recovery ut not timeout recovery RACK (Recently ACK) addresses these limitations with the notion of time instead: a packet P1 is lost if a later packet P2 is s/acked, as at least one round trip has passed. Since RACK cares about the time sequence instead of the data sequence of packets, it can detect tail drops when later retransmission is s/acked while FACK or dupthresh can't. For reordering RACK uses a dynamically adjusted reordering window ("reo_wnd") to reduce false positives on ever (small) degree of reordering. This patch implements tcp_advanced_rack() which tracks the most recent transmission time among the packets that have been delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp is the key to determine which packet has been lost. Consider an example that the sender sends six packets: T1: P1 (lost) T2: P2 T3: P3 T4: P4 T100: sack of P2. rack.mstamp = T2 T101: retransmit P1 T102: sack of P2,P3,P4. rack.mstamp = T4 T205: ACK of P4 since the hole is repaired. rack.mstamp = T101 We need to be careful about spurious retransmission because it may falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK to falsely mark all packets lost, just like a spurious timeout. We identify spurious retransmission by the ACK's TS echo value. If TS option is not applicable but the retransmission is acknowledged less than min-RTT ago, it is likely to be spurious. We refrain from using the transmission time of these spurious retransmissions. The second half is implemented in the next patch that marks packet lost using RACK timestamp. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-