Commits · 2a24444f8f2bea694003e3eac5c2f8d9a386bdc5 · nexedi / linux

14 Nov, 2011 1 commit

ipv6: reduce percpu needs for icmpv6msg mibs · 2a24444f

Eric Dumazet authored Nov 13, 2011

Reading /proc/net/snmp6 on a machine with a lot of cpus is very
expensive (can be ~88000 us).

This is because ICMPV6MSG MIB uses 4096 bytes per cpu, and folding
values for all possible cpus can read 16 Mbytes of memory (32MBytes on
non x86 arches)

ICMP messages are not considered as fast path on a typical server, and
eventually few cpus handle them anyway. We can afford an atomic
operation instead of using percpu data.

This saves 4096 bytes per cpu and per network namespace.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2a24444f

13 Nov, 2011 13 commits

net: introduce ethernet teaming device · 3d249d4c

Jiri Pirko authored Nov 11, 2011

This patch introduces new network device called team. It supposes to be
very fast, simple, userspace-driven alternative to existing bonding
driver.

Userspace library called libteam with couple of demo apps is available
here:
https://github.com/jpirko/libteam
Note it's still in its dipers atm.

team<->libteam use generic netlink for communication. That and rtnl
suppose to be the only way to configure team device, no sysfs etc.

Python binding of libteam was recently introduced.
Daemon providing arpmon/miimon active-backup functionality will be
introduced shortly. All what's necessary is already implemented in
kernel team driver.

v7->v8:
	- check ndo_ndo_vlan_rx_[add/kill]_vid functions before calling
	  them.
	- use dev_kfree_skb_any() instead of dev_kfree_skb()

v6->v7:
	- transmit and receive functions are not checked in hot paths.
	  That also resolves memory leak on transmit when no port is
	  present

v5->v6:
	- changed couple of _rcu calls to non _rcu ones in non-readers

v4->v5:
	- team_change_mtu() uses team->lock while travesing though port
	  list
	- mac address changes are moved completely to jurisdiction of
	  userspace daemon. This way the daemon can do FOM1, FOM2 and
	  possibly other weird things with mac addresses.
	  Only round-robin mode sets up all ports to bond's address then
	  enslaved.
	- Extended Kconfig text

v3->v4:
	- remove redundant synchronize_rcu from __team_change_mode()
	- revert "set and clear of mode_ops happens per pointer, not per
	  byte"
	- extend comment of function __team_change_mode()

v2->v3:
	- team_change_mtu() uses rcu version of list traversal to unwind
	- set and clear of mode_ops happens per pointer, not per byte
	- port hashlist changed to be embedded into team structure
	- error branch in team_port_enter() does cleanup now
	- fixed rtln->rtnl

v1->v2:
	- modes are made as modules. Makes team more modular and
	  extendable.
	- several commenters' nitpicks found on v1 were fixed
	- several other bugs were fixed.
	- note I ignored Eric's comment about roundrobin port selector
	  as Eric's way may be easily implemented as another mode (mode
	  "random") in future.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3d249d4c

bnx2x: update driver version to 1.70.35-0 · 5d70b88c

Dmitry Kravkov authored Nov 13, 2011

Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5d70b88c

bnx2x: Remove on-stack napi struct variable · 72754080

Ariel Elior authored Nov 13, 2011

Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

72754080

bnx2x: prevent race in statistics flow · 4a025f49

Dmitry Kravkov authored Nov 13, 2011

The race may cause access of registers while MAC hw block is
in reset state. As a result syslog will show error messages.
We can prevent this by using state from local variable.
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4a025f49

bnx2x: add fan failure event handling · 8304859a

Ariel Elior authored Nov 13, 2011

Shut down the device in case of fan failure to prevent HW damage.
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8304859a

bnx2x: remove unused #define · 46fa1309

Dmitry Kravkov authored Nov 13, 2011

Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

46fa1309

bnx2x: simplify definition of RX_SGE_MASK_LEN and use it. · b3637827

Dmitry Kravkov authored Nov 13, 2011

Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b3637827

bnx2x: DCBX: use #define instead of magic · f9c058b6

Dmitry Kravkov authored Nov 13, 2011

Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f9c058b6

bnx2x: propagate DCBX negotiation · 00253a8c

Dmitry Kravkov authored Nov 13, 2011

We need propagate the DCBX results from PMF to other functions
on the same port, in order to properly update netdev structure
and allow following new ETS and PFC configurations.
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

00253a8c

bnx2x: separate FCoE and iSCSI license initialization. · b306f5ed

Dmitry Kravkov authored Nov 13, 2011

FCoE license info must be initialized at probe(), but
iSCSI at open().
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b306f5ed

bnx2x: remove unused variable · ad756594

Dmitry Kravkov authored Nov 13, 2011

Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ad756594

bnx2x: use rx_queue index for skb_record_rx_queue() · f233cafe

Dmitry Kravkov authored Nov 13, 2011

Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f233cafe

bnx2x: allow FCoE and DCB for 578xx · 62ac0dc9

Dmitry Kravkov authored Nov 13, 2011

Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

62ac0dc9

12 Nov, 2011 4 commits

be2net: stop issuing FW cmds if any cmd times out · 6589ade0

Sathya Perla authored Nov 10, 2011

A FW cmd timeout (with a sufficiently large timeout value in the
order of tens of seconds) indicates an unresponsive FW. In this state
issuing further cmds and waiting for a completion will only stall the process.
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6589ade0

be2net: don't log more than one error on detecting EEH/UE errors · 434b3648

Sathya Perla authored Nov 10, 2011

Currently we're spamming error messages each time a FW cmd call is made
while in EEH/UE error state. One log msg on error detection is enough.
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

434b3648

be2net: stop checking the UE registers after an EEH error · 72f02485

Sathya Perla authored Nov 10, 2011

Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

72f02485

be2net: init (vf)_if_handle/vf_pmac_id to handle failure scenarios · 30128031

Sathya Perla authored Nov 10, 2011

Initialize if_handle, vf_if_handle and vf_pmac_id with "-1" so that in
failure cases when be_clear() is called, we can skip over
if_destroy/pmac_del cmds if they have not been created.
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

30128031

09 Nov, 2011 2 commits

ipv4: PKTINFO doesnt need dst reference · d826eb14

Eric Dumazet authored Nov 09, 2011

Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :

> At least, in recent kernels we dont change dst->refcnt in forwarding
> patch (usinf NOREF skb->dst)
>
> One particular point is the atomic_inc(dst->refcnt) we have to perform
> when queuing an UDP packet if socket asked PKTINFO stuff (for example a
> typical DNS server has to setup this option)
>
> I have one patch somewhere that stores the information in skb->cb[] and
> avoid the atomic_{inc|dec}(dst->refcnt).
>

OK I found it, I did some extra tests and believe its ready.

[PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference

When a socket uses IP_PKTINFO notifications, we currently force a dst
reference for each received skb. Reader has to access dst to get needed
information (rt_iif & rt_spec_dst) and must release dst reference.

We also forced a dst reference if skb was put in socket backlog, even
without IP_PKTINFO handling. This happens under stress/load.

We can instead store the needed information in skb->cb[], so that only
softirq handler really access dst, improving cache hit ratios.

This removes two atomic operations per packet, and false sharing as
well.

On a benchmark using a mono threaded receiver (doing only recvmsg()
calls), I can reach 720.000 pps instead of 570.000 pps.

IP_PKTINFO is typically used by DNS servers, and any multihomed aware
UDP application.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d826eb14

ipv4: reduce percpu needs for icmpmsg mibs · acb32ba3

Eric Dumazet authored Nov 08, 2011

Reading /proc/net/snmp on a machine with a lot of cpus is very expensive
(can be ~88000 us).

This is because ICMPMSG MIB uses 4096 bytes per cpu, and folding values
for all possible cpus can read 16 Mbytes of memory.

ICMP messages are not considered as fast path on a typical server, and
eventually few cpus handle them anyway. We can afford an atomic
operation instead of using percpu data.

This saves 4096 bytes per cpu and per network namespace.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

acb32ba3

08 Nov, 2011 19 commits

net: rename sk_clone to sk_clone_lock · e56c57d0

Eric Dumazet authored Nov 08, 2011

Make clear that sk_clone() and inet_csk_clone() return a locked socket.

Add _lock() prefix and kerneldoc.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e56c57d0

sch_choke: use skb_header_pointer() · 9ecd04bc

Eric Dumazet authored Nov 08, 2011

Remove the assumption that skb_get_rxhash() makes IP header and ports
linear, and use skb_header_pointer() instead in choke_match_flow()

This permits __skb_get_rxhash() to use skb_header_pointer() eventually.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9ecd04bc

ll_temac: Add support for phy_mii_ioctl · 8d8bdfe8

Ricardo Ribalda authored Nov 07, 2011

This patch enables the ioctl support for the driver. So userspace
programs like mii-tool can work.

Resend in merge window
Signed-off-by: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8d8bdfe8

net: make ipv6 PKTINFO honour freebind · 2563fa59

Maciej Żenczykowski authored Nov 07, 2011

This just makes it possible to spoof source IPv6 address on a socket
without having to create and bind a new socket for every source IP
we wish to spoof.
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2563fa59

net: make ipv6 bind honour freebind · f74024d9

Maciej Żenczykowski authored Nov 07, 2011

This makes native ipv6 bind follow the precedent set by:
  - native ipv4 bind behaviour
  - dual stack ipv4-mapped ipv6 bind behaviour.

This does allow an unpriviledged process to spoof its source IPv6
address, just like it currently can spoof its source IPv4 address
(for example when using UDP).
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f74024d9

sweep the floors and convert some .get_drvinfo routines to strlcpy · 68aad78c

Rick Jones authored Nov 07, 2011

Per the mention made by Ben Hutchings that strlcpy is now the preferred
string copy routine for a .get_drvinfo routine, do a bit of floor
sweeping and convert some of the as-yet unconverted ethernet drivers to
it.
Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

68aad78c

sctp: fasthandoff with ASCONF at server-node · 34d2d89f

Michio Honda authored Jun 17, 2011

Retransmit chunks to newly confirmed destination when ASCONF and
HEARTBEAT negotiation has success with a single-homed peer.
Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>

34d2d89f

sctp: fasthandoff with ASCONF at mobile-node · ddc4bbee

Michio Honda authored Jun 17, 2011

Fast retransmission after changing the last address
with ASCONF negotiation
Signed-off-by: Michio Honda <micchie@sfc.wide.ad.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>

ddc4bbee

net: better pcpu data alignment · 8ce120f1

Eric Dumazet authored Nov 04, 2011

Tunnels can force an alignment of their percpu data to reduce number of
cache lines used in fast path, or read in .ndo_get_stats()

percpu_alloc() is a very fine grained allocator, so any small hole will
be used anyway.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8ce120f1

ipv4: Fix inetpeer expire time information · 2bc8ca40

Steffen Klassert authored Oct 11, 2011

As we update the learned pmtu informations on demand, we might
report a nagative expiration time value to userspace if the
pmtu informations are already expired and we have not send a
packet to that inetpeer after expiration. With this patch we
send a expire time of null to userspace after expiration
until the next packet is send to that inetpeer.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2bc8ca40

net: min_pmtu default is 552 · 20db93c3

Eric Dumazet authored Nov 08, 2011

Small fix in Documentation, since min_pmtu is 512 + 20 + 20 = 552
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

20db93c3

tcp: Fix comments for Nagle algorithm · 6d67e9be

Feng King authored Nov 05, 2011

TCP_NODELAY is weaker than TCP_CORK, when TCP_CORK was set, small
segments will always pass Nagle test regardless of TCP_NODELAY option.
Signed-off-by: Feng King <kinwin2008@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6d67e9be

sunhme: Allow usage on SBI based SBus systems · 59caa561

oftedal authored Nov 07, 2011

To prevent the SBus driver for Sun Happy Meal cards from being loaded for
PCI cards utilizing the same chipset, a filter was added to the probe
function in commit 0b492fce.

The filter was implemented by checking the name of the parent node in
the OF tree. This patch extends this filter, so that the driver will
load on SBus systems that are based upon SBI SBus Bridges.
Signed-off-by: Kjetil Oftedal <oftedal@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

59caa561

l2tp: fix l2tp_udp_recv_core() · e50e705c

Eric Dumazet authored Nov 08, 2011

pskb_may_pull() can change skb->data, so we have to load ptr/optr at the
right place.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e50e705c

Fix incorrect usage of NET_IP_ALIGN · ea1649de

Nico Erfurth authored Nov 08, 2011

The driver used NET_IP_ALIGN to remove some additional padding inside of
the rx_fixup function. On many architectures NET_IP_ALIGN defaults to 2
which removed the correct amount of bytes.

On MCORE2-machines commit ea812ca1
introduces a change which sets NET_IP_ALIGN to 0 by default. Which
triggered the bug on these machines.

This fix introduces a new RXW_PADDING define and uses this instead of
NET_IP_ALIGN. The name was taken from the original SMSC7500 driver which
is provided by SMSC.
Signed-off-by: Nico Erfurth <ne@erfurth.eu>
Tested-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>

ea1649de

ipv6: drop packets when source address is multicast · c457338d

Brian Haley authored Nov 08, 2011

RFC 4291 Section 2.7 says Multicast addresses must not be used as source
addresses in IPv6 packets - drop them on input so we don't process the
packet further.
Signed-off-by: Brian Haley <brian.haley@hp.com>
Reported-and-Tested-by: Kumar Sanghvi <divinekumar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c457338d

r8169: increase the delay parameter of pm_schedule_suspend · 10953db8

hayeswang authored Nov 07, 2011

The link down would occur when reseting PHY. And it would take about 2 ~ 5 seconds
from link down to link up. If the delay of pm_schedule_suspend is not long enough,
the device would enter runtime_suspend before link up. After link up, the device
would wake up and reset PHY again. Then, you would find the driver keep in a loop
of runtime_suspend and rumtime_resume.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Acked-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

10953db8

Linux 3.2-rc1 · 1ea6b8f4

Linus Torvalds authored Nov 07, 2011

.. with new name.  Because nothing says "really solid kernel release"
like naming it after an extinct animal that just happened to be in the
news lately.

1ea6b8f4

Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap · 075cb105

Linus Torvalds authored Nov 07, 2011

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap: (31 commits)
  ARM: OMAP: Fix export.h or module.h includes
  ARM: OMAP: omap_device: Include linux/export.h
  ARM: OMAP2: Fix H4 matrix keyboard warning
  ARM: OMAP1: Remove unused omap-alsa.h
  ARM: OMAP1: Fix warnings about enabling 32 KiHz timer
  ARM: OMAP2+: timer: Remove omap_device_pm_latency
  ARM: OMAP2+: clock data: Remove redundant timer clkdev
  ARM: OMAP: Devkit8000: Remove double omap_mux_init_gpio
  ARM: OMAP: usb: musb: OMAP: Delete unused function
  MAINTAINERS: Update linux-omap git repository
  ARM: OMAP: change get_context_loss_count ret value to int
  ARM: OMAP4: hsmmc: configure SDMMC1_DR0 properly
  ARM: OMAP4: hsmmc: Fix Pbias configuration on regulator OFF
  ARM: OMAP3: hwmod: fix variant registration and remove SmartReflex from common list
  ARM: OMAP: I2C: Fix omap_register_i2c_bus() return value on success
  ARM: OMAP: dmtimer: Include linux/module.h
  ARM: OMAP2+: l3-noc: Include linux/module.h
  ARM: OMAP2+: devices: Fixes for McPDM
  ARM: OMAP: Fix errors and warnings when building for one board
  ARM: OMAP3: PM: restrict erratum i443 handling to OMAP3430 only
  ...

075cb105

07 Nov, 2011 1 commit

VFS: we need to set LOOKUP_JUMPED on mountpoint crossing · a3fbbde7

Al Viro authored Nov 07, 2011

Mountpoint crossing is similar to following procfs symlinks - we do
not get ->d_revalidate() called for dentry we have arrived at, with
unpleasant consequences for NFS4.

Simple way to reproduce the problem in mainline:

    cat >/tmp/a.c <<'EOF'
    #include <unistd.h>
    #include <fcntl.h>
    #include <stdio.h>
    main()
    {
            struct flock fl = {.l_type = F_RDLCK, .l_whence = SEEK_SET, .l_len = 1};
            if (fcntl(0, F_SETLK, &fl))
                    perror("setlk");
    }
    EOF
    cc /tmp/a.c -o /tmp/test

then on nfs4:

    mount --bind file1 file2
    /tmp/test < file1		# ok
    /tmp/test < file2		# spews "setlk: No locks available"...

What happens is the missing call of ->d_revalidate() after mountpoint
crossing and that's where NFS4 would issue OPEN request to server.

The fix is simple - treat mountpoint crossing the same way we deal with
following procfs-style symlinks.  I.e.  set LOOKUP_JUMPED...

Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

a3fbbde7