Commits · 5e1459ca13eddea516f8c56122cb3253e7b0e03e · Kirill Smelkov / linux

13 Mar, 2015 22 commits

Merge branch 'tcp_metrics_netns_debloat' · 5e1459ca

David S. Miller authored Mar 13, 2015

Eric W. Biederman says:

====================
tcp_metrics: Network namespace bloat reduction v3

This is a small pile of patches that convert tcp_metrics from using a
hash table per network namespace to using a single hash table for all
network namespaces.

This is broken up into several patches so that each small step along
the way could be carefully scrutinized as I wrote it, and equally so
that each small step can be reviewed.

There are several cleanups included in this series.  The addition of
panic calls during boot where we can not handle failure, and not trying
simplifies the code.  The removal of the return code from
tcp_metrics_flush_all.

The motivation for this change is that the tcp_metrics hash table at
128KiB is one of the largest components of a freshly allocated network
namespace.

I am resending the the previous version I sent has suffered bitrot, so I
have respun the patches so that they apply.  I believe I have addressed
all of the review concerns except optimal behavior on little machines
with 32-byte cache lines, which is beyond me as even the current code
has bad behavior in that case.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5e1459ca

tcp_metrics: Use a single hash table for all network namespaces. · 098a697b

Eric W. Biederman authored Mar 13, 2015

Now that all of the operations are safe on a single hash table
accross network namespaces, allocate a single global hash table
and update the code to use it.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

098a697b

tcp_metrics: Rewrite tcp_metrics_flush_all · 04f721c6

Eric W. Biederman authored Mar 13, 2015

Rewrite tcp_metrics_flush_all so that it can cope with entries from
different network namespaces on it's hash chain.

This is based on the logic in tcp_metrics_nl_cmd_del for deleting
a selection of entries from a tcp metrics hash chain.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

04f721c6

tcp_metrics: Remove the unused return code from tcp_metrics_flush_all · 8a4bff71

Eric W. Biederman authored Mar 13, 2015

tcp_metrics_flush_all always returns 0. Remove the unnecessary return code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8a4bff71

tcp_metrics: Add a field tcpm_net and verify it matches on lookup · 849e8a0c

Eric W. Biederman authored Mar 13, 2015

In preparation for using one tcp metrics hash table for all network
namespaces add a field tcpm_net to struct tcp_metrics_block, and
verify that field on all hash table lookups.

Make the field tcpm_net of type possible_net_t so it takes no space
when network namespaces are disabled.

Further add a function tm_net to read that field so we can be
efficient when network namespaces are disabled and concise
the rest of the time.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

849e8a0c

tcp_metrics: Mix the network namespace into the hash function. · 3e5da62d

Eric W. Biederman authored Mar 13, 2015

In preparation for using one hash table for all network namespaces
mix the network namespace into the hash value.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3e5da62d

tcp_metrics: panic when tcp_metrics_init fails. · 6493517e

Eric W. Biederman authored Mar 13, 2015

There is not a practical way to cleanup during boot so
just panic if there is a problem initializing tcp_metrics.

That will at least give us a clear place to start debugging
if something does go wrong.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6493517e

vxlan: Don't set s_addr in vxlan_create_sock · 719a11cd

Simon Horman authored Mar 13, 2015

In the case of AF_INET s_addr was set to INADDR_ANY (0) which which both
symmetric with the AF_INET6 case, where s_addr is not set, and unnecessary
as udp_conf is zeroed out earlier in the same function.

I suspect this change does not have any run-time effect due to compiler
optimisations. But it does make the code a little easier on the/my eyes.

Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

719a11cd

mpls: In mpls_egress verify the packet length. · 76fecd82

Eric W. Biederman authored Mar 12, 2015

Reobert Shearman noticed that mpls_egress is failing to verify that
the bytes to be examined are in fact present in the packet before
mpls_egress reads those bytes.

As suggested by David Miller reduce this to a single pskb_may_pull
call so that we don't do unnecessary work in the fast path.
Reported-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

76fecd82

net/macb: Only adjust tx_clk on link change · 2c29b235

Jaeden Amero authored Mar 12, 2015

The PHY state machine (in drivers/net/phy/phy.c) will unconditionally
call phydev->adjust_link (macb_handle_link_change) when polling in the
PHY_CHANGELINK state. As currently written, macb always ends up
requesting a new tx_clk frequency in macb_handle_link_change. It is a
waste of time to request a new tx_clk frequency if the link state hasn't
changed, as the tx_clk will already be configured properly.

Let's only request a new tx_clk clock frequency when necessary.
Signed-off-by: Jaeden Amero <jaeden.amero@ni.com>
Cc: Josh Cartwright <joshc@ni.com>
Cc: Soren Brinkmann <soren.brinkmann@xilinx.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2c29b235

rhashtable: Fix read-side crash during rehash · 39361947

Herbert Xu authored Mar 13, 2015

This patch fixes a typo rhashtable_lookup_compare where we fail
to recompute the hash when looking up the new table.  This causes
elements to be missed and potentially a crash during a resize.
Reported-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

39361947

rhashtable: kill ht->shift atomic operations · a5b6846f

Daniel Borkmann authored Mar 12, 2015

Commit c0c09bfd ("rhashtable: avoid unnecessary wakeup for worker
queue") changed ht->shift to be atomic, which is actually unnecessary.

Instead of leaving the current shift in the core rhashtable structure,
it can be cached inside the individual bucket tables.

There, it will only be initialized once during a new table allocation
in the shrink/expansion slow path, and from then onward it stays immutable
for the rest of the bucket table liftime.

That allows shift to be non-atomic. The patch also moves hash_rnd
management into the table setup. The rhashtable structure now consumes
3 instead of 4 cachelines.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ying Xue <ying.xue@windriver.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

a5b6846f

rhashtable: Fix reader/rehash race · 9497df88

Herbert Xu authored Mar 12, 2015

There is a potential race condition between readers and the rehasher.
In particular, the rehasher could have started a rehash while the
reader finishes a scan of the old table but fails to see the new
table pointer.

This patch closes this window by adding smp_wmb/smp_rmb.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

9497df88

Merge branch 'listener_refactor' · 5ff0d16a

David S. Miller authored Mar 12, 2015

Eric Dumazet says:

====================
inet: tcp listener refactoring, part 8

These patches prepare request socks being hashed into general ehash
table : We declare 3 aliases (ireq_state, ireq_refcnt, ireq_family)

Note that refcnt is not yet handled, this will be done later.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5ff0d16a

inet: introduce ireq_family · 3f66b083

Eric Dumazet authored Mar 12, 2015

Before inserting request socks into general hash table,
fill their socket family.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3f66b083

inet: get_openreq4() & get_openreq6() do not need listener · d4f06873

Eric Dumazet authored Mar 12, 2015

ireq->ir_num contains local port, use it.

Also, get_openreq4() dumping listen_sk->refcnt makes litle sense.

inet_diag_fill_req() can also use ireq->ir_num
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d4f06873

inet: prepare sock_edemux() & sock_gen_put() for new SYN_RECV state · 41b822c5

Eric Dumazet authored Mar 12, 2015

sock_edemux() & sock_gen_put() should be ready to cope with request socks.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

41b822c5

net: add req_prot_cleanup() & req_prot_init() helpers · 0159dfd3

Eric Dumazet authored Mar 12, 2015

Make proto_register() & proto_unregister() a bit nicer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0159dfd3

inet: add rsk_refcnt/ireq_refcnt to request socks · 1e2e0117

Eric Dumazet authored Mar 12, 2015

When request socks will be in ehash, they'll need to be refcounted.

This patch adds rsk_refcnt/ireq_refcnt macros, and adds
reqsk_put() function, but nothing yet use them.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1e2e0117

inet: add ireq_state field to inet_request_sock · d34ac51b

Eric Dumazet authored Mar 12, 2015

We need to identify request sock when they'll be visible in
global ehash table.

ireq_state is an alias to req.__req_common.skc_state.

Its value is set to TCP_NEW_SYN_RECV
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d34ac51b

inet: add TCP_NEW_SYN_RECV state · 10feb428

Eric Dumazet authored Mar 12, 2015

TCP_SYN_RECV state is currently used by fast open sockets.

Initial TCP requests (the pseudo sockets created when a SYN is received)
are not yet associated to a state. They are attached to their parent,
and the parent is in TCP_LISTEN state.

This commit adds TCP_NEW_SYN_RECV state, so that we can convert
TCP stack to a different schem gradually.

This state is not exported to user space.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

10feb428

ipv6: add missing ireq_net & ir_cookie initializations · bd337c58

Eric Dumazet authored Mar 12, 2015

I forgot to update dccp_v6_conn_request() & cookie_v6_check().
They both need to set ireq->ireq_net and ireq->ir_cookie

Lets clear ireq->ir_cookie in inet_reqsk_alloc()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33cf7c90 ("net: add real socket cookies")
Signed-off-by: David S. Miller <davem@davemloft.net>

bd337c58

12 Mar, 2015 18 commits

cls_bpf: do eBPF invocation under non-bh RCU lock variant for maps · 54720df1

Daniel Borkmann authored Mar 12, 2015

Currently, it is possible in cls_bpf to access eBPF maps only under
rcu_read_lock_bh() variants: while on ingress side, that is, handle_ing(),
the classifier would be called from __netif_receive_skb_core() under
rcu_read_lock(); on egress side, however, it's rcu_read_lock_bh() via
__dev_queue_xmit().

This rcu/rcu_bh mix doesn't work together with eBPF maps as they require
soley to be called under rcu_read_lock(). eBPF maps could also be shared
among various other eBPF programs (possibly even with other eBPF program
types, f.e. tracing) and user space processes, so any context is assumed.

Therefore, a possible fix for cls_bpf is to wrap/nest eBPF program
invocation under non-bh RCU lock variant.

Fixes: e2e9b654 ("cls_bpf: add initial eBPF support for programmable classifiers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

54720df1

Merge branch 'fib_trie_table_merge_fixes' · 06741d05

David S. Miller authored Mar 12, 2015

Alexander Duyck says:

====================
fib_trie: Minor fixes for table merge

This patch set addresses two issues reported with the tables merged, the
first is a NULL pointer dereference, and the other is to remove a WARN_ON
and set the ordering for aliases from different tables with the same slen
values.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

06741d05

fib_trie: Provide a deterministic order for fib_alias w/ tables merged · 0b65bd97

Alexander Duyck authored Mar 12, 2015

This change makes it so that we should always have a deterministic ordering
for the main and local aliases within the merged table when two leaves
overlap.

So for example if we have a leaf with a key of 192.168.254.0. If we
previously added two aliases with a prefix length of 24 from both local and
main the first entry would be first and the second would be second. When I
was coding this I had added a WARN_ON should such a situation occur as I
wasn't sure how likely it would be. However this WARN_ON has been
triggered so this is something that should be addressed.

With this patch the ordering of the aliases is as follows. First they are
sorted on prefix length, then on their table ID, then tos, and finally
priority. This way what we end up doing is essentially interleaving the
two tables on what used to be leaf_info structure boundaries.

Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0b65bd97

fib_trie: Avoid NULL pointer if local table is not allocated · 3c9e9f73

Alexander Duyck authored Mar 12, 2015

The function fib_unmerge assumed the local table had already been
allocated.  If that is not the case however when custom rules are applied
then this can result in a NULL pointer dereference.

In order to prevent this we must check the value of the local table pointer
and if it is NULL simply return 0 as there is no local table to separate
from the main.

Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
Reported-by: Madhu Challa <challa@noironetworks.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3c9e9f73

ebpf: verifier: check that call reg with ARG_ANYTHING is initialized · 80f1d68c

Daniel Borkmann authored Mar 12, 2015

I noticed that a helper function with argument type ARG_ANYTHING does
not need to have an initialized value (register).

This can worst case lead to unintented stack memory leakage in future
helper functions if they are not carefully designed, or unintended
application behaviour in case the application developer was not careful
enough to match a correct helper function signature in the API.

The underlying issue is that ARG_ANYTHING should actually be split
into two different semantics:

  1) ARG_DONTCARE for function arguments that the helper function
     does not care about (in other words: the default for unused
     function arguments), and

  2) ARG_ANYTHING that is an argument actually being used by a
     helper function and *guaranteed* to be an initialized register.

The current risk is low: ARG_ANYTHING is only used for the 'flags'
argument (r4) in bpf_map_update_elem() that internally does strict
checking.

Fixes: 17a52670 ("bpf: verifier (add verifier core)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

80f1d68c

Merge branch 'possible_net_t' · 20453d88

David S. Miller authored Mar 12, 2015

Eric W. Biederman says:

====================
Introduce possible_net_t

The current usage of write_pnet and read_pnet is a little laborious and
error prone as you only notice if you failed to include them if are
compiling with network namespaces enabled.

possible_net_t remedies that by using a type that is 0 bytes when
network namespaces are disabled and can only be read and written to with
read_pnet and write_pnet.

Aka less work and safer for the same effect.

I kill hold_net and release_net first as are they are haven't been used
since 2008 and are noise at the points where write_pnet and read_pnet
are used.

I have folded in Eric Dumazets suggestions to improve the killing of
hold_net and release net.  And respon.  I had to respin anyway as
there was enough changes elsewhere in the tree the previous version
of these patches did not quite apply cleanly.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

20453d88

net: Introduce possible_net_t · 0c5c9fb5

Eric W. Biederman authored Mar 11, 2015

Having to say
> #ifdef CONFIG_NET_NS
> 	struct net *net;
> #endif

in structures is a little bit wordy and a little bit error prone.

Instead it is possible to say:
> typedef struct {
> #ifdef CONFIG_NET_NS
>       struct net *net;
> #endif
> } possible_net_t;

And then in a header say:

> 	possible_net_t net;

Which is cleaner and easier to use and easier to test, as the
possible_net_t is always there no matter what the compile options.

Further this allows read_pnet and write_pnet to be functions in all
cases which is better at catching typos.

This change adds possible_net_t, updates the definitions of read_pnet
and write_pnet, updates optional struct net * variables that
write_pnet uses on to have the type possible_net_t, and finally fixes
up the b0rked users of read_pnet and write_pnet.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0c5c9fb5

net: Kill hold_net release_net · efd7ef1c

Eric W. Biederman authored Mar 11, 2015

hold_net and release_net were an idea that turned out to be useless.
The code has been disabled since 2008.  Kill the code it is long past due.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

efd7ef1c

Merge branch 'rhashtable-cleanups' · 6c7005f6

David S. Miller authored Mar 12, 2015

Herbert Xu says:

====================
rhashtable hash cleanups

This is a rebase on top of the nested lock annotation fix.

Nothing to see here, just a bunch of simple clean-ups before
I move onto something more substantial (hopefully).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6c7005f6

rhashtable: Remove obj_raw_hashfn · ec9f71c5

Herbert Xu authored Mar 12, 2015

Now that the only caller of obj_raw_hashfn is head_hashfn, we can
simply kill it and fold it into the latter.

This patch also moves the common shift from head_hashfn/key_hashfn
into rht_bucket_index.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

ec9f71c5

rhashtable: Remove key length argument to key_hashfn · cffaa9cb

Herbert Xu authored Mar 12, 2015

key_hashfn has only one caller and it doesn't really need to supply
the key length as an extra parameter.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

cffaa9cb

rhashtable: Use head_hashfn instead of obj_raw_hashfn · eca84933

Herbert Xu authored Mar 12, 2015

Now that we don't have cross-table hashes, we no longer need to
keep the entire hash value so all users of obj_raw_hashfn can
use head_hashfn instead.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

eca84933

rhashtable: Move masking back into key_hashfn · 8d2b1879

Herbert Xu authored Mar 12, 2015

This patch reverts commit c88455ce
("rhashtable: key_hashfn() must return full hash value") because
the only user of it always masks the hash value.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

8d2b1879

net/mlx5_core: don't export static symbol · 6b9f53bc

Julia Lawall authored Mar 11, 2015

The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@r@
type T;
identifier f;
@@

static T f (...) { ... }

@@
identifier r.f;
declarer name EXPORT_SYMBOL;
@@

-EXPORT_SYMBOL(f);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

6b9f53bc

rhashtable: Add annotation to nested lock · 84ed82b7

Herbert Xu authored Mar 12, 2015

Commit aa34a6cb ("rhashtable:
Add arbitrary rehash function") killed the annotation on the
nested lock which leads to bitching from lockdep.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

84ed82b7

net: fix CONFIG_NET_NS=n compilation · d77c555d

Eric Dumazet authored Mar 11, 2015

I forgot to use write_pnet() in three locations.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 33cf7c90 ("net: add real socket cookies")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d77c555d

ipv6: expose RFC4191 route preference via rtnetlink · c78ba6d6

Lubomir Rintel authored Mar 11, 2015

This makes it possible to retain the route preference when RAs are handled in
userspace.
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>

c78ba6d6

switchdev: correct spelling of notifier in comments · ac70c05b

Simon Horman authored Mar 12, 2015

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ac70c05b