1. 06 Dec, 2016 9 commits
  2. 04 Dec, 2016 31 commits
    • Florian Westphal's avatar
      netfilter: conntrack: add nf_conntrack_default_on sysctl · 481fa373
      Florian Westphal authored
      This switch (default on) can be used to disable automatic registration
      of connection tracking functionality in newly created network
      namespaces.
      
      This means that when net namespace goes down (or the tracker protocol
      module is unloaded) we *might* have to unregister the hooks.
      
      We can either add another per-netns variable that tells if
      the hooks got registered by default, or, alternatively, just call
      the protocol _put() function and have the callee deal with a possible
      'extra' put() operation that doesn't pair with a get() one.
      
      This uses the latter approach, i.e. a put() without a get has no effect.
      
      Conntrack is still enabled automatically regardless of the new sysctl
      setting if the new net namespace requires connection tracking, e.g. when
      NAT rules are created.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      481fa373
    • Florian Westphal's avatar
      netfilter: conntrack: register hooks in netns when needed by ruleset · 0c66dc1e
      Florian Westphal authored
      This makes use of nf_ct_netns_get/put added in previous patch.
      We add get/put functions to nf_conntrack_l3proto structure, ipv4 and ipv6
      then implement use-count to track how many users (nft or xtables modules)
      have a dependency on ipv4 and/or ipv6 connection tracking functionality.
      
      When count reaches zero, the hooks are unregistered.
      
      This delays activation of connection tracking inside a namespace until
      stateful firewall rule or nat rule gets added.
      
      This patch breaks backwards compatibility in the sense that connection
      tracking won't be active anymore when the protocol tracker module is
      loaded.  This breaks e.g. setups that ctnetlink for flow accounting and
      the like, without any '-m conntrack' packet filter rules.
      
      Followup patch restores old behavour and makes new delayed scheme
      optional via sysctl.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      0c66dc1e
    • Florian Westphal's avatar
      netfilter: nf_tables: add conntrack dependencies for nat/masq/redir expressions · 20afd423
      Florian Westphal authored
      so that conntrack core will add the needed hooks in this namespace.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      20afd423
    • Florian Westphal's avatar
      netfilter: nat: add dependencies on conntrack module · a357b3f8
      Florian Westphal authored
      MASQUERADE, S/DNAT and REDIRECT already call functions that depend on the
      conntrack module.
      
      However, since the conntrack hooks are now registered in a lazy fashion
      (i.e., only when needed) a symbol reference is not enough.
      
      Thus, when something is added to a nat table, make sure that it will see
      packets by calling nf_ct_netns_get() which will register the conntrack
      hooks in the current netns.
      
      An alternative would be to add these dependencies to the NAT table.
      
      However, that has problems when using non-modular builds -- we might
      register e.g. ipv6 conntrack before its initcall has run, leading to NULL
      deref crashes since its per-netns storage has not yet been allocated.
      
      Adding the dependency in the modules instead has the advantage that nat
      table also does not register its hooks until rules are added.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a357b3f8
    • Florian Westphal's avatar
      netfilter: add and use nf_ct_netns_get/put · ecb2421b
      Florian Westphal authored
      currently aliased to try_module_get/_put.
      Will be changed in next patch when we add functions to make use of ->net
      argument to store usercount per l3proto tracker.
      
      This is needed to avoid registering the conntrack hooks in all netns and
      later only enable connection tracking in those that need conntrack.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      ecb2421b
    • Florian Westphal's avatar
      netfilter: conntrack: remove unused init_net hook · a379854d
      Florian Westphal authored
      since adf05168 ("netfilter: remove ip_conntrack* sysctl compat code")
      the only user (ipv4 tracker) sets this to an empty stub function.
      
      After this change nf_ct_l3proto_pernet_register() is also empty,
      but this will change in a followup patch to add conditional register
      of the hooks.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a379854d
    • Davide Caratti's avatar
      netfilter: conntrack: built-in support for UDPlite · 9b91c96c
      Davide Caratti authored
      CONFIG_NF_CT_PROTO_UDPLITE is no more a tristate. When set to y,
      connection tracking support for UDPlite protocol is built-in into
      nf_conntrack.ko.
      
      footprint test:
      $ ls -l net/netfilter/nf_conntrack{_proto_udplite,}.ko \
              net/ipv4/netfilter/nf_conntrack_ipv4.ko \
              net/ipv6/netfilter/nf_conntrack_ipv6.ko
      
      (builtin)|| udplite|  ipv4  |  ipv6  |nf_conntrack
      ---------++--------+--------+--------+--------------
      none     || 432538 | 828755 | 828676 | 6141434
      UDPlite  ||   -    | 829649 | 829362 | 6498204
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      9b91c96c
    • Davide Caratti's avatar
      netfilter: conntrack: built-in support for SCTP · a85406af
      Davide Caratti authored
      CONFIG_NF_CT_PROTO_SCTP is no more a tristate. When set to y, connection
      tracking support for SCTP protocol is built-in into nf_conntrack.ko.
      
      footprint test:
      $ ls -l net/netfilter/nf_conntrack{_proto_sctp,}.ko \
              net/ipv4/netfilter/nf_conntrack_ipv4.ko \
              net/ipv6/netfilter/nf_conntrack_ipv6.ko
      
      (builtin)||  sctp  |  ipv4  |  ipv6  | nf_conntrack
      ---------++--------+--------+--------+--------------
      none     || 498243 | 828755 | 828676 | 6141434
      SCTP     ||   -    | 829254 | 829175 | 6547872
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a85406af
    • Davide Caratti's avatar
      netfilter: conntrack: built-in support for DCCP · c51d3901
      Davide Caratti authored
      CONFIG_NF_CT_PROTO_DCCP is no more a tristate. When set to y, connection
      tracking support for DCCP protocol is built-in into nf_conntrack.ko.
      
      footprint test:
      $ ls -l net/netfilter/nf_conntrack{_proto_dccp,}.ko \
              net/ipv4/netfilter/nf_conntrack_ipv4.ko \
              net/ipv6/netfilter/nf_conntrack_ipv6.ko
      
      (builtin)||  dccp  |  ipv4  |  ipv6  | nf_conntrack
      ---------++--------+--------+--------+--------------
      none     || 469140 | 828755 | 828676 | 6141434
      DCCP     ||   -    | 830566 | 829935 | 6533526
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c51d3901
    • Davide Caratti's avatar
      netfilter: nf_conntrack_tuple_common.h: fix #include · 3fefeb88
      Davide Caratti authored
      To allow usage of enum ip_conntrack_dir in include/net/netns/conntrack.h,
      this patch encloses #include <linux/netfilter.h> in a #ifndef __KERNEL__
      directive, so that compiler errors caused by unwanted inclusion of
      include/linux/netfilter.h are avoided.
      In addition, #include <linux/netfilter/nf_conntrack_common.h> line has
      been added to resolve correctly CTINFO2DIR macro.
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Acked-by: default avatarMikko Rapeli <mikko.rapeli@iki.fi>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      3fefeb88
    • Pablo Neira Ayuso's avatar
      Merge tag 'ipvs-for-v4.10' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next · f6b3ef5e
      Pablo Neira Ayuso authored
      Simon Horman says:
      
      ====================
      IPVS Updates for v4.10
      
      please consider these enhancements to the IPVS for v4.10.
      
      * Decrement the IP ttl in all the modes in order to prevent infinite
        route loops. Thanks to Dwip Banerjee.
      * Use IS_ERR_OR_NULL macro. Clean-up from Gao Feng.
      ====================
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      f6b3ef5e
    • Liping Zhang's avatar
      netfilter: nfnetlink_log: add "nf-logger-5-1" module alias name · a7647080
      Liping Zhang authored
      So we can autoload nfnetlink_log.ko when the user adding nft log
      group X rule in netdev family.
      Signed-off-by: default avatarLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a7647080
    • Liping Zhang's avatar
      netfilter: nf_log: do not assume ethernet header in netdev family · 673ab46f
      Liping Zhang authored
      In netdev family, we will handle non ethernet packets, so using
      eth_hdr(skb)->h_proto is incorrect.
      
      Meanwhile, we can use socket(AF_PACKET...) to sending packets, so
      skb->protocol is not always set in bridge family.
      
      Add an extra parameter into nf_log_l2packet to solve this issue.
      
      Fixes: 1fddf4ba ("netfilter: nf_log: add packet logging for netdev family")
      Signed-off-by: default avatarLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      673ab46f
    • Davide Caratti's avatar
      netfilter: built-in NAT support for UDPlite · b8ad652f
      Davide Caratti authored
      CONFIG_NF_NAT_PROTO_UDPLITE is no more a tristate. When set to y, NAT
      support for UDPlite protocol is built-in into nf_nat.ko.
      
      footprint test:
      
      (nf_nat_proto_)           |udplite || nf_nat
      --------------------------+--------++--------
      no builtin                | 408048 || 2241312
      UDPLITE builtin           |   -    || 2577256
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      b8ad652f
    • Davide Caratti's avatar
      netfilter: built-in NAT support for SCTP · 7a2dd28c
      Davide Caratti authored
      CONFIG_NF_NAT_PROTO_SCTP is no more a tristate. When set to y, NAT
      support for SCTP protocol is built-in into nf_nat.ko.
      
      footprint test:
      
      (nf_nat_proto_)           | sctp   || nf_nat
      --------------------------+--------++--------
      no builtin                | 428344 || 2241312
      SCTP builtin              |   -    || 2597032
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      7a2dd28c
    • Davide Caratti's avatar
      netfilter: built-in NAT support for DCCP · 0c4e966e
      Davide Caratti authored
      CONFIG_NF_NAT_PROTO_DCCP is no more a tristate. When set to y, NAT
      support for DCCP protocol is built-in into nf_nat.ko.
      
      footprint test:
      
      (nf_nat_proto_)           | dccp   || nf_nat
      --------------------------+--------++--------
      no builtin                | 409800 || 2241312
      DCCP builtin              |   -    || 2578968
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      0c4e966e
    • Arturo Borrero Gonzalez's avatar
      netfilter: update Arturo Borrero Gonzalez email address · cd727514
      Arturo Borrero Gonzalez authored
      The email address has changed, let's update the copyright statements.
      Signed-off-by: default avatarArturo Borrero Gonzalez <arturo@debian.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      cd727514
    • Erik Nordmark's avatar
      ipv6 addrconf: Implemented enhanced DAD (RFC7527) · adc176c5
      Erik Nordmark authored
      Implemented RFC7527 Enhanced DAD.
      IPv6 duplicate address detection can fail if there is some temporary
      loopback of Ethernet frames. RFC7527 solves this by including a random
      nonce in the NS messages used for DAD, and if an NS is received with the
      same nonce it is assumed to be a looped back DAD probe and is ignored.
      RFC7527 is enabled by default. Can be disabled by setting both of
      conf/{all,interface}/enhanced_dad to zero.
      Signed-off-by: default avatarErik Nordmark <nordmark@arista.com>
      Signed-off-by: default avatarBob Gilligan <gilligan@arista.com>
      Reviewed-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      adc176c5
    • David S. Miller's avatar
      Merge branch 'mv88e6390-batch-three' · ce84c7c6
      David S. Miller authored
      Andrew Lunn says:
      
      ====================
      mv88e6390 batch 3
      
      More patches to support the MV88e6390. This is mostly refactoring
      existing code and adding implementations for the mv88e6390.  This
      patchset set which reserved frames are sent to the cpu, the size of
      jumbo frames that will be accepted, turn off egress rate limiting, and
      configuration of pause frames.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce84c7c6
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Implement mv88e6390 pause control · 3ce0e65e
      Andrew Lunn authored
      The mv88e6390 has a number flow control registers accessed via the
      Flow Control register. Use these to set the pause control.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ce0e65e
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Refactor pause configuration · b35d322a
      Andrew Lunn authored
      The mv88e6390 has a different mechanism for configuring pause.
      Refactor the code into an ops function, and for the moment, don't add
      any mv88e6390 code yet.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b35d322a
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Refactor egress rate limiting · ef70b111
      Andrew Lunn authored
      There are two different rate limiting configurations, depending on the
      switch generation. Refactor this into ops.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef70b111
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Refactor setting of jumbo frames · 5f436666
      Andrew Lunn authored
      Some switches support jumbo frames. Refactor this code into operations
      in the ops structure.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5f436666
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Reserved Management frames to CPU · 6e55f698
      Andrew Lunn authored
      Older devices have a couple of registers in global2. The mv88e6390
      family has a single register in global1 behind which hides similar
      configuration. Implement and op for this.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e55f698
    • David S. Miller's avatar
      Merge branch 'mv88e6390-batch-two' · 7a6c5cb9
      David S. Miller authored
      Andrew Lunn says:
      
      ====================
      MV88E6390 batch two
      
      This is the second batch of patches adding support for the
      MV88e6390. They are not sufficient to make it work properly.
      
      The mv88e6390 has a much expanded set of priority maps. Refactor the
      existing code, and implement basic support for the new device.
      
      Similarly, the monitor control register has been reworked.
      
      The mv88e6390 has something odd in its EDSA tagging implementation,
      which means it is not possible to use it. So we need to use DSA
      tagging. This is the first device with EDSA support where we need to
      use DSA, and the code does not support this. So two patches refactor
      the existing code. The two different register definitions are
      separated out, and using DSA on an EDSA capable device is added.
      
      v2:
      Add port prefix
      Add helper function for 6390
      Add _IEEE_ into #defines
      Split monitor_ctrl into a number of separate ops.
      Remove 6390 code which is management, used in a later patch
      s/EGREES/EGRESS/.
      Broke up setup_port_dsa() and set_port_dsa() into a number of ops
      
      v3:
      Verify mandatory ops for port setup
      Don't set ether type for DSA port.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a6c5cb9
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Refactor CPU and DSA port setup · 56995cbc
      Andrew Lunn authored
      Older chips only support DSA tagging. Newer chips have both DSA and
      EDSA tagging. Refactor the code by adding port functions for setting the
      frame mode, egress mode, and if to forward unknown frames.
      
      This results in the helper mv88e6xxx_6065_family() becoming unused, so
      remove it.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      v3:
      Verify mandatory ops for port setup
      Don't set ether type for DSA port.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56995cbc
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Move the tagging protocol into info · 443d5a1b
      Andrew Lunn authored
      Older chips support a single tagging protocol, DSA. New chips support
      both DSA and EDSA, an enhanced version. Having both as an option
      changes the register layouts. Up until now, it has been assumed that
      if EDSA is supported, it will be used. Hence the register layout has
      been determined by which protocol should be used. However, mv88e6390
      has a different implementation of EDSA, which requires we need to use
      the DSA tagging. Hence separate the selection of the protocol from the
      register layout.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      443d5a1b
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Monitor and Management tables · 33641994
      Andrew Lunn authored
      The mv88e6390 changes the monitor control register into the Monitor
      and Management control, which is an indirection register to various
      registers.
      
      Add ops to set the CPU port and the ingress/egress port for both
      register layouts, to global1
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33641994
    • Andrew Lunn's avatar
      net: dsa: mv88e6xxx: Implement mv88e6390 tag remap · ef0a7318
      Andrew Lunn authored
      The mv88e6390 does not have the two registers to set the frame
      priority map. Instead it has an indirection registers for setting a
      number of different priority maps. Refactor the old code into an
      function, implement the mv88e6390 version, and use an op to call the
      right one.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef0a7318
    • David S. Miller's avatar
      Merge branch 'fib-notifier-event-replay' · 69248719
      David S. Miller authored
      Jiri Pirko says:
      
      ====================
      ipv4: fib: Replay events when registering FIB notifier
      
      Ido says:
      
      In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
      by a new FIB notification chain to which modules could register in order
      to be notified about the addition and deletion of FIB entries. The
      motivation for this change was that switchdev drivers need to be able to
      reflect the entire FIB table and not only FIBs configured on top of the
      port netdevs themselves. This is useful in case of in-band management.
      
      The fundamental problem with this approach is that upon registration
      listeners lose all the information previously sent in the chain and
      thus have an incomplete view of the FIB tables, which can result in
      packet loss. This patchset fixes that by dumping the FIB tables and
      replaying notifications previously sent in the chain for the registered
      notification block.
      
      The entire dump process is done under RCU and thus the FIB notification
      chain is converted to be atomic. The listeners are modified accordingly.
      This is done in the first eight patches.
      
      The ninth patch adds a change sequence counter to ensure the integrity
      of the FIB dump. The last patch adds the dump itself to the FIB chain
      registration function and modifies existing listeners to pass a callback
      to be executed in case dump was inconsistent.
      
      ---
      v3->v4:
      - Register the notification block after the dump and protect it using
        the change sequence counter (Hannes Frederic Sowa).
      - Since we now integrate the dump into the registration function, drop
        the sysctl to set maximum number of retries and instead set it to a
        fixed number. Lets see if it's really a problem before adding something
        we can never remove.
      - For the same reason, dump FIB tables for all net namespaces.
      - Add a comment regarding guarantees provided by mutex semantics.
      
      v2->v3:
      - Add sysctl to set the number of FIB dump retries (Hannes Frederic Sowa).
      - Read the sequence counter under RTNL to ensure synchronization
        between the dump process and other processes changing the routing
        tables (Hannes Frederic Sowa).
      - Pass a callback to the dump function to be executed prior to a retry.
      - Limit the dump to a single net namespace.
      
      v1->v2:
      - Add a sequence counter to ensure the integrity of the FIB dump
        (David S. Miller, Hannes Frederic Sowa).
      - Protect notifications from re-ordering in listeners by using an
        ordered workqueue (Hannes Frederic Sowa).
      - Introduce fib_info_hold() (Jiri Pirko).
      - Relieve rocker from the need to invoke the FIB dump by registering
        to the FIB notification chain prior to ports creation.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      69248719
    • Ido Schimmel's avatar
      ipv4: fib: Replay events when registering FIB notifier · c3852ef7
      Ido Schimmel authored
      Commit b90eb754 ("fib: introduce FIB notification infrastructure")
      introduced a new notification chain to notify listeners (f.e., switchdev
      drivers) about addition and deletion of routes.
      
      However, upon registration to the chain the FIB tables can already be
      populated, which means potential listeners will have an incomplete view
      of the tables.
      
      Solve that by dumping the FIB tables and replaying the events to the
      passed notification block. The dump itself is done using RCU in order
      not to starve consumers that need RTNL to make progress.
      
      The integrity of the dump is ensured by reading the FIB change sequence
      counter before and after the dump under RTNL. This allows us to avoid
      the problematic situation in which the dumping process sends a ENTRY_ADD
      notification following ENTRY_DEL generated by another process holding
      RTNL.
      
      Callers of the registration function may pass a callback that is
      executed in case the dump was inconsistent with current FIB tables.
      
      The number of retries until a consistent dump is achieved is set to a
      fixed number to prevent callers from looping for long periods of time.
      In case current limit proves to be problematic in the future, it can be
      easily converted to be configurable using a sysctl.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3852ef7