1. 15 May, 2015 3 commits
    • Daniel Borkmann's avatar
      test_bpf: add tests related to BPF_MAXINSNS · a4afd37b
      Daniel Borkmann authored
      Couple of torture test cases related to the bug fixed in 0b59d880
      ("ARM: net: delegate filter to kernel interpreter when imm_offset()
      return value can't fit into 12bits.").
      
      I've added a helper to allocate and fill the insn space. Output on
      x86_64 from my laptop:
      
      test_bpf: #233 BPF_MAXINSNS: Maximum possible literals jited:0 7 PASS
      test_bpf: #234 BPF_MAXINSNS: Single literal jited:0 8 PASS
      test_bpf: #235 BPF_MAXINSNS: Run/add until end jited:0 11553 PASS
      test_bpf: #236 BPF_MAXINSNS: Too many instructions PASS
      test_bpf: #237 BPF_MAXINSNS: Very long jump jited:0 9 PASS
      test_bpf: #238 BPF_MAXINSNS: Ctx heavy transformations jited:0 20329 20398 PASS
      test_bpf: #239 BPF_MAXINSNS: Call heavy transformations jited:0 32178 32475 PASS
      test_bpf: #240 BPF_MAXINSNS: Jump heavy test jited:0 10518 PASS
      
      test_bpf: #233 BPF_MAXINSNS: Maximum possible literals jited:1 4 PASS
      test_bpf: #234 BPF_MAXINSNS: Single literal jited:1 4 PASS
      test_bpf: #235 BPF_MAXINSNS: Run/add until end jited:1 1625 PASS
      test_bpf: #236 BPF_MAXINSNS: Too many instructions PASS
      test_bpf: #237 BPF_MAXINSNS: Very long jump jited:1 8 PASS
      test_bpf: #238 BPF_MAXINSNS: Ctx heavy transformations jited:1 3301 3174 PASS
      test_bpf: #239 BPF_MAXINSNS: Call heavy transformations jited:1 24107 23491 PASS
      test_bpf: #240 BPF_MAXINSNS: Jump heavy test jited:1 8651 PASS
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Nicolas Schichan <nschichan@freebox.fr>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4afd37b
    • Eric Dumazet's avatar
      tcp: syncookies: extend validity range · 264ea103
      Eric Dumazet authored
      Now we allow storing more request socks per listener, we might
      hit syncookie mode less often and hit following bug in our stack :
      
      When we send a burst of syncookies, then exit this mode,
      tcp_synq_no_recent_overflow() can return false if the ACK packets coming
      from clients are coming three seconds after the end of syncookie
      episode.
      
      This is a way too strong requirement and conflicts with rest of
      syncookie code which allows ACK to be aged up to 2 minutes.
      
      Perfectly valid ACK packets are dropped just because clients might be
      in a crowded wifi environment or on another planet.
      
      So let's fix this, and also change tcp_synq_overflow() to not
      dirty a cache line for every syncookie we send, as we are under attack.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      264ea103
    • Alexander Duyck's avatar
      ip_tunnel: Report Rx dropped in ip_tunnel_get_stats64 · c24a5964
      Alexander Duyck authored
      The rx_dropped stat wasn't being reported when ip_tunnel_get_stats64 was
      called.  This was leading to some confusing results in my debug as I was
      seeing rx_errors increment but no other value which pointed me toward the
      type of error being seen.
      
      This change corrects that by using netdev_stats_to_stats64 to copy all
      available dev stats instead of just the few that were hand picked.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c24a5964
  2. 14 May, 2015 28 commits
    • Willem de Bruijn's avatar
      packet: fix warnings in rollover lock contention · 54d7c01d
      Willem de Bruijn authored
      Avoid two xchg calls whose return values were unused, causing a
      warning on some architectures.
      
      The relevant variable is a hint and read without mutual exclusion.
      This fix makes all writers hold the receive_queue lock.
      Suggested-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      54d7c01d
    • françois romieu's avatar
      net: batch of last_rx update avoidance in ethernet drivers. · 4ffd3c73
      françois romieu authored
      None of those drivers uses last_rx for its own needs.
      
      See 4dc89133 ("net: add a comment on
      netdev->last_rx") for reference.
      Signed-off-by: default avatarFrancois Romieu <romieu@fr.zoreil.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Zhangfei Gao <zhangfei.gao@linaro.org>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Wingman Kwok <w-kwok2@ti.com>
      Cc: Murali Karicheri <m-karicheri2@ti.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ffd3c73
    • David S. Miller's avatar
      Merge branch 'phy_turn_around' · 7852dada
      David S. Miller authored
      Florian Fainelli says:
      
      ====================
      net: phy: broken turn-around support
      
      This is an attempt at solving the broken turn-around problem in a way that
      is not specific to the mdio-gpio driver, since it affects different kinds of
      platforms.
      
      We cannot make that localized to PHY device drivers because probing the PHY
      device which has a broken turn-around can fail as early as in get_phy_id(),
      therefore we need a bit of help from Device Tree/platform_data.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7852dada
    • Florian Fainelli's avatar
      net: phy: mdio-gpio: Handle phy_ignore_ta_mask · ea48b2b8
      Florian Fainelli authored
      Update mdiobb_read() to read whether the PHY has a broken turn-around,
      and if it does, ignore it to make the read succeeed.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea48b2b8
    • Florian Fainelli's avatar
      of: mdio: Add a "broken-turn-around" property · ab6016e0
      Florian Fainelli authored
      Some Ethernet PHY devices/switches may not properly release the MDIO bus
      during turn-around time, and fail to drive it low, which can be seen by
      some controllers as a read failure, while the data clocked in is still
      correct.
      
      Add a boolean property "broken-turn-around" which is parsed by the
      generic MDIO bus probing code and will set the corresponding bit in the
      MDIO bus phy_ignore_ta_mask bitmask for MDIO bus drivers to utilize that
      information.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab6016e0
    • Florian Fainelli's avatar
      net: phy: Add phy_ignore_ta_mask to account for broken turn-around · 922f2dd1
      Florian Fainelli authored
      Some PHY devices/switches will not release the turn-around line as they
      should do at the end of a MDIO transaction. To help with such
      situations, allow MDIO bus drivers to be made aware of such
      restrictions.
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      922f2dd1
    • Ying Xue's avatar
      tipc: use sock_create_kern interface to create kernel socket · fa787ae0
      Ying Xue authored
      After commit eeb1bd5c ("net: Add a struct net parameter to
      sock_create_kern"), we should use sock_create_kern() to create kernel
      socket as the interface doesn't reference count struct net any more.
      Signed-off-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fa787ae0
    • Brian Haley's avatar
      cls_flower: Fix compile error · dd3aa3b5
      Brian Haley authored
      Fix compile error in net/sched/cls_flower.c
      
          net/sched/cls_flower.c: In function ‘fl_set_key’:
          net/sched/cls_flower.c:240:3: error: implicit declaration of
           function ‘tcf_change_indev’ [-Werror=implicit-function-declaration]
             err = tcf_change_indev(net, tb[TCA_FLOWER_INDEV]);
      
      Introduced in 77b9900e
      
      Fixes: 77b9900e ("tc: introduce Flower classifier")
      Signed-off-by: default avatarBrian Haley <brian.haley@hp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dd3aa3b5
    • David S. Miller's avatar
      Merge branch 'tipc-next' · b55b10be
      David S. Miller authored
      Jon Maloy says:
      
      ====================
      tipc: some link layer improvements
      
      We continue eliminating redundant complexity at the link layer, and
      add a couple of improvements to the packet sending functionality.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b55b10be
    • Jon Paul Maloy's avatar
      tipc: add packet sequence number at instant of transmission · dd3f9e70
      Jon Paul Maloy authored
      Currently, the packet sequence number is updated and added to each
      packet at the moment a packet is added to the link backlog queue.
      This is wasteful, since it forces the code to traverse the send
      packet list packet by packet when adding them to the backlog queue.
      It would be better to just splice the whole packet list into the
      backlog queue when that is the right action to do.
      
      In this commit, we do this change. Also, since the sequence numbers
      cannot now be assigned to the packets at the moment they are added
      the backlog queue, we do instead calculate and add them at the moment
      of transmission, when the backlog queue has to be traversed anyway.
      We do this in the function tipc_link_push_packet().
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dd3f9e70
    • Jon Paul Maloy's avatar
      tipc: improve link congestion algorithm · f21e897e
      Jon Paul Maloy authored
      The link congestion algorithm used until now implies two problems.
      
      - It is too generous towards lower-level messages in situations of high
        load by giving "absolute" bandwidth guarantees to the different
        priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted
        20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the
        available bandwidth. But, in the absence of higher level traffic, the
        ratio between two distinct levels becomes unreasonable. E.g. if there
        is only LOW and MEDIUM traffic on a system, the former is guaranteed
        1/3 of the bandwidth, and the latter 2/3. This again means that if
        there is e.g. one LOW user and 10 MEDIUM users, the  former will have
        33.3% of the bandwidth, and the others will have to compete for the
        remainder, i.e. each will end up with 6.7% of the capacity.
      
      - Packets of type MSG_BUNDLER are created at SYSTEM importance level,
        but only after the packets bundled into it have passed the congestion
        test for their own respective levels. Since bundled packets don't
        result in incrementing the level counter for their own importance,
        only occasionally for the SYSTEM level counter, they do in practice
        obtain SYSTEM level importance. Hence, the current implementation
        provides a gap in the congestion algorithm that in the worst case
        may lead to a link reset.
      
      We now refine the congestion algorithm as follows:
      
      - A message is accepted to the link backlog only if its own level
        counter, and all superior level counters, permit it.
      
      - The importance of a created bundle packet is set according to its
        contents. A bundle packet created from messges at levels LOW to
        CRITICAL is given importance level CRITICAL, while a bundle created
        from a SYSTEM level message is given importance SYSTEM. In the latter
        case only subsequent SYSTEM level messages are allowed to be bundled
        into it.
      
      This solves the first problem described above, by making the bandwidth
      guarantee relative to the total number of users at all levels; only
      the upper limit for each level remains absolute. In the example
      described above, the single LOW user would use 1/11th of the bandwidth,
      the same as each of the ten MEDIUM users, but he still has the same
      guarantee against starvation as the latter ones.
      
      The fix also solves the second problem. If the CRITICAL level is filled
      up by bundle packets of that level, no lower level packets will be
      accepted any more.
      Suggested-by: default avatarGergely Kiss <gergely.kiss@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f21e897e
    • Jon Paul Maloy's avatar
      tipc: simplify link supervision checkpointing · cd4eee3c
      Jon Paul Maloy authored
      We change the sequence number checkpointing that is performed
      by the timer in order to discover if the peer is active. Currently,
      we store a checkpoint of the next expected sequence number "rcv_nxt"
      at each timer expiration, and compare it to the current expected
      number at next timeout expiration. Instead, we now use the already
      existing field "silent_intv_cnt" for this task. We step the counter
      at each timeout expiration, and zero it at each valid received packet.
      If no valid packet has been received from the peer after "abort_limit"
      number of silent timer intervals, the link is declared faulty and reset.
      
      We also remove the multiple instances of timer activation from inside
      the FSM function "link_state_event()", and now do it at only one place;
      at the end of the timer function itself.
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd4eee3c
    • Jon Paul Maloy's avatar
      tipc: rename fields in struct tipc_link · a97b9d3f
      Jon Paul Maloy authored
      We rename some fields in struct tipc_link, in order to give them more
      descriptive names:
      
      next_in_no -> rcv_nxt
      next_out_no-> snd_nxt
      fsm_msg_cnt-> silent_intv_cnt
      cont_intv  -> keepalive_intv
      last_retransmitted -> last_retransm
      
      There are no functional changes in this commit.
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a97b9d3f
    • Jon Paul Maloy's avatar
      tipc: simplify packet sequence number handling · e4bf4f76
      Jon Paul Maloy authored
      Although the sequence number in the TIPC protocol is 16 bits, we have
      until now stored it internally as an unsigned 32 bits integer.
      We got around this by always doing explicit modulo-65535 operations
      whenever we need to access a sequence number.
      
      We now make the incoming and outgoing sequence numbers to unsigned
      16-bit integers, and remove the modulo operations where applicable.
      
      We also move the arithmetic inline functions for 16 bit integers
      to core.h, and the function buf_seqno() to msg.h, so they can easily
      be accessed from anywhere in the code.
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4bf4f76
    • Jon Paul Maloy's avatar
      tipc: simplify include dependencies · a6bf70f7
      Jon Paul Maloy authored
      When we try to add new inline functions in the code, we sometimes
      run into circular include dependencies.
      
      The main problem is that the file core.h, which really should be at
      the root of the dependency chain, instead is a leaf. I.e., core.h
      includes a number of header files that themselves should be allowed
      to include core.h. In reality this is unnecessary, because core.h does
      not need to know the full signature of any of the structs it refers to,
      only their type declaration.
      
      In this commit, we remove all dependencies from core.h towards any
      other tipc header file.
      
      As a consequence of this change, we can now move the function
      tipc_own_addr(net) from addr.c to addr.h, and make it inline.
      
      There are no functional changes in this commit.
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6bf70f7
    • Jon Paul Maloy's avatar
      tipc: simplify link timer handling · 75b44b01
      Jon Paul Maloy authored
      Prior to this commit, the link timer has been running at a "continuity
      interval" of configured link tolerance/4. When a timer wakes up and
      discovers that there has been no sign of life from the peer during the
      previous interval, it divides its own timer interval by another factor
      four, and starts sending one probe per new interval. When the configured
      link tolerance time has passed without answer, i.e. after 16 unacked
      probes, the link is declared faulty and reset.
      
      This is unnecessary complex. It is sufficient to continue with the
      original continuity interval, and instead reset the link after four
      missed probe responses. This makes the timer handling in the link
      simpler, and opens up for some planned later changes in this area.
      This commit implements this change.
      Reviewed-by: default avatarRichard Alpe <richard.alpe@ericsson.com>
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75b44b01
    • Jon Paul Maloy's avatar
      tipc: simplify resetting and disabling of bearers · b1c29f6b
      Jon Paul Maloy authored
      Since commit 4b475e3f2f8e4e241de101c8240f1d74d0470494
      ("tipc: eliminate delayed link deletion at link failover") the extra
      boolean parameter "shutting_down" is not any longer needed for the
      functions bearer_disable() and tipc_link_delete_list().
      
      Furhermore, the function tipc_link_reset_links(), called from
      bearer_reset()  is now unnecessary. We can just as well delete
      all the links, as we do in bearer_disable(), and start over with
      creating new links.
      
      This commit introduces those changes.
      Reviewed-by: default avatarErik Hugne <erik.hugne@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1c29f6b
    • David S. Miller's avatar
      Merge branch 'be2net-next' · c16ead79
      David S. Miller authored
      Venkat Duvvuru says:
      
      ====================
      be2net: patch-set
      
      The following patch set has one new feature addition and two fixes.
      
      Patch 1 adds support for hwmon sysfs interface to display board temperature.
      Board temperature display through ethtool statistics is removed.
      
      Patch 2 reports "link down" in a few more error cases which are not handled
      currently.
      
      Patch 3 adds support for os2bmc. OS2BMC feature will allow the server to
      communicate with the on-board BMC/idrac (Baseboard Management Controller)
      over the LOM via standard Ethernet. More details are added in the commit log.
      
      Please review.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c16ead79
    • Venkata Duvvuru's avatar
      be2net: Support for OS2BMC. · 760c295e
      Venkata Duvvuru authored
      OS2BMC feature will allow the server to communicate with the on-board
      BMC/idrac (Baseboard Management Controller) over the LOM via
      standard Ethernet.
      
      When OS2BMC feature is enabled, the LOM will filter traffic coming
      from the host. If the destination MAC address matches the iDRAC MAC
      address, it will forward the packet to the NC-SI side band interface
      for iDRAC processing. Otherwise, it would send it out on the wire to
      the external network. Broadcast and multicast packets are sent on the
      side-band NC-SI channel and on the wire as well. Some of the packet
      filters are not supported in the NIC and hence driver will identify
      such packets and will hint the NIC to send those packets to the BMC.
      This is done by duplicating packets on the management ring. Packets
      are sent to the management ring, by setting mgmt bit in the wrb header.
      The NIC will forward the packets on the management ring to the BMC
      through the side-band NC-SI channel.
      
      Please refer to this online document for more details,
      http://www.dell.com/downloads/global/products/pedge/
      os_to_bmc_passthrough_a_new_chapter_in_system_management.pdf
      Signed-off-by: default avatarVenkat Duvvuru <VenkatKumar.Duvvuru@Emulex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      760c295e
    • Venkata Duvvuru's avatar
      be2net: Report a "link down" to the stack when a fatal error or fw reset happens. · 954f6825
      Venkata Duvvuru authored
      When an error (related to HW or FW) is detected on a function, the driver
      must pro-actively report a "link down" to the stack so that a possible
      failover can be initiated. This is being done currently only for some
      HW errors. This patch reports a "link down" even for fatal FW errors and
      EEH errors.
      Signed-off-by: default avatarVenkat Duvvuru <VenkatKumar.Duvvuru@Emulex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      954f6825
    • Venkata Duvvuru's avatar
      be2net: Export board temperature using hwmon-sysfs interface. · 29e9122b
      Venkata Duvvuru authored
      Ethtool statistics is not the right place to display board temperature.
      This patch adds support to export die temperature of devices supported
      by be2net driver via the sysfs hwmon interface.
      Signed-off-by: default avatarVenkat Duvvuru <VenkatKumar.Duvvuru@Emulex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29e9122b
    • David S. Miller's avatar
      Merge branch 'nf-ingress' · 5a99e7f2
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter ingress support (v4)
      
      This is the v4 round of patches to add the Netfilter ingress hook, it basically
      comes in two steps:
      
      1) Add the CONFIG_NET_INGRESS switch to wrap the ingress static key around it.
         The idea is to use the same global static key to avoid adding more code to
         the hot path.
      
      2) Add the Netfilter ingress hook after the tc ingress hook, under the global
         ingress_needed static key. As I said, the netfilter ingress hook also has
         its own static key, that is nested under the global static key. Please, see
         patch 5/5 for performance numbers and more information.
      
      I originally started this next round, as it was suggested, exploring the
      independent static key for netfilter ingress just after tc ingress, but the
      results that I gathered from that patch are not good for non-users:
      
      Result: OK: 6425927(c6425843+d83) usec, 100000000 (60byte,0frags)
        15561955pps 7469Mb/sec (7469738400bps) errors: 100000000
      
      this roughly means 500Kpps less performance wrt. the base numbers, so that's
      the reason why I discarded that approach and I focused on this.
      
      The idea of this patchset is to open the window to nf_tables, which comes with
      features that will work out-of-the-box (once the boiler plate code to support
      the 'netdev' table family is in place), to avoid repeating myself [1], the most
      relevant features are:
      
      1) Multi-dimensional key dictionary lookups.
      2) Arbitrary stateful flow tables.
      3) Transactions and good support for dynamic updates.
      
      But there are also interest aspects to consider from userspace, such as the
      ability to support new layer 2 protocols without kernel updates, a well-defined
      netlink interface, userspace libraries and utilities for third party
      applications, among others.
      
      I hope we can be happy with this approach.
      
      Please, apply. Thanks.
      
      [1] http://marc.info/?l=netfilter-devel&m=143033337020328&w=2
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5a99e7f2
    • Pablo Neira's avatar
      netfilter: add netfilter ingress hook after handle_ing() under unique static key · e687ad60
      Pablo Neira authored
      This patch adds the Netfilter ingress hook just after the existing tc ingress
      hook, that seems to be the consensus solution for this.
      
      Note that the Netfilter hook resides under the global static key that enables
      ingress filtering. Nonetheless, Netfilter still also has its own static key for
      minimal impact on the existing handle_ing().
      
      * Without this patch:
      
      Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
        16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
      
          42.46%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          25.92%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
           7.81%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           5.62%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.70%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           2.34%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
           1.44%  kpktgend_0   [kernel.kallsyms]   [k] __build_skb
      
      * With this patch:
      
      Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
        16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
      
          41.23%  kpktgend_0      [kernel.kallsyms]  [k] __netif_receive_skb_core
          26.57%  kpktgend_0      [kernel.kallsyms]  [k] kfree_skb
           7.72%  kpktgend_0      [pktgen]           [k] pktgen_thread_worker
           5.55%  kpktgend_0      [kernel.kallsyms]  [k] ip_rcv
           2.78%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_internal
           2.06%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_sk
           1.43%  kpktgend_0      [kernel.kallsyms]  [k] __build_skb
      
      * Without this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
        10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
      
          40.99%  kpktgend_0   [kernel.kallsyms]  [k] __netif_receive_skb_core
          17.50%  kpktgend_0   [kernel.kallsyms]  [k] kfree_skb
          11.77%  kpktgend_0   [cls_u32]          [k] u32_classify
           5.62%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify_compat
           5.18%  kpktgend_0   [pktgen]           [k] pktgen_thread_worker
           3.23%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify
           2.97%  kpktgend_0   [kernel.kallsyms]  [k] ip_rcv
           1.83%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_internal
           1.50%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_sk
           0.99%  kpktgend_0   [kernel.kallsyms]  [k] __build_skb
      
      * With this patch + tc ingress:
      
              tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                      u32 match ip dst 4.3.2.1/32
      
      Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
        10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
      
          42.01%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
          17.78%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
          11.70%  kpktgend_0   [cls_u32]           [k] u32_classify
           5.46%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify_compat
           5.16%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
           2.98%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
           2.84%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify
           1.96%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
           1.57%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
      
      Note that the results are very similar before and after.
      
      I can see gcc gets the code under the ingress static key out of the hot path.
      Then, on that cold branch, it generates the code to accomodate the netfilter
      ingress static key. My explanation for this is that this reduces the pressure
      on the instruction cache for non-users as the new code is out of the hot path,
      and it comes with minimal impact for tc ingress users.
      
      Using gcc version 4.8.4 on:
      
      Architecture:          x86_64
      CPU op-mode(s):        32-bit, 64-bit
      Byte Order:            Little Endian
      CPU(s):                8
      [...]
      L1d cache:             16K
      L1i cache:             64K
      L2 cache:              2048K
      L3 cache:              8192K
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e687ad60
    • Pablo Neira's avatar
      net: add CONFIG_NET_INGRESS to enable ingress filtering · 1cf51900
      Pablo Neira authored
      This new config switch enables the ingress filtering infrastructure that is
      controlled through the ingress_needed static key. This prepares the
      introduction of the Netfilter ingress hook that resides under this unique
      static key.
      
      Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
      problem since this also depends on CONFIG_NET_CLS_ACT.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Acked-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1cf51900
    • Pablo Neira's avatar
      netfilter: add nf_hook_list_active() · b8d0aad0
      Pablo Neira authored
      In preparation to have netfilter ingress per-device hook list.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8d0aad0
    • Pablo Neira's avatar
      f7191483
    • Pablo Neira's avatar
    • Dan Carpenter's avatar
      net: macb: OR vs AND typos · a104a6b3
      Dan Carpenter authored
      The bitwise tests are always true here because it uses '|' where '&' is
      intended.
      
      Fixes: 98b5a0f4 ('net: macb: Add support for jumbo frames')
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a104a6b3
  3. 13 May, 2015 9 commits