An error occurred fetching the project authors.
  1. 26 Feb, 2014 2 commits
    • Hannes Frederic Sowa's avatar
      ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMIT · 1b346576
      Hannes Frederic Sowa authored
      IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
      generation of fragments if the interface mtu is exceeded, it is very
      hard to make use of this option in already deployed name server software
      for which I introduced this option.
      
      This patch adds yet another new IP_MTU_DISCOVER option to not honor any
      path mtu information and not accepting new icmp notifications destined for
      the socket this option is enabled on. But we allow outgoing fragmentation
      in case the packet size exceeds the outgoing interface mtu.
      
      As such this new option can be used as a drop-in replacement for
      IP_PMTUDISC_DONT, which is currently in use by most name server software
      making the adoption of this option very smooth and easy.
      
      The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
      ignoring incoming path MTU updates and not honoring discovered path MTUs
      in the output path.
      
      Fixes: 482fc609 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b346576
    • Hannes Frederic Sowa's avatar
      ipv4: use ip_skb_dst_mtu to determine mtu in ip_fragment · 69647ce4
      Hannes Frederic Sowa authored
      ip_skb_dst_mtu mostly falls back to ip_dst_mtu_maybe_forward if no socket
      is attached to the skb (in case of forwarding) or determines the mtu like
      we do in ip_finish_output, which actually checks if we should branch to
      ip_fragment. Thus use the same function to determine the mtu here, too.
      
      This is important for the introduction of IP_PMTUDISC_OMIT, where we
      want the packets getting cut in pieces of the size of the outgoing
      interface mtu. IPv6 already does this correctly.
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      69647ce4
  2. 14 Jan, 2014 1 commit
  3. 13 Jan, 2014 1 commit
    • Hannes Frederic Sowa's avatar
      ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing · f87c10a8
      Hannes Frederic Sowa authored
      While forwarding we should not use the protocol path mtu to calculate
      the mtu for a forwarded packet but instead use the interface mtu.
      
      We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
      introduced for multicast forwarding. But as it does not conflict with
      our usage in unicast code path it is perfect for reuse.
      
      I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
      along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
      dependencies because of IPSKB_FORWARDED.
      
      Because someone might have written a software which does probe
      destinations manually and expects the kernel to honour those path mtus
      I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
      can disable this new behaviour. We also still use mtus which are locked on a
      route for forwarding.
      
      The reason for this change is, that path mtus information can be injected
      into the kernel via e.g. icmp_err protocol handler without verification
      of local sockets. As such, this could cause the IPv4 forwarding path to
      wrongfully emit fragmentation needed notifications or start to fragment
      packets along a path.
      
      Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
      won't be set and further fragmentation logic will use the path mtu to
      determine the fragmentation size. They also recheck packet size with
      help of path mtu discovery and report appropriate errors.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f87c10a8
  4. 22 Dec, 2013 1 commit
  5. 06 Nov, 2013 1 commit
    • Hannes Frederic Sowa's avatar
      ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE · 482fc609
      Hannes Frederic Sowa authored
      Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery,
      their sockets won't accept and install new path mtu information and they
      will always use the interface mtu for outgoing packets. It is guaranteed
      that the packet is not fragmented locally. But we won't set the DF-Flag
      on the outgoing frames.
      
      Florian Weimer had the idea to use this flag to ensure DNS servers are
      never generating outgoing fragments. They may well be fragmented on the
      path, but the server never stores or usees path mtu values, which could
      well be forged in an attack.
      
      (The root of the problem with path MTU discovery is that there is
      no reliable way to authenticate ICMP Fragmentation Needed But DF Set
      messages because they are sent from intermediate routers with their
      source addresses, and the IMCP payload will not always contain sufficient
      information to identify a flow.)
      
      Recent research in the DNS community showed that it is possible to
      implement an attack where DNS cache poisoning is feasible by spoofing
      fragments. This work was done by Amir Herzberg and Haya Shulman:
      <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf>
      
      This issue was previously discussed among the DNS community, e.g.
      <http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>,
      without leading to fixes.
      
      This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode
      regarding local fragmentation with UFO/CORK" for the enforcement of the
      non-fragmentable checks. If other users than ip_append_page/data should
      use this semantic too, we have to add a new flag to IPCB(skb)->flags to
      suppress local fragmentation and check for this in ip_finish_output.
      
      Many thanks to Florian Weimer for the idea and feedback while implementing
      this patch.
      
      Cc: David S. Miller <davem@davemloft.net>
      Suggested-by: default avatarFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      482fc609
  6. 29 Oct, 2013 1 commit
    • Hannes Frederic Sowa's avatar
      ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK · daba287b
      Hannes Frederic Sowa authored
      UFO as well as UDP_CORK do not respect IP_PMTUDISC_DO and
      IP_PMTUDISC_PROBE well enough.
      
      UFO enabled packet delivery just appends all frags to the cork and hands
      it over to the network card. So we just deliver non-DF udp fragments
      (DF-flag may get overwritten by hardware or virtual UFO enabled
      interface).
      
      UDP_CORK does enqueue the data until the cork is disengaged. At this
      point it sets the correct IP_DF and local_df flags and hands it over to
      ip_fragment which in this case will generate an icmp error which gets
      appended to the error socket queue. This is not reflected in the syscall
      error (of course, if UFO is enabled this also won't happen).
      
      Improve this by checking the pmtudisc flags before appending data to the
      socket and if we still can fit all data in one packet when IP_PMTUDISC_DO
      or IP_PMTUDISC_PROBE is set, only then proceed.
      
      We use (mtu-fragheaderlen) to check for the maximum length because we
      ensure not to generate a fragment and non-fragmented data does not need
      to have its length aligned on 64 bit boundaries. Also the passed in
      ip_options are already aligned correctly.
      
      Maybe, we can relax some other checks around ip_fragment. This needs
      more research.
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      daba287b
  7. 19 Oct, 2013 1 commit
  8. 28 Sep, 2013 1 commit
    • Francesco Fusco's avatar
      ipv4: processing ancillary IP_TOS or IP_TTL · aa661581
      Francesco Fusco authored
      If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
      packets with the specified TTL or TOS overriding the socket values specified
      with the traditional setsockopt().
      
      The struct inet_cork stores the values of TOS, TTL and priority that are
      passed through the struct ipcm_cookie. If there are user-specified TOS
      (tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
      used to override the per-socket values. In case of TOS also the priority
      is changed accordingly.
      
      Two helper functions get_rttos and get_rtconn_flags are defined to take
      into account the presence of a user specified TOS value when computing
      RT_TOS and RT_CONN_FLAGS.
      Signed-off-by: default avatarFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa661581
  9. 19 Sep, 2013 2 commits
  10. 14 Aug, 2013 1 commit
  11. 11 May, 2013 1 commit
  12. 01 Apr, 2013 1 commit
  13. 13 Feb, 2013 1 commit
    • Pravin B Shelar's avatar
      net: Fix possible wrong checksum generation. · c9af6db4
      Pravin B Shelar authored
      Patch cef401de (net: fix possible wrong checksum
      generation) fixed wrong checksum calculation but it broke TSO by
      defining new GSO type but not a netdev feature for that type.
      net_gso_ok() would not allow hardware checksum/segmentation
      offload of such packets without the feature.
      
      Following patch fixes TSO and wrong checksum. This patch uses
      same logic that Eric Dumazet used. Patch introduces new flag
      SKBTX_SHARED_FRAG if at least one frag can be modified by
      the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
      info tx_flags rather than gso_type.
      
      tx_flags is better compared to gso_type since we can have skb with
      shared frag without gso packet. It does not link SHARED_FRAG to
      GSO, So there is no need to define netdev feature for this.
      Signed-off-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c9af6db4
  14. 09 Dec, 2012 1 commit
  15. 08 Oct, 2012 1 commit
    • Julian Anastasov's avatar
      ipv4: introduce rt_uses_gateway · 155e8336
      Julian Anastasov authored
      Add new flag to remember when route is via gateway.
      We will use it to allow rt_gateway to contain address of
      directly connected host for the cases when DST_NOCACHE is
      used or when the NH exception caches per-destination route
      without DST_NOCACHE flag, i.e. when routes are not used for
      other destinations. By this way we force the neighbour
      resolving to work with the routed destination but we
      can use different address in the packet, feature needed
      for IPVS-DR where original packet for virtual IP is routed
      via route to real IP.
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      155e8336
  16. 24 Sep, 2012 1 commit
    • Eric Dumazet's avatar
      net: use a per task frag allocator · 5640f768
      Eric Dumazet authored
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      operations.
      
      This page is used to build fragments for skbs.
      
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: default avatarVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5640f768
  17. 26 Aug, 2012 1 commit
  18. 21 Aug, 2012 1 commit
  19. 10 Aug, 2012 2 commits
  20. 06 Aug, 2012 1 commit
  21. 22 Jul, 2012 1 commit
  22. 20 Jul, 2012 1 commit
  23. 19 Jul, 2012 1 commit
    • Eric Dumazet's avatar
      ipv4: tcp: remove per net tcp_sock · be9f4a44
      Eric Dumazet authored
      tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
      per network namespace.
      
      This leads to bad behavior on multiqueue NICS, because many cpus
      contend for the socket lock and once socket lock is acquired, extra
      false sharing on various socket fields slow down the operations.
      
      To better resist to attacks, we use a percpu socket. Each cpu can
      run without contention, using appropriate memory (local node)
      
      Additional features :
      
      1) We also mirror the queue_mapping of the incoming skb, so that
      answers use the same queue if possible.
      
      2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree()
      
      3) We now limit the number of in-flight RST/ACK [1] packets
      per cpu, instead of per namespace, and we honor the sysctl_wmem_default
      limit dynamically. (Prior to this patch, sysctl_wmem_default value was
      copied at boot time, so any further change would not affect tcp_sock
      limit)
      
      [1] These packets are only generated when no socket was matched for
      the incoming packet.
      Reported-by: default avatarBill Sommerfeld <wsommerfeld@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be9f4a44
  24. 05 Jul, 2012 2 commits
  25. 28 Jun, 2012 1 commit
  26. 13 Jun, 2012 1 commit
    • Michel Machado's avatar
      net-next: add dev_loopback_xmit() to avoid duplicate code · 95603e22
      Michel Machado authored
      Add dev_loopback_xmit() in order to deduplicate functions
      ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and
      ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c).
      
      I was about to reinvent the wheel when I noticed that
      ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I
      need and are not IP-only functions, but they were not available to reuse
      elsewhere.
      
      ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I
      understand that this is harmless, and should be in dev_loopback_xmit().
      Signed-off-by: default avatarMichel Machado <michel@digirati.com.br>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      CC: James Morris <jmorris@namei.org>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jiri Pirko <jpirko@redhat.com>
      CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95603e22
  27. 04 Jun, 2012 1 commit
  28. 15 May, 2012 1 commit
  29. 28 Mar, 2012 1 commit
  30. 05 Dec, 2011 1 commit
  31. 01 Dec, 2011 1 commit
  32. 24 Oct, 2011 1 commit
    • Eric Dumazet's avatar
      ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT · 66b13d99
      Eric Dumazet authored
      There is a long standing bug in linux tcp stack, about ACK messages sent
      on behalf of TIME_WAIT sockets.
      
      In the IP header of the ACK message, we choose to reflect TOS field of
      incoming message, and this might break some setups.
      
      Example of things that were broken :
        - Routing using TOS as a selector
        - Firewalls
        - Trafic classification / shaping
      
      We now remember in timewait structure the inet tos field and use it in
      ACK generation, and route lookup.
      
      Notes :
       - We still reflect incoming TOS in RST messages.
       - We could extend MuraliRaja Muniraju patch to report TOS value in
      netlink messages for TIME_WAIT sockets.
       - A patch is needed for IPv6
      Signed-off-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66b13d99
  33. 19 Oct, 2011 1 commit
  34. 25 Aug, 2011 1 commit
  35. 08 Aug, 2011 1 commit
  36. 03 Aug, 2011 1 commit