1. 23 Jul, 2012 17 commits
    • David S. Miller's avatar
      ipv4: Change rt->rt_iif encoding. · 13378cad
      David S. Miller authored
      On input packet processing, rt->rt_iif will be zero if we should
      use skb->dev->ifindex.
      
      Since we access rt->rt_iif consistently via inet_iif(), that is
      the only spot whose interpretation have to adjust.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      13378cad
    • David S. Miller's avatar
      net: Make skb->skb_iif always track skb->dev · b6858177
      David S. Miller authored
      Make it follow device decapsulation, from things such as VLAN and
      bonding.
      
      The stuff that actually cares about pre-demuxed device pointers, is
      handled by the "orig_dev" variable in __netif_receive_skb().  And
      the only consumer of that is the po->origdev feature of AF_PACKET
      sockets.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6858177
    • David S. Miller's avatar
      ipv4: Prepare for change of rt->rt_iif encoding. · 92101b3b
      David S. Miller authored
      Use inet_iif() consistently, and for TCP record the input interface of
      cached RX dst in inet sock.
      
      rt->rt_iif is going to be encoded differently, so that we can
      legitimately cache input routes in the FIB info more aggressively.
      
      When the input interface is "use SKB device index" the rt->rt_iif will
      be set to zero.
      
      This forces us to move the TCP RX dst cache installation into the ipv4
      specific code, and as well it should since doing the route caching for
      ipv6 is pointless at the moment since it is not inspected in the ipv6
      input paths yet.
      
      Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
      obsolete set to a non-zero value to force invocation of the check
      callback.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92101b3b
    • David S. Miller's avatar
      ipv4: Remove all RTCF_DIRECTSRC handliing. · fe3edf45
      David S. Miller authored
      The last and final kernel user, ICMP address replies,
      has been removed.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe3edf45
    • David S. Miller's avatar
      ipv4: Really ignore ICMP address requests/replies. · 838942a5
      David S. Miller authored
      Alexey removed kernel side support for requests, and the
      only thing we do for replies is log a message if something
      doesn't look right.
      
      As Alexey's comment indicates, this belongs in userspace (if
      anywhere), and thus we can safely just get rid of this code.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      838942a5
    • David S. Miller's avatar
      decnet: Don't set RTCF_DIRECTSRC. · 8acfaa94
      David S. Miller authored
      It's an ipv4 defined route flag, and only ipv4 uses it.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8acfaa94
    • Saurabh's avatar
      net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse. · e7d4b18c
      Saurabh authored
      With CONFIG_SPARSE_RCU_POINTER=y sparse identified references which did not
      specificy __rcu in ip_vti.c
      Signed-off-by: default avatarSaurabh Mohan <saurabh.mohan@vyatta.com>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7d4b18c
    • Lin Ming's avatar
      ipv4: Remove redundant assignment · 8fe5cb87
      Lin Ming authored
      It is redundant to set no_addr and accept_local to 0 and then set them
      with other values just after that.
      Signed-off-by: default avatarLin Ming <mlin@ss.pku.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8fe5cb87
    • Weiping Pan's avatar
      rds: set correct msg_namelen · 06b6a1cf
      Weiping Pan authored
      Jay Fenlason (fenlason@redhat.com) found a bug,
      that recvfrom() on an RDS socket can return the contents of random kernel
      memory to userspace if it was called with a address length larger than
      sizeof(struct sockaddr_in).
      rds_recvmsg() also fails to set the addr_len paramater properly before
      returning, but that's just a bug.
      There are also a number of cases wher recvfrom() can return an entirely bogus
      address. Anything in rds_recvmsg() that returns a non-negative value but does
      not go through the "sin = (struct sockaddr_in *)msg->msg_name;" code path
      at the end of the while(1) loop will return up to 128 bytes of kernel memory
      to userspace.
      
      And I write two test programs to reproduce this bug, you will see that in
      rds_server, fromAddr will be overwritten and the following sock_fd will be
      destroyed.
      Yes, it is the programmer's fault to set msg_namelen incorrectly, but it is
      better to make the kernel copy the real length of address to user space in
      such case.
      
      How to run the test programs ?
      I test them on 32bit x86 system, 3.5.0-rc7.
      
      1 compile
      gcc -o rds_client rds_client.c
      gcc -o rds_server rds_server.c
      
      2 run ./rds_server on one console
      
      3 run ./rds_client on another console
      
      4 you will see something like:
      server is waiting to receive data...
      old socket fd=3
      server received data from client:data from client
      msg.msg_namelen=32
      new socket fd=-1067277685
      sendmsg()
      : Bad file descriptor
      
      /***************** rds_client.c ********************/
      
      int main(void)
      {
      	int sock_fd;
      	struct sockaddr_in serverAddr;
      	struct sockaddr_in toAddr;
      	char recvBuffer[128] = "data from client";
      	struct msghdr msg;
      	struct iovec iov;
      
      	sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0);
      	if (sock_fd < 0) {
      		perror("create socket error\n");
      		exit(1);
      	}
      
      	memset(&serverAddr, 0, sizeof(serverAddr));
      	serverAddr.sin_family = AF_INET;
      	serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
      	serverAddr.sin_port = htons(4001);
      
      	if (bind(sock_fd, (struct sockaddr*)&serverAddr, sizeof(serverAddr)) < 0) {
      		perror("bind() error\n");
      		close(sock_fd);
      		exit(1);
      	}
      
      	memset(&toAddr, 0, sizeof(toAddr));
      	toAddr.sin_family = AF_INET;
      	toAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
      	toAddr.sin_port = htons(4000);
      	msg.msg_name = &toAddr;
      	msg.msg_namelen = sizeof(toAddr);
      	msg.msg_iov = &iov;
      	msg.msg_iovlen = 1;
      	msg.msg_iov->iov_base = recvBuffer;
      	msg.msg_iov->iov_len = strlen(recvBuffer) + 1;
      	msg.msg_control = 0;
      	msg.msg_controllen = 0;
      	msg.msg_flags = 0;
      
      	if (sendmsg(sock_fd, &msg, 0) == -1) {
      		perror("sendto() error\n");
      		close(sock_fd);
      		exit(1);
      	}
      
      	printf("client send data:%s\n", recvBuffer);
      
      	memset(recvBuffer, '\0', 128);
      
      	msg.msg_name = &toAddr;
      	msg.msg_namelen = sizeof(toAddr);
      	msg.msg_iov = &iov;
      	msg.msg_iovlen = 1;
      	msg.msg_iov->iov_base = recvBuffer;
      	msg.msg_iov->iov_len = 128;
      	msg.msg_control = 0;
      	msg.msg_controllen = 0;
      	msg.msg_flags = 0;
      	if (recvmsg(sock_fd, &msg, 0) == -1) {
      		perror("recvmsg() error\n");
      		close(sock_fd);
      		exit(1);
      	}
      
      	printf("receive data from server:%s\n", recvBuffer);
      
      	close(sock_fd);
      
      	return 0;
      }
      
      /***************** rds_server.c ********************/
      
      int main(void)
      {
      	struct sockaddr_in fromAddr;
      	int sock_fd;
      	struct sockaddr_in serverAddr;
      	unsigned int addrLen;
      	char recvBuffer[128];
      	struct msghdr msg;
      	struct iovec iov;
      
      	sock_fd = socket(AF_RDS, SOCK_SEQPACKET, 0);
      	if(sock_fd < 0) {
      		perror("create socket error\n");
      		exit(0);
      	}
      
      	memset(&serverAddr, 0, sizeof(serverAddr));
      	serverAddr.sin_family = AF_INET;
      	serverAddr.sin_addr.s_addr = inet_addr("127.0.0.1");
      	serverAddr.sin_port = htons(4000);
      	if (bind(sock_fd, (struct sockaddr*)&serverAddr, sizeof(serverAddr)) < 0) {
      		perror("bind error\n");
      		close(sock_fd);
      		exit(1);
      	}
      
      	printf("server is waiting to receive data...\n");
      	msg.msg_name = &fromAddr;
      
      	/*
      	 * I add 16 to sizeof(fromAddr), ie 32,
      	 * and pay attention to the definition of fromAddr,
      	 * recvmsg() will overwrite sock_fd,
      	 * since kernel will copy 32 bytes to userspace.
      	 *
      	 * If you just use sizeof(fromAddr), it works fine.
      	 * */
      	msg.msg_namelen = sizeof(fromAddr) + 16;
      	/* msg.msg_namelen = sizeof(fromAddr); */
      	msg.msg_iov = &iov;
      	msg.msg_iovlen = 1;
      	msg.msg_iov->iov_base = recvBuffer;
      	msg.msg_iov->iov_len = 128;
      	msg.msg_control = 0;
      	msg.msg_controllen = 0;
      	msg.msg_flags = 0;
      
      	while (1) {
      		printf("old socket fd=%d\n", sock_fd);
      		if (recvmsg(sock_fd, &msg, 0) == -1) {
      			perror("recvmsg() error\n");
      			close(sock_fd);
      			exit(1);
      		}
      		printf("server received data from client:%s\n", recvBuffer);
      		printf("msg.msg_namelen=%d\n", msg.msg_namelen);
      		printf("new socket fd=%d\n", sock_fd);
      		strcat(recvBuffer, "--data from server");
      		if (sendmsg(sock_fd, &msg, 0) == -1) {
      			perror("sendmsg()\n");
      			close(sock_fd);
      			exit(1);
      		}
      	}
      
      	close(sock_fd);
      	return 0;
      }
      Signed-off-by: default avatarWeiping Pan <wpan@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06b6a1cf
    • Dan Carpenter's avatar
      openvswitch: potential NULL deref in sample() · 5b3e7e6c
      Dan Carpenter authored
      If there is no OVS_SAMPLE_ATTR_ACTIONS set then "acts_list" is NULL and
      it leads to a NULL dereference when we call nla_len(acts_list).  This
      is a static checker fix, not something I have seen in testing.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b3e7e6c
    • Eric Dumazet's avatar
      tcp: dont drop MTU reduction indications · 563d34d0
      Eric Dumazet authored
      ICMP messages generated in output path if frame length is bigger than
      mtu are actually lost because socket is owned by user (doing the xmit)
      
      One example is the ipgre_tunnel_xmit() calling
      icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
      
      We had a similar case fixed in commit a34a101e (ipv6: disable GSO on
      sockets hitting dst_allfrag).
      
      Problem of such fix is that it relied on retransmit timers, so short tcp
      sessions paid a too big latency increase price.
      
      This patch uses the tcp_release_cb() infrastructure so that MTU
      reduction messages (ICMP messages) are not lost, and no extra delay
      is added in TCP transmits.
      Reported-by: default avatarMaciej Żenczykowski <maze@google.com>
      Diagnosed-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Tore Anderson <tore@fud.no>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      563d34d0
    • Yuval Mintz's avatar
      bnx2x: Add new 57840 device IDs · c3def943
      Yuval Mintz authored
      The 57840 boards come in two flavours: 2 x 20G and 4 x 10G.
      To better differentiate between the two flavours, a separate device ID
      was assigned to each.
      The silicon default value is still the currently supported 57840 device ID
      (0x168d), and since a user can damage the nvram (e.g., 'ethtool -E')
      the driver will still support this device ID to allow the user to amend the
      nvram back into a supported configuration.
      
      Notice this patch contains lines longer than 80 characters (strings).
      Signed-off-by: default avatarYuval Mintz <yuvalmin@broadcom.com>
      Signed-off-by: default avatarEilon Greenstein <eilong@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3def943
    • Julian Anastasov's avatar
      tcp: avoid oops in tcp_metrics and reset tcpm_stamp · 9a0a9502
      Julian Anastasov authored
      	In tcp_tw_remember_stamp we incorrectly checked tw
      instead of tm, it can lead to oops if the cached entry is
      not found.
      
      	tcpm_stamp was not updated in tcpm_check_stamp when
      tcpm_suck_dst was called, move the update into tcpm_suck_dst,
      so that we do not call it infinitely on every next cache hit
      after TCP_METRICS_TIMEOUT.
      Signed-off-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a0a9502
    • Shuah Khan's avatar
      niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value · 9b70749e
      Shuah Khan authored
      Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return
      value to be consistent with the rest of the checks after niu_rbr_add_page()
      calls in this file.
      Signed-off-by: default avatarShuah Khan <shuah.khan@hp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b70749e
    • Shuah Khan's avatar
      niu: Fix to check for dma mapping errors. · ec2deec1
      Shuah Khan authored
      Fix Neptune ethernet driver to check dma mapping error after map_page()
      interface returns.
      Signed-off-by: default avatarShuah Khan <shuah.khan@hp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec2deec1
    • Jesper Juhl's avatar
      net: Fix references to out-of-scope variables in put_cmsg_compat() · 81881047
      Jesper Juhl authored
      In net/compat.c::put_cmsg_compat() we may assign 'data' the address of
      either the 'ctv' or 'cts' local variables inside the 'if
      (!COMPAT_USE_64BIT_TIME)' branch.
      
      Those variables go out of scope at the end of the 'if' statement, so
      when we use 'data' further down in 'copy_to_user(CMSG_COMPAT_DATA(cm),
      data, cmlen - sizeof(struct compat_cmsghdr))' there's no telling what
      it may be refering to - not good.
      
      Fix the problem by simply giving 'ctv' and 'cts' function scope.
      Signed-off-by: default avatarJesper Juhl <jj@chaosbits.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      81881047
    • David S. Miller's avatar
      Merge branch 'kill_rtcache' · 5e9965c1
      David S. Miller authored
      The ipv4 routing cache is non-deterministic, performance wise, and is
      subject to reasonably easy to launch denial of service attacks.
      
      The routing cache works great for well behaved traffic, and the world
      was a much friendlier place when the tradeoffs that led to the routing
      cache's design were considered.
      
      What it boils down to is that the performance of the routing cache is
      a product of the traffic patterns seen by a system rather than being a
      product of the contents of the routing tables.  The former of which is
      controllable by external entitites.
      
      Even for "well behaved" legitimate traffic, high volume sites can see
      hit rates in the routing cache of only ~%10.
      
      The general flow of this patch series is that first the routing cache
      is removed.  We build a completely new rtable entry every lookup
      request.
      
      Next we make some simplifications due to the fact that removing the
      routing cache causes several members of struct rtable to become no
      longer necessary.
      
      Then we need to make some amends such that we can legally cache
      pre-constructed routes in the FIB nexthops.  Firstly, we need to
      invalidate routes which are hit with nexthop exceptions.  Secondly we
      have to change the semantics of rt->rt_gateway such that zero means
      that the destination is on-link and non-zero otherwise.
      
      Now that the preparations are ready, we start caching precomputed
      routes in the FIB nexthops.  Output and input routes need different
      kinds of care when determining if we can legally do such caching or
      not.  The details are in the commit log messages for those changes.
      
      The patch series then winds down with some more struct rtable
      simplifications and other tidy ups that remove unnecessary overhead.
      
      On a SPARC-T3 output route lookups are ~876 cycles.  Input route
      lookups are ~1169 cycles with rpfilter disabled, and about ~1468
      cycles with rpfilter enabled.
      
      These measurements were taken with the kbench_mod test module in the
      net_test_tools GIT tree:
      
      git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git
      
      That GIT tree also includes a udpflood tester tool and stresses
      route lookups on packet output.
      
      For example, on the same SPARC-T3 system we can run:
      
      	time ./udpflood -l 10000000 10.2.2.11
      
      with routing cache:
      real    1m21.955s       user    0m6.530s        sys     1m15.390s
      
      without routing cache:
      real    1m31.678s       user    0m6.520s        sys     1m25.140s
      
      Performance undoubtedly can easily be improved further.
      
      For example fib_table_lookup() performs a lot of excessive
      computations with all the masking and shifting, some of it
      conditionalized to deal with edge cases.
      
      Also, Eric's no-ref optimization for input route lookups can be
      re-instated for the FIB nexthop caching code path.  I would be really
      pleased if someone would work on that.
      
      In fact anyone suitable motivated can just fire up perf on the loading
      of the test net_test_tools benchmark kernel module.  I spend much of
      my time going:
      
      bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2
      bash# perf report
      
      Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben
      Hutchings, and others.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e9965c1
  2. 22 Jul, 2012 22 commits
  3. 21 Jul, 2012 1 commit