1. 14 May, 2014 24 commits
    • David S. Miller's avatar
      Merge branch 'mlx4-next' · 005e35f5
      David S. Miller authored
      Or Gerlitz says:
      
      ====================
      Mellanox driver update 2014-05-12
      
      This patchset introduce some small bug fixes:
      
      Eyal fixed some compilation and syntactic checkers warnings. Ido fixed a
      coruption in user priority mapping when changing number of channels. Shani
      fixed some other problems when modifying MAC address. Yuval fixed a problem
      when changing IRQ affinity during high traffic - IRQ changes got effective
      only after the first pause in traffic.
      
      This patchset was tested and applied over commit 93dccc59: "mdio_bus: fix
      devm_mdiobus_alloc_size export"
      
      Changes from V1:
      - applied feedback from Dave to use true/false and not 0/1 in patch 1/9
      - removed the patch from Noa which adddressed a bug in flow steering table
        when using a bond device, as the fix might need to be in the bonding driver,
        this is now dicussed in the netdev thread "bonding directly changes
        underlying device address"
      
      Changes from V0:
      - Patch 1/9 - net/mlx4_core: Enforce irq affinity changes immediatly
        - Moved the new members to a hot cache line as Eric suggested
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      005e35f5
    • Eyal Perry's avatar
      net/mlx4_core: Fix inaccurate return value of mlx4_flow_attach() · 75720384
      Eyal Perry authored
      Adopt the "info: why not propagate 'ret' from parse_trans_rule()..."
      suggestion made by the smatch semantic checker on:
      drivers/net/ethernet/mellanox/mlx4/mcg.c:867 mlx4_flow_attach()
      Signed-off-by: default avatarEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75720384
    • Eyal Perry's avatar
      net/mlx4_en: Using positive error value for unsigned · c3ca5205
      Eyal Perry authored
      Using a positive value for error: MLX4_NET_TRANS_RULE_NUM instead
      of -EPROTONOSUPPORT, to remove compilation warning.
      Signed-off-by: default avatarEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3ca5205
    • Shani Michaelli's avatar
      net/mlx4_en: Protect MAC address modification with the state_lock mutex · fe1ff29d
      Shani Michaelli authored
      This Patches solves an issue that could raise when modifying the
      device's MAC. It occurs due to a simultaneous access to priv->mac_hash
      from two contexts. The buggy scenario described below:
      Context 1: copy the new mac address to the dev->dev_addr field.
      Context 2: mlx4_en_do_uc_filter removes prev_mac entry from the mac_hash
                 db since it is not in dev->uc and not equal to dev->dev_addr.
      Context 1: mlx4_en_do_set_mac() calls mlx4_en_replace_mac() to replace
                 prev_mac with dev_addr but it fails to update the mac_hash db
                 since it no longer contains prev_mac, therefore it returns
                 with an error.
      
      The fix is to prevent mlx4_en_do_uc_filter from being executed by both
      of the context 1 calls described above, This is done by putting them
      both under the mdev->state_lock lock, it will solve this issue since
      mlx4_en_do_uc_filter is already protected by the mdev->state_lock.
      Reviewed-by: default avatarEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: default avatarShani Michaeli <shanim@mellanox.com>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe1ff29d
    • Eyal Perry's avatar
      net/mlx4_core: Removed unnecessary bit operation condition · 483e0132
      Eyal Perry authored
      Fix the "warn: suspicious bitop condition" made by the smatch semantic
      checker on:
      drivers/net/ethernet/mellanox/mlx4/main.c:509 mlx4_slave_cap()
      Signed-off-by: default avatarEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      483e0132
    • Eyal Perry's avatar
      net/mlx4_core: Fix smatch error - possible access to a null variable · c05a116f
      Eyal Perry authored
      Fix the "error: we previously assumed 'out_param' could be null" found
      by smatch semantic checker on:
      drivers/net/ethernet/mellanox/mlx4/cmd.c:506 mlx4_cmd_poll()
      drivers/net/ethernet/mellanox/mlx4/cmd.c:578 mlx4_cmd_wait()
      Signed-off-by: default avatarEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c05a116f
    • Shani Michaelli's avatar
      net/mlx4_en: Fix errors in MAC address changing when port is down · ee755324
      Shani Michaelli authored
      This patch fix an issue that happen when changing the MAC address when
      the port is down, described as follows:
      1. Set the port down.
      2. Change the MAC address - mlx4_en_set_mac() will change dev->dev_addr.
      3. Set the port up - will result in mlx4_en_do_uc_filter that will
         remove the prev_mac entry from the mac_hash db.
      4. Changing the MAC address again will eventually trigger the call to
         mlx4_en_replace_mac() in order to replace prev_mac with dev_addr but
         the prev_mac entry is already not exist in the mac_hash db therefore
         the operation fails.
      
      The fix is to set the prev_mac with the new MAC address so in step 3
      above, after setting the port up mlx4_en_get_qp() is updating the
      mac_hash with the entry of dev_addr which is equal to prev_mac.
      Therefore in step 4, when calling mlx4_en_replace_mac, the entry related
      to prev_mac exist in mac_hash and the replace operation succeed.
      Reviewed-by: default avatarEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: default avatarShani Michaeli <shanim@mellanox.com>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee755324
    • Ido Shamay's avatar
      net/mlx4_en: User prio mapping gets corrupted when changing number of channels · f5b6345b
      Ido Shamay authored
      When using ethtool set_channels, mlx4_en_setup_tc is always called, even
      when it was not configured. Fixed code to call mlx4_en_setup_tc() only
      if needed.
      Signed-off-by: default avatarIdo Shamay <idos@mellanox.com>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f5b6345b
    • Yuval Atias's avatar
      net/mlx4_core: Enforce irq affinity changes immediatly · 2eacc23c
      Yuval Atias authored
      During heavy traffic, napi is constatntly polling the complition queue
      and no interrupt is fired. Because of that, changes to irq affinity are
      ignored until traffic is stopped and resumed.
      
      By registering to the irq notifier mechanism, and forcing interrupt when
      affinity is changed, irq affinity changes will be immediatly enforced.
      Signed-off-by: default avatarYuval Atias <yuvala@mellanox.com>
      Signed-off-by: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2eacc23c
    • dingtianhong's avatar
      macvlan: Propagate lowerdev MTU changes · 3763e7ef
      dingtianhong authored
      When the physical MTU changes we should ensure that all existing MACVLAN
      dev MTU do not exceed the new lowerdev MTU. This patch adds that
      propagation.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Reviewed-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3763e7ef
    • wangweidong's avatar
      dccp: make the request_retries minimum is 1 · 8ba7e7bf
      wangweidong authored
      In Documentation/networking/dccp.txt points that request_retries
      should be greater than 0. So make the extra1 to be &one instead
      of &zero.
      Signed-off-by: default avatarWang Weidong <wangweidong1@huawei.com>
      Signed-off-by: default avatarGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ba7e7bf
    • WANG Cong's avatar
      snmp: fix some left over of snmp stats · c9f2dba6
      WANG Cong authored
      Fengguang reported the following sparse warning:
      
      >> net/ipv6/proc.c:198:41: sparse: incorrect type in argument 1 (different address spaces)
         net/ipv6/proc.c:198:41:    expected void [noderef] <asn:3>*mib
         net/ipv6/proc.c:198:41:    got void [noderef] <asn:3>**pcpumib
      
      Fixes: commit 698365fa (net: clean up snmp stats code)
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c9f2dba6
    • WANG Cong's avatar
      ipv4: make ip_local_reserved_ports per netns · 122ff243
      WANG Cong authored
      ip_local_port_range is already per netns, so should ip_local_reserved_ports
      be. And since it is none by default we don't actually need it when we don't
      enable CONFIG_SYSCTL.
      
      By the way, rename inet_is_reserved_local_port() to inet_is_local_reserved_port()
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      122ff243
    • Laurent Pinchart's avatar
      irda: sh_irda: Enable driver compilation with COMPILE_TEST · 9cc5e36d
      Laurent Pinchart authored
      This helps increasing build testing coverage.
      Signed-off-by: default avatarLaurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
      Acked-by: default avatarSimon Horman <horms@verge.net.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9cc5e36d
    • David S. Miller's avatar
      Merge branch 'tipc-next' · d6cc76d3
      David S. Miller authored
      Jon Maloy says:
      
      ====================
      tipc: bug fixes and improvements
      
      Intensive and extensive testing has revealed some rather infrequent
      problems related to flow control, buffer handling and link
      establishment. Commits ##1 to 4 deal with these problems.
      
      The remaining four commits are just code improvments, aiming at
      making the code more comprehensible and maintainable. There are
      no functional enhancements in this series.
      
      v2: Fixed a typo in commit log #2. Otherwise no changes from v1.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d6cc76d3
    • Jon Paul Maloy's avatar
      tipc: merge port message reception into socket reception function · 9816f061
      Jon Paul Maloy authored
      In order to reduce complexity and save a call level during message
      reception at port/socket level, we remove the function tipc_port_rcv()
      and merge its functionality into tipc_sk_rcv().
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9816f061
    • Jon Paul Maloy's avatar
      tipc: clean up neigbor discovery message reception · c82910e2
      Jon Paul Maloy authored
      The function tipc_disc_rcv(), which is handling received neighbor
      discovery messages, is perceived as messy, and it is hard to verify
      its correctness by code inspection. The fact that the task it is set
      to resolve is fairly complex does not make the situation better.
      
      In this commit we try to take a more systematic approach to the
      problem. We define a decision machine which takes three state flags
       as input, and produces three action flags as output. We then walk
      through all permutations of the state flags, and for each of them we
      describe verbally what is going on, plus that we set zero or more of
      the action flags. The action flags indicate what should be done once
      the decision machine has finished its job, while the last part of the
      function deals with performing those actions.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c82910e2
    • Jon Paul Maloy's avatar
      tipc: improve and extend media address conversion functions · 38504c28
      Jon Paul Maloy authored
      TIPC currently handles two media specific addresses: Ethernet MAC
      addresses and InfiniBand addresses. Those are kept in three different
      formats:
      
      1) A "raw" format as obtained from the device. This format is known
         only by the media specific adapter code in eth_media.c and
         ib_media.c.
      2) A "generic" internal format, in the form of struct tipc_media_addr,
         which can be referenced and passed around by the generic media-
         unaware code.
      3) A serialized version of the latter, to be conveyed in neighbor
         discovery messages.
      
      Conversion between the three formats can only be done by the media
      specific code, so we have function pointers for this purpose in
      struct tipc_media. Here, the media adapters can install their own
      conversion functions at startup.
      
      We now introduce a new such function, 'raw2addr()', whose purpose
      is to convert from format 1 to format 2 above. We also try to as far
      as possible uniform commenting, variable names and usage of these
      functions, with the purpose of making them more comprehensible.
      
      We can now also remove the function tipc_l2_media_addr_set(), whose
      job is done better by the new function.
      
      Finally, we expand the field for serialized addresses (format 3)
      in discovery messages from 20 to 32 bytes. This is permitted
      according to the spec, and reduces the risk of problems when we
      add new media in the future.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38504c28
    • Jon Paul Maloy's avatar
      tipc: rename and move message reassembly function · 37e22164
      Jon Paul Maloy authored
      The function tipc_link_frag_rcv() is in reality a re-entrant generic
      message reassemby function that has nothing in particular to do with
      the link, where it is defined now. This becomes obvious when we see
      the need to call the function from other places in the code.
      
      In this commit rename it to tipc_buf_append() and move it to the file
      msg.c. We also simplify its signature by moving the tail pointer to
      the control block of the head buffer, hence making the head buffer
      self-contained.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37e22164
    • Jon Paul Maloy's avatar
      tipc: mark head of reassembly buffer as non-linear · 5074ab89
      Jon Paul Maloy authored
      The message reassembly function does not update the 'len' and 'data_len'
      fields of the head skbuff correctly when fragments are chained to it.
      This may sometimes lead to obsure errors, such as fragment reordering
      when we receive fragments which are cloned buffers.
      
      This commit fixes this, by ensuring that the two fields are updated
      correctly.
      Suggested-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5074ab89
    • Jon Paul Maloy's avatar
      tipc: don't record link RESET or ACTIVATE messages as traffic · ec37dcd3
      Jon Paul Maloy authored
      In the current code, all incoming LINK_PROTOCOL messages, irrespective
      of type, nudge the "last message received" checkpoint, informing the
      link state machine that a message was received from the peer since last
      supervision timeout event. This inhibits the link from starting probing
      the peer unnecessarily.
      
      However, not only STATE messages are recorded as legitimate incoming
      traffic this way, but even RESET and ACTIVATE messages, which in
      reality are there to inform the link that the peer endpoint has been
      reset. At the same time, some RESET messages may be dropped instead
      of causing a link reset. This happens when the link endpoint thinks
      it is fully up and working, and the session number of the RESET is
      lower than or equal to the current link session. In such cases the
      RESET is perceived as a delayed remnant from an earlier session, or
      the current one, and dropped.
      
      Now, if a TIPC module is removed and then immediately reinserted, e.g.
      when using a script, RESET messages may arrive at the peer link endpoint
      before this one has had time to discover the failure. The RESET may be
      dropped because of the session number, but only after it has been
      recorded as a legitimate traffic event. Hence, the receiving link will
      not start probing, and not discover that the peer endpoint is down, at
      the same time ignoring the periodic RESET messages coming from that
      endpoint. We have ended up in a stale state where a failed link cannot
      be re-established.
      
      In this commit, we remedy this by nudging the checkpoint only for
      received STATE messages, not for RESET or ACTIVATE messages.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec37dcd3
    • Jon Paul Maloy's avatar
      tipc: compensate for double accounting in socket rcv buffer · 4f4482dc
      Jon Paul Maloy authored
      The function net/core/sock.c::__release_sock() runs a tight loop
      to move buffers from the socket backlog queue to the receive queue.
      
      As a security measure, sk_backlog.len of the receiving socket
      is not set to zero until after the loop is finished, i.e., until
      the whole backlog queue has been transferred to the receive queue.
      During this transfer, the data that has already been moved is counted
      both in the backlog queue and the receive queue, hence giving an
      incorrect picture of the available queue space for new arriving buffers.
      
      This leads to unnecessary rejection of buffers by sk_add_backlog(),
      which in TIPC leads to unnecessarily broken connections.
      
      In this commit, we compensate for this double accounting by adding
      a counter that keeps track of it. The function socket.c::backlog_rcv()
      receives buffers one by one from __release_sock(), and adds them to the
      socket receive queue. If the transfer is successful, it increases a new
      atomic counter 'tipc_sock::dupl_rcvcnt' with 'truesize' of the
      transferred buffer. If a new buffer arrives during this transfer and
      finds the socket busy (owned), we attempt to add it to the backlog.
      However, when sk_add_backlog() is called, we adjust the 'limit'
      parameter with the value of the new counter, so that the risk of
      inadvertent rejection is eliminated.
      
      It should be noted that this change does not invalidate the original
      purpose of zeroing 'sk_backlog.len' after the full transfer. We set an
      upper limit for dupl_rcvcnt, so that if a 'wild' sender (i.e., one that
      doesn't respect the send window) keeps pumping in buffers to
      sk_add_backlog(), he will eventually reach an upper limit,
      (2 x TIPC_CONN_OVERLOAD_LIMIT). After that, no messages can be added
      to the backlog, and the connection will be broken. Ordinary, well-
      behaved senders will never reach this buffer limit at all.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f4482dc
    • Jon Paul Maloy's avatar
      tipc: decrease connection flow control window · 6163a194
      Jon Paul Maloy authored
      Memory overhead when allocating big buffers for data transfer may
      be quite significant. E.g., truesize of a 64 KB buffer turns out
      to be 132 KB, 2 x the requested size.
      
      This invalidates the "worst case" calculation we have been
      using to determine the default socket receive buffer limit,
      which is based on the assumption that 1024x64KB = 67MB buffers
      may be queued up on a socket.
      
      Since TIPC connections cannot survive hitting the buffer limit,
      we have to compensate for this overhead.
      
      We do that in this commit by dividing the fix connection flow
      control window from 1024 (2*512) messages to 512 (2*256). Since
      older version nodes send out acks at 512 message intervals,
      compatibility with such nodes is guaranteed, although performance
      may be non-optimal in such cases.
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6163a194
    • dingtianhong's avatar
      bonding: alloc the structure ad_info dynamically in per slave · 3fdddd85
      dingtianhong authored
      The struct ad_slave_info is very huge, and only be used for 802.3ad mode,
      so alloc the structure dynamically could save 356 Bits for every slave in
      non 802.3ad mode.
      
      Cc: Jay Vosburgh <j.vosburgh@gmail.com>
      Cc: Veaceslav Falico <vfalico@gmail.com>
      Cc: Andy Gospodarek <andy@greyhouse.net>
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Acked-by: default avatarVeaceslav Falico <vfalico@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3fdddd85
  2. 13 May, 2014 16 commits
    • Sergei Shtylyov's avatar
      sh_eth: replace devm_kzalloc() with devm_kmalloc_array() · 86b5d251
      Sergei Shtylyov authored
      When I was converting the driver to the managed device API, only devm_kzalloc()
      was available for memory allocation, so I had to use it, despite zeroing out the
      PHY IRQ array right before initializing all  its entries to PHY_POLL was quite
      stupid.   Now that devm_kmalloc_array() has become available, we can avoid the
      needless zeroing out...
      Signed-off-by: default avatarSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86b5d251
    • David S. Miller's avatar
      Merge branch 'tg3-next' · 1e1c77bf
      David S. Miller authored
      Michael Chan says:
      
      ====================
      tg3: TSO related enhancements to prevent memory allocation failure
      
      Michael Chan (3):
        tg3: Don't modify ip header fields when doing GSO
        tg3: Prevent page allocation failure during TSO workaround
        tg3: Update copyright and version to 3.137
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e1c77bf
    • Michael Chan's avatar
      de750e4c
    • Michael Chan's avatar
      tg3: Prevent page allocation failure during TSO workaround · d3f6f3a1
      Michael Chan authored
      If any TSO fragment hits hardware bug conditions (e.g. 4G boundary), the
      driver will workaround by calling skb_copy() to copy to a linear SKB.  Users
      have reported page allocation failures as the TSO packet can be up to 64K.
      Copying such a large packet is also very inefficient.  We fix this by using
      existing tg3_tso_bug() to transmit the packet using GSO.
      Signed-off-by: default avatarPrashant Sreedharan <prashant@broadcom.com>
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3f6f3a1
    • Michael Chan's avatar
      tg3: Don't modify ip header fields when doing GSO · d71c0dc4
      Michael Chan authored
      tg3 uses GSO as workaround if the hardware cannot perform TSO on certain
      packets.  We should not modify the ip header fields if we do GSO on the
      packet.  It happens to work by accident because GSO recalculates the IP
      checksum and IP total length.
      
      Also fix the tg3_start_xmit comment to reflect that this is the only
      xmit function for all devices.
      Signed-off-by: default avatarPrashant Sreedharan <prashant@broadcom.com>
      Signed-off-by: default avatarMichael Chan <mchan@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d71c0dc4
    • David S. Miller's avatar
      Merge branch 'inet_fwmark_reflect' · b6bd26c4
      David S. Miller authored
      Lorenzo Colitti says:
      
      ====================
      Make mark-based routing work better with multiple separate networks.
      
      Mark-based routing (ip rule fwmark 17 lookup 100) combined with
      either iptables marking (iptables -j MARK --set-mark 17) or
      application-based marking (the SO_MARK setsockopt) are a good
      way to deal with connecting simultaneously to multiple networks.
      
      Each network can be given a routing table, and ip rules can
      be configured to make different fwmarks select different
      networks. Applications can select networks them by setting
      appropriate socket marks, and iptables rules can be used to
      handle non-aware applications, enforce policy, etc.
      
      This patch series improves functionality when mark-based routing
      is used in this way. Current behaviour has the following
      limitations:
      
      1. Kernel-originated replies that are not associated with a
         socket always use a mark of zero. This means that, for
         example, when the kernel sends a ping reply or a TCP reset,
         it does not send it on the network from which it received the
         original packet.
      2. Path MTU discovery, which is triggered by incoming packets,
         does not always work correctly, because the routing lookups it
         uses to clone routes do not take the fwmark into account and
         thus can happen in the wrong routing table.
      3. Application-based marking works well for outbound connections,
         but does not work well for incoming connections. Marking a
         listening socket causes that socket to only accept
         connections from a given network, and sockets that are
         returned by accept() are not marked (and are thus not routed
         correctly).
      
      sysctl. This causes route lookups for kernel-generated replies
      and PMTUD to use the fwmark of the packet that caused them.
      
      which causes TCP sockets returned by accept() to be marked with
      the same mark that sent the intial SYN packet.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6bd26c4
    • Lorenzo Colitti's avatar
      net: support marking accepting TCP sockets · 84f39b08
      Lorenzo Colitti authored
      When using mark-based routing, sockets returned from accept()
      may need to be marked differently depending on the incoming
      connection request.
      
      This is the case, for example, if different socket marks identify
      different networks: a listening socket may want to accept
      connections from all networks, but each connection should be
      marked with the network that the request came in on, so that
      subsequent packets are sent on the correct network.
      
      This patch adds a sysctl to mark TCP sockets based on the fwmark
      of the incoming SYN packet. If enabled, and an unmarked socket
      receives a SYN, then the SYN packet's fwmark is written to the
      connection's inet_request_sock, and later written back to the
      accepted socket when the connection is established.  If the
      socket already has a nonzero mark, then the behaviour is the same
      as it is today, i.e., the listening socket's fwmark is used.
      
      Black-box tested using user-mode linux:
      
      - IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the
        mark of the incoming SYN packet.
      - The socket returned by accept() is marked with the mark of the
        incoming SYN packet.
      - Tested with syncookies=1 and syncookies=2.
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84f39b08
    • Lorenzo Colitti's avatar
      net: Use fwmark reflection in PMTU discovery. · 1b3c61dc
      Lorenzo Colitti authored
      Currently, routing lookups used for Path PMTU Discovery in
      absence of a socket or on unmarked sockets use a mark of 0.
      This causes PMTUD not to work when using routing based on
      netfilter fwmark mangling and fwmark ip rules, such as:
      
        iptables -j MARK --set-mark 17
        ip rule add fwmark 17 lookup 100
      
      This patch causes these route lookups to use the fwmark from the
      received ICMP error when the fwmark_reflect sysctl is enabled.
      This allows the administrator to make PMTUD work by configuring
      appropriate fwmark rules to mark the inbound ICMP packets.
      
      Black-box tested using user-mode linux by pointing different
      fwmarks at routing tables egressing on different interfaces, and
      using iptables mangling to mark packets inbound on each interface
      with the interface's fwmark. ICMPv4 and ICMPv6 PMTU discovery
      work as expected when mark reflection is enabled and fail when
      it is disabled.
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b3c61dc
    • Lorenzo Colitti's avatar
      net: add a sysctl to reflect the fwmark on replies · e110861f
      Lorenzo Colitti authored
      Kernel-originated IP packets that have no user socket associated
      with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.)
      are emitted with a mark of zero. Add a sysctl to make them have
      the same mark as the packet they are replying to.
      
      This allows an administrator that wishes to do so to use
      mark-based routing, firewalling, etc. for these replies by
      marking the original packets inbound.
      
      Tested using user-mode linux:
       - ICMP/ICMPv6 echo replies and errors.
       - TCP RST packets (IPv4 and IPv6).
      Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e110861f
    • David S. Miller's avatar
      Merge branch 'arc_emac-next' · 87e067cd
      David S. Miller authored
      Beniamino Galvani says:
      
      ====================
      arc_emac: promiscuous/multicast mode and netpoll support
      
      These patches add support for promiscuous mode, multicast filtering
      and netpoll to the ARC EMAC driver.
      
      They were both tested on a Radxa Rock board which uses a ARC EMAC IP
      core integrated in the Rockchip RK3188 SoC.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      87e067cd
    • Beniamino Galvani's avatar
      5a45e57a
    • Beniamino Galvani's avatar
      arc_emac: implement promiscuous mode and multicast filtering · 775dd682
      Beniamino Galvani authored
      This patch implements the set_rx_mode function to enable/disable
      promiscuous or all-multicast modes and to update the multicast
      filtering list of the device.
      Signed-off-by: default avatarBeniamino Galvani <b.galvani@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      775dd682
    • David S. Miller's avatar
      Merge branch 'tcp-fastopen-ipv6' · ae8b42c6
      David S. Miller authored
      Yuchung Cheng says:
      
      ====================
      tcp: IPv6 support for fastopen server
      
      This patch series add IPv6 support for fastopen server. To minimize
      code duplication in IPv4 and IPv6, the current v4 only code is
      refactored and common code is moved into net/ipv4/tcp_fastopen.c.
      
      Also the current code uses a different function from
      tcp_v4_send_synack() to send the first SYN-ACK in fastopen.
      The new code eliminates this separate function by refactoring the
      child-socket and syn-ack creation code.  After these refactoring
      in the first four patches, we can easily add the fastopen code in
      IPv6 by changing corresponding IPv6 functions.
      
      Note Fast Open client already supports IPv6. This patch is for
      the server-side (passive open) IPv6 support only.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae8b42c6
    • Daniel Lee's avatar
      tcp: IPv6 support for fastopen server · 3a19ce0e
      Daniel Lee authored
      After all the preparatory works, supporting IPv6 in Fast Open is now easy.
      We pretty much just mirror v4 code. The only difference is how we
      generate the Fast Open cookie for IPv6 sockets. Since Fast Open cookie
      is 128 bits and we use AES 128, we use CBC-MAC to encrypt both the
      source and destination IPv6 addresses since the cookie is a MAC tag.
      Signed-off-by: default avatarDaniel Lee <longinus00@gmail.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarJerry Chu <hkchu@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3a19ce0e
    • Yuchung Cheng's avatar
      tcp: improve fastopen icmp handling · 0a672f74
      Yuchung Cheng authored
      If a fast open socket is already accepted by the user, it should
      be treated like a connected socket to record the ICMP error in
      sk_softerr, so the user can fetch it. Do that in both tcp_v4_err
      and tcp_v6_err.
      
      Also refactor the sequence window check to improve readability
      (e.g., there were two local variables named 'req').
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDaniel Lee <longinus00@gmail.com>
      Signed-off-by: default avatarJerry Chu <hkchu@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a672f74
    • Yuchung Cheng's avatar
      tcp: use tcp_v4_send_synack on first SYN-ACK · 843f4a55
      Yuchung Cheng authored
      To avoid large code duplication in IPv6, we need to first simplify
      the complicate SYN-ACK sending code in tcp_v4_conn_request().
      
      To use tcp_v4(6)_send_synack() to send all SYN-ACKs, we need to
      initialize the mini socket's receive window before trying to
      create the child socket and/or building the SYN-ACK packet. So we move
      that initialization from tcp_make_synack() to tcp_v4_conn_request()
      as a new function tcp_openreq_init_req_rwin().
      
      After this refactoring the SYN-ACK sending code is simpler and easier
      to implement Fast Open for IPv6.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDaniel Lee <longinus00@gmail.com>
      Signed-off-by: default avatarJerry Chu <hkchu@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      843f4a55