1. 16 Apr, 2014 16 commits
    • Nicolas Dichtel's avatar
      sit: use the right netns in ioctl handler · 9aad77c3
      Nicolas Dichtel authored
      Because the netdevice may be in another netns than the i/o netns, we should
      use the i/o netns instead of dev_net(dev).
      
      Note that netdev_priv(dev) cannot bu NULL, hence we can remove these useless
      checks.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9aad77c3
    • Nicolas Dichtel's avatar
      ip_tunnel: use the right netns in ioctl handler · 8c923ce2
      Nicolas Dichtel authored
      Because the netdevice may be in another netns than the i/o netns, we should
      use the i/o netns instead of dev_net(dev).
      
      The variable 'tunnel' was used only to get 'itn', hence to simplify code I
      remove it and use 't' instead.
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c923ce2
    • Jan Glauber's avatar
      net: use SYSCALL_DEFINEx for sys_recv · b7c0ddf5
      Jan Glauber authored
      Make sys_recv a first class citizen by using the SYSCALL_DEFINEx
      macro. Besides being cleaner this will also generate meta data
      for the system call so tracing tools like ftrace or LTTng can
      resolve this system call.
      Signed-off-by: default avatarJan Glauber <jan.glauber@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b7c0ddf5
    • David S. Miller's avatar
      Merge branch 'mdio-gpio' · c3206e6f
      David S. Miller authored
      Guenter Roeck says:
      
      ====================
      net: mdio-gpio enhancements
      
      The following series of patches adds support for active-low gpio pins
      as well as for systems with separate MDI and MDO pins to the mdio-gpio
      driver.
      
      A board using those features is based on a COM Express CPU board.
      The COM Express standard supports GPIO pins on its connector,
      with one caveat: The pins on the connector have fixed direction
      and are hard configured either as input or output pins.
      The COM Express Design Guide [1] provides additional details.
      
      The hardware uses three of the GPO/GPI pins from the COM Express board
      to drive an MDIO bus. Connectivity between GPI/GPO pins and the MDIO bus
      is as follows.
      
      GPI2 --------------------+------------ MDIO
                               |
                  +--------+   |
      GPO2 ---+---G        |   |
              |   |        |   |
             4.7k | 2N7002 D---+
      	|   |        |
      	+---S        |
      	|   +--------+
             GND
      
      GPO1 --------------------------------- MDC
      
      To support this hardware, two extensions to the driver were necessary.
      
      - Due to the FET in the MDO path (GPO2), the MDO signal is inverted.
        The driver therefore has to support active-low GPIO pins.
      
      - The MDIO signal must be separated into MDI and MDO.
      
      Those changes are implemented in patch 2/3 and 3/3.
      Patch 1/3 simplifies the error path and thus the subsequent
      patches.
      
      [1] http://www.picmg.org/pdf/picmg_comdg_100.pdf
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3206e6f
    • Guenter Roeck's avatar
      net: mdio-gpio: Add support for separate MDI and MDO gpio pins · f1d54c47
      Guenter Roeck authored
      This is for a system with fixed assignments of input and output pins
      (various variants of Kontron COMe).
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1d54c47
    • Guenter Roeck's avatar
      net: mdio-gpio: Add support for active low gpio pins · 1d251481
      Guenter Roeck authored
      Some systems using mdio-gpio may use active-low gpio pins
      (eg with inverters or FETs connected to all or some of the
      gpio pins).
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d251481
    • Guenter Roeck's avatar
      net: mdio-gpio: Use devm_ functions where possible · 78cdb079
      Guenter Roeck authored
      This simplifies error path and deinit/removal functions.
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Tested-by: default avatarChris Healy <cphealy@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      78cdb079
    • David S. Miller's avatar
      Merge branch 'fib_validate_loopback' · bc383ea5
      David S. Miller authored
      Cong Wang says:
      
      ====================
      ipv4: fix flowi4_iif for input routing
      
      This patchset fixes ->flowi4_iif for input routing and rp filter,
      based on suggestion from Julian. See per patch for details.
      
      v1 -> v2:
      * merge the first two patches into one
      * fix fib_check_nh() too
      * add this cover letter
      ====================
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarCong Wang <cwang@twopensource.com>
      Reviewed-by: default avatarJulian Anastasov <ja@ssi.bg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc383ea5
    • Cong Wang's avatar
      ipv4, route: pass 0 instead of LOOPBACK_IFINDEX to fib_validate_source() · 0d5edc68
      Cong Wang authored
      In my special case, when a packet is redirected from veth0 to lo,
      its skb->dev->ifindex would be LOOPBACK_IFINDEX. Meanwhile we
      pass the hard-coded LOOPBACK_IFINDEX to fib_validate_source()
      in ip_route_input_slow(). This would cause the following check
      in fib_validate_source() fail:
      
                  (dev->ifindex != oif || !IN_DEV_TX_REDIRECTS(idev))
      
      when rp_filter is disabeld on loopback. As suggested by Julian,
      the caller should pass 0 here so that we will not end up by
      calling __fib_validate_source().
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d5edc68
    • Cong Wang's avatar
      ipv4, fib: pass LOOPBACK_IFINDEX instead of 0 to flowi4_iif · 6a662719
      Cong Wang authored
      As suggested by Julian:
      
      	Simply, flowi4_iif must not contain 0, it does not
      	look logical to ignore all ip rules with specified iif.
      
      because in fib_rule_match() we do:
      
              if (rule->iifindex && (rule->iifindex != fl->flowi_iif))
                      goto out;
      
      flowi4_iif should be LOOPBACK_IFINDEX by default.
      
      We need to move LOOPBACK_IFINDEX to include/net/flow.h:
      
      1) It is mostly used by flowi_iif
      
      2) Fix the following compile error if we use it in flow.h
      by the patches latter:
      
      In file included from include/linux/netfilter.h:277:0,
                       from include/net/netns/netfilter.h:5,
                       from include/net/net_namespace.h:21,
                       from include/linux/netdevice.h:43,
                       from include/linux/icmpv6.h:12,
                       from include/linux/ipv6.h:61,
                       from include/net/ipv6.h:16,
                       from include/linux/sunrpc/clnt.h:27,
                       from include/linux/nfs_fs.h:30,
                       from init/do_mounts.c:32:
      include/net/flow.h: In function ‘flowi4_init_output’:
      include/net/flow.h:84:32: error: ‘LOOPBACK_IFINDEX’ undeclared (first use in this function)
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarCong Wang <cwang@twopensource.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6a662719
    • Chris Mason's avatar
      mlx4_en: don't use napi_synchronize inside mlx4_en_netpoll · c98235cb
      Chris Mason authored
      The mlx4 driver is triggering schedules while atomic inside
      mlx4_en_netpoll:
      
      	spin_lock_irqsave(&cq->lock, flags);
      	napi_synchronize(&cq->napi);
      		^^^^^ msleep here
      	mlx4_en_process_rx_cq(dev, cq, 0);
      	spin_unlock_irqrestore(&cq->lock, flags);
      
      This was part of a patch by Alexander Guller from Mellanox in 2011,
      but it still isn't upstream.
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      cc: stable@vger.kernel.org
      Acked-By: default avatarAmir Vadai <amirv@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c98235cb
    • David S. Miller's avatar
      Merge branch 'mvneta_qsgmii' · b07afe07
      David S. Miller authored
      Thomas Petazzoni says:
      
      ====================
      net: mvneta: fix usage as a module, and support QSGMII properly
      
      This set of patches is a new attempt at fixing the operation of the
      mvneta driver when built as a module. For the record, the previous
      attempt, merged in commit e3a8786c
      ('net: mvneta: fix usage as a module on RGMII configurations') caused
      problems for all RGMII configurations.
      
      In fact, it turned out that the MAC to PHY connection on the Armada XP
      GP, which was described as using RGMII-ID according to its Device
      Tree, is in fact a QSGMII connection. And the RGMII and QSGMII
      configurations have to be handled in a different way in the driver,
      because the SERDES configuration is different in those two cases.
      
      So, this patch series fixes that by:
      
       * Adding minimal handling of a "qsgmii" connection type in the PHY
         layer. Mainly to make sure that a "qsgmii" phy-mode in the Device
         Tree is recognized, and handed over to the driver as
         PHY_INTERFACE_QSGMII.
      
       * Changing the mvneta driver to properly configure the RGMIIEn and
         PCSEn bits in the GMAC_CTRL_2 register, and configure the SERDES
         register, in the three possible cases: RGMII, SGMII and QSGMII.
      
       * Updating the Device Tree of the Armada XP GP board to reflect the
         fact that it uses a QSGMII MAC/PHY connection.
      
      PATCH 1 and 2 would be merged by David Miller, through the net tree,
      while PATCH 3 would be merged by the mach-mvebu maintainers, through
      their tree and arm-soc.
      
      This set of patches has been tested on:
      
       * Armada XP GP (four QSGMII interfaces)
       * Armada XP DB (two RGMII interfaces and two SGMII interfaces)
       * Armada 370 Mirabox (two RGMII interfaces)
      
      I've tested both the driver built-in, and compiled as a module.
      
      Since the last attempt at fixing this was quite a fiasco, I'd like
      this new attempt to be tested more widely before being applied. I'll
      try to do some testing on other Armada boards I have, but independent
      testing from other persons would also be appreciated.
      
      Note that these patches apply after reverting the previous attempt,
      obviously.
      ====================
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Tested-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b07afe07
    • Thomas Petazzoni's avatar
      net: mvneta: properly configure the MAC <-> PHY connection in all situations · 3f1dd4bc
      Thomas Petazzoni authored
      Commit 5445eaf3 ('mvneta: Try to fix mvneta when compiled as
      module') fixed the mvneta driver to make it work properly when loaded
      as a module in SGMII configuration, which was tested successful by the
      author on the Armada XP OpenBlocks AX3, which uses SGMII.
      
      However, some other platforms, namely the Armada XP GP don't use
      SGMII, but a QSGMII connection between the MAC and the PHY, and this
      case was not supported by the mvneta driver, which was relying on
      configuration put in place by the bootloader. While this works when
      the mvneta driver is built-in (because clocks are not gated), it
      breaks when mvneta is built as a module, because the clock is gated
      (all configuration is lost) and then re-enabled when the mvneta driver
      is loaded.
      
      In order to support all of RGMII, SGMII and QSGMII, this commit
      reworks how the PHY interface configuration is done, and simplifies
      it: it removes the mvneta_port_sgmii_config() and
      mvneta_gmac_rgmii_set() functions, which were strange because
      mvneta_gmac_rgmii_set() was called in all cases, even for SGMII
      configurations. Also, the mvneta_gmac_rgmii_set() function was taking
      a boolean as argument, which was always true.
      
      Instead, all the PHY interface configuration logic is moved into the
      mvneta_port_power_up() function, in a much simpler 'switch' construct,
      with four cases:
      
       - QSGMII: the RGMIIEn bit, the PCSEn bit in GMAC_CTRL_2 are set, and
         the SERDES is configured in QSGMII. Technically speaking,
         configuring the SERDES of the first port would be sufficient, but
         it is simpler to do it on all ports.
      
       - SGMII: the RGMIIEn bit, the PCSEn bit in GMAC_CTRL_2 are set, and
         the SERDES is configured as SGMII.
      
       - RGMII: the RGMIIEn bit in GMAC_CTRL_2 is set. The PCSEn bit is kept
         cleared, and no SERDES configuration is done, because RGMII is not
         using SERDES lanes.
      
       - other: an error is returned. For this reason, the
         mvneta_port_power_up() now returns an int instead of nothing, and
         the return value is checked by mvneta_probe().
      
      This has been successfully tested on:
      
       * Armada XP DB, which has two RGMII and two SGMII connections
       * Armada XP GP, which uses QSGMII for its four interfaces
       * Armada 370 Mirabox, which has two RGMII connections
      Signed-off-by: default avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f1dd4bc
    • Thomas Petazzoni's avatar
      net: phy: add minimal support for QSGMII PHY · b9d12085
      Thomas Petazzoni authored
      This commit adds the necessary definitions for the PHY layer to
      recognize "qsgmii" as a valid PHY interface. A QSMII interface, as
      defined at
      http://en.wikipedia.org/wiki/Media_Independent_Interface#Quad_Serial_Gigabit_Media_Independent_Interface,
      is "is a method of combining four SGMII lines into a 5Gbit/s
      interface. QSGMII, like SGMII, uses LVDS signalling for the TX and RX
      data and a single LVDS clock signal. QSGMII uses significantly fewer
      signal lines than four SGMII busses."
      
      This type of MAC <-> PHY connection might require special handling on
      the MAC driver side, so it should be possible to express this type of
      MAC <-> PHY connection, for example in the Device Tree.
      Signed-off-by: default avatarThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: devicetree@vger.kernel.org
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9d12085
    • Edward Cree's avatar
      sfc:On MCDI timeout, issue an FLR (and mark MCDI to fail-fast) · e283546c
      Edward Cree authored
      When an MCDI command times out (whether or not we find it
      completed when we poll), call efx_mcdi_abandon(), which tells
      all subsequent MCDI calls to fail-fast, and queues up an FLR.
      
      Because an FLR doesn't lead to receiving any reboot even from
      the MC (unlike most other types of reset), we have to call
      efx_ef10_reset_mc_allocations.
      In efx_start_all(), if a reset (of any kind) is pending, we
      bail out.
      Without this, attempts to reconfigure (e.g. change mtu) can
      cause driver/mc state inconsistency if the first MCDI call
      triggers an FLR.
      
      For similar reasons, on EF10, in
      efx_reset_down(method=RESET_TYPE_MCDI_TIMEOUT), set the number
      of active queues to zero before calling efx_stop_all().
      And, on farch, in efx_reset_up(method=RESET_TYPE_MCDI_TIMEOUT),
      set active_queues and flushes pending & outstanding to zero.
      
      efx_mcdi_mode_{poll,event}() should not take us out of fail-fast
       mode. Instead, this is done by efx_mcdi_reset() after the FLR
      completes.
      
      The new FLR reset_type RESET_TYPE_MCDI_TIMEOUT doesn't really
      fit into the hierarchy of reset 'scopes' whereby efx_reset()
      decides some resets subsume others.  Thus, it uses separate logic.
      
      Also, fixed up some inconsistency around RESET_TYPE_MC_BIST,
      which was in the wrong place in that hierarchy.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e283546c
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 10ec34fc
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Fix BPF filter validation of netlink attribute accesses, from
          Mathias Kruase.
      
       2) Netfilter conntrack generation seqcount not initialized properly,
          from Andrey Vagin.
      
       3) Fix comparison mask computation on big-endian in nft_cmp_fast(),
          from Patrick McHardy.
      
       4) Properly limit MTU over ipv6, from Eric Dumazet.
      
       5) Fix seccomp system call argument population on 32-bit, from Daniel
          Borkmann.
      
       6) skb_network_protocol() should not use hard-coded ETH_HLEN, instead
          skb->mac_len needs to be used.  From Vlad Yasevich.
      
       7) We have several cases of using socket based communications to
          implement a tunnel.  For example, some tunnels are encapsulations
          over UDP so we use an internal kernel UDP socket to do the
          transmits.
      
          These tunnels should behave just like other software devices and
          pass the packets on down to the next layer.
      
          Most importantly we want the top-level socket (eg TCP) that created
          the traffic to be charged for the SKB memory.
      
          However, once you get into the IP output path, we have code that
          assumed that whatever was attached to skb->sk is an IP socket.
      
          To keep the top-level socket being charged for the SKB memory,
          whilst satisfying the needs of the IP output path, we now pass in an
          explicit 'sk' argument.
      
          From Eric Dumazet.
      
       8) ping_init_sock() leaks group info, from Xiaoming Wang.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
        cxgb4: use the correct max size for firmware flash
        qlcnic: Fix MSI-X initialization code
        ip6_gre: don't allow to remove the fb_tunnel_dev
        ipv4: add a sock pointer to dst->output() path.
        ipv4: add a sock pointer to ip_queue_xmit()
        driver/net: cosa driver uses udelay incorrectly
        at86rf230: fix __at86rf230_read_subreg function
        at86rf230: remove check if AVDD settled
        net: cadence: Add architecture dependencies
        net: Start with correct mac_len in skb_network_protocol
        Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer"
        cxgb4: Save the correct mac addr for hw-loopback connections in the L2T
        net: filter: seccomp: fix wrong decoding of BPF_S_ANC_SECCOMP_LD_W
        seccomp: fix populating a0-a5 syscall args in 32-bit x86 BPF
        qlcnic: Do not disable SR-IOV when VFs are assigned to VMs
        qlcnic: Fix QLogic application/driver interface for virtual NIC configuration
        qlcnic: Fix PVID configuration on eSwitch port.
        qlcnic: Fix max ring count calculation
        qlcnic: Fix to send INIT_NIC_FUNC as first mailbox.
        qlcnic: Fix panic due to uninitialzed delayed_work struct in use.
        ...
      10ec34fc
  2. 15 Apr, 2014 9 commits
  3. 14 Apr, 2014 15 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/virt/kvm/kvm · 55101e2d
      Linus Torvalds authored
      Pull KVM fixes from Marcelo Tosatti:
       - Fix for guest triggerable BUG_ON (CVE-2014-0155)
       - CR4.SMAP support
       - Spurious WARN_ON() fix
      
      * git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: remove WARN_ON from get_kernel_ns()
        KVM: Rename variable smep to cr4_smep
        KVM: expose SMAP feature to guest
        KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
        KVM: Add SMAP support when setting CR4
        KVM: Remove SMAP bit from CR4_RESERVED_BITS
        KVM: ioapic: try to recover if pending_eoi goes out of range
        KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi (CVE-2014-0155)
      55101e2d
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · dafe344d
      Linus Torvalds authored
      Pull bmc2835 crypto fix from Herbert Xu:
       "This fixes a potential boot crash on bcm2835 due to the recent change
        that now causes hardware RNGs to be accessed on registration"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        hwrng: bcm2835 - fix oops when rng h/w is accessed during registration
      dafe344d
    • Mikulas Patocka's avatar
      user namespace: fix incorrect memory barriers · e79323bd
      Mikulas Patocka authored
      smp_read_barrier_depends() can be used if there is data dependency between
      the readers - i.e. if the read operation after the barrier uses address
      that was obtained from the read operation before the barrier.
      
      In this file, there is only control dependency, no data dependecy, so the
      use of smp_read_barrier_depends() is incorrect. The code could fail in the
      following way:
      * the cpu predicts that idx < entries is true and starts executing the
        body of the for loop
      * the cpu fetches map->extent[0].first and map->extent[0].count
      * the cpu fetches map->nr_extents
      * the cpu verifies that idx < extents is true, so it commits the
        instructions in the body of the for loop
      
      The problem is that in this scenario, the cpu read map->extent[0].first
      and map->nr_extents in the wrong order. We need a full read memory barrier
      to prevent it.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e79323bd
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 00cbc3dc
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains three Netfilter fixes for your net tree,
      they are:
      
      * Fix missing generation sequence initialization which results in a splat
        if lockdep is enabled, it was introduced in the recent works to improve
        nf_conntrack scalability, from Andrey Vagin.
      
      * Don't flush the GRE keymap list in nf_conntrack when the pptp helper is
        disabled otherwise this crashes due to a double release, from Andrey
        Vagin.
      
      * Fix nf_tables cmp fast in big endian, from Patrick McHardy.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      00cbc3dc
    • Vlad Yasevich's avatar
      net: Start with correct mac_len in skb_network_protocol · 1e785f48
      Vlad Yasevich authored
      Sometimes, when the packet arrives at skb_mac_gso_segment()
      its skb->mac_len already accounts for some of the mac lenght
      headers in the packet.  This seems to happen when forwarding
      through and OpenSSL tunnel.
      
      When we start looking for any vlan headers in skb_network_protocol()
      we seem to ignore any of the already known mac headers and start
      with an ETH_HLEN.  This results in an incorrect offset, dropped
      TSO frames and general slowness of the connection.
      
      We can start counting from the known skb->mac_len
      and return at least that much if all mac level headers
      are known and accounted for.
      
      Fixes: 53d6471c (net: Account for all vlan headers in skb_mac_gso_segment)
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Daniel Borkman <dborkman@redhat.com>
      Tested-by: default avatarMartin Filip <nexus+kernel@smoula.net>
      Signed-off-by: default avatarVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e785f48
    • Marcelo Tosatti's avatar
      b351c39c
    • Feng Wu's avatar
      KVM: Rename variable smep to cr4_smep · 66386ade
      Feng Wu authored
      Rename variable smep to cr4_smep, which can better reflect the
      meaning of the variable.
      Signed-off-by: default avatarFeng Wu <feng.wu@intel.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      66386ade
    • Feng Wu's avatar
      KVM: expose SMAP feature to guest · de935ae1
      Feng Wu authored
      This patch exposes SMAP feature to guest
      Signed-off-by: default avatarFeng Wu <feng.wu@intel.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      de935ae1
    • Feng Wu's avatar
      KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode · e1e746b3
      Feng Wu authored
      SMAP is disabled if CPU is in non-paging mode in hardware.
      However KVM always uses paging mode to emulate guest non-paging
      mode with TDP. To emulate this behavior, SMAP needs to be
      manually disabled when guest switches to non-paging mode.
      Signed-off-by: default avatarFeng Wu <feng.wu@intel.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      e1e746b3
    • Feng Wu's avatar
      KVM: Add SMAP support when setting CR4 · 97ec8c06
      Feng Wu authored
      This patch adds SMAP handling logic when setting CR4 for guests
      
      Thanks a lot to Paolo Bonzini for his suggestion to use the branchless
      way to detect SMAP violation.
      Signed-off-by: default avatarFeng Wu <feng.wu@intel.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      97ec8c06
    • Feng Wu's avatar
      KVM: Remove SMAP bit from CR4_RESERVED_BITS · 56d6efc2
      Feng Wu authored
      This patch removes SMAP bit from CR4_RESERVED_BITS.
      Signed-off-by: default avatarFeng Wu <feng.wu@intel.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      56d6efc2
    • Daniel Borkmann's avatar
      Revert "net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer" · 362d5204
      Daniel Borkmann authored
      This reverts commit ef2820a7 ("net: sctp: Fix a_rwnd/rwnd management
      to reflect real state of the receiver's buffer") as it introduced a
      serious performance regression on SCTP over IPv4 and IPv6, though a not
      as dramatic on the latter. Measurements are on 10Gbit/s with ixgbe NICs.
      
      Current state:
      
      [root@Lab200slot2 ~]# iperf3 --sctp -4 -c 192.168.241.3 -V -l 1452 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
      Time: Fri, 11 Apr 2014 17:56:21 GMT
      Connecting to host 192.168.241.3, port 5201
            Cookie: Lab200slot2.1397238981.812898.548918
      [  4] local 192.168.241.2 port 38616 connected to 192.168.241.3 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.09   sec  20.8 MBytes   161 Mbits/sec
      [  4]   1.09-2.13   sec  10.8 MBytes  86.8 Mbits/sec
      [  4]   2.13-3.15   sec  3.57 MBytes  29.5 Mbits/sec
      [  4]   3.15-4.16   sec  4.33 MBytes  35.7 Mbits/sec
      [  4]   4.16-6.21   sec  10.4 MBytes  42.7 Mbits/sec
      [  4]   6.21-6.21   sec  0.00 Bytes    0.00 bits/sec
      [  4]   6.21-7.35   sec  34.6 MBytes   253 Mbits/sec
      [  4]   7.35-11.45  sec  22.0 MBytes  45.0 Mbits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-11.45  sec  0.00 Bytes    0.00 bits/sec
      [  4]  11.45-12.51  sec  16.0 MBytes   126 Mbits/sec
      [  4]  12.51-13.59  sec  20.3 MBytes   158 Mbits/sec
      [  4]  13.59-14.65  sec  13.4 MBytes   107 Mbits/sec
      [  4]  14.65-16.79  sec  33.3 MBytes   130 Mbits/sec
      [  4]  16.79-16.79  sec  0.00 Bytes    0.00 bits/sec
      [  4]  16.79-17.82  sec  5.94 MBytes  48.7 Mbits/sec
      (etc)
      
      [root@Lab200slot2 ~]#  iperf3 --sctp -6 -c 2001:db8:0:f101::1 -V -l 1400 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0 #1 SMP Thu Apr 3 23:18:29 EDT 2014 x86_64
      Time: Fri, 11 Apr 2014 19:08:41 GMT
      Connecting to host 2001:db8:0:f101::1, port 5201
            Cookie: Lab200slot2.1397243321.714295.2b3f7c
      [  4] local 2001:db8:0:f101::2 port 55804 connected to 2001:db8:0:f101::1 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1400 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.00   sec   169 MBytes  1.42 Gbits/sec
      [  4]   1.00-2.00   sec   201 MBytes  1.69 Gbits/sec
      [  4]   2.00-3.00   sec   188 MBytes  1.58 Gbits/sec
      [  4]   3.00-4.00   sec   174 MBytes  1.46 Gbits/sec
      [  4]   4.00-5.00   sec   165 MBytes  1.39 Gbits/sec
      [  4]   5.00-6.00   sec   199 MBytes  1.67 Gbits/sec
      [  4]   6.00-7.00   sec   163 MBytes  1.36 Gbits/sec
      [  4]   7.00-8.00   sec   174 MBytes  1.46 Gbits/sec
      [  4]   8.00-9.00   sec   193 MBytes  1.62 Gbits/sec
      [  4]   9.00-10.00  sec   196 MBytes  1.65 Gbits/sec
      [  4]  10.00-11.00  sec   157 MBytes  1.31 Gbits/sec
      [  4]  11.00-12.00  sec   175 MBytes  1.47 Gbits/sec
      [  4]  12.00-13.00  sec   192 MBytes  1.61 Gbits/sec
      [  4]  13.00-14.00  sec   199 MBytes  1.67 Gbits/sec
      (etc)
      
      After patch:
      
      [root@Lab200slot2 ~]#  iperf3 --sctp -4 -c 192.168.240.3 -V -l 1452 -t 60
      iperf version 3.0.1 (10 January 2014)
      Linux Lab200slot2 3.14.0+ #1 SMP Mon Apr 14 12:06:40 EDT 2014 x86_64
      Time: Mon, 14 Apr 2014 16:40:48 GMT
      Connecting to host 192.168.240.3, port 5201
            Cookie: Lab200slot2.1397493648.413274.65e131
      [  4] local 192.168.240.2 port 50548 connected to 192.168.240.3 port 5201
      Starting Test: protocol: SCTP, 1 streams, 1452 byte blocks, omitting 0 seconds, 60 second test
      [ ID] Interval           Transfer     Bandwidth
      [  4]   0.00-1.00   sec   240 MBytes  2.02 Gbits/sec
      [  4]   1.00-2.00   sec   239 MBytes  2.01 Gbits/sec
      [  4]   2.00-3.00   sec   240 MBytes  2.01 Gbits/sec
      [  4]   3.00-4.00   sec   239 MBytes  2.00 Gbits/sec
      [  4]   4.00-5.00   sec   245 MBytes  2.05 Gbits/sec
      [  4]   5.00-6.00   sec   240 MBytes  2.01 Gbits/sec
      [  4]   6.00-7.00   sec   240 MBytes  2.02 Gbits/sec
      [  4]   7.00-8.00   sec   239 MBytes  2.01 Gbits/sec
      
      With the reverted patch applied, the SCTP/IPv4 performance is back
      to normal on latest upstream for IPv4 and IPv6 and has same throughput
      as 3.4.2 test kernel, steady and interval reports are smooth again.
      
      Fixes: ef2820a7 ("net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer")
      Reported-by: default avatarPeter Butler <pbutler@sonusnet.com>
      Reported-by: default avatarDongsheng Song <dongsheng.song@gmail.com>
      Reported-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Tested-by: default avatarPeter Butler <pbutler@sonusnet.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Cc: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
      Cc: Alexander Sverdlin <alexander.sverdlin@nsn.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Acked-by: default avatarVlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      362d5204
    • Steve Wise's avatar
      cxgb4: Save the correct mac addr for hw-loopback connections in the L2T · bfae2324
      Steve Wise authored
      Hardware needs the local device mac address to support hw loopback for
      rdma loopback connections.
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bfae2324
    • Daniel Borkmann's avatar
      net: filter: seccomp: fix wrong decoding of BPF_S_ANC_SECCOMP_LD_W · 8c482cdc
      Daniel Borkmann authored
      While reviewing seccomp code, we found that BPF_S_ANC_SECCOMP_LD_W has
      been wrongly decoded by commit a8fc9277 ("sk-filter: Add ability to
      get socket filter program (v2)") into the opcode BPF_LD|BPF_B|BPF_ABS
      although it should have been decoded as BPF_LD|BPF_W|BPF_ABS.
      
      In practice, this should not have much side-effect though, as such
      conversion is/was being done through prctl(2) PR_SET_SECCOMP. Reverse
      operation PR_GET_SECCOMP will only return the current seccomp mode, but
      not the filter itself. Since the transition to the new BPF infrastructure,
      it's also not used anymore, so we can simply remove this as it's
      unreachable.
      
      Fixes: a8fc9277 ("sk-filter: Add ability to get socket filter program (v2)")
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c482cdc
    • Daniel Borkmann's avatar
      seccomp: fix populating a0-a5 syscall args in 32-bit x86 BPF · 2eac7648
      Daniel Borkmann authored
      Linus reports that on 32-bit x86 Chromium throws the following seccomp
      resp. audit log messages:
      
        audit: type=1326 audit(1397359304.356:28108): auid=500 uid=500
      gid=500 ses=2 subj=unconfined_u:unconfined_r:chrome_sandbox_t:s0-s0:c0.c1023
      pid=3677 comm="chrome" exe="/opt/google/chrome/chrome" sig=0
      syscall=172 compat=0 ip=0xb2dd9852 code=0x30000
      
        audit: type=1326 audit(1397359304.356:28109): auid=500 uid=500
      gid=500 ses=2 subj=unconfined_u:unconfined_r:chrome_sandbox_t:s0-s0:c0.c1023
      pid=3677 comm="chrome" exe="/opt/google/chrome/chrome" sig=0 syscall=5
      compat=0 ip=0xb2dd9852 code=0x50000
      
      These audit messages are being triggered via audit_seccomp() through
      __secure_computing() in seccomp mode (BPF) filter with seccomp return
      codes 0x30000 (== SECCOMP_RET_TRAP) and 0x50000 (== SECCOMP_RET_ERRNO)
      during filter runtime. Moreover, Linus reports that x86_64 Chromium
      seems fine.
      
      The underlying issue that explains this is that the implementation of
      populate_seccomp_data() is wrong. Our seccomp data structure sd that
      is being shared with user ABI is:
      
        struct seccomp_data {
          int nr;
          __u32 arch;
          __u64 instruction_pointer;
          __u64 args[6];
        };
      
      Therefore, a simple cast to 'unsigned long *' for storing the value of
      the syscall argument via syscall_get_arguments() is just wrong as on
      32-bit x86 (or any other 32bit arch), it would result in storing a0-a5
      at wrong offsets in args[] member, and thus i) could leak stack memory
      to user space and ii) tampers with the logic of seccomp BPF programs
      that read out and check for syscall arguments:
      
        syscall_get_arguments(task, regs, 0, 1, (unsigned long *) &sd->args[0]);
      
      Tested on 32-bit x86 with Google Chrome, unfortunately only via remote
      test machine through slow ssh X forwarding, but it fixes the issue on
      my side. So fix it up by storing args in type correct variables, gcc
      is clever and optimizes the copy away in other cases, e.g. x86_64.
      
      Fixes: bd4cf0ed ("net: filter: rework/optimize internal BPF interpreter's instruction set")
      Reported-and-bisected-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2eac7648